A look into the black box: AI training data set C4 also draws from murky sources

AI chatbots learn information about the world from written language. The text material incorporated into their language models during training largely determines the quality of their later information and conversations with people. Large corpora of texts, books and material scraped together from the Internet serve as machine fodder. Not all providers of large language models speak openly about what they have trained their products with – OpenAI, for example, keeps the database of GPT-4 and ChatGPT secret, which is why researchers speak of a black box here and with other proprietary (closed, mostly commercial) models. However, even open source projects are not always precise when it comes to their information, and model offshoots such as the leaked LLaMA that were trained purely with synthetic (generated via the OpenAI API), distilled data sets are increasingly appearing.

A Washington Post investigative team has looked inside the black boxes, examining 15 million web pages that serve as the source of a particularly essential data set for machine learning training: Colossal Clean Crawled Corpus (C4) is a web-scraped collection of data English-language texts, consisting of a single snapshot (the snapshot of indexed web pages). The snapshot was subsequently heavily cleaned and filtered, data was excluded, block lists were applied, duplicates were cleaned up, identities were made unrecognizable – the finished data set comprises around 750 gigabytes. Websites that do not have at least 99 percent English-language content were excluded.

Journalist Nitasha Tiku, journalist Kevin Schaul, and data reporter Chen Szu Yu, along with researchers from the Allen Institute for AI, examined the websites from which C4 gets its data and found all sorts of inconsistencies. For example, over 200 million times the copyright symbol is included and some pirate sites like b-ok.org, which knowingly infringe copyright to illegally distribute content, are among the domains from which the dataset derives content – at rank 190 (with 14 million tokens and 0.009 percent of the total corpus). At least 27 other sites that are officially known in the US for counterfeiting and product piracy can be found in the data set.

About half of the top 10 webpages searched are from major daily newspapers (NY Times at #4, followed by #6 through #9: Los Angeles Times, The Guardian, Forbes, Huffpost, and Washington Post at #11), while the top sources include about (ranked 2nd out of 15 million) Wikipedia – the not freely accessible online library scribd.com is in 3rd place. The journalists found the skimming of data from websites with voter information from Colorado and Florida, which are in the top 100 of the C4 sources are located. Websites such as Kickstarter and Patreon, through which artists and creative people earn an income through monthly subscriptions, are skimmed off for C4. Marketing ideas and artistic projects, i.e. intellectual property, could be tapped here. In view of the numerous copyright notices that could be identified in the dataset, the dispute over authorship and its protection could be fueled further.

According to Nitasha Tiku and her colleagues, the C4 dataset is dominated by text scraped together from the Internet from the fields of journalism, medicine, content creation, science, public relations/advertising and marketing – areas that are considered to be particularly affected by AI text generators and in which the automation of text generation is likely to lead to even greater upheavals.

Of particular interest is an interactive infographic that breaks down the contents of C4 into categories, with the size of the fields corresponding to their quantitative share in the data set. Business and industry (16 percent) and technology (15 percent), but also news and media (13 percent), art and entertainment (11 percent) and research and health (9 percent) make up a large part. Work and education (7 percent) is about the same as hobbies and leisure (8 percent) and home and garden (6 percent). Law and government are also represented (4 percent). US websites and English-language content dominate in all areas. The investigative team and the Allen researchers could not categorize all websites because some of them are no longer accessible on the Internet.

A look into the black box of a data set used to train AI chatbots: millions of websites, clustered by topic.

A look into the black box of a data set used to train AI chatbots: millions of websites, clustered by topic.

A look into the black box of a data set used to train AI chatbots: millions of websites, clustered by topic. In the Washington Post, the infographic is clickable and reveals different layers of information as you scroll the page.

(Bild: Washington Post)

Curiously, the data source that contributes by far the most to the corpus is a Google search engine for full-text worldwide patents (patents.google.com): 720 million tokens come from this source, accounting for 0.46 percent of the entire dataset. For comparison: The (English-language) Wikipedia follows in second place with 290 million tokens (0.19 percent share of the data set). A token is the smallest unit of meaning in words, images, or sentences that machine learning breaks down into. The tokens can be embedded in a vector space where they can later be found by the model. This technique (tokenization) is fundamental in Natural Language Processing (NLP), e.g. for creating transformer models like ChatGPT or in the form of token classes for text classification in BERT transformers.

Media and propaganda sites that are not known for their high level of trustworthiness are not or not fully filtered out of the data set: articles from Russia Today (RT.com, ranked 65) and the right-wing populist site Breitbart News (ranked 159) can be found in the data set again. Traces of white supremacy are represented by vdare.com (ranked 993), as well as extreme orientations of various religious groups, some of which preach hatred of other groups and prejudices.

Numerous private blogs, also in the tech area, find their way into C4. Social networks like Facebook and Twitter, on the other hand, are not represented because they have forbidden scraping to train AI models. Nobody knows exactly what is done with user data inside corporations like Facebook and Google. Elon Musk has now also announced that he will set up his own AI company to compete with a chatbot called TruthGPT OpenAIs ChatGPT. It does not seem impossible that Twitter data will then become part of the training basis. According to the WP research, the filters of data set C4 overlooked sources of conspiracy stories, 4chan.org, threecentpartriots.com (very far down the ranks) and the racist site stormfront.org are represented.

Schaul and Szu Yu have built a search engine that can be used to find the URLs of around 15 million referenced websites. The search engine provides quantitative information: For each website whose URL uses C4 to skim data, it gives the absolute number of tokens and their percentage of the entire data set. C4 is a standard data set that was previously considered rather uncritical for model training and is fundamental to numerous large language models (LLM) such as presumably GPT-4 and ChatGPT, which generate texts according to specifications. C4 served as a training basis for AI systems such as Google’s Flan-T5, Facebook’s LLaMA and is incorporated into the data sets of non-profit open source initiatives, such as the new AI project RedPajama.

Training data set of RedPajama compared to LLaMA by data source

(Image: Heise)

C4 only contributes part of the trained model data, which always feeds in numerous other data collections. For example, GPT-3 contained 41 CommonCrawl runs (snapshots from across the web taken at different points in time), as well as the entire English-language Wikipedia and a set of web links that Reddit users had rated as particularly useful sources of information, such as collections of open-access novels by lesser-known people authors.

The composition of the training data, the quality and scope of certain content are central to assessing how AI systems achieve their output. According to Tiku, Schaul and the research team at the Allen Institute for AI, the precise examination of the training data is therefore an important contribution to making the processes inside large language models comprehensible and understandable. With a view to the decisions of legislators and for AI regulation, this should be relevant.

Since the creators of the dataset explicitly excluded non-English language material according to the project description at Hugging Face (English only), a separate consideration of German-language media is not expedient – even though some German media are rudimentarily represented: About 71,000 tokens come from Heise, which corresponds to 0.00005 percent of the data set and could be generated by GPT-4 in two search queries today. The BILD newspaper is represented with 42,000 tokens, Golem.de with 7,300, ZEIT with 5,800 and Chip.de with 190. Only SPIEGEL is more strongly represented, with 4.1 million tokens: This is probably due to the fact that numerous English-speaking items are available.

In the case of tokens, it is not entirely clear which subunit is meant: tokens can be a word, a sentence or the meaningful components of a word. They can be used to unlock disordered text information for machine learning. What this lack of information from cultures where English is not the primary language means is the subject of another article. The large language models, mainly produced in the USA and English-speaking countries, have blind spots in the field of foreign languages, and language conveys not only grammar but also themes, values ​​and diversity – it is therefore doubtful whether data sets such as the colossal Kraul adequately depict European realities at all can.

The full research can be found in the Washington Post. The search tool for checking websites in the C4 data set is embedded in the last third. The C4 data set is scientifically documented: “Documenting the English Colossal Clean Crawled Corpus” and “Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus”, both from 2021. This research work already opened up the data set with an interactive web Interface of an indexed copy of C4.


To home page

Related Posts

Hot News


usefull links

robis robis robis