Header
Homepage > Papers > How is the knowledge base used by ChatGPT created?

How is the knowledge base used by ChatGPT created?

Kako nastaje baza znanja kojom se služi ChatGPT? (1/6)

When you type a query into ChatGPT or some other language model, it’s easy to get the impression that you’re interacting with a system that’s scouring the internet in real time to provide you with an answer. However, the truth is technologically much more precise and extremely important for understanding digital literacy in modern education.

Andrej Karpathy, one of the world’s leading experts on artificial intelligence, offers a mental model that fundamentally changes our perception of these tools. He compares the process of creating an AI model to systematically writing and reading a textbook. The first phase of this process is called “pre-training,” and its scope is surprisingly “finite,” meaning that it involves a precisely defined, measurable amount of data, rather than an infinite number of sources.

Choosing material for a “digital textbook”

It all starts with organizations like Common Crawl, which have been archiving billions of web pages for years. By 2024, about 2.7 billion pages had been indexed. Although it’s a huge amount of raw material, it’s in its original form unprocessed. In addition to the desired text, it also contains redundant elements of the original digital structure (such as navigation menus, advertising blocks, and program codes) that do not contribute to learning.

In order to shape this vast amount of information into systematic knowledge, the data undergoes a process of careful selection and purification:

  1. URL filtering: Web pages from “blacklists” that contain malware, unwanted content, or hate speech are removed.
  2. Text extraction: Algorithms remove visual code (HTML) to isolate only pure text – sentences and paragraphs.
  3. Language filtering: Most of today’s models are optimized for the English language. Some datasets automatically give priority to pages with a high percentage of text in English, which explains the current superiority of the model in that language area.
  4. Personal data removal: Sensitive information such as addresses or personally identifiable information (PII) is detected and removed.

All the “knowledge” fits in a backpack

What remains after this precise selection process is a dataset of top quality (like the FineWeb dataset). Karpathy highlights a key insight here that demystifies the notion of infinity:

“Even though the internet is vast, we’re dealing exclusively with text that we filter. We end up with about 44 terabytes of data. You could almost fit that on a large hard drive these days that you could hold in your hand.”

Karpathy uses a powerful analogy: An AI model is like a highly compressed archive of the internet. Once training is complete, the model no longer has direct access to that data, but has “compressed” its knowledge into its parameters. Just as a digital photograph loses some of its original detail to take up less space, an AI model doesn’t memorize text verbatim, but rather memorizes statistical relationships and patterns between pieces of information.

What does this mean for teachers and the education system?

Razumijevanje ovog procesa ključno je za ispravno korištenje alata u učionici:

  1. The model doesn’t “Google”: It’s important to emphasize that a language model doesn’t search the internet for answers (except in specific cases where the user enters a URL directly into it or uses search plugins). He creates an answer according to learned patterns, rather than copying it verbatim from the source. That is why the model sometimes generates convincing but incorrect facts - it reconstructs them based on statistical probability, not by checking the database.
  2. Knowledge is limited in time: Since training is the process of "reading" a specific, fixed set of data, the model does not possess information about the events that occurred after the end of that phase, unless it is connected to external tools.

So when we use artificial intelligence, we are not accessing an omniscient machine, but an impressive, measurable and refined summary of human knowledge stored in digital form.

Source: This article is based on an analysis of Andrej Karpathy's technical lecture: Deep Dive into LLMs like ChatGPT and is the first in a series of articles in which we will explore the deep architecture of language models. In the next installment, we look at the phenomenon of tokenization - find out why models sometimes make mistakes in counting letters and how this affects their understanding of language.