{"id":1358,"date":"2026-01-15T11:09:40","date_gmt":"2026-01-15T11:09:40","guid":{"rendered":"https:\/\/brain.hr\/?p=1358"},"modified":"2026-01-27T12:15:23","modified_gmt":"2026-01-27T12:15:23","slug":"kako-nastaje-baza-znanja-kojom-se-sluzi-chatgpt","status":"publish","type":"post","link":"https:\/\/brain.hr\/en\/kako-nastaje-baza-znanja-kojom-se-sluzi-chatgpt\/","title":{"rendered":"How is the knowledge base used by ChatGPT created?"},"content":{"rendered":"<p>When you type a query into ChatGPT or some other language model, it\u2019s easy to get the impression that you\u2019re interacting with a system that\u2019s scouring the internet in real time to provide you with an answer. However, the truth is technologically much more precise and extremely important for understanding digital literacy in modern education.<\/p>\n\n\n\n<p>Andrej Karpathy, one of the world\u2019s leading experts on artificial intelligence, offers a mental model that fundamentally changes our perception of these tools. He compares the process of creating an AI model to systematically writing and reading a textbook. The first phase of this process is called \u201cpre-training,\u201d and its scope is surprisingly \u201cfinite,\u201d meaning that it involves a precisely defined, measurable amount of data, rather than an infinite number of sources.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Choosing material for a \u201cdigital textbook\u201d<\/strong><\/h3>\n\n\n\n<p>It all starts with organizations like <em>Common Crawl<\/em>, which have been archiving billions of web pages for years. By 2024, about 2.7 billion pages had been indexed. Although it\u2019s a huge amount of raw material, it\u2019s in its original form unprocessed. In addition to the desired text, it also contains redundant elements of the original digital structure (such as navigation menus, advertising blocks, and program codes) that do not contribute to learning.<\/p>\n\n\n\n<p>In order to shape this vast amount of information into systematic knowledge, the data undergoes a process of careful selection and purification:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>URL filtering: Web pages from \u201cblacklists\u201d that contain malware, unwanted content, or hate speech are removed.<\/li>\n\n\n\n<li>Text extraction: Algorithms remove visual code (HTML) to isolate only pure text \u2013 sentences and paragraphs.<\/li>\n\n\n\n<li>Language filtering: Most of today\u2019s models are optimized for the English language. Some datasets automatically give priority to pages with a high percentage of text in English, which explains the current superiority of the model in that language area.<\/li>\n\n\n\n<li>Personal data removal: Sensitive information such as addresses or personally identifiable information (PII) is detected and removed.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>All the \u201cknowledge\u201d fits in a backpack<\/strong><\/h3>\n\n\n\n<p>What remains after this precise selection process is a dataset of top quality (like the <em>FineWeb dataset<\/em>). Karpathy highlights a key insight here that demystifies the notion of infinity:<\/p>\n\n\n\n<p><em><em>\u201cEven though the internet is vast, we\u2019re dealing exclusively with text that we filter. We end up with about 44 terabytes of data. You could almost fit that on a large hard drive these days that you could hold in your hand.\u201d<\/em><\/em><\/p>\n\n\n\n<p>Karpathy uses a powerful analogy: An AI model is like a highly compressed archive of the internet. Once training is complete, the model no longer has direct access to that data, but has \u201ccompressed\u201d its knowledge into its parameters. Just as a digital photograph loses some of its original detail to take up less space, an AI model doesn\u2019t memorize text verbatim, but rather memorizes statistical relationships and patterns between pieces of information.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>What does this mean for teachers and the education system?<\/strong><\/h3>\n\n\n\n<p>Razumijevanje ovog procesa klju\u010dno je za ispravno kori\u0161tenje alata u u\u010dionici:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The model doesn\u2019t \u201cGoogle\u201d: It\u2019s important to emphasize that a language model doesn\u2019t search the internet for answers (except in specific cases where the user enters a URL directly into it or uses search plugins). He creates an answer according to learned patterns, rather than copying it verbatim from the source. That is why the model sometimes generates convincing but incorrect facts - it reconstructs them based on statistical probability, not by checking the database.<\/li>\n\n\n\n<li>Knowledge is limited in time: Since training is the process of \"reading\" a specific, fixed set of data, the model does not possess information about the events that occurred after the end of that phase, unless it is connected to external tools.<\/li>\n<\/ol>\n\n\n\n<p>So when we use artificial intelligence, we are not accessing an omniscient machine, but an impressive, measurable and refined summary of human knowledge stored in digital form.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Source:<\/strong> This article is based on an analysis of Andrej Karpathy's technical lecture: <a href=\"https:\/\/www.youtube.com\/watch?v=7xTGNNLPyMI\">Deep Dive into LLMs like ChatGPT<\/a> and is the first in a series of articles in which we will explore the deep architecture of language models. In the next installment, we look at the phenomenon of tokenization - find out why models sometimes make mistakes in counting letters and how this affects their understanding of language.<\/p>","protected":false},"excerpt":{"rendered":"<p>Kada upi\u0161ete upit u ChatGPT ili neki drugi jezi\u010dni model, lako je ste\u0107i dojam da komunicirate sa sustavom koji u stvarnom vremenu pretra\u017euje internet kako bi vam ponudio odgovor. Me\u0111utim, istina je tehnolo\u0161ki znatno preciznija i iznimno va\u017ena za razumijevanje digitalne pismenosti u suvremenom obrazovanju. Andrej Karpathy, jedan od vode\u0107ih svjetskih stru\u010dnjaka za umjetnu inteligenciju, [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":1359,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_et_pb_use_builder":"off","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[14],"tags":[],"class_list":["post-1358","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-radovi"],"acf":{"radovi_source_url":"","radovi_button_label":"Pro\u010ditajte izvorni rad"},"_links":{"self":[{"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/posts\/1358","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/comments?post=1358"}],"version-history":[{"count":3,"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/posts\/1358\/revisions"}],"predecessor-version":[{"id":1465,"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/posts\/1358\/revisions\/1465"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/media\/1359"}],"wp:attachment":[{"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/media?parent=1358"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/categories?post=1358"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/brain.hr\/en\/wp-json\/wp\/v2\/tags?post=1358"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}