LLMs 101
“AI is the science and engineering of making intelligent machines.” -John McCarthy, the Stanford computer science professor who coined the term “artificial intelligence” in 1956

In November 2022, engineers at OpenAI were agonizing over a decision to release a new chatbot.
They hoped the release would create a flywheel: more people would use their AI, which would mean more user feedback, which would mean more rapid improvement to the model, which would attract even more users. At the time, however, the company’s executives were worried the chatbot wasn’t good enough for public release. The AI in question was ChatGPT.
Within two months of launch, it was the most viral technology in history, with 30 million users and a $10 billion valuation from Microsoft. And depending on your vantage, it either felt like an overnight revolution or a decades-long evolution.
To much of the public, the tool’s capabilities were so impressive they seemed almost like sorcery. But for AI researchers, ChatGPT was just the latest iteration of a research field known as natural language processing, or NLP. If AI is, broadly, the science of making intelligent machines, then NLP is the discipline that tries to teach the intelligent use of language to those machines. Since the field of AI emerged around 1950,1 NLP has been a mostly underappreciated and underfunded field of inquiry, with only a handful of researchers in academic and big tech research labs who believed we could teach computers to read and write at human levels.2 NLP is underfunded no more.
Like all AI, a language model has three components: a model, compute to power the model, and the data the model learns from. Technological advances around each of these three components—GPUs (compute), transformer architecture (models), and the internet (data)—have come together to make ChatGPT and the current wave of LLMs and AI applications possible.
In this primer, we’ll walk through each of these breakthroughs as we explain how language models work. More importantly, we’ll explain why quality data—specifically human-generated content like articles, podcasts, videos, books, and images—will be the difference between success and failure in an AI-enabled world.
The Compute Breakthrough: Graphics Processing Units
Compute is processing power.3 The more processing power, the bigger the model can be and the more data it can process. Whether cloud or a local device, compute is essentially a cluster of microchips that performs the basic computing processes that make everything else work.
The breakthrough to more powerful chips, and therefore greater compute, traces back to the video game boom of the 1990s.4 At the time, the only chips that existed were serial processors, which had the capacity to execute a single task at a time. These serial processors steadily improved—my 10-year-old mind was blown when I upgraded from the original 8-bit Nintendo to a 16-bit Sega. But to get from Street Fighter to The Last of Us, developers needed a chip capable of splitting up and running multiple, complex processing tasks at the same time. A startup called Nvidia5 met that need by designing a new type of chip—the GPU.
These GPU chips made it possible to process a lot more data, a lot more efficiently. And as interest and investment in language models have exploded, so has demand for GPUs. That demand has led to a global chip shortage and made compute perhaps the biggest cost and bottleneck to advancing AI models and adoption. (The dearth of GPUs even drove some desperate engineers to repurpose old video game chips as a substitute.)
But that shortage is easing as more public and private funding goes into chip manufacturing. Nvidia is ramping up production and, in March 2024 at its GTC conference, announced a new generation of AI chips said to be more than 30 times more powerful than the current H100s.
With more compute, the next bottleneck to AI is likely to be high-quality training data. However, let’s not get ahead of ourselves. First, we need to talk about models.
The Model Breakthrough: Transformer Architecture
Crack open any AI model and, at its core, it’s statistics and probability. Models learn statistical patterns from a given dataset and then apply those lessons to any new data they encounter. To do that with any kind of efficiency, though, required an innovation: a deep learning architecture known as a transformer architecture.
The transformer architectures6 that are the basis of LLMs, like GPT, Claude, and Gemini, were first described in the influential 2017 paper “Attention Is All You Need.” The paper proposed an architecture that could take advantage of the ability of GPUs to parallel process information and effectively predict the next word. With this architecture and sufficient compute, a language model could train on huge amounts of text, identify relationships and patterns, and then use those relationships and patterns to predict the next word and generate responses to prompts.
How exactly LLMs do this is incredibly complex. Fortunately, it’s not important to be deeply versed on things like model weights, parameters, backpropagation, and forward feeding to effectively use AI, much like you don’t need to understand the physics behind an internal combustion engine to drive a car.
It is, however, critical to understand three core concepts of what transformer-based language models do: 1) convert language into math, specifically vector math, 2) predict the next word, and 3) improve exponentially with more data and bigger models, a principle known as scaling laws.
1) Convert language into math
Language models convert words—or “tokens,” which are often words, but can be parts or series of words—into vectors,7 through a process known as word embeddings, or word2vec. A vector can be represented as a set of numbers that tell you the location, size (or “magnitude,” in mathematical terms), and direction of the vector. This essentially converts words, a language machines don’t understand, into math, a language that they do. This mathematical representation allows models to calculate relationships between vectors and use those calculations to predict the next word.
For a deeper explanation of how vectors and language models work, check out this fantastic Ars Technica explainer.
2) Predict the next word
Whenever the model is prompted, it responds by predicting the next word repeatedly. For example, if you train a model on a bunch of lullabies and then ask the AI to complete the line, “twinkle, twinkle, little ___,” it will analyze the patterns across the lullabies and most likely predict “star.”
That prediction isn’t guaranteed because AI is not like traditional software. So, what’s the difference between AI and traditional software?
Traditional computers are deterministic. You code an input, and the computer will always provide that input in response when confronted with a given prompt. LLMs, however, are probabilistic, driven by statistics. The answers a model generates and the words it predicts change, even if the prompt remains the same. Think of it as rolling 10 different dice 10 times. The dice and the action of rolling are the same each time, but the combination of numbers invariably changes.
Like a fancier version of a bell curve, an LLM’s potential answers fall on a probability curve. The exact shape of that curve is controlled by a model’s weights and parameters. For instance, a parameter known as “temperature” controls how random the outputs are. Temperature was initially a value of 0 to 1, but recently the scale was extended from 0 to 2. At 0, there is no randomness to the next word that the model picks, making responses more deterministic and uniform, but also less creative and interesting. At 2, the model takes more risks picking the next word, resulting in more random outputs, more creativity, and more hallucinations.
3) Scaling Laws
Initially, transformer-based models didn’t seem all that impressive, except to the researchers who understood the technology. As I once heard Dario Amodei, founder of Anthropic and previously VP of Research at OpenAI, explain to a room full of AI builders:
When we put out GPT-2, some of the stuff that was considered most impressive at the time was, “Oh, my God. You give these five examples of English to French translation. Just offer it straight into the language model. Then you put in a sixth sentence in English and it actually translates into French. It actually understands the pattern.” That was crazy to us, even though the translation was terrible. It was almost worse than if you were to just take a dictionary and substitute word for word.
Our view is that this is the beginning of something amazing because there’s no limit and you can continue to scale it up. There’s no reason why the patterns we’ve seen before won’t continue to hold. The objective of predicting the next word is so rich and there’s so much you can push against that it just absolutely has to work. Some people looked at it and were like, “You made a bot that translates really badly.”
What Amodei and researchers saw in GPT-1 and GPT-2, and the rest of us discovered with ChatGPT, was the power of scaling laws. These laws essentially say that performance improves with a bigger model in the form of more parameters, more compute, and…more data.
The Data Breakthrough: The Internet
Even with the most sophisticated deep learning architecture and the most powerful processing available, a language model is only as good as the data—the words—that it is trained on.
ChatGPT, for better or worse, was trained on the world’s largest dataset: the internet. To keep improving, especially as more models come to market, the biggest LLMs—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini—are ravenous for data. What’s more, if AI scaling laws hold, the demand for data— specifically, “unstructured data”—will only increase.
Structured data is what we typically picture when someone says “data”—it’s the rows and columns in a spreadsheet. Unstructured data is data without a clear and consistent format. It’s Slack messages, Zoom recordings, emails, GDocs, and images. (There is also semi-structured data, but we won’t get into that here.) Before LLMs, unstructured data was too difficult and expensive for most companies to bother capturing and storing.

To put numbers to it: today, unstructured data accounts for 90+% of all the data that companies generate, but less than 2% of that is captured and stored. The amount they are generating (and capturing and storing) is rapidly increasing. But just because they have the data, doesn’t mean they are willing to give it to models to train AI on.
This leaves LLMs with two big problems: 1) they are running out of data, and 2) the data they have is of very mixed quality.
LLMs need more data. Data is important enough that LLM builders will go to great lengths to get it—with or without explicit permission.8 But the publicly available data is running out, possibly as soon as 2026. How these LLMs will acquire the content they crave is the open question, and much of the answer will be determined in legislatures and courtrooms. (We cover these in more detail in an upcoming post.)
LLMs need better data. As any human who has ever logged on to the internet knows, it’s a mixed bag that contains both the totality of humankind’s collective knowledge, and an equal or greater amount of junk—expired eBay listings, controversial subreddits, derelict MySpace pages, and more nefarious sources of misinformation and overt and covert bias and discrimination.
As LLM hallucinations9 frequently remind us, even the most powerful AI models are subject to the old rule—“garbage in, garbage out.” If a model is trained on low-quality data, it will be more likely to produce low-quality results. Put another way, if you train a model on the entire internet, you can weight the model to prefer Wikipedia over 4Chan, but the trolling is still somewhere in the model’s probability curve.
A flood of talent and funding has led to new finetuning techniques and retrieval augmented generation (RAG) that make it possible to customize AI and improve performance and accuracy on a specific use case. It is more technical than we’ll get into here, but we like this explanation from Snorkel AI that compares finetuning to a doctor updating his expertise and RAG to checking a patient’s chart. Both techniques can be, and increasingly are, used together to optimize performance. These techniques use content in different ways, but they all require content to teach the AI what good looks like.
AI is intelligent, but it isn’t omniscient. If you can’t define good, you can never train the AI to get there. And defining good is really just telling the AI what good content—articles, videos, podcasts, and graphics—to focus on. What defines good content hasn’t changed: it’s content that knows its audience, backs up claims with authoritative sources, has a voice and style, uses the medium to support the message, and is regularly reviewed and updated. Who creates good content hasn’t changed either: it’s the best human creatives, not bots.
Footnotes:
- Some people date the birth of AI as a field to the late 1800s and the work of Ada Lovelace; others to 1950, when Alan Turing released his foundational paper Computing Machinery and Intelligence; still others to the 1956 Dartmouth summer conference when the term “artificial intelligence” was coined.
- By “human levels” we mean that generally people view ChatGPT as about as good as the average human.
- For AI, this processing power is often measured in floating point operations, often shortened to FLOPS.
- Interesting note: games are often where cutting edge infrastructure is built. First, it seems we build it for World of Warcraft, Elder Scrolls, and Zelda, then we sell it to the Fortune 500.
- Nvidia launched their original GPU in 1999, the same year they went public with a stock valued at $12 per share.
- For simplicity, we focus this primer on transformer architectures. There are a number of other types of AI models that are part of the current wave, including diffusion architectures that apply similar principles to image and video, rather than text.
- For a better technical explanation of vectors: https://mathinsight.org/vector_introduction
- While OpenAI does not disclose exactly what comprises its training dataset, it generally describes it as “all publicly available data on the internet.” In an investigative piece in April 2024, the New York Times reported that OpenAI built a speech recognition tool to help transcribe YouTube videos to serve as training data.
- A hallucination is when a model presents invented data as fact.
Written by N2+AI’s Das Rush, Madelyn Goodman, and Tessa Stuart
Edited by N2 CEO Joe Flood and COO Chris Edmonds
Feature image prompt: Create an illustration that visually represents the three core components of a language model: the model (symbolized by a neural network or transformer architecture), compute power (symbolized by GPUs or a data center), and data (symbolized by an interconnected web, or icons representing articles, podcasts, videos, books, and images). The illustration should convey the synergy between these elements, highlighting their role in the development of modern NLP and AI applications. Use a futuristic and cohesive design to emphasize the cutting-edge nature of these technological advancements. Do not include words in the image.
Like this post? Check out other pieces in our content series:
- N2+AI builds on what we’ve learned and tailors it for what’s next
- Our take on the importance of writers to AI
Interested in working with us? We are currently looking for beta customers for our AI content services. Reach out to ai@n2comms.com.