Understanding the Foundations of Language Modeling

Building LLMs from First Principles

SuNaAI Lab

Technical Guide Series

ResourcesTechnical GuidesLanguage Modeling Foundations

Chapter 1: The Hidden Engine Behind LLMs

Everyone talks about the size of GPT-4 or Claude — trillions of parameters, petaflops of compute, massive data pipelines. But beneath all this scale lies a single truth: the secret to powerful language models is efficiency, not brute force.

Imagine standing in front of a skyscraper. You see the gleaming windows, the impressive height, the architectural marvel. But what truly holds it up? Not the glass facade, but the steel beams, the concrete foundations, the engineering principles that allow it to reach such heights without collapsing.

Large language models are similar. When we marvel at GPT-4's capabilities or Claude's reasoning, we're seeing the final result. But beneath the surface lies something more fundamental: a deep understanding of how to build intelligence efficiently.

The Series Overview

This comprehensive guide breaks down how large language models are actually built — not from hype, but from first principles. We'll start from the basics of tokenization and architecture, move through systems optimization and data scaling, and eventually reach evaluation and fine-tuning. Each chapter builds on the previous one, creating a complete picture of modern language modeling.

This first part is about understanding the foundational philosophy of language modeling — how efficiency governs everything: from how text is represented, to how GPUs are used, to how models are scaled and evaluated. By the end, you'll see that every design choice in LLMs is about doing more with less.

Chapter 2: The Core Philosophy — Efficiency Everywhere

Building a large language model is an optimization puzzle. You have limited compute, limited data, and limited memory. The question is not "How do we build a bigger model?" but rather: "How do we build the most intelligent model possible given fixed constraints?"

Let me tell you a story. Imagine you're a chef in a prestigious restaurant. You have:

  • 32 cooking stations (GPUs) — each capable of complex operations
  • 2 weeks to perfect your menu (training time)
  • Limited ingredients (data) — you can't just order everything
  • Limited storage (memory) — your pantry has constraints

How do you create the most exceptional dining experience under these constraints? This is exactly the challenge of language modeling. The brilliant insight is that efficiency — not raw power — is what separates good models from extraordinary ones.

The Three Constraints

1. Compute

Every GPU hour costs money. Every parallel operation consumes resources. The question: how do you maximize learning per compute unit?

2. Data

Data isn't free. Quality matters more than quantity. The question: how do you extract maximum signal from limited examples?

3. Memory

Model weights, activations, gradients all consume memory. The question: how do you fit bigger models in available space?

Every single design choice in LLMs — architecture, data, parallelism, optimization — is about how to use limited resources wisely. That's what makes this field so elegant: it's not engineering excess; it's engineering precision.

The Efficiency Principle

Efficiency = Intelligence per Resource

A model that achieves GPT-4's performance with half the parameters is more efficient. A training method that reaches the same loss in half the time is more efficient. An inference system that generates text twice as fast is more efficient.

Chapter 3: Tokenization — Teaching Machines to Read

Computers don't understand words. They understand numbers — specifically, tokens (integer IDs). Tokenization is how we bridge the gap between human language and machine computation.

Imagine teaching a child to read. You don't start with whole books — you start with letters and syllables. Then words, then sentences. In language modeling, tokenization serves a similar purpose: breaking down the complex structure of human language into digestible building blocks for the model.

The Tokenization Process

Tokenization Pipeline
Input:  "Language modeling is fun"

Step 1: Split into tokens
        ["Language", " modeling", " is", " fun"]

Step 2: Map to integer IDs
        [1034, 5721, 42, 98]

Step 3: Convert to embeddings
        Each ID → 768-dimensional vector
        [1034] → [0.23, -0.45, 0.12, ..., 0.89]
        [5721] → [-0.12, 0.67, -0.23, ..., -0.34]
        ...

Each token ID corresponds to a learned vector in a high-dimensional space (typically 768 or 4096 dimensions). During training, the model learns relationships between these embeddings — words close in meaning end up close in vector space. This is how the model understands that "cat" and "feline" are related.

Byte Pair Encoding (BPE)

Modern tokenizers use techniques like Byte Pair Encoding (BPE) that merge frequent character pairs. This balances between too granular (character-level) and too large (word-level) tokenization.

Why BPE Works

  • Subword units: Handles out-of-vocabulary words by breaking them into known pieces
  • Frequency-based: Common words stay whole, rare words get split intelligently
  • Compact: Fewer tokens per sentence means less compute

Efficiency Insight

Fewer tokens = fewer steps = less compute per sentence.

Efficient tokenization can literally save millions of GPU-hours. A model processing 50,000 tokens instead of 100,000 tokens per batch effectively doubles its throughput. This is why tokenization isn't just a preprocessing step — it's a critical optimization target.

Chapter 4: Transformer Architecture — The Brain of Modern LLMs

Before 2017, language models relied on RNNs and LSTMs — sequential architectures that processed text one word at a time. Then came the Transformer, with a simple yet revolutionary idea: "Attention is all you need."

The Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need," changed everything. It wasn't just an improvement — it was a fundamental shift in how we build language models.

The Architecture Structure

Transformer Architecture
Input Embeddings
    ↓
Position Encoding
    ↓
┌─────────────────────┐
│ Multi-Head          │
│ Attention           │
└─────────────────────┘
    ↓
Layer Normalization
    ↓
┌─────────────────────┐
│ Feed-Forward        │
│ Network             │
└─────────────────────┘
    ↓
Layer Normalization
    ↓
    ... (Repeat N times)
    ↓
Output

Stack this structure dozens or hundreds of times — you get GPT, Llama, Claude, Gemini. The beauty is in its simplicity and scalability.

Understanding Attention

Attention allows the model to "look back" at all previous words and decide which ones are relevant. It's like giving the model selective memory and focus.

The Attention Analogy

Imagine reading a paragraph — your mind doesn't process every word equally. You selectively pay attention to important ones to understand meaning. When you read "The cat sat on the mat," your attention focuses on "cat," "sat," and "mat" as the key elements. That's exactly what the Attention mechanism does.

The Parallelism Revolution

Unlike RNNs, Transformers process tokens in parallel — they see the entire sequence at once. This is what made large-scale pretraining possible. Parallelism = efficiency.

RNN/LSTM:
Sequential processing
100 tokens = 100 steps
Transformer:
Parallel processing
100 tokens = 1 step

Modern Improvements

SwiGLU Activations

Instead of ReLU, modern models use SwiGLU (Swish-Gated Linear Unit), which provides smoother gradients and better performance.

RMSNorm Normalization

Root Mean Square Layer Normalization is more efficient than LayerNorm, reducing computational overhead.

Rotary Position Embeddings (RoPE)

Rotary embeddings encode position information more effectively, especially for longer sequences.

The Transformer's simplicity and scalability made it the perfect architecture for the "scaling era" of AI. Every modern LLM is built on this foundation.

Chapter 5: Systems — Where Efficiency Meets Hardware

A model is only as efficient as the hardware that runs it. GPUs are fast, but data movement is slow — so system design is all about minimizing communication and maximizing computation.

Let's talk about GPUs. These aren't just "faster CPUs" — they're specialized machines designed for parallel computation. Understanding how they work is crucial for understanding why certain model architectures and training strategies are more efficient.

GPU Architecture Basics

The GPU Kitchen Analogy

Imagine a restaurant kitchen. The chefs (GPU cores) are fast, but if they keep running to the storage room (DRAM) for ingredients, they waste time. The solution? Keep ingredients nearby — in fast local memory (cache). Modern GPU optimization is all about minimizing trips to "storage" and keeping data close to computation.

GPU Memory Hierarchy
┌─────────────────────────────────────┐
│  High Bandwidth Memory (HBM)        │
│  ~1.5TB/s bandwidth                 │
│  SLOW to access (but large)         │
└──────────────┬──────────────────────┘
               │
               ↓
┌─────────────────────────────────────┐
│  L2 Cache                           │
│  ~3TB/s bandwidth                   │
│  Medium speed                       │
└──────────────┬──────────────────────┘
               │
               ↓
┌─────────────────────────────────────┐
│  L1 Cache / Shared Memory           │
│  ~19TB/s bandwidth                  │
│  FAST (but small)                   │
└─────────────────────────────────────┘

Parallelism Types

When one GPU isn't enough, we use multiple — but how? Different parallelism strategies solve different bottlenecks.

TypeDescriptionAnalogy
Data ParallelismSplit data across GPUsEach chef cooks different dishes
Tensor ParallelismSplit model weights across GPUsEach chef handles part of the same recipe
Pipeline ParallelismSplit model layers across GPUsAssembly line of chefs
Sequence ParallelismSplit input sequence across GPUsDivide sentences among chefs

The Golden Rule of Systems Optimization

Minimize communication, maximize computation.

Data movement between GPUs is very slow. The best parallelization strategy is the one that keeps each GPU computing as much as possible and communicating as little as necessary.

Chapter 6: Inference — When the Model Starts Talking

Training teaches the model language. Inference is when we actually use it — when it starts generating text.

There's a fundamental difference between training and inference. Training is about learning — processing millions of examples in parallel, updating weights, improving performance. Inference is about application — taking what you've learned and using it to generate one response at a time.

The Two Phases of Inference

Prefill Phase

The model reads the prompt — all at once. This is compute-bound (lots of math, but efficient).

Example: User sends "Explain quantum computing"
Model processes entire prompt in parallel

Decoding Phase

The model generates one token at a time. This is memory-bound (needs to store previous states).

Example: Model generates:
"Quantum" → "computing" → "uses" → "qubits"...

The Writing Analogy

It's like writing a sentence — you read what's already written (prefill), then think of the next word (decode). The challenge is that decoding happens sequentially, which makes it harder to parallelize than training.

KV Caching

During decoding, we reuse the Key/Value pairs from previous tokens instead of recomputing them. This drastically speeds up inference.

KV Caching Concept
Without KV Cache:
Token 1: Compute all attention → Generate Token 2
Token 2: Recompute all attention → Generate Token 3
... (slow, redundant)

With KV Cache:
Token 1: Compute attention, cache K/V → Generate Token 2
Token 2: Use cached K/V, only compute new → Generate Token 3
... (fast, efficient)

Speculative Decoding

Use a smaller "draft" model to predict multiple tokens, then verify them using the main model — to speed things up without losing accuracy.

Inference Optimization Summary

Inference optimization is where theory meets engineering — making the model not just smart, but usable in real time. Techniques like KV caching and speculative decoding are what make ChatGPT feel instant rather than sluggish.

Chapter 7: Scaling Laws — The Science of Bigger Models

Scaling laws answer one question: "If I have more compute, should I train a bigger model, or train longer on more data?"

For years, the field operated on intuition. "Bigger models are better." "More data helps." But in 2020, researchers discovered something remarkable: there are predictable mathematical relationships governing how models improve with scale.

The Key Papers

Kaplan et al. (2020)

Discovered that loss decreases predictably as a power-law function of model size, dataset size, and compute.

Hoffmann et al. (2022)

Found the optimal compute-data-parameters balance: for a given compute budget, there's an ideal model size and dataset size.

Compute-Optimal Training

For a given compute budget (C), there's an optimal balance between model size (N) and data (D). The relationship discovered by Hoffmann et al.:

Optimal Data Formula
D* = 20 × N^0.76

Where:
  D* = Optimal number of tokens
  N  = Number of parameters (in billions)

Example:
  1.4B parameter model → ~28B tokens optimal
  7B parameter model  → ~90B tokens optimal
  70B parameter model → ~580B tokens optimal

The Balanced Diet Analogy

Think of it like a balanced diet — too much protein (parameters) and not enough vegetables (data), and your system becomes inefficient. Scaling laws tell us the "nutrition ratio" for models.

The Big Picture

Scaling laws are the physics of deep learning — they reveal predictable patterns in how intelligence grows with data and compute. This is why companies can plan massive training runs with confidence: the relationships are mathematical, not just empirical.

Chapter 8: Data Composition — The Model's Diet

The quality of a language model depends on what it reads. Models are reflections of the data they consume.

Imagine training a model only on Reddit comments. It would be snarky, casual, and maybe a bit unhinged. Train it only on academic papers? It would be formal, precise, but perhaps overly rigid. The point is: data diversity defines the model's personality.

Common Data Categories

Academic

  • • arXiv papers
  • • PubMed articles
  • • Scholarly books

Internet

  • • Wikipedia
  • • StackExchange
  • • Web pages

Prose

  • • Books3
  • • Fiction
  • • Non-fiction

Code

  • • GitHub
  • • Documentation
  • • Stack Overflow

Dialogue

  • • Reddit
  • • Subtitles
  • • Forums

The Balanced Diet Analogy

Just like a human diet needs balance (proteins, carbs, vitamins), models need diverse data — too much Reddit and it becomes chaotic; too much arXiv and it becomes robotic. The art is in finding the right mix.

Key Takeaway

Data diversity defines the model's "personality." LLMs are reflections of the data they consume. Understanding this helps you understand why different models excel in different domains.

Chapter 9: Evaluation — How We Measure Intelligence

Evaluation is like school exams for AI — from vocabulary tests to PhD-level thesis defense.

How do you measure intelligence? For humans, we have IQ tests, standardized exams, job performance. For AI, it's more complex — but we've developed a hierarchy of evaluation that goes from basic fluency to complex reasoning.

The Evaluation Hierarchy

1️⃣

Perplexity

What it tests: Language fluency

Measures how "surprised" the model is by text. Lower = better language understanding.

2️⃣

Benchmarks

What it tests: Reasoning, knowledge

MMLU (general knowledge), GSM8K (math), HumanEval (coding). Standardized tests for models.

3️⃣

Instruction Following

What it tests: Obedience

AlpacaEval, WildBench. Can the model follow complex instructions accurately?

4️⃣

Test-time Compute

What it tests: Logical depth

Chain-of-Thought, Ensembling. Does giving the model more "thinking time" help?

5️⃣

LM-as-a-Judge

What it tests: Quality scoring

GPT-4 evaluating GPT-3.5 outputs. Using stronger models to evaluate weaker ones.

6️⃣

Full Systems

What it tests: Real-world use

RAG systems, Agents. How does the model perform in actual applications?

Each level builds on the previous one. You can't have good instruction following without good language fluency. You can't have effective real-world systems without solid reasoning capabilities.

Chapter 10: Data Curation — The True Foundation

All of this — scaling, training, evaluation — depends on one invisible process: data curation.

Data doesn't fall from the sky. It's scraped, cleaned, filtered, licensed, and structured. This invisible work is what separates successful models from failures.

The Data Pipeline

Data Curation Pipeline
Raw Sources
  ↓
┌─────────────────────┐
│  1. Collection      │
│  • Web scraping     │
│  • API access       │
│  • Datasets         │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│  2. Cleaning        │
│  • Remove HTML      │
│  • Extract text     │
│  • Fix encoding     │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│  3. Filtering       │
│  • Quality check    │
│  • Toxicity filter  │
│  • Deduplication    │
└──────────┬──────────┘
           ↓
┌─────────────────────┐
│  4. Licensing       │
│  • Fair use         │
│  • Permissions      │
│  • Attribution      │
└──────────┬──────────┘
           ↓
    Training Dataset

The Cooking Analogy

Training a model without good data is like cooking with spoiled ingredients — no matter how good your chef (model) is, the dish will fail. Data curation is the invisible work that ensures quality from the start.

Common Challenges

Format Diversity

HTML, PDFs, JSON, plain text — each requires different extraction methods.

Quality Control

How do you automatically detect high-quality text at scale?

Legal Issues

Fair use, copyright, licensing — the legal landscape is complex.

Scale

Processing trillions of tokens requires massive infrastructure.

Chapter 11: Data Processing — Turning Raw Web Data into Gold

Once we have the data, we need to process it — because raw web data is messy, repetitive, and unsafe.

Web data is a wasteland. It's full of ads, duplicates, broken tags, and harmful content. Processing transforms this chaos into clean, usable training data.

The Three Pillars of Data Processing

1. Transformation — From HTML/PDF to Text

Raw web pages aren't ready for models. They must be cleaned, structured, and converted to text.

  • Removing ads, navigation bars, and code snippets
  • Extracting readable paragraphs and preserving formatting
  • Rewriting malformed sections for consistency
Analogy: Converting a messy scanned newspaper into a clean eBook

2. Filtering — Keep Quality, Remove Harm

Not all internet text is useful or safe. We train classifiers that automatically score quality and detect problems.

  • Score text quality (is it coherent and informative?)
  • Detect toxicity, hate speech, or spam
  • Remove harmful or low-value documents
Analogy: A sieve that lets only gold dust (high-quality sentences) pass through

3. Deduplication — Don't Teach the Model to Memorize

This is one of the most important but under-discussed steps. Models shouldn't memorize — they should generalize.

  • Use Bloom Filters or MinHash algorithms to detect near-duplicates
  • Remove duplicates before training
  • Prevents overfitting and verbatim copying
Analogy: If a student reads the same textbook page 100 times, they'll memorize instead of learning broadly

The Efficiency Goal

Minimize perplexity given a token budget.

Train on the cleanest possible subset of data that gives maximum performance per token. Quality beats quantity — this is efficiency in action.

Chapter 12: Alignment — Making the Model Useful

Once data processing is done, we train the base model — a pure next-token predictor. But a base model isn't polite or safe. Alignment is what turns a raw model into a helpful assistant.

If pretraining is teaching a child every book in the library, alignment is teaching them how to behave and converse. A base model can complete text, but it can't follow instructions or refuse harmful tasks.

The Goals of Alignment

Follow Instructions

Respond meaningfully to prompts ("Summarize this", "Explain like I'm five")

Tune Style

Adjust tone, format, and coherence (formal, concise, friendly)

Ensure Safety

Refuse harmful or unethical queries ("How to hack...", etc.)

Phase 1: Supervised Fine-Tuning (SFT)

We start with a curated dataset of prompt–response pairs. These examples are human-annotated or generated from existing assistants.

SFT Example
[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "What is 1 + 1?"
  },
  {
    "role": "assistant",
    "content": "The answer is 2."
  }
]

We fine-tune the model to maximize P(response | prompt). The model already knows the knowledge; we're showing it the format and tone to use.

The Student Analogy

It's like teaching a student model answers for exam questions — they already know the knowledge; you're showing them the format and tone to use.

Chapter 13: Preference Learning — Learning What Humans Like

After SFT, the model can follow instructions — but not necessarily in the best way. Preference learning teaches the model human preferences.

Generate multiple candidate answers to the same prompt, and ask humans (or other models) which one is better. This data teaches the model human preferences — conciseness, clarity, helpfulness.

Preference Learning Example
Prompt: "What's the best way to train a language model?"

Response A: "Use a large dataset and train for a long time."
Response B: "Use a small dataset and train briefly."

Preference: A > B

The Student Analogy

If SFT teaches a student the correct answers, preference learning teaches them which answers teachers prefer most.

Algorithms for Learning from Feedback

1. PPO — Proximal Policy Optimization

This is what OpenAI used for ChatGPT's RLHF. You train a reward model that scores how good a response is. The model then generates text and adjusts itself via reinforcement learning to maximize that reward.

Analogy: A debate coach giving feedback after every answer — the model learns by trial and error, guided by rewards. But PPO is computationally heavy — it requires training a separate reward model.

2. DPO — Direct Preference Optimization

A newer, simpler algorithm. Instead of training a reward model, it directly compares preferred and dispreferred answers. The model is optimized so that it assigns higher probability to preferred responses.

Analogy: Training a student by simply saying "Answer A was better than B — remember that pattern." No separate "reward function" needed.

3. GRPO — Group Relative Preference Optimization

A recent advancement. Removes the need for a value function (simplifies PPO even more). Handles groups of preferences instead of pairwise comparisons. More stable, efficient, and scalable for large datasets.

Analogy: If DPO compares two answers, GRPO compares multiple students at once and finds who's consistently best — more reliable and faster.

Chapter 14: Verifiers — Teaching Models to Check Themselves

Even with feedback learning, models still make factual or logical errors. That's where verifiers come in.

Verifiers are like proofreaders or referees — they don't write, they check. They ensure that model outputs are correct, safe, and appropriate.

TypeDescriptionExample
Formal VerifiersExternal tools that check correctnessCode compilers, math solvers
Learned VerifiersAnother model that scores outputsLMs-as-judges (e.g., GPT-4 evaluating GPT-3.5)

Why Verifiers Matter

Models can generate plausible-sounding but incorrect information. Verifiers catch these errors before they reach users, improving reliability and trustworthiness.

Why All This Comes Back to Efficiency

Notice how every step — data curation, cleaning, deduplication, SFT, DPO, PPO — all come back to one idea: doing more with less. Efficiently using data, compute, and human feedback is what separates average models from truly scalable systems like GPT-4 or Claude 3.

If pretraining builds the brain, alignment teaches the heart — and efficiency keeps both running in sync.

The future of language modeling isn't about building bigger models — it's about building smarter systems that extract maximum intelligence from every resource we have.

This guide covered the foundations. The next chapters will dive deeper into each topic, with practical examples and hands-on implementations.