# 04. The Art and Science of Data for LLMs

> âš¡Compute Note: You can run this notebook on CPU. 

Welcome to the fourth part of our series! So far, we have built and optimized a GPT model from scratch. However, a powerful model architecture is only half of the equation. The other, arguably more important, half is the **data** it's trained on.

The principle of "garbage in, garbage out" has never been more true than in the age of LLMs. The quality, diversity, and cleanliness of your training data directly determine your model's capabilities, its biases, and its failure modes. In this notebook, we will explore:

1.  **What large-scale datasets look like**: We'll load a subset of a massive web-scraped dataset.
2.  **Common data quality issues**: We'll inspect raw data to find boilerplate, code, and other artifacts.
3.  **Filtering and cleaning techniques**: We'll discuss and implement simple heuristics to improve data quality.
4.  **The impact of curation**: We'll compare our raw dataset to a highly-filtered one to see the difference.

For this tutorial, we'll rely heavily on the ðŸ¤— `datasets` library, which is the standard for accessing and processing massive datasets efficiently.

I highly recommend the reader to go through the [Stanford CS336 Data lectures](https://www.youtube.com/watch?v=WePxmeXU1xg&list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_&index=13).

### Setup

First, let's install the necessary libraries. We need `datasets` to download and process our data, and `matplotlib` for visualization.

In [4]:
%pip install datasets matplotlib

Note: you may need to restart the kernel to use updated packages.


In [2]:
import datasets
from datasets import load_dataset
import matplotlib.pyplot as plt
import re

# Set default figure size for plots
plt.rcParams['figure.figsize'] = (10, 6)

### 1. Exploring a Raw Web Dataset: C4

The **Colossal Cleaned Common Crawl (C4)** dataset was created by Google for training their T5 models. It's a massive scrape of the public internet, which has undergone some basic cleaning (like removing offensive words and deduplication). However, it's still considered relatively "raw" compared to more modern, heavily curated datasets.

Let's load a small part of the C4 dataset to see what it looks like. We'll use `streaming=True` to avoid downloading the entire dataset, which is several terabytes!

In [None]:
# Load the C4 dataset in streaming mode
c4_dataset = load_dataset("allenai/c4", "en", streaming=True, split='train')

# Let's look at the first few examples
print("--- Raw C4 Dataset Examples ---")
for i, example in enumerate(iter(c4_dataset.take(5))):
    print(f"\n--- Example {i+1} ---")
    # Print the first 500 characters of the text
    print(example['text'][:500])

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

--- Raw C4 Dataset Examples ---

--- Example 1 ---
Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat select

--- Example 2 ---
Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012.
I've got a 500gb internal drive and a 240gb SSD.
When trying to restore using disk utility i'm given the error "Not enough space on disk ____ to restore"
But I shouldn't have to do that!!!
Any ideas or workarounds before resorting to the above?
Use Carbon Copy Cloner to copy one drive to the other. I've done this several times going from larger HDD to smal

### 2. Identifying Data Quality Issues

As you look through the examples above, you might notice some problems:
- **Boilerplate Text**: Phrases like "log in," "terms of use," or cookie consent notices.
- **Code and Markup**: Snippets of JavaScript, HTML, or CSS that are not natural language.
- **Strange Formatting**: Excessive line breaks, weird characters, or garbled text.
- **Non-Prose Content**: Lists, tables, or other structured data that doesn't read like a book.

Training a model on this kind of data can teach it to generate undesirable content. Our goal is to filter the dataset to keep only high-quality, natural language prose.

### 3. Data Filtering Techniques

Data filtering is a deep and complex field, but we can apply some simple yet powerful heuristics. Here are a few common ones:

1.  **Length Filtering**: Remove documents that are too short or too long.
2.  **Character Filtering**: Remove documents with a high percentage of non-alphanumeric characters.
3.  **Boilerplate Removal**: Remove documents containing common web boilerplate phrases (e.g., "JavaScript is disabled").
4.  **Repetition Removal**: Remove documents with highly repetitive lines or n-grams.

Let's create a simple filtering function that combines a few of these ideas.

In [10]:
def is_high_quality(example):
    """A simple heuristic-based filter for data quality."""
    text = example['text']
    
    # 1. Length filter
    if len(text) < 200 or len(text) > 100000:
        return False
    
    # 2. Boilerplate filter
    boilerplate_phrases = [
        "terms of use", "privacy policy", "cookie policy", 
        "subscribe to our newsletter", "enable javascript"
    ]
    if any(phrase in text.lower() for phrase in boilerplate_phrases):
        return False
        
    # 3. Character filter (check for high proportion of non-alphanumeric chars)
    # This can be a proxy for code or heavily formatted text
    alphanumeric_chars = sum(c.isalnum() for c in text)
    if alphanumeric_chars / len(text) < 0.75:
        return False
        
    return True

# The .filter() method applies our function to each example
filtered_c4 = c4_dataset.filter(is_high_quality)

print("--- Filtered C4 Dataset Examples ---")
for i, example in enumerate(iter(filtered_c4.take(5))):
    print(f"\n--- Example {i+1} ---")
    print(example['text'][:500])

--- Filtered C4 Dataset Examples ---

--- Example 1 ---
Beginners BBQ Class Taking Place in Missoula!
Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.
He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat select

--- Example 2 ---
Discussion in 'Mac OS X Lion (10.7)' started by axboi87, Jan 20, 2012.
I've got a 500gb internal drive and a 240gb SSD.
When trying to restore using disk utility i'm given the error "Not enough space on disk ____ to restore"
But I shouldn't have to do that!!!
Any ideas or workarounds before resorting to the above?
Use Carbon Copy Cloner to copy one drive to the other. I've done this several times going from larger HDD to

While our simple filter helps, professional dataset creation involves much more sophisticated pipelines. Let's look at a dataset that has already undergone this process.

### 4. A Look at a Highly Curated Dataset: FineWeb

**FineWeb**, created by the Hugging Face team, is a great example of a state-of-the-art, highly filtered dataset. It starts from Common Crawl but applies a rigorous filtering and deduplication pipeline, resulting in over 15 trillion tokens of high-quality text.

Let's load a sample of FineWeb and compare it to the raw C4 examples.

I also recommend going through the description of the fineweb to get an idea of how the quality is assessed for production datasets. 
Reference: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

In [12]:
# Load a sample of the FineWeb dataset
# We'll use a smaller 10B token sample for demonstration
fineweb_dataset = load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT", streaming=True, split='train')

print("--- FineWeb Dataset Examples ---")
for i, example in enumerate(iter(fineweb_dataset.take(5))):
    print(f"\n--- Example {i+1} ---")
    print(example['text'][:500])

Resolving data files:   0%|          | 0/2410 [00:00<?, ?it/s]

--- FineWeb Dataset Examples ---

--- Example 1 ---
The Independent Jane
For all the love, romance and scandal in Jane Austenâ€™s books, what they are really about is freedom and independence. Independence of thought and the freedom to choose.
Elizabethâ€™s refusal of Mr. Collins offer of marriage showed an independence seldom seen in heroines of the day. Her refusal of Mr. Darcy while triggered by anger showed a level of independence that left him shocked and stunned.
The freedom she exhibited in finally accepting him in direct defiance of Lady Cath

--- Example 2 ---
Taking Play Seriously
By ROBIN MARANTZ HENIG
Published: February 17, 2008
On a drizzly Tuesday night in late January, 200 people came out to hear a psychiatrist talk rhapsodically about play -- not just the intense, joyous play of children, but play for all people, at all ages, at all times. (All species too; the lecture featured touching photos of a polar bear and a husky engaging playfully at a snowy outpost in norther

### Conclusion: Quality Over Quantity

Comparing the raw C4 examples with the FineWeb examples, the difference is clear. The FineWeb text is much cleaner, reads more like natural prose, and is free of the distracting artifacts common in raw web data.

However, extending to a production pipeline is not as straightforward. Lot of nuances go into the tuning of hyperparameters at that scale. 
