import requests
import yaml
= "https://huggingface.co/datasets/roneneldan/TinyStories/raw/main/Evaluation%20prompts.yaml"
url = requests.get(url)
response = yaml.safe_load(response.text)
data len(data)
44
Vishal Bakshi
April 27, 2025
In this notebook I’m going to generate story completions using the TinyStories 1M, 8M, 28M models. The actual HF model size for these models is 3.7M, 19.7M and 52M, respectively. Since I’m training 5M, 25M, 60M and 125M models, these three TinyStories models will serve as proxies for my first three sizes, and I will expect my 125M model to generate stories that receive higher scores than my 60M (by how much higher is TBD).
Ronen Eldan, the TinyStories paper author, has listed on this HF model card discussion forum:
we used temp=0, beams=5
So I’ll be using those two settings during inference.
Here are some key architectural details for my initial quick-and-dirty models:
Model Name | Hidden Dim | Intermediate Dim | Number of Layers | Number of Params |
---|---|---|---|---|
5M | 64 | 256 | 13 | 4_949_696 |
25M | 256 | 1024 | 8 | 24_776_960 |
60M | 512 | 2048 | 6 | 57_940_480 |
125M | 768 | 3072 | 8 | 124_662_528 |
Here are my initial scoring categories, based on type of language model capability:
Referencing Figure 4 in the TinyStories paper I would expect to achieve LLM Judge for these models close to the following (TSL = TinyScale Lab):
Similar to | Hidden Dim | Num Layers | Eval Loss | Creativity | Grammar | Consistency | Plot |
---|---|---|---|---|---|---|---|
TSL-5M | 64 | 12 | 2.02 | 4.84 | 6.19 | 4.75 | 4.39 |
TSL-25M | 256 | 8 | 1.38 | 6.54 | 7.72 | 8.02 | 7.23 |
TSL-60M | 512 | Average of 4 and 8 scores | 1.23 | 6.8 | 8.35 | 8.7 | 7.31 |
TSL-125M | 768 | 8 | 1.18 | 7.02 | 8.62 | 9.34 | 7.34 |
Mapping the Figure 4 scores to the official 1M, 8M and 28M models directly:
TinyStories | Hidden Dim | Num Layers | Eval Loss | Creativity | Grammar | Consistency | Plot |
---|---|---|---|---|---|---|---|
1M | 64 | 8 | 2.08 | 4.68 | 6.14 | 4.45 | 4.40 |
8M | 256 | 8 | 1.38 | 6.54 | 7.72 | 8.02 | 7.23 |
28M | 512 | 8 | 1.20 | 6.85 | 8.34 | 8.95 | 7.26 |
The two scoring categories I’m using that are not assessed quantitatively in the TinyStories paper: Factual Knowledge and Reasoning. If my LLM Judge scores match Figure 4 for the other four categories and match my manual evaluations for all six categories, I should expect the LLM Judge to assess these two categories correctly.
Lucky for me, the TinyStories authors have published their evaluation prompts.
I’ll walk through some basic generation code to make sure it works before I apply it to the full dataset.
To my knowledge, you want tokz.padding_side to be “left” during batched inference, and the default here is “right”. Examples of difference shown for batched prompts. Padding right starts the next token prediction with the pad token, padding left with the last tok in prompt.
(50256, 50256, 50256, 'right')
inputs = tokz(data, padding=True, truncation=True, return_tensors="pt").to("cuda")
tokz.decode(inputs.input_ids[0])
"Once upon a time, there lived a bunny in a field. Her name was Lucy. Lucy loved to have feasts and parties with her bunny friends. One day, when Lucy was about to leave for a feast at a friend's house, she realized she's starting to feel sick. She was so weak she could<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>"
tokz.padding_side = "left"
inputs = tokz(data, padding=True, truncation=True, return_tensors="pt").to("cuda")
tokz.decode(inputs.input_ids[0])
"<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>Once upon a time, there lived a bunny in a field. Her name was Lucy. Lucy loved to have feasts and parties with her bunny friends. One day, when Lucy was about to leave for a feast at a friend's house, she realized she's starting to feel sick. She was so weak she could"
tensor([50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256, 50256, 50256, 7454, 2402, 257, 640, 11,
612, 5615, 257, 44915, 287, 257, 2214, 13, 2332, 1438,
373, 22162, 13, 22162, 6151, 284, 423, 730, 5773, 290,
4671, 351, 607, 44915, 2460, 13, 1881, 1110, 11, 618,
22162, 373, 546, 284, 2666, 329, 257, 26951, 379, 257,
1545, 338, 2156, 11, 673, 6939, 673, 338, 3599, 284,
1254, 6639, 13, 1375, 373, 523, 4939, 673, 714],
device='cuda:0')
Reusing the generation code I used for the TinyHackthon competition, but setting do_sample=False
and num_beams=5
:
def _generate(model, prompts, max_length=384, min_length=120):
tokz.padding_side = "left"
inputs = tokz(prompts, padding=True, truncation=True, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=max_length,
min_length=min_length,
num_beams=5,
do_sample=False,
pad_token_id=tokz.eos_token_id
)
input_length = inputs.input_ids[0].size(0)
completion_tokens = outputs[0][input_length:]
completion_text = tokz.decode(completion_tokens, skip_special_tokens=True)
completions = []
for j, output in enumerate(outputs):
input_length = inputs.input_ids[j].size(0)
completion_tokens = output[input_length:]
completion_text = tokz.decode(completion_tokens, skip_special_tokens=True)
completions.append(completion_text)
assert outputs.shape[0] == len(prompts)
assert outputs.shape[1] == max_length
assert len(completions) == len(prompts)
return completions
completions = _generate(model, data)
not sleep.
Lucy asked her mom, "What's wrong, Mommy?" Her mom replied, "It's okay, sweetie. I'll help you."
Lucy smiled and said, "I'm sorry, Mommy. I'll help you." Her mom smiled and said, "It's okay, Lucy. I'm glad you're safe."
Lucy smiled and said, "Thank you, Mommy. I love you." Her mom smiled and said, "I love you too, Lucy."
go to the hospital. The little boy was very sad and he didn't want to go to the hospital.
His mom said, "Don't worry, I'll help you." But the little boy didn't listen. He said, "I'm sorry, mom. I won't do it again."
His mom smiled and said, "It's okay, I'll help you."
The little boy was so happy and thanked his mom. From that day on, he always made sure to always be careful when playing outside.
room.
The little girl asked her daddy, "Daddy, can you help me?"
Daddy said, "Yes, I can help you."
The little girl was so happy. She said, "Thank you, Daddy!"
Daddy smiled and said, "You're welcome, sweetheart. I'm glad you're safe."
The little girl smiled and said, "Thank you, Daddy!"
Daddy smiled and said, "You're welcome, sweetheart. I'm glad you're safe."
The little girl smiled and said, "I'm glad you're safe."
I’ll now iterate through a list of all three models, generate story completions, and save it to CSV for evaluation.
df = pd.DataFrame({"prompt": data, "1M": [None]*len(data), "8M": [None]*len(data), "28M": [None]*len(data)})
df.head()
prompt | 1M | 8M | 28M | |
---|---|---|---|---|
0 | Once upon a time, there lived a bunny in a fie... | None | None | None |
1 | One day a girl walked into the living room and... | None | None | None |
2 | Once upon a time, there lived a hamster in the... | None | None | None |
3 | Jack asked his mom if he could ride the bike a... | None | None | None |
4 | Alice was bored and wanted to find some advent... | None | None | None |
prompt | 1M | 8M | 28M | |
---|---|---|---|---|
0 | Once upon a time, there lived a bunny in a fie... | not sleep.\n\nLucy asked her mom, "What's wro... | hardly move.\n\nLucy's friend, a wise old owl... | barely move. \n\nLucy's bunny friends noticed... |
1 | One day a girl walked into the living room and... | , she heard a voice.\n\n"What are you doing he... | she heard a voice.\n\n"Who are you?" the voic... | a voice came from behind her.\n\n"What do you... |
2 | Once upon a time, there lived a hamster in the... | it was too late.\n\nThe next day, the hamster... | the mouse was trying to help him.\n\nThe hams... | he had to help the mouse.\n\nHe used all his ... |
3 | Jack asked his mom if he could ride the bike a... | thank you" to his daughter.\n\nThe next day, J... | no" and he knew that he had to be careful.\n\n... | Don't ride too fast, be careful!"\n\nSo Jack s... |
4 | Alice was bored and wanted to find some advent... | go to the park?"\n\nTom said, "I don't want t... | go on an adventure together?"\n\nBen smiled a... | go on an adventure?"\n\nBen thought for a mom... |
While we’re, I’ll calculat the average prompt and completion length in tokens.
{'input_ids': [7454, 2402, 257, 640, 11, 612, 5615, 257, 44915, 287, 257, 2214, 13, 2332, 1438, 373, 22162, 13, 22162, 6151, 284, 423, 730, 5773, 290, 4671, 351, 607, 44915, 2460, 13, 1881, 1110, 11, 618, 22162, 373, 546, 284, 2666, 329, 257, 26951, 379, 257, 1545, 338, 2156, 11, 673, 6939, 673, 338, 3599, 284, 1254, 6639, 13, 1375, 373, 523, 4939, 673, 714], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
The prompt (62) and completions (140, 153, 164) average about 200 tokens in length. This is a different tokenizer than I’m using so results will vary for my trained models
I have established two key elements for my evals:
TinyStories | Hidden Dim | Num Layers | Eval Loss | Creativity | Grammar | Consistency | Plot |
---|---|---|---|---|---|---|---|
1M | 64 | 8 | 2.08 | 4.68 | 6.14 | 4.45 | 4.40 |
8M | 256 | 8 | 1.38 | 6.54 | 7.72 | 8.02 | 7.23 |
28M | 512 | 8 | 1.20 | 6.85 | 8.34 | 8.95 | 7.26 |
Model Name | Hidden Dim | Intermediate Dim | Number of Layers | Number of Params |
---|---|---|---|---|
5M | 64 | 256 | 13 | 4_949_696 |
25M | 256 | 1024 | 8 | 24_776_960 |
60M | 512 | 2048 | 6 | 57_940_480 |
125M | 768 | 3072 | 8 | 124_662_528 |
My next steps:
Both steps will take considerable, so I’ll break them down to smaller steps and publish blog posts and videos along the way. Make sure to subscribe to my YouTube channel or check the TinyScale Lab playlist for latest content!