TinyScale Lab Update: Setting Eval Targets and Generating Completions for LLM Judge Development

LLM
deep learning
TinyScale-Lab
In this TinyScale Lab update, I tackle how to evaluate language models before building them. Using the TinyStories paper, I set target scores for my upcoming 5M-125M parameter models and generate reference completions using existing TinyStories models (1M, 8M, 28M).
Author

Vishal Bakshi

Published

April 27, 2025

Background

In this notebook I’m going to generate story completions using the TinyStories 1M, 8M, 28M models. The actual HF model size for these models is 3.7M, 19.7M and 52M, respectively. Since I’m training 5M, 25M, 60M and 125M models, these three TinyStories models will serve as proxies for my first three sizes, and I will expect my 125M model to generate stories that receive higher scores than my 60M (by how much higher is TBD).

Ronen Eldan, the TinyStories paper author, has listed on this HF model card discussion forum:

we used temp=0, beams=5

So I’ll be using those two settings during inference.

Here are some key architectural details for my initial quick-and-dirty models:

Model Name Hidden Dim Intermediate Dim Number of Layers Number of Params
5M 64 256 13 4_949_696
25M 256 1024 8 24_776_960
60M 512 2048 6 57_940_480
125M 768 3072 8 124_662_528

Here are my initial scoring categories, based on type of language model capability:

  • Foundational language capabilities: Grammar and Context-Tracking (Consistency)
  • Emergent capabilities: Factual Knowledge, Reasoning, Creativity
  • Story-related capabilities: Plot

Referencing Figure 4 in the TinyStories paper I would expect to achieve LLM Judge for these models close to the following (TSL = TinyScale Lab):

Similar to Hidden Dim Num Layers Eval Loss Creativity Grammar Consistency Plot
TSL-5M 64 12 2.02 4.84 6.19 4.75 4.39
TSL-25M 256 8 1.38 6.54 7.72 8.02 7.23
TSL-60M 512 Average of 4 and 8 scores 1.23 6.8 8.35 8.7 7.31
TSL-125M 768 8 1.18 7.02 8.62 9.34 7.34

Mapping the Figure 4 scores to the official 1M, 8M and 28M models directly:

TinyStories Hidden Dim Num Layers Eval Loss Creativity Grammar Consistency Plot
1M 64 8 2.08 4.68 6.14 4.45 4.40
8M 256 8 1.38 6.54 7.72 8.02 7.23
28M 512 8 1.20 6.85 8.34 8.95 7.26

The two scoring categories I’m using that are not assessed quantitatively in the TinyStories paper: Factual Knowledge and Reasoning. If my LLM Judge scores match Figure 4 for the other four categories and match my manual evaluations for all six categories, I should expect the LLM Judge to assess these two categories correctly.

Evaluation Prompts

Lucky for me, the TinyStories authors have published their evaluation prompts.

import requests
import yaml
url = "https://huggingface.co/datasets/roneneldan/TinyStories/raw/main/Evaluation%20prompts.yaml"
response = requests.get(url)
data = yaml.safe_load(response.text)
len(data)
44
data[0]
"Once upon a time, there lived a bunny in a field. Her name was Lucy. Lucy loved to have feasts and parties with her bunny friends. One day, when Lucy was about to leave for a feast at a friend's house, she realized she's starting to feel sick. She was so weak she could"

Generating Story Completions

I’ll walk through some basic generation code to make sure it works before I apply it to the full dataset.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import pandas as pd
model = AutoModelForCausalLM.from_pretrained("roneneldan/TinyStories-1M").to("cuda")
tokz = AutoTokenizer.from_pretrained("roneneldan/TinyStories-1M")
tokz.pad_token_id
tokz.pad_token = tokz.eos_token

To my knowledge, you want tokz.padding_side to be “left” during batched inference, and the default here is “right”. Examples of difference shown for batched prompts. Padding right starts the next token prediction with the pad token, padding left with the last tok in prompt.

tokz.bos_token_id, tokz.eos_token_id, tokz.pad_token_id, tokz.padding_side
(50256, 50256, 50256, 'right')
inputs = tokz(data, padding=True, truncation=True, return_tensors="pt").to("cuda")
tokz.decode(inputs.input_ids[0])
"Once upon a time, there lived a bunny in a field. Her name was Lucy. Lucy loved to have feasts and parties with her bunny friends. One day, when Lucy was about to leave for a feast at a friend's house, she realized she's starting to feel sick. She was so weak she could<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>"
tokz.padding_side = "left"
inputs = tokz(data, padding=True, truncation=True, return_tensors="pt").to("cuda")
tokz.decode(inputs.input_ids[0])
"<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>Once upon a time, there lived a bunny in a field. Her name was Lucy. Lucy loved to have feasts and parties with her bunny friends. One day, when Lucy was about to leave for a feast at a friend's house, she realized she's starting to feel sick. She was so weak she could"
inputs.input_ids[0]
tensor([50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256,  7454,  2402,   257,   640,    11,
          612,  5615,   257, 44915,   287,   257,  2214,    13,  2332,  1438,
          373, 22162,    13, 22162,  6151,   284,   423,   730,  5773,   290,
         4671,   351,   607, 44915,  2460,    13,  1881,  1110,    11,   618,
        22162,   373,   546,   284,  2666,   329,   257, 26951,   379,   257,
         1545,   338,  2156,    11,   673,  6939,   673,   338,  3599,   284,
         1254,  6639,    13,  1375,   373,   523,  4939,   673,   714],
       device='cuda:0')
inputs.attention_mask[0].shape
torch.Size([119])

Reusing the generation code I used for the TinyHackthon competition, but setting do_sample=False and num_beams=5:

def _generate(model, prompts, max_length=384, min_length=120):
    tokz.padding_side = "left"
    inputs = tokz(prompts, padding=True, truncation=True, return_tensors="pt").to("cuda")

    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_length=max_length,
            min_length=min_length,
            num_beams=5,
            do_sample=False,
            pad_token_id=tokz.eos_token_id
        )

        input_length = inputs.input_ids[0].size(0)
        completion_tokens = outputs[0][input_length:]
        completion_text = tokz.decode(completion_tokens, skip_special_tokens=True)

        completions = []

        for j, output in enumerate(outputs):
            input_length = inputs.input_ids[j].size(0)
            completion_tokens = output[input_length:]
            completion_text = tokz.decode(completion_tokens, skip_special_tokens=True)
            completions.append(completion_text)

        assert outputs.shape[0] == len(prompts)
        assert outputs.shape[1] == max_length
        assert len(completions) == len(prompts)
        return completions

completions = _generate(model, data)
print(completions[0])
 not sleep.

Lucy asked her mom, "What's wrong, Mommy?" Her mom replied, "It's okay, sweetie. I'll help you."

Lucy smiled and said, "I'm sorry, Mommy. I'll help you." Her mom smiled and said, "It's okay, Lucy. I'm glad you're safe."

Lucy smiled and said, "Thank you, Mommy. I love you." Her mom smiled and said, "I love you too, Lucy."
print(completions[-1])
 go to the hospital. The little boy was very sad and he didn't want to go to the hospital. 

His mom said, "Don't worry, I'll help you." But the little boy didn't listen. He said, "I'm sorry, mom. I won't do it again." 

His mom smiled and said, "It's okay, I'll help you." 

The little boy was so happy and thanked his mom. From that day on, he always made sure to always be careful when playing outside.
print(completions[22])
 room. 

The little girl asked her daddy, "Daddy, can you help me?" 

Daddy said, "Yes, I can help you." 

The little girl was so happy. She said, "Thank you, Daddy!" 

Daddy smiled and said, "You're welcome, sweetheart. I'm glad you're safe." 

The little girl smiled and said, "Thank you, Daddy!"

Daddy smiled and said, "You're welcome, sweetheart. I'm glad you're safe." 

The little girl smiled and said, "I'm glad you're safe."

I’ll now iterate through a list of all three models, generate story completions, and save it to CSV for evaluation.

model_names = ["roneneldan/TinyStories-1M", "roneneldan/TinyStories-8M", "roneneldan/TinyStories-28M"]
df = pd.DataFrame({"prompt": data, "1M": [None]*len(data), "8M": [None]*len(data), "28M": [None]*len(data)})
df.head()
prompt 1M 8M 28M
0 Once upon a time, there lived a bunny in a fie... None None None
1 One day a girl walked into the living room and... None None None
2 Once upon a time, there lived a hamster in the... None None None
3 Jack asked his mom if he could ride the bike a... None None None
4 Alice was bored and wanted to find some advent... None None None
for name in model_names:
    model = AutoModelForCausalLM.from_pretrained(name).to("cuda")
    tokz = AutoTokenizer.from_pretrained(name)
    tokz.pad_token = tokz.eos_token
    completions = _generate(model, data)
    df[name.split("-")[-1]] = completions
df.head()
prompt 1M 8M 28M
0 Once upon a time, there lived a bunny in a fie... not sleep.\n\nLucy asked her mom, "What's wro... hardly move.\n\nLucy's friend, a wise old owl... barely move. \n\nLucy's bunny friends noticed...
1 One day a girl walked into the living room and... , she heard a voice.\n\n"What are you doing he... she heard a voice.\n\n"Who are you?" the voic... a voice came from behind her.\n\n"What do you...
2 Once upon a time, there lived a hamster in the... it was too late.\n\nThe next day, the hamster... the mouse was trying to help him.\n\nThe hams... he had to help the mouse.\n\nHe used all his ...
3 Jack asked his mom if he could ride the bike a... thank you" to his daughter.\n\nThe next day, J... no" and he knew that he had to be careful.\n\n... Don't ride too fast, be careful!"\n\nSo Jack s...
4 Alice was bored and wanted to find some advent... go to the park?"\n\nTom said, "I don't want t... go on an adventure together?"\n\nBen smiled a... go on an adventure?"\n\nBen thought for a mom...
df.to_csv("2025-04-27-evals.csv", index=False)

While we’re, I’ll calculat the average prompt and completion length in tokens.

tokz(data[0])
{'input_ids': [7454, 2402, 257, 640, 11, 612, 5615, 257, 44915, 287, 257, 2214, 13, 2332, 1438, 373, 22162, 13, 22162, 6151, 284, 423, 730, 5773, 290, 4671, 351, 607, 44915, 2460, 13, 1881, 1110, 11, 618, 22162, 373, 546, 284, 2666, 329, 257, 26951, 379, 257, 1545, 338, 2156, 11, 673, 6939, 673, 338, 3599, 284, 1254, 6639, 13, 1375, 373, 523, 4939, 673, 714], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
toks = 0
for p in data: toks += len(tokz(p).input_ids)
toks//44
62
toks = 0
for p in df["1M"]: toks += len(tokz(p).input_ids)
toks//44
164
toks = 0
for p in df["8M"]: toks += len(tokz(p).input_ids)
toks//44
153
toks = 0
for p in df["28M"]: toks += len(tokz(p).input_ids)
toks//44
140
62+164, 64+153, 64+140
(226, 217, 204)

The prompt (62) and completions (140, 153, 164) average about 200 tokens in length. This is a different tokenizer than I’m using so results will vary for my trained models

Closing Thoughts

I have established two key elements for my evals:

  • Targets based on the literature (for my reference models and experiment models)
TinyStories Hidden Dim Num Layers Eval Loss Creativity Grammar Consistency Plot
1M 64 8 2.08 4.68 6.14 4.45 4.40
8M 256 8 1.38 6.54 7.72 8.02 7.23
28M 512 8 1.20 6.85 8.34 8.95 7.26
Model Name Hidden Dim Intermediate Dim Number of Layers Number of Params
5M 64 256 13 4_949_696
25M 256 1024 8 24_776_960
60M 512 2048 6 57_940_480
125M 768 3072 8 124_662_528
  • Generations using reference models for evaluation prompts from literature
    • 44 prompts (62 tokens on average)

My next steps:

  • Evaluate a sample of prompts from each model by hand for my six scoring categories:
    • Foundational language capabilities: Grammar and Context-Tracking (Consistency)
    • Emergent capabilities: Factual Knowledge, Reasoning, Creativity
    • Story-related capabilities: Plot
  • Prompt different LLMs, iterating on prompts until LLM Judge scores match mine 90%+ of the time.

Both steps will take considerable, so I’ll break them down to smaller steps and publish blog posts and videos along the way. Make sure to subscribe to my YouTube channel or check the TinyScale Lab playlist for latest content!