TinyScale Lab Update: Training Cost Analysis and Evaluation Infrastructure Plans

LLM
deep learning
TinyScale-Lab
I share my progress on the TinyScaleLab project where I’m studying small language models trained on the Tiny Stories dataset. I discuss the architecture of my four model sizes (5M, 25M, 60M, and 125M parameters), compare training costs between L4 and A100 GPUs, and outline my plans for developing an evaluation framework using LLM judges. This project aims to investigate both the training dynamics of tiny models and their language capabilities across grammar, context tracking, factual knowledge, reasoning, creativity, and storytelling.
Author

Vishal Bakshi

Published

April 27, 2025

Background

In this notebook I’ll share results from some quick and dirty training runs executed so I get a rough but reasonable estimate of training time and costs using L4 and A100 GPUs on Google Colab Pro.

Model Sizes

In these experiments, I’m training four model sizes: 5M, 25M, 60M and 125M. I’ve chosen to roughly follow the TinyStories models, using the hidden dimension and intermediate dimension for the TinyStories-1M, -8M, -28M and -33M models, in each case there is a 4x increase from hidden to intermediate dimension in the MLP layers. These are just initial architectural choices which might change over the course of the project as I learn more about what results in a better performing model.

For now, I’m using 8 attention heads (for all models) and the Llama-2 tokenizer with a 32000 vocab size.

Model Name Hidden Dim Intermediate Dim Number of Layers Number of Params
5M 64 256 13 4_949_696
25M 256 1024 8 24_776_960
60M 512 2048 6 57_940_480
125M 768 3072 8 124_662_528

Training Dataset

I’ve tokenized the TinyStories_all_data.tar.gz dataset which contains 4.9M stories generated by GPT3.5 and GPT4, using the meta-llama/Llama-2-7b-hf tokenizer. I haven’t performed any data cleaning (yet). The total number of tokens in this dataset is little over 1B: 1_028_013_532.

Training Duration

I’m training all initial runs for 1 epoch.

Training GPUs

I trained the 5M and 25M models on both L4 (22.5 GB VRAM) and A100 (40GB VRAM) GPUs. I trained the 60M and 125M models on the A100 GPU.

Results

All models are trained for 1 epoch (1.03 tokens):

Hardware Model Size Time (hr) Batch Size Max Memory Cost
L4 5M 0.87 384 20% $0.18
L4 25M 1.45 288 65% $0.30
A100-40GB 5M 0.32 2048 78% $0.25
A100-40GB 25M 0.35 1536 98% $0.27
A100-40GB 60M 0.54 1152 86% $0.41
A100-40GB 125M 1.10 512 99% $0.84

Takeaways

From this analysis, only the 5M model makes sense to train on the L4. It’s 3 cents cheaper per hour to train the 25M model on the A100, though I’m flirting with OOM so I should reduce the batch size.

I’ll need to perform longer trainings to get a sense of how many full epochs I need to produce coherent language-generating models, but from my TinyHackathon experience, it took 20 epochs for the 60M model to perform decently (3/5 LLM Judge overall score). I would expect the 125M model to require less epochs, and the smaller models more epochs, to achieve comparable performance. But we’ll see!

Appendix

Here are the LlamaConfig objects for each model:

5M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=64,
    intermediate_size=256,
    num_hidden_layers=13,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)

25M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=256,
    intermediate_size=1024,
    num_hidden_layers=8,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)

60M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=512,
    intermediate_size=2048,
    num_hidden_layers=6,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)

125M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=768,
    intermediate_size=3072,
    num_hidden_layers=8,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)