TinyScaleLab Update: Training Cost Analysis and Evaluation Infrastructure Plans

LLM

deep learning

TinyScaleLab

I share my progress on the TinyScaleLab project where I’m studying small language models trained on the Tiny Stories dataset. I discuss the architecture of my four model sizes (5M, 25M, 60M, and 125M parameters), compare training costs between L4 and A100 GPUs, and outline my plans for developing an evaluation framework using LLM judges. This project aims to investigate both the training dynamics of tiny models and their language capabilities across grammar, context tracking, factual knowledge, reasoning, creativity, and storytelling.

Author

Vishal Bakshi

Published

April 27, 2025

Background

In this notebook I’ll share results from some quick and dirty training runs executed so I get a rough but reasonable estimate of training time and costs using L4 and A100 GPUs on Google Colab Pro.

Model Sizes

In these experiments, I’m training four model sizes: 5M, 25M, 60M and 125M. I’ve chosen to roughly follow the TinyStories models, using the hidden dimension and intermediate dimension for the TinyStories-1M, -8M, -28M and -33M models, in each case there is a 4x increase from hidden to intermediate dimension in the MLP layers. These are just initial architectural choices which might change over the course of the project as I learn more about what results in a better performing model.

For now, I’m using 8 attention heads (for all models) and the Llama-2 tokenizer with a 32000 vocab size.

Model Name	Hidden Dim	Intermediate Dim	Number of Layers	Number of Params
5M	64	256	13	4_949_696
25M	256	1024	8	24_776_960
60M	512	2048	6	57_940_480
125M	768	3072	8	124_662_528

Training Dataset

I’ve tokenized the TinyStories_all_data.tar.gz dataset which contains 4.9M stories generated by GPT3.5 and GPT4, using the meta-llama/Llama-2-7b-hf tokenizer. I haven’t performed any data cleaning (yet). The total number of tokens in this dataset is little over 1B: 1_028_013_532.

Training Duration

I’m training all initial runs for 1 epoch.

Training GPUs

I trained the 5M and 25M models on both L4 (22.5 GB VRAM) and A100 (40GB VRAM) GPUs. I trained the 60M and 125M models on the A100 GPU.

Results

All models are trained for 1 epoch (1.03 tokens):

Hardware	Model Size	Time (hr)	Batch Size	Max Memory	Cost
L4	5M	0.87	384	20%	$0.18
L4	25M	1.45	288	65%	$0.30
A100-40GB	5M	0.32	2048	78%	$0.25
A100-40GB	25M	0.35	1536	98%	$0.27
A100-40GB	60M	0.54	1152	86%	$0.41
A100-40GB	125M	1.10	512	99%	$0.84

Takeaways

From this analysis, only the 5M model makes sense to train on the L4. It’s 3 cents cheaper per hour to train the 25M model on the A100, though I’m flirting with OOM so I should reduce the batch size.

I’ll need to perform longer trainings to get a sense of how many full epochs I need to produce coherent language-generating models, but from my TinyHackathon experience, it took 20 epochs for the 60M model to perform decently (3/5 LLM Judge overall score). I would expect the 125M model to require less epochs, and the smaller models more epochs, to achieve comparable performance. But we’ll see!

Appendix

Here are the LlamaConfig objects for each model:

5M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=64,
    intermediate_size=256,
    num_hidden_layers=13,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)

25M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=256,
    intermediate_size=1024,
    num_hidden_layers=8,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)

60M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=512,
    intermediate_size=2048,
    num_hidden_layers=6,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)

125M

config = LlamaConfig(
    vocab_size=32000,
    hidden_size=768,
    intermediate_size=3072,
    num_hidden_layers=8,
    num_attention_heads=8,
    max_position_embeddings=512,
    rope_theta=10000.0,
    attention_bias=False,
    mlp_bias=True,
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)