TinyScale Lab Update: Training Cost Analysis and Evaluation Infrastructure Plans
Background
In this notebook I’ll share results from some quick and dirty training runs executed so I get a rough but reasonable estimate of training time and costs using L4 and A100 GPUs on Google Colab Pro.
Model Sizes
In these experiments, I’m training four model sizes: 5M, 25M, 60M and 125M. I’ve chosen to roughly follow the TinyStories models, using the hidden dimension and intermediate dimension for the TinyStories-1M, -8M, -28M and -33M models, in each case there is a 4x increase from hidden to intermediate dimension in the MLP layers. These are just initial architectural choices which might change over the course of the project as I learn more about what results in a better performing model.
For now, I’m using 8 attention heads (for all models) and the Llama-2 tokenizer with a 32000 vocab size.
Model Name | Hidden Dim | Intermediate Dim | Number of Layers | Number of Params |
---|---|---|---|---|
5M | 64 | 256 | 13 | 4_949_696 |
25M | 256 | 1024 | 8 | 24_776_960 |
60M | 512 | 2048 | 6 | 57_940_480 |
125M | 768 | 3072 | 8 | 124_662_528 |
Training Dataset
I’ve tokenized the TinyStories_all_data.tar.gz dataset which contains 4.9M stories generated by GPT3.5 and GPT4, using the meta-llama/Llama-2-7b-hf
tokenizer. I haven’t performed any data cleaning (yet). The total number of tokens in this dataset is little over 1B: 1_028_013_532.
Training Duration
I’m training all initial runs for 1 epoch.
Training GPUs
I trained the 5M and 25M models on both L4 (22.5 GB VRAM) and A100 (40GB VRAM) GPUs. I trained the 60M and 125M models on the A100 GPU.
Results
All models are trained for 1 epoch (1.03 tokens):
Hardware | Model Size | Time (hr) | Batch Size | Max Memory | Cost |
---|---|---|---|---|---|
L4 | 5M | 0.87 | 384 | 20% | $0.18 |
L4 | 25M | 1.45 | 288 | 65% | $0.30 |
A100-40GB | 5M | 0.32 | 2048 | 78% | $0.25 |
A100-40GB | 25M | 0.35 | 1536 | 98% | $0.27 |
A100-40GB | 60M | 0.54 | 1152 | 86% | $0.41 |
A100-40GB | 125M | 1.10 | 512 | 99% | $0.84 |
Takeaways
From this analysis, only the 5M model makes sense to train on the L4. It’s 3 cents cheaper per hour to train the 25M model on the A100, though I’m flirting with OOM so I should reduce the batch size.
I’ll need to perform longer trainings to get a sense of how many full epochs I need to produce coherent language-generating models, but from my TinyHackathon experience, it took 20 epochs for the 60M model to perform decently (3/5 LLM Judge overall score). I would expect the 125M model to require less epochs, and the smaller models more epochs, to achieve comparable performance. But we’ll see!
Appendix
Here are the LlamaConfig
objects for each model:
5M
= LlamaConfig(
config =32000,
vocab_size=64,
hidden_size=256,
intermediate_size=13,
num_hidden_layers=8,
num_attention_heads=512,
max_position_embeddings=10000.0,
rope_theta=False,
attention_bias=True,
mlp_bias="flash_attention_2",
attn_implementation="bfloat16"
torch_dtype )
25M
= LlamaConfig(
config =32000,
vocab_size=256,
hidden_size=1024,
intermediate_size=8,
num_hidden_layers=8,
num_attention_heads=512,
max_position_embeddings=10000.0,
rope_theta=False,
attention_bias=True,
mlp_bias="flash_attention_2",
attn_implementation="bfloat16"
torch_dtype )
60M
= LlamaConfig(
config =32000,
vocab_size=512,
hidden_size=2048,
intermediate_size=6,
num_hidden_layers=8,
num_attention_heads=512,
max_position_embeddings=10000.0,
rope_theta=False,
attention_bias=True,
mlp_bias="flash_attention_2",
attn_implementation="bfloat16"
torch_dtype )
125M
= LlamaConfig(
config =32000,
vocab_size=768,
hidden_size=3072,
intermediate_size=8,
num_hidden_layers=8,
num_attention_heads=512,
max_position_embeddings=10000.0,
rope_theta=False,
attention_bias=True,
mlp_bias="flash_attention_2",
attn_implementation="bfloat16"
torch_dtype )