Estimating Storage and CPU RAM Requirements for Indexing 12.6M Documents

python

information retrieval

deep learning

RAGatouille

ColBERT

I index 100k, 250k, 500k, 1M and 2M documents using T4 and RTX6000Ada instances and estimate the storage and CPU RAM requirements for a 12.6M document collection.

Author

Vishal Bakshi

Published

February 12, 2025

Background

After a few days of flailing about trying to index the 12.6M document Genomics dataset (from UKPLab/DAPR) in Google Colab Pro using RAGatouille, I decided to plan the attempt in a more organized way. In this blog post I’ll share my findings and next actions.

Here’s an example text from the corpus:

The 33D1 rat MoAb92  identifies a low-density Ag on mouse (marginal zone) spleen DC. The antibody does not stain DC in cryostat sections and does not react with LC. No biochemical data on the Ag are available. Nonetheless, this antibody has proved extremely useful for C lysis of mouse spleen DC.\r\n

The average length of text in this corpus is ~540 characters.

`RAG.index`

The main function of interest if RAG.index which takes a list of documents and indexes them in preparation for retrieval.

index_path = RAG.index(
            index_name=f"{dataset_name}_index",
            collection=passages[:ndocs]["text"],
            document_ids=passages[:ndocs]["_id"]
        )

I used the following code to log the RAM memory usage, with ndocs being defined globally:

def memory_monitor(stop_event, readings):
    while not stop_event.is_set():
        mem = psutil.Process().memory_info().rss / 1024 / 1024 / 1024
        readings.append((datetime.now(), mem))
        time.sleep(5)

def log_memory_during_index():
    stop_event = threading.Event()
    readings = []
    monitor_thread = threading.Thread(target=memory_monitor, args=(stop_event, readings))
    monitor_thread.start()
    
    try:
        index_path = RAG.index(
            index_name=f"{dataset_name}_index",
            collection=passages[:ndocs]["text"],
            document_ids=passages[:ndocs]["_id"]
        )
    finally:
        stop_event.set()
        monitor_thread.join()
    
    return index_path, readings

index_path, memory_readings = log_memory_during_index()

Memory Logging Results

I used two machines for these experiments:

T4 GPU (16 GB vRAM, 51GB RAM) using Google Colab Pro.
RTX6000Ada (48GB vRAM, 128GB RAM) using Jarvis Labs.

I chose the following number of documents to index: - 100k - 250k - 500k - 1M - 2M

Here are the results:

RTX6000Ada (48GB vRAM, 128GB RAM)

# Docs	index_path Size	Max RAM	Time
100k	0.41 GB	6.96 GB	4 min
250k	1.1 GB	8.4 GB	6.4 min
500k	2.2 GB	11.4 GB	12 min
1M	4.5 GB	16.3 GB	24 min
2M	9.1 GB	24 GB	47 min

T4 w/High-RAM (16GB vRAM, 51GB RAM)

# Docs	index_path Size	Max RAM	Time
100k	0.41 GB	6.5 GB	8 min
250k	1.1 GB	8.8 GB	20 min
500k	2.2 GB	11.8 GB	36 min
1M	4.5 GB	18.8 GB	78 min
2M	9.1 GB	28.6 GB	145 min

I also used the A100 instance on Google Colab Pro for some initial experiments. It’s interesting to note the difference in speed of encoding 25k passages:

GPU	seconds/25k
RTX6000Ada	12
A100	22
T4	44

Extrapolating to 12.6M Documents

I’ll start with the easier one: the size of the directory created by RAG.index. Doubling the number of documents doubles its size (approximately) so if 1M documents takes up 4.5GB of space I expect 12.6M documents to take up ~54GB of space. I’ll set my storage size to 100GB just in case.

The maximum RAM used (by the CPU, not the GPU vRAM) for 12.6M documents is a bit more involved. I’m planning to use the RTX6000Ada machine so I’ll use its numbers.

RTX6000Ada (48GB vRAM, 128GB RAM)

# Docs	Max RAM	Increase
100k	6.96 GB	–
250k	8.4 GB	20%
500k	11.4 GB	36%
1M	16.3 GB	43%
2M	24 GB	47%

The percent increase amount is slowing down. Let’s say it plateaus at a 50% increase going from 2M to 4M documents (doubling). 2M to 12.6M is ~2.66 doublings (is that a word?). 24 GB x 1.5^2.66 = 70GB. If I was using Colab numbers: 28.6 x 1.5^2.66 = 84 GB. When I tried to index 12.6M documents with an A100 High-RAM (83.5 GB CPU) instance on Google Colab Pro, the runtime crashed as it ran out of System RAM so this checks out.

Finally, let’s say the time it takes to index documents doubles when the number of documents doubles from 2M onwards. 47 min x 2^2.66 = 300 minutes or 5 hours. At about $1/hr, this would take $5 on an RTX6000Ada.

I should note that in all my experiments, the GPU vRAM usage didn’t go past 3-4 GB.

While the peak CPU RAM usage varied, in all instances the plots looked like the following (2M documents on RTX6000Ada):

I couldn’t figure out from my profiler the exact function call during that largest spike. Also note the spike near the end before indexing is finished.

Final Thoughts

Time will tell if these calculations are worth anything, but it seems like my best option is to use Jarvis Labs’ RTX6000Ada machine with 128GB CPU RAM. Once I successfully index the 12.6M-document Genomics dataset, I’ll have a better estimate for how much it will cost to index the largest dataset in the DAPR collection: MIRACL (32.9M documents). Stay tuned!