Estimating Storage and CPU RAM Requirements for Indexing 12.6M Documents
Background
After a few days of flailing about trying to index the 12.6M document Genomics dataset (from UKPLab/DAPR) in Google Colab Pro using RAGatouille, I decided to plan the attempt in a more organized way. In this blog post I’ll share my findings and next actions.
Here’s an example text from the corpus:
The 33D1 rat MoAb92 identifies a low-density Ag on mouse (marginal zone) spleen DC. The antibody does not stain DC in cryostat sections and does not react with LC. No biochemical data on the Ag are available. Nonetheless, this antibody has proved extremely useful for C lysis of mouse spleen DC.\r\n
The average length of text in this corpus is ~540 characters.
RAG.index
The main function of interest if RAG.index
which takes a list of documents and indexes them in preparation for retrieval.
= RAG.index(
index_path =f"{dataset_name}_index",
index_name=passages[:ndocs]["text"],
collection=passages[:ndocs]["_id"]
document_ids )
I used the following code to log the RAM memory usage, with ndocs
being defined globally:
def memory_monitor(stop_event, readings):
while not stop_event.is_set():
= psutil.Process().memory_info().rss / 1024 / 1024 / 1024
mem
readings.append((datetime.now(), mem))5)
time.sleep(
def log_memory_during_index():
= threading.Event()
stop_event = []
readings = threading.Thread(target=memory_monitor, args=(stop_event, readings))
monitor_thread
monitor_thread.start()
try:
= RAG.index(
index_path =f"{dataset_name}_index",
index_name=passages[:ndocs]["text"],
collection=passages[:ndocs]["_id"]
document_ids
)finally:
set()
stop_event.
monitor_thread.join()
return index_path, readings
= log_memory_during_index() index_path, memory_readings
Memory Logging Results
I used two machines for these experiments:
- T4 GPU (16 GB vRAM, 51GB RAM) using Google Colab Pro.
- RTX6000Ada (48GB vRAM, 128GB RAM) using Jarvis Labs.
I chose the following number of documents to index: - 100k - 250k - 500k - 1M - 2M
Here are the results:
RTX6000Ada (48GB vRAM, 128GB RAM)
# Docs | index_path Size | Max RAM | Time |
---|---|---|---|
100k | 0.41 GB | 6.96 GB | 4 min |
250k | 1.1 GB | 8.4 GB | 6.4 min |
500k | 2.2 GB | 11.4 GB | 12 min |
1M | 4.5 GB | 16.3 GB | 24 min |
2M | 9.1 GB | 24 GB | 47 min |
T4 w/High-RAM (16GB vRAM, 51GB RAM)
# Docs | index_path Size | Max RAM | Time |
---|---|---|---|
100k | 0.41 GB | 6.5 GB | 8 min |
250k | 1.1 GB | 8.8 GB | 20 min |
500k | 2.2 GB | 11.8 GB | 36 min |
1M | 4.5 GB | 18.8 GB | 78 min |
2M | 9.1 GB | 28.6 GB | 145 min |
I also used the A100 instance on Google Colab Pro for some initial experiments. It’s interesting to note the difference in speed of encoding 25k passages:
GPU | seconds/25k |
---|---|
RTX6000Ada | 12 |
A100 | 22 |
T4 | 44 |
Extrapolating to 12.6M Documents
I’ll start with the easier one: the size of the directory created by RAG.index
. Doubling the number of documents doubles its size (approximately) so if 1M documents takes up 4.5GB of space I expect 12.6M documents to take up ~54GB of space. I’ll set my storage size to 100GB just in case.
The maximum RAM used (by the CPU, not the GPU vRAM) for 12.6M documents is a bit more involved. I’m planning to use the RTX6000Ada machine so I’ll use its numbers.
RTX6000Ada (48GB vRAM, 128GB RAM)
# Docs | Max RAM | Increase |
---|---|---|
100k | 6.96 GB | – |
250k | 8.4 GB | 20% |
500k | 11.4 GB | 36% |
1M | 16.3 GB | 43% |
2M | 24 GB | 47% |
The percent increase amount is slowing down. Let’s say it plateaus at a 50% increase going from 2M to 4M documents (doubling). 2M to 12.6M is ~2.66 doublings (is that a word?). 24 GB x 1.5^2.66 = 70GB. If I was using Colab numbers: 28.6 x 1.5^2.66 = 84 GB. When I tried to index 12.6M documents with an A100 High-RAM (83.5 GB CPU) instance on Google Colab Pro, the runtime crashed as it ran out of System RAM so this checks out.
Finally, let’s say the time it takes to index documents doubles when the number of documents doubles from 2M onwards. 47 min x 2^2.66 = 300 minutes or 5 hours. At about $1/hr, this would take $5 on an RTX6000Ada.
I should note that in all my experiments, the GPU vRAM usage didn’t go past 3-4 GB.
While the peak CPU RAM usage varied, in all instances the plots looked like the following (2M documents on RTX6000Ada):
I couldn’t figure out from my profiler the exact function call during that largest spike. Also note the spike near the end before indexing is finished.
Final Thoughts
Time will tell if these calculations are worth anything, but it seems like my best option is to use Jarvis Labs’ RTX6000Ada machine with 128GB CPU RAM. Once I successfully index the 12.6M-document Genomics dataset, I’ll have a better estimate for how much it will cost to index the largest dataset in the DAPR collection: MIRACL (32.9M documents). Stay tuned!