-
Notifications
You must be signed in to change notification settings - Fork 42
Description
I am downloading SRA files with prefetch and then reading them in C++ by iterating over ngs::ReadCollection. All calculations are running in the Cloud on AWS. I have found a small number of seeingly "pathological" SRA runs that take longer to read as the iteration progresses through the file.
For example, the graph below shows the time required to read sequential, 0.1% chunks of ERR3212419 (where the x-axis is the cumulative number of reads read as a percentage of the total number of reads in the SRA run). As shown in the graph, the first 16% of reads can be read from disk relatively quickly (approximately 2 seconds per 0.1% chunk). However, the time to read the same number of reads then jumps to approximately 12 seconds, and then jumps again to over 100 seconds. (I stopped after loading 21% of the reads).
Is there a way to read this SRA record (and records like it), so that the time required to read different parts of the file is even? This is important because I would like to read SRA records (from disk) in parallel, and the uneven time-to-read makes for significant load imbalances. In this example, parallel workers reading near the beginning of the file finish much faster than the parallel worker reading near the end of the file.
