Here are at UMD we have a proving ground for large scale archives. This week we are loading some small, medium and large collections into our new repository system, DRASTIC, to see how it performs.
First we are going to load them without triggering any automated workflow in response, i.e. no follow up processing will happen. This will give us a performance metric for bulk loading collections. Then we can compare this metric with a variety of other measurements. These include running the Elasticsearch and Brown Dog-based workflows (indexing, extraction, full text conversion) with the collections already residing in the repository, as well as ingest and workflow in tandem. We expect the workflow to introduce delays, as limited worker processes will have to perform both ingest and workflow tasks.
The loading process is recursive. Our collection files are supplied by an Nginx web server that also produces tidy JSON indexes of folder content. By creating an ingest task for a top level folder URL, we instruct the workers to ingest the entire collection recursively. This means that the first folder ingest task will create more ingest tasks for all its sub-folders. The sub-folders tasks create ingest tasks for *their* sub-folders, and so on..
Sidenote: If we allow these recursive folder ingest tasks to execute right away, we may quickly fill our task queue with ingest tasks. I managed to fill up a server’s disk space in this way. If dealing with very large collections, in the millions, this may put too much strain on your message queue system, especially if ingest tasks will also result in further workflow tasks. The solution we can up with was to delay execution of the recursive folder ingest tasks, until the queue size was below a certain threshold. The threshold has to be low enough not to overwhelm your message queue system, but also high enough that you don’t create a “deadlock” situation, where the queue is completely filled with folder ingest tasks and none of them can run. (This can happen if you have a folder that happens to have more sub-folders than your chosen max queued tasks number.)
Benchmark Collections (Federal Record Groups):
- 267 – Records of the Supreme Court of the United States (small)
- 1.5 Gigabytes in 4,268 files
- Ingest time 0:16:33 (w/o workflow)
- 064 – Records of the National Archives and Records Administration
- 5.3 Gigabytes in 9432 files
- Attempt #1 incomplete
- 1,408 files ingested in 29 minutes
- failed lookup of ACLs for parents
- fixed by increasing consistency level to LOCAL_QUORUM
- Attempt #2 complete
- 9426 files ingested in 1:41:24
- 060 – General Records of the Department of Justice
- 2.6 GB in 26087 files
- Ingested Attempt #1 complete
- 26087 files in 2:03:43
- C* combined table space: 376724740 – 370229768 = 6,494,972 (3 replicas per object)
- 443 – Records of the National Institutes of Health (medium)
- 174.5 Gigabytes 28,275 files
- Attempt #1: Ingest time 1:04:16 (ONLY 22,693 FILES WERE INGESTED)
- Some Drastic Cassandra operations were timed out on the Cassandra client side (python driver). Will increase the timeout from 10 to 60 seconds.
- Attempt #2: TBD
- 266 – Records of the Securities and Exchange Commission (large)
- 1678.8 Gigabytes in 7.4 Million files
- Ingest time – TBD
I’ll keep updating this post as the benchmarks are determined.