The Digital Curation Innovation Center (DCIC) is building a 100 Million-file data observatory (called CI-BER – “cyberinfrastructure for billions of electronic records”) to analyze big record sets, provide training datasets, and teach students practical digital curation skills. The DCIC is contributing to a $10.5M National Science Foundation / Data Infrastructure Building Blocks (DIBBs)-funded project called “Brown Dog”, with partners at the University of Illinois NCSA Supercomputing Center.
Brown Dog is a set of extensible and distributed data transformation services being developed by our partners at NCSA. These web-scale services include file format conversion, the Data Access Proxy (DAP), and metadata extraction from file contents, named the Data Tilling Service (DTS). For more information on how you can use the Brown Dog service, please see the project website.
Our case study explores how the Brown Dog services can be applied within a large organization’s archives, to reveal the data within the diverse file formats of archival collections. We are developing a model architecture for a born-digital repository that leverages the Brown Dog services in repository workflows that also includes a scalable mix of search and analysis services, notably Indigo (Cassandra), Elasticsearch, and Kibana. We also develop file conversion and extraction tools that enhance our understanding of the archival materials in our case study. These tools are contributed to the growing Brown Dog tools catalog, which is designed for community contributions of tools.
Lastly, Brown Dog is designed to provide services to the general public at web scale. In order to prepare for this demand, the 100 Million files in the CI-BER data set are used to systematically test the Brown Dog service APIs. These tests include load tests, to ensure that performance does not degrade under load, and qualitative tests of the services’ response to diverse file formats.