Managing Unstructured Data; 5 Key Strategies for Unlocking the Science of the Past

Selected blog by Jennifer Proctor

While archival science is used to processing collections, sorting into folders, and creating finding aids, the product of its work is not usually thought of as data, rather as ‘research materials’, ‘primary sources’, ‘field notes’ or similar. This is slowly starting to change as researchers realize that these collections may contain charts, tables, lexicons, linguistic samples, weather observations, etc which, with a little effort, can be combined with more recent data to get a valuable picture of change over time or to rediscover information that has been lost in part of in full. However, it takes a different approach than traditional archival processing to make these kinds of materials discoverable to those kinds of researchers and those scientists may then need particular kinds of assistance in converting this unstructured data into a form they can use.

Get Comfortable with Statistics and Visualizations

The exponential growth of digital records means a significant portion of text- and image-based data is part of massive digital collections which can only be processed using computational analysis and computational finding aids. A picture is worth a thousand words. And the field of data visualization agrees, demonstrating that graphs, infographics, and charts are much more efficient at sense-making than a body of text. This is why computational finding aids which set out a numerical description of a collection are often visualizations. That these kinds of descriptions are the only practical means of characterizing massive digital collections underlies their growth, but, to quote another proverb, necessity is the mother of invention and far from being a poor substitute for traditional description, computational finding aids are an innovative invention capable of empowering researchers of the digital age in new ways.

First, these kinds of description can automatically intake and describe enormous volumes of material, offering hope of someday reaching a point where archivists are able to handle the quantity of born digital material being produced every day.

For another, the development of tools and methods for creating computational finding aids using algorithmic description and data visualizations, backed by databases and served online provide new ways of unlocking value from even the most traditional collections. These kinds of finding aids can be immediately adapted by users through custom queries – they are interactive and dynamic. While some might argue this kind of finding aid is really not a finding aid, but rather only the structure of the database designed to store a collection, this findability model is also very similar to the Semantic Web model and may be in the future for all kinds of data storage.

Dip into Development

Transforming unstructured and/or non-digital data into a usable and discoverable form can require specialized tools from Natural Language Processing, Big Data, Data Visualization, and others. For example, transcription of linguistics texts often involves characters which are not part of any standard language character sets. This leads to some frustrating challenges like text boxes that can’t print and/or can’t save these special characters. Luckily, this problem can be addressed in part by using the markup language LaTeX, a standard tool in linguistics for which a variety of free and paid processing software options exist.

Whatever the particular challenges encountered, the search for a work-around begins the same place that software development typically begins: requirements analysis. Identify the specific traits needed and/or problems faced. In the linguistics example, this generated a list including:

  • must be able to correctly parse linguistic symbols
  • needs the widest possible variety of supported language character sets
  • should have some way of holding transcriptions of words of phrases in multiple languages in order to enhance findability and data matching (old linguistics documents can have outdated placenames and tribal names, plus the value of these resources to the communities whose heritage they represent means full translation may be an eventual outcome of a researcher’s re-use)
  • the chosen tool should be in wide use or at least be usable by a wide variety of researchers and other cultural heritage institutions (the software itself should not be a barrier to use of the collection)

Once a set of needs are identified, its important to begin by researching for currently available software tools for solving those or similar problems. Compare and contrast their approaches and evaluate the successes and drawbacks of each. If one matches particularly well, it may be immediately usable. If one is close, partnering with the development team may turn it into a full solution. However, as this use is growing at different rates in different fields, it may be the case that the tools needed for a particular collection of data are only in early development or simply don’t exist yet. Librarians are often introduced to some manner of coding in contemporary Masters of Library and Information Science curricula and this case is an opportunity to use those skills.

Decide how much transformation is desirable and feasible on the archival side

Archivist are perpetually faced with backlogs – they have dozens of ways they could spend any given working hour – so its important to prioritize. Identify what level of processing is necessary to make a dataset findable. For data sets with a high anticipated value for reuse, such as materials on lost indigenous languages or historical weather data, this level should be considered the minimum. For other collections may involve only scanning pages as images and creating metadata, possibly including linked data (to enhance findability through virtual connection of related datasets within the institution and beyond), and indexes to key datasets within larger collections.

Also consider what degree of processing an interested researcher would consider reasonable. In linguistics, transcribing a few dozen pages in LaTeX is a completely reasonable commitment so a small institution may not need to invest time in coding this information themselves. Old genealogical data, however, such as the documents of the Overseas Pension Project require a high level of processing just to be findable and connected within the collection (for example, determining which of 7 Henry Muellers who served with the NY 7th Volunteers a particular document is about would be an important precursor to entering the individual record into a database structure. The identification may or may not be within the capabilities of the average user of genealogical collections and the entering of records into a database likely is not). Asking too much of an interested user group discourages reuse.

Finally, consider restraining factors like scale and scope of a collection and cost and time commitment for particular processing levels and aim for a similar level of processing across the full body of the collection. Without taking these into consideration from the start, the likelihood of a processing (/restructuring) project being abandoned before it reaches a level of completeness which is usable to researchers is higher. Striking the right balance of processing and efficiency is therefore vitally important.

Connect with Subject Expert(s)

Making complete sense of unstructured data may require specialized knowledge so it’s important to involve experts, not only in the appraisal and outreach phases, but also in processing. This may be as minimal as asking an expert to provide or confirm for the archivists any relevant subject-specific terminology to describe the data – ‘signposts’ or column headers to fill out a new structure for that data. On the other hand, it can be more intense. For example, old linguistics documents often make use of idiosyncratic or outdated phonetic symbols. Experts can identify this and, if necessary, can help create a key to ‘translate’ each archaic symbol into contemporary standards to prevent data loss and increase findability. Full transcription of such documents would be impossible without that knowledge of phonetics that archivists, or even subject librarians may not possess. Larger institutions, or those attached to or partnered with universities, can usually find volunteers to do such transcriptions as long as the technology is in place to create and manage the documents correctly, though any sufficiently valuable dataset will interest some experts within its field enough to engage them.

Learn to think of metadata as part of the collection

When it comes to data reuse, having data about data is vital. Whether it is documentation of transformations (such as the key for a linguistics document showing substitutions made in the transcription process), the provenance of that data, the methodology of its collection, and even background and other research by the original data creator – having data about data make or break information for reuse in all fields and contexts. To meet that need, archivists must begin compiling it from the beginning of their work with a collection: collecting that information, keeping it with the data itself, and including it in easily findable and downloadable ways with all disseminations of the dataset. Experts can help with understanding what data about the data to collect but the value of archival training is also important; while other fields interested in unstructured data – business, data science, etc – will often say not to keep duplicates of datasets (except perhaps a backup copy), archivists know that the decision to deaccession copies must be based on a variety of factors including what, if any, transformation has been done in the process of giving unstructured data structure, the long-term preservation considerations for old and new formats, and the costs of storing different media types.

Curation of unstructured digital and digitized data is truly a space where archival skills, data management, subject expertise, and computer science must meet, not only to manage data collections, but also to unlock that data for use.