IEEE Big Data 2017: 2nd CAS workshop

Workshop Title: 2nd Computational Archival Science (CAS) workshop
Wednesday, December 13, 2017
Westin Copley Plaza
10 Huntington Avenue, Boston, MA 02116
Boston, USA, 20001

PART OF: IEEE Big Data 2017
*** There is a 1-day registration option ***

14 presentations from France, Netherlands, UK, Canada, US, Taiwan; 2 demos from GE, US; Student panel on new curricula.

9:00 – 9:15 Welcome

  • Workshop Chairs:
    Mark Hedges1, Victoria Lemieux2, Richard Marciano3
    1 KCL, 2 UBC, 3 U. Maryland

      • Join our Google Group at:
    • Foundational Paper: Dec. 2017, “Archival records and training in the Age of Big Data”, Marciano, Lemieux, Hedges, Esteva, Underwood, Kurtz, Conrad, accepted for publication. See: LINK.
      • 8 topics: (1) Evolutionary prototyping and computational linguistics, (2) Graph analytics, digital humanities and archival representation, (3) Computational finding aids, (4) Digital curation, (5) Public engagement with (archival) content, (6) Authenticity, (7) Confluences between archival theory and computational methods: cyberinfrastructure and the Records Continuum, and (8) Spatial and temporal analytics.
      • [In: “Advances in Librarianship – Re-Envisioning the MLIS: Perspectives on the Future of Library and Information Science Education”, Editors: Lindsay C. Sarin, Johnna Percell, Paul T. Jaeger, & John Carlo Bertot.]

9:15 – 10:35 Session 1: Exploring Archival Data (talks: 20 mins each)

  • #1: Building new knowledge from distributed scientific corpus; HERBADROP & EUROPEANA: two concrete case studies for exploring big archival data
    [Pascal Dugenie, Nuno Freire, Daan Broeder — CINES, FR & MEERTENS Institut, NL & INESC-ID/Europeana DSI, NL]

    Slides — Paper

    • *Abstract: This paper presents approaches for building new knowledge using emerging methods and big data technologies together with archival practices. Two cases studies have been considered. The first one called HERBADROP is concerned with preservation and analysis of herbarium images. The second one called EUROPEANA investigates how to facilitate the re-use of cultural heritage language resources for research purposes. The common point between these two case studies is that they are both concerned with the use of valuable heritage resources within the EUDAT (European Data) infrastructure. HERBADROP leverages on the data services provided by EUDAT for long-term preservation, while EUROPEANA leverages on EUDAT to achieve citability and persistent identification of cultural heritage datasets. EUDAT1 is an initiative of some of the main European data centers and together with community research infrastructure organisations, to build a common eInfrastructure for general research data management. In this paper, we show how technologcal trends may offer some new research potential in the domain of computational archival science in particular appraising the challenges of producing quality, meaning, knowledge and value from quantity, tracing data and analytic provenance across complex big data platforms and knowledge production ecosystems.
  • #2: An Infrastructure and Application of Computational Archival Science to Enrich and Integrate Big Digital Archival Data: Using Taiwan Indigenous Peoples Open Research Data (TIPD) as Example
    [Ji-Ping LinAcademia Sinica, TW]

    Slides — Paper

    • *Abstract: This paper highlights research on constructing a big archival data called Taiwan Indigenous Peoples Open Research Data (TIPD, see based on contemporary census and household registration data sets in 2013-2017 (see TIPD utilizes record linkage, geocoding, and high-performance in-memory computing technology to construct various dimensions of Taiwan Indigenous Peoples (TIPs) demographics and developments. Embedded in collecting, cleaning, cleansing, processing, exploring, and enriching individual digital records are archival computational science and data science. TIPD consists of three categories of archival open data: (1) categorical data, (2) household structure and characteristics data, and (3) population dynamics data, including cross-sectional time-series categorical data, longitudinally linked population dynamics data, life tables, household statistics, micro genealogy data, marriage practice and ethnic identity data, internal migration data, geocoded data, etc. TIPD big archival data not only help unveil contemporary TIPs demographics and various developments, but also help overcome research barriers and unleash creativity for TIPs studies.
    • Keywords: identity; genealogy; in-memory computing; open data; record linkage; TIPD
  • #3: Computational Curation of a Digitized Record Series of WWII Japanese-American Internment
    [William Underwood, Richard Marciano, Sandra Laib, Carl Apgar, Luis Beteta, Waleed Falak, Marisa Gilman, Riss Hardcastle, Keona Holden, Yun Huang, David Baasch, Brittni Ballard, Tricia Glaser, Adam Gray, Leigh Plummer, Zeynep Diker, Mayanka Jha, Aakanksha Singh, and Namrata Walanj — University of Maryland, USA]

    Slides — Paper

    • *Abstract: This paper describes the linguistic analysis of index note cards from record series of the World War II Japanese-American Internment Camps that are in the custody of the National Archives. It also describes the use of GATE Developer, and an extension of ANNIE, a GATE plugin, in linguistic processing of information specific to index note cards in order to extract metadata supporting access and archival decisions regarding record release and withdrawal. The content of the index cards will be interpreted as OWL/RDF statements. Those statements will be stored in a graph database and used with objects such as digital maps and photos to produce an interactive user interface to exhibit events at relocation centers.
    • Keywords: NLP, NER, World War II Japanese-American Internment Camps, Computational Archival Science
  • #4: The Cybernetics Thought Collective Project: Using Computational Methods to Reveal Intellectual Context in Archival Material
    [Bethany Anderson, Christopher Prom, Kevin Hamilton, James Hutchinson, Mark Sammons, and Alex Dolski — University of Illinois at Urbana-Champaign, USA]

    Slides — Paper

    • *Abstract: This paper discusses “The Cybernetics Thought Collective: A History of Science and Technology Portal Project,” a collaborative effort among four institutions that maintain archival records vital to the exploration of cybernetic history—the University of Illinois at Urbana-Champaign, the American Philosophical Society, the British Library, and MIT. With recent grant funding from the NEH, the multi-institutional team is developing a prototype web-portal and analysis-engine to provide access to archival material related to the development of the field of cybernetics, which influenced the development of modern computing and provided a common language to articulate similar questions about behavior across disciplines—regardless of whether the subject of study was animal, machine, or social group. The project is also enabling the digitization of the personal archives of four founders of cybernetics—Heinz von Foerster, W. Ross Ashby, Warren S. McCulloch, and Norbert Wiener. Using computational methods based on advanced machine-learning algorithms to yield network and entity relationships maps from the digitized texts, this project seeks to create access to archival material that enables humanities scholars to better understand the development of cybernetic ideas and to enable scientists and engineers to reuse and access cybernetic data.
    • Keywords: digital archives, cybernetics, named entities, natural language processing, machine learning

10:35 – 10:45 Questions and Discussion

10:45 – 11:05 Coffee break

11:05 – 12:25 Session 2: Curation and Appraisal (talks: 20 mins each)

  • #5: Towards Automated Quality Curation of Video Collections from a Realistic Perspective
    [Todd Goodall, Maria Esteva, Sandra Sweat, and Alan Bovik — University of Texas, USA]

    Slides — Paper

    • *Abstract: We investigate the use of automated Video Quality Assessment (VQA) algorithms to evaluate digital video collections. These algorithms are driven by well-defined natural scene statistics (NSS), which capture the behavior of natural distortion-free videos. Because human vision has adapted to these real-world statistics over the course of evolution, quality predictions delivered by these NSS-based VQA algorithms correlate well with human opinions of quality. In particular, we expect these algorithms to accurately predict quality on sizable and diverse video collections. To test this hypothesis, we gathered a testbed of video clips that represent a larger video art collection. Next, we conducted a human study in which users scored the quality of the clips. Enabled by the human study, we trained three VQA algorithms (Video BLIINDS, BRISQUE, and VIIDEO) using our testbed collection to assess a real-world digital video art collection from our university museum. Two of the algorithms provided good automatic predictions of the quality of the videos. These same algorithms also highlighted limitations that arise when assessing artistic collections. We present current research progress and discuss future directions for testbed and algorithm improvement. Our ongoing effort furthers the field of Computational Archival Science by applying computational models of human perception to video appraisal and preservation tasks.
    • Keywords:
  • #6: Line Detection in Binary Document Scans: A Case Study with the International Tracing Service Archives
    [Benjamin LeeUnited States Holocaust Memorial Museum, USA]

    Slides — Paper

    • *Abstract: In this short paper, I present my in-progress work on a method of line detection in binary document scans that is capable of differentiating solid and dotted lines. This method entails post-processing candidate lines detected using the progressive probabilistic Hough line transform by filtering out false positives. Solid lines are identified by performing a cut on the average pixel value of the pixels along each candidate line, and dotted lines are identified by performing a cut on the dominant frequency of the Fast Fourier Transform of the same pixel values along each candidate line. I demonstrate the efficacy of this method by running this algorithm on a subset of binary TIF images from the International Tracing Service digitized archives, one of the world’s largest collections of Holocaust-related documents. In the case of the International Tracing Service archive, classifying documents based on line structure provides an effective method of extracting information from the documents in an automated fashion, an otherwise intractable endeavor due to low scan quality and the prevalence of handwritten text throughout the archive. My proposed method of identifying line structure represents the first step in this proposed pipeline of classifying International Tracing Service documents by line structure.
    • Keywords: line detection; dotted lines; computational archival
      science; International Tracing Service; Holocaust research
  • #7: Auto-Categorization & Future Access to Digital Archives
    [Nathaniel Payne and Jason Baron — University of British Columbia, CAN & Of Counsel, Drinker Biddle & Reath LLP, USA]

    Slides — Paper

    • *Abstract: Archivists and records managers would benefit from a greater understanding of the use and effectiveness of various machine learning methods, especially in the related context of electronic discovery. However, the binary classification methods used in advance search techniques in the e-discovery space may or may not prove efficacious where the information task involves sorting records into multiple categories. A survey of the landscape of machine learning methods reveals areas of potential weakness, which in turn serve as a starting point for future research in the computational archives space.
    • Keywords: auto-categorization, auto-classification, binary classification, e-discovery, machine learning, coverage, detail
  • #8: Heuristics for Assessing Computational Archival Science (CAS) Research: The Case of the Human Face of Big Data Project
    [Myeong Lee, Yuheng Zhang, Shiyun Chen, Edel Spencer, Jhon Dela Cruz, Hyeonggi Hong, and Richard Marciano — University of Maryland, USA]

    Slides — Paper

    • *Abstract: Computational Archival Science (CAS) has been proposed as a trans-disciplinary field that combines computational and archival thinking. To provide grounded evidence, a foundational paper explored eight initial themes that constitute potential building blocks [1]. In order for a CAS community to emerge, further studies are needed to test this framework. While the foundational paper for CAS provides a conceptual and theoretical basis of this new field, there is still a need to articulate useful guidelines and checkpoints that validate a CAS research agenda. In this position paper, we propose heuristics for assessing emerging CAS-related studies that researchers from traditional fields can use in their research design stage. The Human Face of Big Data project, a digital curation and interface design project for urban renewal data, is presented and analyzed to demonstrate the validity of the suggested heuristics.
    • Keywords: Computational Archival Science; assessment heuristics; urban renewal; data platform; digital curation

12:25 – 12:45 Session 3: CAS Methods (talk: 20 min)

  • #9: What Can a Knowledge Complexity Approach Reveal About Big Data and Archival Practice?
    [Nicola HorsleyThe Netherlands Institute for Permanent Access to Digital Research Resources, NL]

    Slides — Paper

    • *Abstract: As one of the major technological concepts driving ICT development today, big data has been touted as offering new forms of analysis of research data. Its application has reached out across disciplines but some research sources and archival practices do not sit comfortably within the computational turn and this has sparked concerns that cultural heritage collections that cannot be structured, represented, or, indeed, digitised accordingly may be excluded and marginalised by this new paradigm. This work-in-progress paper reports on the contribution of the KPLEX project’s knowledge complexity approach to understanding the relationship between big data and archival practice.
    • Keywords: big data; knowledge complexity; digital humanities

  • 12:45 – 2:00 Lunch

    2:00 – 3:00 Session 3 CAS Methods cont. (talks: 20 mins each)

    • #10: Protecting Privacy in the Archives: Preliminary Explorations of Topic Modeling for Born-Digital Collections
      [Tim HutchinsonUniversity of Saskatchewan Library, CAN]

      Slides — Paper

      • *Abstract: Natural language processing (NLP) is an area of increased interest for digital archivists, although most research to date has focused on digitized rather than born-digital collections. This study in progress explores whether NLP techniques can be used effectively to surface documents requiring restrictions due to their personal information content. This phase of the research focuses on using topic modeling to find records relating to human resources. Early results show some promise, but suggest that topic modeling on its own will not be sufficient; other techniques to be explored include sentiment analysis and named entity extraction.
      • Keywords: topic modeling, natural language processing, NLP, personal information, digital archives
    • #11: Identifying Epochs in Text Archives
      [Tobias Blanke and Jon Wilson — King’s College London, UK]

      Slides — Paper

      • *Abstract: This paper develops an automated approach to the ’distant reading’ of textual archives in order to classify epochs in the use of language and examine their particular characteristic. It classifies epochs by applying a series of standardised dictionaries to map the semantics of government documents, using the changing frequency of terms in these dictionaries to identify moments of rupture in language. It then tests a variety of techniques to chart the relationship between the changing shape of individual linguistic elements and aggregate patterns, particularly topic models and word2vec word embeddings. The result are a set of largely automated tools for understanding the structure of digital textual archives.
      • Keywords: Computational Archives, Digital History, Cultural Analytics
    • #12: GraphQL for Archival Metadata: An Overview of the EHRI GraphQL API
      [Mike BryantKing’s College London, UK]

      Slides — Paper

      • *Abstract: The European Holocaust Research Infrastructure (EHRI) portal provides transnational access to archival metadata relating to the Holocaust. A GraphQL API has recently been added to the portal in order to expand access to EHRI data in structured form. The API defines a schema which mediates access to EHRI’s graph-based data store, catering to both targeted and bulk metadata retrieval across a range of interrelated data types. This short paper provides an overview of the GraphQL API and illustrates a number of use-cases for the capturing of structured archival metadata.
      • Keywords: Archives, APIs, Structured data.

    3:00 – 3:40 Session 4: Creation and Management of Current Records (talks: 20 mins each)

    • #13: The Blockchain Litmus Test
      [Tyler SmithAdventium Labs, USA]

      Slides — Paper

      • *Abstract: Bitcoin’s underlying blockchain database is a novel approach to recordkeeping that has the potential to decentralize big data. Bitcoin’s success has inspired a multitude of spinoff projects hoping to use blockchain as a distributed database for records-management innovation in other fields. Misconceptions and exaggerations about blockchain and its capabilities are pervasive in the media. Drawing on perspectives from archival science, dependable computing, and secure computing, this paper surveys current applications, research, and critiques of blockchain to provide an objective assessment of its benefits and limitations. Based on the findings of the survey, this paper proposes three criteria that predict success for blockchain-based data management projects, briefly: dependability, security, and trust.
      • Keywords: bitcoin; blockchain; archival science, dependability; security; trust; dependable and secure computing;
      • #14: A Typology of Blockchain Recordkeeping Solutions and Some Reflections on their Implications for the Future of Archival Preservation
        [Victoria LemieuxUniversity of British Columbia, CAN]

        Slides — Paper

        • *Abstract: This paper presents a synthesis of original research documenting several cases of the application of blockchain technology to land transaction, medical, and financial record keeping. Using a thematic synthesis of the cases, the paper describes a typology of blockchain solutions for managing current records representing three distinct design patterns. It then considers the different types of solutions in relation to implications for recordkeeping and long-term preservation of authentic records.
        • Keywords: blockchain; distributed ledger; recordkeeping; digital preservation

      3:40 – 4:05 Questions and Discussion

      4:05 – 4:25 Coffee break

      4:25 – 4:55 Demos

      4:55 – 5:15 Student Session:

      • Seven graduate students at the U. Maryland participated in a fall 2017 seminar exploring the eight case studies proposed in the 2017 Foundational Paper: “Archival records and training in the Age of Big Data”, Marciano, Lemieux, Hedges, Esteva, Underwood, Kurtz, Conrad, Link:, to be published in “Advances in Librarianship – Re-Envisioning the MLIS: Perspectives on the Future of Library and Information Science Education”, Editors: Lindsay C. Sarin, Johnna Percell, Paul T. Jaeger, & John Carlo Bertot.

        The case studies included: (1) Evolutionary prototyping and computational linguistics, (2) Graph analytics, digital humanities and archival representation, (3) Computational finding aids, (4) Digital curation, (5) Public engagement with (archival) content, (6) Authenticity, (7) Confluences between archival theory and computational methods:cyberinfrastructure and the Records Continuum, and (8) Spatial and temporal analytics.

        Students offered to discuss educational takeaways, and methods of incorporating CAS into the Master’s of Library and Information Science (MLIS) education in order to better address the needs of today’s MLIS graduates looking to employ both ‘traditional’ archival principles in conjunction with computational methods.

      5:15 Closing Remarks

      Introduction to workshop:
      The large-scale digitization of analog archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship through the application of computational methods and tools to the archival problem space, and, more fundamentally, through the integration of ‘computational thinking’ with ‘archival thinking’.

      Our working definition of Archival Computational Science (CAS) is:

      Contributing to the development of the theoretical foundations of a new trans-discipline of computer and archival science

      This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

      This is the 2nd workshop at IEEE Big Data addressing Computational Archival Science (1st CAS workshop at: This will builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland (

      Research topics covered:
      Topics covered by the workshop include, but are not restricted to, the following:

      • Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
      • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
      • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
      • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
      • Cyber-infrastructures for archive-based research and for development and hosting of collections
      • Big data and archival theory and practice
      • Digital curation and preservation
      • Crowd-sourcing and archives
      • Big data and the construction of memory and identity
      • Specific big data technologies (e.g. NoSQL databases) and their applications
      • Corpora and reference collections of big archival data
      • Linked data and archives
      • Big data and provenance
      • Constructing big data research objects from archives
      • Legal and ethical issues in big data archives

      Program Chairs:
      Dr. Mark Hedges
      Department of Digital Humanities (DDH)
      King’s College London, UK

      Prof. Victoria Lemieux
      School of Library, Archival and Information Studies
      University of British Columbia, Canada

      Prof. Richard Marciano
      Digital Curation Innovation Center (DCIC)
      College of Information Studies
      University of Maryland, USA

      Program Committee Members:
      The program chairs will serve on the Program Committee, as will the following:

      Dr. Maria Esteva
      Data Intensive Computing
      Texas Advanced Computing Center (TACC), USA

      Dr. Bill Underwood
      Digital Curation Innovation Center (DCIC)
      College of Information Studies
      University of Maryland, USA

      Prof. Michael Kurtz
      Digital Curation Innovation Center (DCIC)
      College of Information Studies
      University of Maryland, USA

      Mark Conrad
      National Archives and Records Administration (NARA)

      Dr. Tobias Blanke
      Department of Digital Humanities
      King’s College London, UK