IEEE Big Data 2017: 2nd CAS workshop

Workshop Title: 2nd Computational Archival Science (CAS) workshop
Wednesday, December 13, 2017
Westin Copley Plaza
10 Huntington Avenue, Boston, MA 02116
Boston, USA, 20001

PART OF: IEEE Big Data 2017
*** There is a 1-day registration option ***

14 presentations from France, Netherlands, UK, Canada, US, Taiwan; 2 demos from GE, US; Student panel on new curricula.


9:00 – 9:15 Welcome

  • Workshop Chairs:
    Mark Hedges1, Victoria Lemieux2, Richard Marciano3
    1 KCL, 2 UBC, 3 U. Maryland


      • Join our Google Group at:
    • Foundational Paper: Dec. 2017, “Archival records and training in the Age of Big Data”, Marciano, Lemieux, Hedges, Esteva, Underwood, Kurtz, Conrad, accepted for publication. See: LINK.
      • 8 topics: (1) Evolutionary prototyping and computational linguistics, (2) Graph analytics, digital humanities and archival representation, (3) Computational finding aids, (4) Digital curation, (5) Public engagement with (archival) content, (6) Authenticity, (7) Confluences between archival theory and computational methods: cyberinfrastructure and the Records Continuum, and (8) Spatial and temporal analytics.
      • [In: “Advances in Librarianship – Re-Envisioning the MLIS: Perspectives on the Future of Library and Information Science Education”, Editors: Lindsay C. Sarin, Johnna Percell, Paul T. Jaeger, & John Carlo Bertot.]

9:15 – 10:35 Session 1: Exploring Archival Data (talks: 20 mins each)

  • #1: Building new knowledge from distributed scientific corpus; HERBADROP & EUROPEANA: two concrete case studies for exploring big archival data
    [Pascal Dugenie, Nuno Freire, Daan Broeder — CINES, FR & MEERTENS Institut, NL & INESC-ID/Europeana DSI, NL]


    • Computational Methods: EUDAT automated scalable e-infrastructure, integrated computational services,

    • Archival Concepts: Trusted digital repositories (TDR),
      OCR, cultural heritage platforms
  • #2: An Infrastructure and Application of Computational Archival Science to Enrich and Integrate Big Digital Archival Data: Using Taiwan Indigenous Peoples Open Research Data (TIPD) as Example
    [Ji-Ping LinAcademia Sinica, TW]


    • Computational Methods: Record linking, GIS

    • Archival Concepts: Big archival data
  • #3: Computational Curation of a Digitized Record Series of WWII Japanese-American Internment
    [William Underwood, Richard Marciano, Sandra Laib, Carl Apgar, Luis Beteta, Waleed Falak, Marisa Gilman, Riss Hardcastle, Keona Holden, Yun Huang, David Baasch, Brittni Ballard, Tricia Glaser, Adam Gray, Leigh Plummer, Zeynep Diker, Mayanka Jha, Aakanksha Singh, and Namrata Walanj — University of Maryland, USA]


    • Computational Methods: NLP, NER, GIS, Graph database,
      linked data

    • Archival Concepts: Digital curation, automated metadata extraction
  • #4: The Cybernetics Thought Collective Project: Using Computational Methods to Reveal Intellectual Context in Archival Material
    [Bethany Anderson, Christopher Prom, Kevin Hamilton, James Hutchinson, Mark Sammons, and Alex Dolski — University of Illinois at Urbana-Champaign, USA]


    • Computational Methods: NLP, NER, machine learning

    • Archival Concepts: Geographically dispersed archives

10:35 – 10:45 Questions and Discussion

10:45 – 11:05 Coffee break

11:05 – 12:25 Session 2: Curation and Appraisal (talks: 20 mins each)

  • #5: Towards Automated Quality Curation of Video Collections from a Realistic Perspective
    [Todd Goodall, Maria Esteva, Sandra Sweat, and Alan Bovik — University of Texas, USA]


    • Computational Methods: Feature computing from video records, automated quality prediction, scalable HPC

    • Archival Concepts: Collection assessment, quality-aware metadata for video collections to inform appraisal, preservation, and access decisions, quality detection in videos

  • #6: Line Detection in Binary Document Scans: A Case Study with the International Tracing Service Archives
    [Benjamin LeeUnited States Holocaust Memorial Museum, USA]


    • Computational Methods: Line detection, image segmentation

    • Archival Concepts: Classification of archival images
  • #7: Auto-Categorization & Future Access to Digital Archives
    [Nathaniel Payne and Jason BaronUniversity of British Columbia, CAN & Of Counsel, Drinker Biddle & Reath LLP, USA]


    • Computational Methods: Auto-categorization, auto-classification, e-discovery, machine learning

    • Archival Concepts: Recordkeeping
  • #8: Heuristics for Assessing Computational Archival Science (CAS) Research: The Case of the Human Face of Big Data Project
    [Myeong Lee, Yuheng Zhang, Shiyun Chen, Edel Spencer, Jhon Dela Cruz, Hyeonggi Hong, and Richard Marciano — University of Maryland, USA]


    • Computational Methods: Heuristics for CAS research,

    • Archival Concepts: Iterative design, value-sensitive design

12:25 – 12:45 Session 3: CAS Methods (talk: 20 min)

  • #9: What Can a Knowledge Complexity Approach Reveal About Big Data and Archival Practice?
    [Nicola HorsleyThe Netherlands Institute for Permanent Access to Digital Research Resources, NL]


    • Computational Methods: Digital narrative with big data,

    • Archival Concepts: Knowledge complexity in archives

  • 12:45 – 2:00 Lunch

    2:00 – 3:00 Session 3 CAS Methods cont. (talks: 20 mins each)

    • #10: Protecting Privacy in the Archives: Preliminary Explorations of Topic Modeling for Born-Digital Collections
      [Tim HutchinsonUniversity of Saskatchewan Library, CAN]


      • Computational Methods: NLP, NER, sentiment analysis

      • Archival Concepts: PII
    • #11: Identifying Epochs in Text Archives
      [Tobias Blanke and Jon Wilson — King’s College London, UK]


      • Computational Methods: Cultural analytics, topic modeling

      • Archival Concepts: Classification of time-coded
        collections of textual collections into epochs and periods
    • #12: GraphQL for Archival Metadata: An Overview of the EHRI GraphQL API
      [Mike BryantKing’s College London, UK]


      • Computational Methods: APIs for cultural heritage materials, graph databases

      • Archival Concepts: Structured data interfaces to archival materials

    3:00 – 3:40 Session 4: Creation and Management of Current Records (talks: 20 mins each)

    • #13: The Blockchain Litmus Test
      [Tyler SmithAdventium Labs, USA]


      • Computational Methods: Blockchain, secure computing,

      • Archival Concepts: Decentralized recordkeeping
    • #14: A Typology of Blockchain Recordkeeping Solutions and Some Reflections on their Implications for the Future of Archival Preservation
      [Victoria LemieuxUniversity of British Columbia, CAN]


      • Computational Methods: Blockchain, computational validation, distributed ledger, computational trust

      • Archival Concepts: Recordkeeping, digital preservation,
        archival trust

    3:40 – 4:05 Questions and Discussion

    4:05 – 4:25 Coffee break

    4:25 – 4:55 Demos

    4:55 – 5:15 Student Session:

    • Moderator: Michael KurtzStudents: LEFT TO RIGHT — Jennifer Proctor, Claire McDonald , Will Thomas

      Seven graduate students at the U. Maryland participated in a fall 2017 seminar exploring the eight case studies proposed in the 2017 Foundational Paper: “Archival records and training in the Age of Big Data”, Marciano, Lemieux, Hedges, Esteva, Underwood, Kurtz, Conrad, LINK, to be published in “Advances in Librarianship – Re-Envisioning the MLIS: Perspectives on the Future of Library and Information Science Education”, Editors: Lindsay C. Sarin, Johnna Percell, Paul T. Jaeger, & John Carlo Bertot.

      Students offered to discuss educational takeaways, and methods of incorporating CAS into the Master’s of Library and Information Science (MLIS) education in order to better address the needs of today’s MLIS graduates looking to employ both ‘traditional’ archival principles in conjunction with computational methods.

    5:15 Closing Remarks


    Introduction to workshop:
    The large-scale digitization of analog archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship through the application of computational methods and tools to the archival problem space, and, more fundamentally, through the integration of ‘computational thinking’ with ‘archival thinking’.

    Our working definition of Archival Computational Science (CAS) is:

    Contributing to the development of the theoretical foundations of a new trans-discipline of computer and archival science

    This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

    This is the 2nd workshop at IEEE Big Data addressing Computational Archival Science (1st CAS workshop at: This will builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland (

    Research topics covered:
    Topics covered by the workshop include, but are not restricted to, the following:

    • Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
    • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
    • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
    • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
    • Cyber-infrastructures for archive-based research and for development and hosting of collections
    • Big data and archival theory and practice
    • Digital curation and preservation
    • Crowd-sourcing and archives
    • Big data and the construction of memory and identity
    • Specific big data technologies (e.g. NoSQL databases) and their applications
    • Corpora and reference collections of big archival data
    • Linked data and archives
    • Big data and provenance
    • Constructing big data research objects from archives
    • Legal and ethical issues in big data archives

    Program Chairs:
    Dr. Mark Hedges
    Department of Digital Humanities (DDH)
    King’s College London, UK

    Prof. Victoria Lemieux
    School of Library, Archival and Information Studies
    University of British Columbia, Canada

    Prof. Richard Marciano
    Digital Curation Innovation Center (DCIC)
    College of Information Studies
    University of Maryland, USA

    Program Committee Members:
    The program chairs will serve on the Program Committee, as will the following:

    Dr. Maria Esteva
    Data Intensive Computing
    Texas Advanced Computing Center (TACC), USA

    Dr. Bill Underwood
    Digital Curation Innovation Center (DCIC)
    College of Information Studies
    University of Maryland, USA

    Prof. Michael Kurtz
    Digital Curation Innovation Center (DCIC)
    College of Information Studies
    University of Maryland, USA

    Mark Conrad
    National Archives and Records Administration (NARA)

    Dr. Tobias Blanke
    Department of Digital Humanities
    King’s College London, UK