IEEE Big Data 2018: 3rd CAS workshop

Workshop Title: 3rd Computational Archival Science (CAS) workshop
Wednesday, Dec. 12, 2018
Seattle, WA

PART OF: IEEE Big Data 2018
*** There is a 1-day registration option ***

Important dates:

  • Oct 10: NOW Oct 22, 2018: Due date for full workshop papers submission — UPDATED
  • Oct 29, 2018: Notification of paper acceptance to authors
  • Nov 15, 2018: Camera-ready of accepted papers
  • Dec 12, 2018: Wednesday Workshop

Paper Submission
All papers accepted for the workshop will be included in the Proceedings published by the IEEE Computer Society Press, made available at the iEEE Big Data Conference.

Please submit a full-length paper (up to 10 page IEEE 2-column format) through the online submission system. See: We also encourage submission of short papers (up to 4 pages) reporting work in progress.

Papers should be formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines (see link to “formatting instructions” below).
Formatting Instructions
8.5″ x 11″ (DOC, PDF)
LaTex Formatting Macros

Introduction to workshop:
The large-scale digitization of analog archives, the emerging diverse forms of born-digital archives, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship through the application of computational methods and tools to the archival problem space, and, more fundamentally, through the integration of ‘computational thinking’ with ‘archival thinking’.

Our working definition of Archival Computational Science (CAS) is:

A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.

This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

This is the 3rd workshop at IEEE Big Data addressing Computational Archival Science, following on from workshops in 2016 and 2017.

It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland (


  • Join our Google Group at:
  • Foundational Paper: “Archival records and training in the Age of Big Data”, Marciano, R., Lemieux, V., Hedges, M., Esteva, M., Underwood, W., Kurtz, M. & Conrad, M.. See: LINK. In J. Percell , L. C. Sarin , P. T. Jaeger , J. C. Bertot (Eds.), Re-Envisioning the MLS: Perspectives on the Future of Library and Information Science Education (Advances in Librarianship, Volume 44B, pp.179-199). Emerald Publishing Limited. May 17, 2018.
    • 8 topics: (1) Evolutionary prototyping and computational linguistics, (2) Graph analytics, digital humanities and archival representation, (3) Computational finding aids, (4) Digital curation, (5) Public engagement with (archival) content, (6) Authenticity, (7) Confluences between archival theory and computational methods: cyberinfrastructure and the Records Continuum, and (8) Spatial and temporal analytics.
  • Lessons learned from the CAS#1 and CAS#2 workshops on archival concept mappings to computational methods:

    Archival Concepts Computational Methods
    Support accessibility to large historical European Commission archival holdings Topic Modeling for concept extraction from large EC archival holdings
    Going from paper catalog entries to digital catalogs, Matching records in distributed databases. Graph and Probabilistic Databases
    Technology assisted review accessibility of presidential and federal e-mail accessioned into National Archives Analytics, predictive coding to address PII
    Provenance of scientific data records (datasets).Trust in authenticity of the data, transparency and reuse DataONE extensions to PROV (Provenance data model)
    Need for a service to assign globally unique persistent identifiers to data sets in order to support accessibility, reference and reuse. Scalable, robust automated computational service for data content comparison.
    Enriched Archival Science concepts Linguistic Models and Graph Theory
    Provenance in terms of why, who and how Abstraction and ontology construction
    Web Archives Research Objects — Disciplinary perspective, legal agreements, Motivations, Interpretation, Designs, … Research Objects Framework used to analyze the computational methods used in web archives research, — Research Objects in Computational Science
    Appraisal Analysis Tab — File Format Characterization, File Format policies, Bulk extractor (Identifies PII), Content Preview, Tagging
    Corpus — One Billion Requests for Linguistic Services Text mining
    Trusted digital repositories (TDR), OCR, cultural heritage platforms EUDAT automated scalable e-infrastructure, integrated computation services
    Support accessibility to large historical European Commission archival holdings Topic Modeling for concept extraction from large EC archival holdings
    Support accessibility to large historical European Commission archival holdings Topic Modeling for concept extraction from large EC archival holdings
    Annotation, entity extraction, NLP, machine learning Archival materials contextual discovery
    Collection assessment, quality-aware metadata for video collections to inform appraisal, preservation, and access decisions, quality detection in videos Feature computing from video records, automated quality prediction, scalable HPC
    Classification of archival images Line detection, image segmentation
    Recordkeeping Auto-categorization, auto-classification, e-discovery, machine learning
    Iterative design, value-sensitive design Heuristics for CAS research
    Knowledge complexity in archives Digital narrative with big data
    Personally Identifiable Information (PII) NLP, NER, sentiment analysis
    Classification of time-coded collections of textual collections into epochs and periods Cultural analytics, topic modeling/td>
    Structured data interfaces to archival materials APIs for cultural heritage materials, graph databases
    Decentralized recordkeeping Blockchain, secure computing, trustworthiness
    Recordkeeping, digital preservation, archival trust Blockchain, computational validation, distributed ledger, computational trust

Recommended Research topics for the CAS#3 Workshop:
Topics covered by the workshop include, but are not restricted to, the following:

  • Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
  • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
  • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
  • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
  • Cyber-infrastructures for archive-based research and for development and hosting of collections
  • Big data and archival theory and practice
  • Digital curation and preservation
  • Crowd-sourcing and archives
  • Big data and the construction of memory and identity
  • Specific big data technologies (e.g. NoSQL databases) and their applications
  • Corpora and reference collections of big archival data
  • Linked data and archives
  • Big data and provenance
  • Constructing big data research objects from archives
  • Legal and ethical issues in big data archives

Program Chairs:
Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK

Prof. Victoria Lemieux
School of Library, Archival and Information Studies
University of British Columbia, Canada

Prof. Richard Marciano
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Program Committee Members:
The program chairs will serve on the Program Committee, as will the following:

Dr. Maria Esteva
Data Intensive Computing
Texas Advanced Computing Center (TACC), USA

Dr. Bill Underwood
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Prof. Michael Kurtz
Emeritus Associate Director of the Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Mark Conrad
National Archives and Records Administration (NARA)

Dr. Tobias Blanke
Department of Digital Humanities
King’s College London, UK