IEEE Big Data 2016: 1st CAS workshop

Workshop Title: Computational Archival Science: digital records in the age of big data
Thursday, December 8, 2016
Hyatt Regency Washington on Capitol Hill
400 New Jersey Avenue, NW
Washington, D.C., USA, 20001

PART OF: IEEE Big Data 2016
http://cci.drexel.edu/bigdata/bigdata2016/
*** There is a 1-day registration option ***


FINAL PROGRAM:
Keynote, 10 presentations from Belgium, Germany, UK, Canada, USA (universities, government agencies, companies), Panel, Breakout sessions.

8:45 – 9:00 Welcome

  • Workshop Organizers:
    Mark Hedges1, Richard Marciano2, Victoria Lemieux3, Maria Esteva4, Bill Underwood2, Michael Kurtz2, and Myeong Lee2, Mary Kendig2

    1 KCL, 2 U. Maryland, 3 UBC, 4 TACC

9:00 – 9:45 Keynote (30 min + 15 min discussion)

  • “Collaboration is the Thing”, Mark Conrad [Archives Specialist, National Archives and Records Administration (U.S.A.)]
conrad Slides

9:45 – 10:45 Session 1 (3 talks: 20 mins each)

  • #1: Exploring Archives with Probabilistic Models: Topic Modelling for the Valorisation of Digitised Archives of the European Commission
    [Simon Hengchen, Mathias Coeckelbers, Seth van Hooland, Ruben Verborgh, Thomas Steiner — U. Libre de Bruxelles, Ghent U. (Belgium), Google Germany]

    1_simon
    SlidesPaper

    • Computational Method: Topic Modelling for concept extraction from large EC archival holdings

    • Archival concept: Support accessibility to large historical European Commission archival holdings

  • #2: Traces Through Time: A Probabilistic Approach to Connected Archival Data
    [Sonia Ranade — The UK National Archives]
  • 2_sonia
    SlidesPaper

    • Computational Method: Graph and ProbabilisticDatabases

    • Archival concepts: Going from paper catalog entries to digital catalogs, Matching records in distributed databases.

  • #3: Opening Up Dark Digital Archives Through The Use of Analytics to Identify Sensitive Content
    [Jason Baron, Bennett Borden — Drinker Biddle & Reath LLP (Washington D.C.)]
  • 3_jason
    SlidesPaper

    • Computational Methods: Analytics, predictive coding to address PII

    • Archival concepts: technology assisted review accessibility of presidential and federal e-mail accessioned into National Archives

10:45 – 11:05 Coffee break

11:05 – 12:45 Session 2 (5 talks: 20 mins each)

  • #4: Computational Provenance in DataONE: Implications for Cultural Heritage Institutions
    [Robert Sandusky — U. of Illinois at Chicago Library]
  • 4_robert
    SlidesPaper

    • Computational Method: DataONE extensions to PROV (Provenance data model)

    • Archival concepts: Provenance of scientific data records (datasets).Trust in authenticity of the data, transparency and reuse

  • #5: Content-based Comparison for Collections Identification
    [Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, Ramona Walls — U. Texas at Austin, TACC]
  • 5_ruizhu
    SlidesPaper

    • Computational method: scalable, robust automated computational service for data content comparison.

    • Archival concepts: need for a service to assign globally unique persistent identifiers to data sets in order to to support accessibility, reference and reuse.

  • #6: Breaking Down the Invisible Wall to Enrich Archival Science and Practice
    [Kenneth Thibodeau — US National Archives (retired) ]
  • 6_ken
    SlidesPaper

    • Computational Method: Linguistic Models and Graph Theory

    • Archival Science: Enriched Archival Science concepts

  • #7: Mind the explanatory gap: Quality from Quantity
    [Jenny Bunn — UCL (UK)]
  • 7_jenny
    SlidesPaper

    • Methods: Abstraction and ontology construction

    • Archival concepts: provenance in terms of why, who and how

  • #8: Understanding Computational Web Archives Research Methods Using Research Objects
    [Emily Maemura, Christoph Becker, Ian Milligan — U. of Toronto, U. of Waterloo (Canada)]
  • 8_emily
    SlidesPaper

    • Computational methods: Research Objects Framework used to analyze the computational methods used in web archives research, — Research Objects in Computational Science

    • Archival concepts: Web Archives Research Objects — Disciplinary perspective, legal agreements, Motivations, Interpretation, Designs, …

12:45 – 2:00 Lunch

2:00 – 2:40 Session 3 (2 talks: 20 mins each)

  • #9: Appraising Digital Archives with Archivematica
    [Michael Shallcross — U. Michigan Bentley Historical Library]
  • 9_michael
    SlidesPaper

    • Computational Methods: Analysis Tab — File Format Characterization, File Format policies, Bulk extractor (Identifies PII), Content Preview, Tagging

    • Archival concepts: Appraisal

  • #10: Mining and Analysing One Billion Requests to Linguistic Services
    [Marco Büchler, G. Franzini, E. Franzini, T. Eckart — Georg-August U. Gottingen, U. Leipzig (Germany)]
  • 10_marco
    SlidesPaper

    • Computational Method: Text mining

    • Archival concepts: Corpus — One Billion Requests for Linguistic Services

2:40 – 3:30 Panel: The future for research and education in CAS

  • Panelists: [Bill Underwood (summary & position), Maria Esteva (position), Victoria Lemieux (position), Mark Hedges (position), Richard Marciano (position), Mary Kendig (position)]

panel_billpanel_mary

3:30 – 4:00 Coffee break & Posters [U. British Columbia & U. Maryland]

posters

4:00 – 5:00 Reporting back and next steps [Lead by Maria Esteva and Vicki Lemieux]

mariareporting_vicki


Introduction to workshop:

The large-scale digitization of analogue archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship through the application of computational methods and tools to the archival problem space, and, more fundamentally, through the integration of ‘computational thinking’ with ‘archival thinking’.

Our working definition of Archival Computational Science (CAS) is:

An interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.

This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

This is the first workshop at IEEE Big Data addressing Computational Archival Science, although it builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland (http://dcicblog.umd.edu/cas/).

Research topics covered:
Topics covered by the workshop include, but are not restricted to, the following:

  • Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
  • Analytics in support of archival processing, including appraisal, arrangement and description.
  • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
  • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
  • Cyber-infrastructures for archive-based research and for development and hosting of collections
  • Big data and archival theory and practice
  • Digital curation and preservation
  • Crowd-sourcing and archives
  • Big data and the construction of memory and identity
  • Specific big data technologies (e.g. NoSQL databases) and their applications
  • Corpora and reference collections of big archival data
  • Linked data and archives
  • Big data and provenance
  • Constructing big data research objects from archives

Program Chairs:
Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK

Dr. Tobias Blanke
Department of Digital Humanities (DDH)
King’s College London, UK

Prof. Richard Marciano
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Prof. Michael Kurtz
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Dr. Bill Underwood
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Prof. Victoria Lemieux
School of Library, Archival and Information Studies
University of British Columbia, Canada

Dr. Maria Esteva
Data Intensive Computing
Texas Advanced Computing Center (TACC), USA