IEEE Big Data 2018: 3rd CAS workshop

Workshop Title: 3rd Computational Archival Science (CAS) workshop
Wednesday, Dec. 12, 2018, Seattle, WA

PART OF: IEEE Big Data 2018
http://cci.drexel.edu/bigdata/bigdata2018/
*** There is a 1-day registration option ***


9:00 – 9:10 Welcome: ELLIOTT BAY, Floor 1

  • Workshop Chairs:
    Mark Hedges 1, Victoria Lemieux 2, Richard Marciano 3

    1 KCL, 2 UBC, 3 U. Maryland

9:10 – 9:45 Keynote #1:

  • “Reclaiming our Story: Using Digital Archives to Preserve the History of WWII Japanese-American incarceration”
    Geoff Froh
    Deputy Director at Densho.org in Seattle

9:45 – 10:05 Coffee break: GRAND FOYER


10:05 – 11:40 SESSION 1: Computational Thinking & Computational Archival Science

  • 10:05-10:30 #1:Introducing Computational Thinking into Archival Science Education
    [William Underwood, David Weintrop, Michael Kurtz, and Richard Marciano — University of Maryland, USA]

    Slides — Paper

    ABSTRACT: The discipline of professional archivists is rapidly changing. Most contemporary records are created, stored, maintained, used and preserved in digital form. Most graduate programs and continuing education programs in Archival Studies address this challenge by introducing students to information technology as it relates to digital records. We propose an approach to addressing this challenge based on introducing computational thinking into the graduate archival studies curriculum.
  • 10:30-10:50 #2:Automating the Detection of Personally Identifiable Information (PII) in Japanese-American WWII Incarceration Camp Records
    [Richard Marciano, William Underwood, Mohammad Hanaee, Connor Mullane, Aakanksha Singh, and Zayden Tethong — University of Maryland, USA]

    SlidesPaper

    ABSTRACT: We describe computational treatments of archival collections through a case study involving World War II Japanese-American Incarceration Camps. We focus on automating the detection of personally identifiable information or PII. The paper also discusses the emergence of computational archival science (CAS) and the development of a computational framework for library and archival education. Computational Thinking practices are applied to Archival Science practices. These include: (1) data creation, manipulation, analysis, and visualization (2) designing and constructing computational models, and (3) computer programming, developing modular computational solutions, and troubleshooting and debugging. We conclude with PII algorithm accuracy, transparency, and performance considerations and future developments.
  • 10:50-11:15 #3:Computational Archival Practice: Towards a Theory for Archival Engineering
    [Kenneth ThibodeauNational Archives and Records Administration (retired), USA]

    SlidesPaper

    ABSTRACT: The value of computational archival science is realized only in the delivery of products and services. The ultimate value of archival science is its contribution to the construction of information about the past. Archival engineering offers a systematic basis for delivering value. The paper articulates concepts that can be melded with traditional archival theory to expand the applicable domain and to develop quantified, testable and verifiable archival methods.
  • 11:15-11:40 #4:Stirring The Cauldron: Redefining Computational Archival Science (CAS) for The Big Data Domain
    [Nathaniel PayneThe University of British Columbia, CAN]

    SlidesPaper

    ABSTRACT: Over the past 10 years, digitization, big data, and technology advancement has had a significant impact on the work done by computer scientists, information scientists, and archivists. Together, each of these groups has contributed to unlock new areas of trans-disciplinary research that are critical for forward progression in the world of big data, while collectively spurring the creation of a new inter-disciplinary field-Computational Archival Science (CAS). Unfortunately, significant gaps exist, including the lack of a comprehensive definition of CAS. This paper closes those gaps by proposing a new, comprehensive definition of Computational Archival Science (CAS) while simultaneously highlighting key big data challenges that exist both in industry and academia. The paper also proposes important areas of future research especially in the context of big data and artificial intelligence.

11:40 – 12:10 Discussion and Feedback

  • Michael Kurtz & Bill Underwood, [University of Maryland, USA]
      Questions:

    1. Thesis: “The shifting landscape of archival work means that in order to succeed in future archival tasks, it is essential that computational thinking is included as part of their (archivists) training.” Taken from Bill’s Framework document.
    2. Is the approach of using computational thinking knowledge areas, knowledge units, and topics a sound basis for the Computational Framework for Library and Archival Education?
    3. Evaluate integrating computational thinking into Archival Science Knowledge Units as illustrated in Figure 3.
    4. What are the challenges in working with archival educators to introduce computational thinking in graduate Archival Science curricula? Strategies to overcome?

12:10 – 1:30 Lunch: GRAND II & III


1:30 – 2:20 SESSION 2: Machine Learning in Support of Archival Functions

  • 1:30-1:55 #5:Protecting Privacy in the Archives: Supervised Machine Learning and Born-Digital Records
    [Tim HutchinsonUniversity of Saskatchewan Library – University Archives & Special Collections, CAN]

    SlidesPaper

    ABSTRACT: This paper documents the iterations attempted in developing training sets for supervised machine learning relating to identification of documents relating to human resources and containing personal information. Overall, these results show promise, although we have so far been unable to propose a more systematic approach to developing training sets. This suggests that supervised machine learning could be a viable approach for a “triage” method of reviewing collection for restrictions.
  • 1:55-2:20 #6:Computer-Assisted Appraisal and Selection of Archival Materials
    [Christopher LeeUniversity of North Carolina, USA]

    SlidesPaper

    ABSTRACT: Despite progress on various technologies to support both digital preservation and description of archival materials, we have still seen relatively little progress on software support for the core activities of selection and appraisal. There are two considerations that make selection and appraisal of digital materials substantially different from selection and appraisal of analog materials: that digital materials exist at multiple levels of representation and that they are directly machine readable. There are great opportunities to better assist selection and appraisal of digital materials, including use of digital forensics tools, natural language processing, and machine learning.

2:20 – 3:35 SESSION 3: Metadata and Enterprise Architecture

  • 2:20-2:45 #7:Measuring Completeness as Metadata Quality Metric in Europeana
    [Péter Király and Marco Büchler — Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen and Georg-August-Universität Göttingen, DEU]

    SlidesPaper

    ABSTRACT: Europeana, the European digital platform for cultural heritage, has a heterogeneous collection of metadata records ingested from more than 3200 data providers. The original nature and context of these records were different. In order to create effective services upon them we should know the strength and weakness or in other words the quality of these data. This paper proposes a method and an open source implementation to measure some structural features of these data, such as completeness, multilinguality, uniqueness, record patterns, to reveal quality issues.
  • 2:45-3:10 #8:In-place Synchronisation of Hierarchical Archival Descriptions
    [Mike Bryant, Linda Reijnhoudt, and Boyan Simeonov — King’s College London, Data Archiving and Networked Services, and Ontotext, GBR, NLD, and BGR]

    SlidesPaper

    ABSTRACT: This short paper describes work under-taken by the European Holocaust Research Infrastructure (EHRI) project to achieve reliable and repeatable harvesting of hierarchical archival metadata that is robust to structural changes and reorganisation of the source material.
  • 3:10-3:35 #9:The Utility Enterprise Architecture for Records Professionals
    [Shadrack KatuuUniversity of South Africa, ZAF]

    SlidesPaper

    ABSTRACT: Modern institutions invest large amounts of resources to build technology platforms and business applications to support organizational activities which will fulfil their institutional mandate. Enterprise architecture (EA) has emerged as an approach to improve the alignment between the organization’s business and their technology platforms. This article is drawn from a research project investigating the utility of EA for records and archives specialists.

3:35 – 4:55 SESSION 4: Data Management

  • 3:35-4:00 #10:Framing the scope of the common data model for machine-actionable Data Management Plans
    [Tomasz Miksa, João Cardoso, and José Borbinha — SBA Research & TU Wien and INESC-ID & Instituto Superior Técnico, AUT and PRT]

    SlidesPaper

    ABSTRACT: Currently, research requires processing data at a large scale. Data is not anymore a collection of static documents, but often a continuous stream of information flowing into information systems. Researchers need to manage their data efficiently not only to keep it safe, but also to ensure that it can be later correctly interpreted and reused. Existing solutions are not sufficient. Traditional Data Management Plans are manually created text documents that describe how research data will be handled. Yet, researchers must implement all actions by themselves. Machine-actionable Data Management Plans are a new approach that allows systems to act on behalf of researchers and other stakeholders involved in data management, to help them manage data in an efficient and scalable way. This paper summarises the results of work performed by the Research Data Alliance working group on Data Management Plan Common Standards to realise this vision. The paper describes results of consultations and proof of concept tools that help in: identifying needs for information of stakeholders involved in data management; defining the scope of the common data model for Machine-actionable Data Management Plans to allow for exchange of information between systems; identifying necessary services and components of infrastructure that support automation of data management tasks.
  • 4:00-4:10 10 min discussion / break

  • 4:10 – 4:30 Coffee break: GRAND FOYER

  • 4:30-4:55 #11:The Blockchain Litmus Test
    [Tyler SmithAdventium Labs, USA]

    Slides — Paper

    ABSTRACT: Bitcoin’s underlying blockchain database is a novel approach to recordkeeping that has the potential to decentralize big data. Bitcoin’s success has inspired a multitude of spinoff projects hoping to use blockchain as a distributed database for records-management innovation in other fields. Misconceptions and exaggerations about blockchain and its capabilities are pervasive in the media. Drawing on perspectives from archival science, dependable computing, and secure computing, this paper surveys current applications, research, and critiques of blockchain to provide an objective assessment of its benefits and limitations. Based on the findings of the survey, this paper proposes three criteria that predict success for blockchain-based data management projects, briefly: dependability, security, and trust.

4:55 – 5:20 SESSION 5: Social and Cultural Institution Archives

  • 4:55-5:20 #12:A Case Study in Creating Transparency in Using Cultural Big Data: The Legacy of Slavery Project
    [Ryan Cox, Sohan Shah, William Frederick, Tammie Nelson, Will Thomas, Greg Jansen, Noah Dibert, Michael Kurtz, and Richard Marciano — Maryland State Archives and University of Maryland, USA]

    SlidesPaper

    ABSTRACT: The Maryland State Archives (MSA) and the Digital Curation Innovation Center (DCIC) of the University of Maryland’s iSchool are collaborating on a digital project that utilizes digital strategies and technologies to create an in-depth understanding of the African-American experience in Maryland during the era of slavery. Utilizing crowdsourcing for transcription, data cleaning and transformation techniques, and data visualization strategies, the joint project team is creating new avenues for understanding the complex web of relationships that undergirded the institution of slavery. iSchool students, full participants on the project team, are learning digital curation and other technical skills while gaining insights into the multiple uses of how cultural Big Data can penetrate the past and illuminate the present.
  • 5:20-5:45 #13:Jupyter Notebooks for Generous Archive Interfaces
    [Mari Wigham, Liliana Melgar, and Roeland Ordelman — Netherlands Institute for Sound and Vision and University of Amsterdam, NLD]

    SlidesPaper

    ABSTRACT: To help scholars to extract meaning, knowledge and value from large volumes of archival content, such as the Dutch Common Lab Research Infrastructure for the Arts and Humanities (CLARIAH), we need to provide more ‘generous’ access to the data than can be provided with generalised search and visualisation tools alone. Our approach is to use Jupyter Notebooks in combination with the existing archive APIs (Application Programming Interface). This gives access to both the archive metadata and a wide range of analysis and visualisation techniques. We have created notebooks and modules of supporting functions that enable the overview, investigation and analysis of the archive. We demonstrate the value of our approach in preliminary tests of its use in scholarly research, and give our observations of the potential value for archivists. Finally, we show that good archive knowledge is essential to create correct and meaningful visualisations and statistics.

5:45 CLOSING

  • Next Steps

 
Program Chairs:
Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK

Prof. Victoria Lemieux
School of Library, Archival and Information Studies
University of British Columbia, Canada

Prof. Richard Marciano
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Program Committee Members:
The program chairs will serve on the Program Committee, as will the following:

Dr. Maria Esteva
Data Intensive Computing
Texas Advanced Computing Center (TACC), USA

Dr. Bill Underwood
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Prof. Michael Kurtz
Emeritus Associate Director of the Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Mark Conrad
National Archives and Records Administration (NARA)

Dr. Tobias Blanke
Department of Digital Humanities
King’s College London, UK



Introduction to workshop:
The large-scale digitization of analog archives, the emerging diverse forms of born-digital archives, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship through the application of computational methods and tools to the archival problem space, and, more fundamentally, through the integration of ‘computational thinking’ with ‘archival thinking’.

Our working definition of Archival Computational Science (CAS) is:

A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.

This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

This is the 3rd workshop at IEEE Big Data addressing Computational Archival Science, following on from workshops in 2016 and 2017.

It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a symposium held in April 2016 at the University of Maryland (http://dcicblog.umd.edu/cas/dcickcl-invited-cas-symposium-apr-2016/).

RESOURCES and EXAMPLES of CAS can be found at the “COMPUTATIONAL ARCHIVAL SCIENCE (CAS)” Portal: http://dcicblog.umd.edu/cas. Also:

  • Join our Google Group at: computational-archival-science@googlegroups.com
  • Foundational Paper: “Archival records and training in the Age of Big Data”, Marciano, R., Lemieux, V., Hedges, M., Esteva, M., Underwood, W., Kurtz, M. & Conrad, M.. See: LINK. In J. Percell , L. C. Sarin , P. T. Jaeger , J. C. Bertot (Eds.), Re-Envisioning the MLS: Perspectives on the Future of Library and Information Science Education (Advances in Librarianship, Volume 44B, pp.179-199). Emerald Publishing Limited. May 17, 2018.
    • 8 topics: (1) Evolutionary prototyping and computational linguistics, (2) Graph analytics, digital humanities and archival representation, (3) Computational finding aids, (4) Digital curation, (5) Public engagement with (archival) content, (6) Authenticity, (7) Confluences between archival theory and computational methods: cyberinfrastructure and the Records Continuum, and (8) Spatial and temporal analytics.
  • Lessons learned from the CAS#1 and CAS#2 workshops on archival concept mappings to computational methods:

    Archival Concepts Computational Methods
    Support accessibility to large historical European Commission archival holdings Topic Modeling for concept extraction from large EC archival holdings
    Going from paper catalog entries to digital catalogs, Matching records in distributed databases. Graph and Probabilistic Databases
    Technology assisted review accessibility of presidential and federal e-mail accessioned into National Archives Analytics, predictive coding to address PII
    Provenance of scientific data records (datasets).Trust in authenticity of the data, transparency and reuse DataONE extensions to PROV (Provenance data model)
    Need for a service to assign globally unique persistent identifiers to data sets in order to support accessibility, reference and reuse. Scalable, robust automated computational service for data content comparison.
    Enriched Archival Science concepts Linguistic Models and Graph Theory
    Provenance in terms of why, who and how Abstraction and ontology construction
    Web Archives Research Objects — Disciplinary perspective, legal agreements, Motivations, Interpretation, Designs, … Research Objects Framework used to analyze the computational methods used in web archives research, — Research Objects in Computational Science
    Appraisal Analysis Tab — File Format Characterization, File Format policies, Bulk extractor (Identifies PII), Content Preview, Tagging
    Corpus — One Billion Requests for Linguistic Services Text mining
    Trusted digital repositories (TDR), OCR, cultural heritage platforms EUDAT automated scalable e-infrastructure, integrated computation services
    Support accessibility to large historical European Commission archival holdings Topic Modeling for concept extraction from large EC archival holdings
    Support accessibility to large historical European Commission archival holdings Topic Modeling for concept extraction from large EC archival holdings
    Annotation, entity extraction, NLP, machine learning Archival materials contextual discovery
    Collection assessment, quality-aware metadata for video collections to inform appraisal, preservation, and access decisions, quality detection in videos Feature computing from video records, automated quality prediction, scalable HPC
    Classification of archival images Line detection, image segmentation
    Recordkeeping Auto-categorization, auto-classification, e-discovery, machine learning
    Iterative design, value-sensitive design Heuristics for CAS research
    Knowledge complexity in archives Digital narrative with big data
    Personally Identifiable Information (PII) NLP, NER, sentiment analysis
    Classification of time-coded collections of textual collections into epochs and periods Cultural analytics, topic modeling/td>
    Structured data interfaces to archival materials APIs for cultural heritage materials, graph databases
    Decentralized recordkeeping Blockchain, secure computing, trustworthiness
    Recordkeeping, digital preservation, archival trust Blockchain, computational validation, distributed ledger, computational trust

Recommended Research topics for the CAS#3 Workshop:
Topics covered by the workshop include, but are not restricted to, the following:

  • Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
  • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
  • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
  • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
  • Cyber-infrastructures for archive-based research and for development and hosting of collections
  • Big data and archival theory and practice
  • Digital curation and preservation
  • Crowd-sourcing and archives
  • Big data and the construction of memory and identity
  • Specific big data technologies (e.g. NoSQL databases) and their applications
  • Corpora and reference collections of big archival data
  • Linked data and archives
  • Big data and provenance
  • Constructing big data research objects from archives
  • Legal and ethical issues in big data archives