Kolloquium "Phänomenologie der Digital Humanities" #8 (7.2.2025)

Tracking and Representing the

Provenance of Gender Data in the

Digital Humanities

FU Logo

Lisa Poggel Chair of Digital Humanities, FU Berlin

QR Code Presentation

Overview

Introduction
  1. Provenance, Data Provenance, Research Data Provenance
  2. Representing Provenance
  3. Representing Gender
The Project
  1. Research Aims and Questions
  2. Preliminary Findings
  3. Outlook and Timeline

Provenance Data Provenance Research Data Provenance

Provenance

Art history: a record of the history of ownership of a piece of art

Archival science/archaeology: context wherein objects and records are found and created to aid in their interpretation and take those materials as evidence of context.

Sources: Glavic (2021: 3f.), Anderson (2024)

W3C Definition of Provenance

W3C Definition of Provenance

Source: W3C (2010)

(Research) Data Provenance

(or lineage, pedigree, parentage, genealogy, ...)

Databases: record of how the data item was derived from other data items by a set of transformations; explains how the result of an operation was derived from its inputs

Experimental sciences: metadata record of the process of experiment workflows, annotations, and notes about experiments; ensures reproducibility and trustworthiness

Data science: record that describes the origins and processing of data; enables responsible (i.e. fair, accountable, transparent and explainable) AI

Research Data Management: degree to which a data set and the data elements and data values it contains are equipped with verifiable information about its origin. Dokumentation darüber, woher Datenmaterial stammt und mit welchen Prozessen und Methoden es produziert wurde; beantwortet die Fragen, warum und wie die Daten produziert wurden, wo, wann und von wem.

Sources: Glavic (2021: 3f.), Simmhan, Plale and Gannon (2005: 1), Data Provenance Initiative (2024), Werder, Ramesh and Zhang (2022), Stein and Taentzer (2023), eResearch Alliance Universität Göttingen (2025)

Types of Provenance

Provenance: any information describing the production process of an end product, which can be anything from a piece of digital data to a physical object.

Provenance meta-data: meta-data describing an arbitrary production process using an arbitrary data model and model of computation

Information system provenance: meta-data collected for an information-disseminating process that can be computed based on the input, the output, and the parameters of the process. (= processes producing digital data within information systems)

Workflow provenance: specializes information system provenance by further restricting the type of production processes to so-called workflows. (= directed graph where nodes represent arbitrary functions or modules [...] with some input, output, and parameters)

Data provenance: allows to track the processing of individual data items (e.g., tuples) at the “highest resolution,” i.e., the provenance itself is at the level of individual data items (and the operations they undergo). Collecting data provenance typically applies on structured data models and declarative query languages with clearly defined semantics of individual operators.

Source: Herschel, Diestelkämper and Lahmar (2017: 882f.)

Different Motivations and Goals

  • monitoring data quality
  • debugging systems
  • explaining results and ensuring understandability
  • protecting the integrity of data
  • protecting the accuracy and reliability of data
  • guaranteeing transparency
  • ensuring reproducibility and reusability
  • precondition for plausibility and trustworthiness

Sources: highlighted terms from Herschel, Diestelkämper and Lahmar (2017: 882f.), Simmhan, Plale and Gannon (2005), IBM (2024) u.A.

FAIR data consortium

The FAIR data consortium subsumes data provenance under the reusability aspect and describes data provenance as a precondition for reusability:

For others to reuse your data, they should know where the data came from (i.e., clear story of origin/history, see R1), who to cite and/or how you wish to be acknowledged. Include a description of the workflow that led to your data: Who generated or collected it? How has it been processed? Has it been published before? Does it contain data from someone else that you may have transformed or completed? Ideally, this workflow is described in a machine-readable format."

Source: FAIR Data Consortium

NFDI Sektion (Meta)daten, Terminologien, Provenienz

Tasks of Sektion (Meta)daten, Terminologien, Provenienz according to Sektionskonzept (2021):

Im Themenbereich Provenienz befasst sich die Sektion mit rechtlichen, technischen und kulturellen Aspekten des Entstehungskontextes von (Meta)daten (z. B. im Rahmen von Experimenten, Laborbüchern, Digitalisierungsprozessen etc.) und entwirft Vorschläge für einheitliche und nachvollziehbare Dokumentationsverfahren zur Beantwortung der Fragen nach dem Was, Wo, Wann, Wer, Wie und Warum der Datenerzeugung und Datenprozessierung. Hierbei entwickelt die Sektion Empfehlungen für die Abbildung der Provenienz in einem möglichen NFDI-Kernmetadatenformat.

Source: Koepler et al. (2021)

From the Charta of the Cookbooks, Guidance and Best Practices Working Group within the Sektion (Meta)daten, Terminologien, Provenienz:

A common understanding of (meta)data, terminology, provenance and related sub-concepts is core in data-driven research to foster the provision of FAIR data. However, knowledge and implementation of metadata standards, data repositories, terminologies as well as provenance concepts differ within and across disciplines. In order to create or reuse subject- and application-specific metadata that is at the same time semantically rich, machine-actionable and interoperable, and to interlink data (i.e. FAIR data), a common understanding of quality parameters for metadata is required.

Source: Arndt et al. (2022)

Representing Provenance

Different provenance types, different approaches

Provenance meta-data: e.g. Historical Context Ontology (HiCO)

Information system provenance: e.g. commercial data historians like Clarify for gathering process manufacturing data; tools for recording the computing environment like R E2ETools

Workflow provenance: e.g. TaDiRAH; tools like Vistrails or LabelFlow

Data provenance: e.g. Wikidata references, ProvSQL

Different types of provenance: e.g. PROV-O, CRMdig

Example: Representation of provenance in RDF

RDF Provenance

Source: Massari, Peroni, Tomasi and Heibi (2023), see also Sikos (2021), Sikos and Philp (2020)

Representing Gender

Why gender?

When it comes to managing gender data, common challenges and beliefs seem to be:

Gender trouble
Certainly within the community of researchers working with cultural data, the desire to compare and aggregate diverse sources held together by a thin red thread of potential narrative cohesion, is only increasing. The KPLEX project (kplex-project.eu) is investigating these barriers to meaning-making. Our team has adopted a comparative, multidisciplinary, and multi-sectoral approach to this problem, focussing on key challenges to the knowledge creation capacity of cultural data such as the terms we use to speak about data in a cultural context, the manner in which data that are not digitised or shared become “hidden” from aggregation systems, the fact that data lacks the objectivity often ascribed to the term and the subtle ways in which data that are complex almost always become simplified before they can be aggregated.

Source: Edmond and Folan (2017)

Representation of gender in common standards

CIDOC bubbles
  • Entity E76 Gender removed in 2001
  • Gender modelled via P2 has type
  • Alternatives discussed but complex, f.e. via gender assignment event [1]
Text Encoding Initiative (TEI) Logo
  • Standard approach is @sex attribute with values M, F, O (other), N (none)
  • But any external standard can be used
  • Alternative approaches exist but complex: f.e. via <trait type=“gender”> [2]
GND logo
  • Allowed values for gender in GND entities "male" and "female", without reference
  • Other values allowed in free form in different field only if reference is provided
Wikidata Logo
  • Gender modelled via P21 sex or gender
  • Used for animals and humans, conflates sex, gender identity and modality

Sources: [1] Andrews et al. (2024), [2] Flanders (2021)

Of course, not all is bad...

Homosaurus logo
  • Expansive linked data vocabulary for contemporary LGBTQ terms
  • Not the only one, f.e. GSSO ontology
Wigedi logo
  • Wikimedia-funded research project, reevaluated Wikidata model and recommended:
  • Require references for gender statements
  • Define standards for references
  • Remove P21 sex or gender
  • Separate gender identity and modality

...but it could be better.

Overview

Introduction
  1. Provenance, Data Provenance, Research Data Provenance
  2. Representing Provenance
  3. Representing Gender
The Project
  1. Research Aims and Questions
  2. Preliminary Findings
  3. Outlook and Timeline

Research Aims Research Questions

Research Aims and Questions

  1. Take stock: How is gender represented in DH datasets? Is provenance information provided? How is it represented and communicated? Which standards, ontologies, vocabularies are used? Methods: Kuckartz' Qualitative Content Analysis; Kitchenham’s guidelines for Systematic Review (2004)
  2. Identify issues: How may the quality of gender data in digital humanities projects be assessed? Is gender data FAIR? Is it interpretable, plausible and trustworthy? How does provenance information affect data quality?
  3. Understand their origins: Why are certain standards and modelling approaches adopted/rejected? What are common requirements and constraints? Which types of provenance are tracked and how? What are typical data integration and reconciliation workflows? Methods: Interview stakeholders in DH projects: expert interviews; requirement elicitation techniques
  4. Formulate recommendations: Which workflows and modelling approaches improve data quality in a linked data setting? How should provenance information be made available? Which types of provenance should be tracked when dealing with (historical) gender data? What strategies have proven successful in low resource settings? Output: Create repository with content analysis and interview data; publish workflows on SSH Open Marketplace

TL;DR

Digital humanities projects working with prosopographical data face a dilemma: historical gender data is inaccurate, messy, and mostly binary, but leaving it out means rendering gender as a social category of difference invisible. Approaches to tracking and representing the provenance of historical gender data vary and standardization is needed to improve interoperability and interpretability.

Preliminary Findings

  1. Take stock: How is gender represented in DH datasets? Is provenance information provided? How is it represented and communicated? Which standards, ontologies, vocabularies are used? Methods: Kuckartz' Qualitative Content Analysis; Kitchenham’s guidelines for Systematic Review (2004)

Trial run: DH 2024

  1. Collect data
    Extract URLs from Books of Abstracts of ADHO conferences
  2. Clean and process data
    Filter URLs for URLs likely pointing to DH datasets
  3. Conduct content analysis
    Identify DH datasets and evaluate datasets with respect to a set of variables
    • Develop annotation variables and coding scheme
    • repeat

    • Annotate content
    • Evaluate agreement between annotators with ICR scores
  4. Evaluate results
Scrapy Logo

Data collection

Python Scrapy used to scrape ADHO conference abstracts 2013-2023 from conference websites and repositories

XML-TEI 2013–2016, 2018, 2020, 2022, and 2023

Plaintext 2017

PDF to plaintext conversion 2019

Code available at: https://github.com/lipogg/dh-projects-scraper

Scrapy Logo

Data cleaning and preprocessing

  • Deduplication
    • remove URLs from the same conference year
  • Validation
    • validate URLs, remove e-mails, URL-like strings
  • Pre-filtering
    • filter out very frequent domains
  • sciencedirect
  • culturalanalytics
  • orcid.org
  • twitter.com
  • reddit.com
  • doi.org
  • zenodo.org
Google Sheets Logo

Annotation

  • Qualitative content analysis
  • Mixed deductive-inductive category system development
  • Selection variable funnel and content variables
  • Quantitative evaluation

(Mayring, 2000)

Selection Variable Funnel

  • Is the URL …
    • online?
    • a Digital Humanities resource?
    • a dataset?
    • a single dataset?
    • unrestricted?
  • And does it …
    • contain personal data?
    • contain gender data?

Inter-Coder Reliability Scores


First annotation round (5 coders)
Variable Fleiss’ Kappa Strength of agreement
Online 0.76 Substantial
Digital Humanities 0.58 Moderate
Dataset 0.32 Fair
Single dataset 0.35 Fair
Unrestricted 0.43 Moderate
Personal Data 0.46 Moderate
Gender Data 0.59 Moderate

Second annotation round (3 coders)
Variable Fleiss’ Kappa Strength of agreement
Online 0.64 Substantial
Digital Humanities 0.63 Substantial
Dataset 0.68 Substantial
Single dataset 0.77 Substantial
Unrestricted 0.68 Substantial
Personal Data 0.48 Moderate
Gender Data 0.63 Substantial

Result interpretation based on Landis and Koch (1977)

Selection Variables

Gender expressions

Test

Common non-gender expressions

Uncertainty Undetermined, Indeterminate, Contested

Inapplicability Not applicable

Unavailability Not provided, ?, Unknown

The “other” category Used for “other” genders, unknown values, animals and organizations

Gender expressions by structuredness of gender data

Test

Standards used

Text Encoding Initiative (TEI) Logo XML-TEI <sex> four times, with remarks: sex refers to “performed gender”, “female/male named individuals”

CIDOC bubbles CIDOC-CRM

Wikidata Logo Wikidata

FOAF Logo FOAF:gender

RDF Logo rdaGr2:gender

Gender expressions by availability of provenance information

Test

Provenance of gender data

Test

Gender expressions by provenance

Test

Limitations

  • Global North overrepresented in DH conferences and conference abstracts
  • Established projects favoured over small datasets
  • Gender categories beyond the binary less easily enumerated for unstructured data

See e.g. Siddiqui (2023)

Outlook Timeline

  1. Additional questions:
    • What is the language of the data / project?
    • What is the stage of the project (onset/ maintenance/ ...)
    • What types of provenance are represented and how?
    • How is provenance information communicated?
    Reinclude DOIs and Zenodo links? Incorporate Kitchenham's guidelines?
2027

Publish repository ④

Write thesis

2026

Conduct interviews ③

Create repository ④

Write thesis

2025

Take stock ①

Identify issues ②

Thank you! Questions, suggestions, criticism, …?

Literature

Anderson, B. G. (2024). Kindred contexts: Archives, archaeology, and the concept of provenance. Archival Science, 24(4), 761–781. https://doi.org/10.1007/s10502-024-09459-5
Andrews, T. L., Deierl, M., & Ebel, C. (2024). Gender Assignment as an Event—A Contemporary Approach for the Adequate Depiction of Historical Gender Categories. Digital Scholarship in the Humanities, 39(1), 5–12. https://doi.org/10.1093/llc/fqad100
Arndt, S., Burger, F., Dreyer, B., Espinoza, S., Fischer, B., Fuhrmans, M., Henzen, C., Hofman, V., Jaeger, D., Linke, D., Löbe, M., Mathiak, B., Rose, T., Saalbach, C., Shutsko, A., Soeding, E., Stein, R., & Terzijska, D. (2022). “Cookbooks, Guidance, and Best Practices”—Working Group Charter (NFDI section-metadata). Zenodo. https://doi.org/10.5281/zenodo.6758256
Edmond, J., & Nugent Folan, G. (2017). Data, Metadata, Narrative. Barriers to the Reuse of Cultural Sources. In E. Garoufallou, S. Virkus, R. Siatri, & D. Koutsomiha (Eds.), Metadata and Semantic Research (pp. 253–260). Springer International Publishing. https://doi.org/10.1007/978-3-319-70863-8_25
Flanders, J. (2021). Gender in the machine: Representing gender in digital publication frameworks. Digital humanities and gender history. https://doi.org/10.22032/dbt.48960
Glavic, B. (2021). Data Provenance. Foundations and Trends® in Databases, 9(3–4), 209–441. https://doi.org/10.1561/1900000068
Herschel, M., Diestelkämper, R., & Ben Lahmar, H. (2017). A survey on provenance: What for? What form? What from? The VLDB Journal, 26(6), 881–906. https://doi.org/10.1007/s00778-017-0486-1
Koepler, O., Schrade, T., Neumann, S., Stotzka, R., Wiljes, C., Blümel, I., Bracht, C., Hamann, T., Arndt, S., & Hunold, J. (2021). Sektionskonzept Meta(daten), Terminologien und Provenienz zur Einrichtung einer Sektion im Verein Nationale Forschungsdateninfrastruktur (NFDI) e.V. https://doi.org/10.5281/zenodo.5619089
Massari, A., Peroni, S., Tomasi, F., & Heibi, I. (2024). Representing provenance and track changes of cultural heritage metadata in RDF: A survey of existing approaches (arXiv:2305.08477). arXiv. https://doi.org/10.48550/arXiv.2305.08477
R1.2: (Meta)data are associated with detailed provenance. (n.d.). GO FAIR. Retrieved February 6, 2025, from https://www.go-fair.org/fair-principles/r1-2-metadata-associated-detailed-provenance/
Siddiqui, N. (2023). An Undue Burden: Race, Gender, and Mobility in Digital Humanities Conferences. DH2023, Graz. https://dh-abstracts.library.virginia.edu/works/12417
Sikos, L. F. (2021). The Evolution of Context-Aware RDF Knowledge Graphs. In L. F. Sikos, O. W. Seneviratne, & D. L. McGuinness (Eds.), Provenance in Data Science: From Data Models to Context-Aware Knowledge Graphs (pp. 1–10). Springer International Publishing. https://doi.org/10.1007/978-3-030-67681-0_1
Sikos, L. F., & Philp, D. (2020). Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs. Data Science and Engineering, 5(3), 293–316. https://doi.org/10.1007/s41019-020-00118-0
Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance techniques. https://www.semanticscholar.org/paper/A-survey-of-data-provenance-techniques-Simmhan-Plale/54c2b2f2a709f5c14d0f46265729a06e69f1955f
W3C. (2010). What Is Provenance—XG Provenance Wiki. https://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
Zhao, F. (2023). A systematic review of Wikidata in Digital Humanities projects. Digital Scholarship in the Humanities, 38(2), 852–874. https://doi.org/10.1093/llc/fqac083