Rethinking Collections as Data
Without data there are no Labs. This chapter covers identifying collections and assessing their suitability for Labs, how to describe them, make them accessible and reusable. It also touches on strategies for dealing with messy data as well as some useful basic concepts: different forms of collections, digitisation, metadata and preservation. It closes with a case study looking at making data available.
About digital collections
Cultural heritage institutions collect a wide range of materials. Since the early 2000s, these materials have been increasingly digitised and published in digital libraries, in archival portals or museum websites. Digitisation, along with techniques such as OCR, impacts a collection to an extent that its uses could be limited. It is therefore important to document the digitisation process in as much detail as possible as this feeds into the level of transparency of the collection. Collecting and preserving born-digital materials, such as web archives, social media, video games and software, is becoming increasingly common place.
Long-term thinking and planning about collections ensures their use for decades to come. This process would normally be a task for the parent organisation as it raises pertinent questions about the longevity of an institution. However, when a Lab publishes data in any form, digital preservation of that collection should be kept in mind. Considerations should include adding DOIs, how to deal with metadata, digital objects and associated data that constitute the collections themselves. The Digital Preservation Coalition provides an exhaustive resource about this area.
Collections as Data
Providing data-level access to digitised and born-digital collections from galleries, libraries, archives and museums is at the heart of GLAM Labs activities. Users are increasingly generating their own data and experimenting together with GLAM Labs to jointly generate new datasets. Access to collections in bulk means opening data and metadata associated with digitised and born-digital cultural heritage collections for use in new ways. A great example of a team working to facilitate the publication of collections as data is the Mellon-funded initiative Always Already Computational: Collections as Data, which aimed to find a way to document, exchange experience, and share knowledge for ‘supporting users who want to work with collections as data' (Padilla, 2019).
When sharing collections as data, several aspects need to be considered. What data is available to share? What is in the datasets and how were they constructed. In addition, each dataset will have distinct rights statements - or a lack there of. A decision needs to be made about how much time – if any at all - is spent on data cleaning and curation before sharing. Also, how will the data be made available to users?
Requests to use collections as data often come from an external partner or user. In addition to helping to facilitate external requests for data, many Labs proactively gather collections data that could be of interest to broad audiences. A list of digitised collections is a great starting point for considering what has the potential to be used computationally. However, this list may not exist in a single place, especially in a distributed context, as this example shows.
Example: Digital Assets, ÖNB Labs
Information about digitised collections at the Austrian National Library is highly fragmented and distributed across several departments and storage formats. One year after launching ÖNB Labs formally, the team is still in the process of locating additional, hidden digitised assets.
Gathering information about past and ongoing digitisation projects provides a perfect starting point in order to have a record of digitised and born-digital collections. This can be achieved by consulting knowledgeable people within the institution, such as curators, custodians of library systems, or IT specialists. Some data might not appear to be a collection at first glance, such as records in a digital library catalogue, but can be very relevant as the following example shows.
Example: Delpher, KB National Library of the Netherlands
The KB publishes around 100 million pages of digitised text on the platform Delpher. The Centrum voor Wiskunde en Informatica (CWI) worked with the anonymised search log files of the platform to research user interest and behaviour in the digitised newspaper section of the search platform (Bogaard et al., 2019). The developed dashboard has been subsequently shared with the KB Lab and is now available for internal purposes.
To facilitate reuse of collections, it is important they are described in detail. The more information that can be shared about the development of the dataset, the better researchers (and a Lab) are able to work with it, as this provides transparency which is crucial for source criticism.
The implication of providing transparency for each dataset is that a Lab has to be open and communicative about the data and collections that they have. This is more challenging than it sounds within a single organisation. Acquisition and preservation policies change over time, as do documentation and responsibilities. More often than not, collections are big and messy and documentation varies widely.
Shared knowledge between users and the organisation about the collections and the data is crucial for successful outcomes from collaborations. Providing documentation, for example, about the original purpose of a project, selection and digitisation strategy, implementation, technical details, and subsequently communicating those appropriately to users, is a time-consuming endeavour, but worthwhile.
By providing transparent information about the provided datasets, it becomes possible to examine sets for (hopefully unintended) bias. Very often this bias creeps in during the selection process for practical reasons, such as book size, printing letter or even copyright issues. This might result in a non-representative digitised collection as opposed to the physical one as seen in the example below.
Example: Sample Generator, BL Labs
The BL Labs competition winner of 2013, Pieter Francois, developed a tool to search 1.9 million records of books from the 19th century held at the BL. Of these, 2.6% were digitised. He wanted to know whether the 2.6% that were digitised were representative of the larger sample. The tool enabled researchers to select representative samples of books based on filtered search terms of both digital and physical items from a larger corpus for further analysis. It gave BL Labs a deeper understanding of the distribution of the digitised material that the British Library holds relative to the physical collections both over time and by topic.
Other issues with bias that could be relevant to research are related to the more ethical concerns about representativeness of gender and ethnicity. As stated by Zaagsma (among others), digitisation is far from neutral (2019).
Ideally, a Lab would provide open access to the data being made available through the Lab. However, owing to a variety of reasons (including, but not limited to, copyright, donor agreements, and other collection-dependent restrictions), a Lab may need to work with restricted data. Ideally, access is then still possible for research purposes. There are two main ways to provide this:
If open access is not possible due to restrictions, the Lab could opt to provide data for research purposes under certain conditions. This is naturally only an option if the copyright holders agree to this or have entered an agreement with the organisation that this is allowed. Researchers can contact the Lab to request data. The Lab can then set up an agreement or contract with the researcher defining the regulations attached to the use of the data after which it can be shared.
Example: Restricted data with off-site access, KB National Library of the Netherlands
The KB has set up a Data Services team in the Collections Department for everything related to the delivery and accessibility of the KB's digital collections. For all collections available on Delpher, the KB has entered into agreements with right holder's organisations that the data may be shared for research purposes. Researchers sign a standard agreement when data is requested and they are required to delete the data once the agreement ends.
Example: Restricted data with on-site access, The Royal Danish Library
For certain types of restricted data, the DK provides researchers with a service where they create stand-alone, internal computational clusters, ensuring that the data is not available beyond its agreed purpose. In order to comply with EU General Data and Protection Regulations (GDPR), the library creates and keeps a log file for six months in order to recreate what the researcher has done on the cluster. In more complex use cases (such as the Danish web archive data), the DK provides a developer / consultant to collaborate with the researcher and ensure regulations are followed.
Rights and licensing
The rights status of a data collection or item is not always clear. Collections may even contain orphan works. While these issues around rights statements are complex, it is important to be aware of them and to be able to have an informed conversation with legal advisors about using collections within the legal framework. Lab teams are often the group which is well placed to advocate for the broad use of collections and data that have unknown rights statuses or with complex implementation requests. It is therefore important that a Lab member is well-versed in the IPR regulations of the country and understands the flexibility that may exist in the law.
Providing access to data and collections comes with its own set of concerns when discussing licensing. Legal restraints and the lack of open licences limit the use of data. Different legislation exists in all countries, and as such there is no-one-size-fits-all (collections) standard. Labs need to consider a managed-risk approach to licensing.
Licences that are commonly used in Labs (and even the entire cultural heritage community) are the Creative Commons licences. They are usually described by their abbreviations, such as CC-BY-SA. A full list of CC licences and their corresponding possibilities for reuse can be found in the diagram below.
Curated versus messy data
Publishing data as a Lab can be done in more than one way. Depending on the intent, timing and necessity, a dataset can be released straight out of the digitisation process. This results in messy data, which may not be suitable for all reuse purposes. However, it is a fast way to share collections and is often seen within the Lab community. Another way is to curate a dataset before publishing. This requires quite an effort and is not always possible. This does provide users with a clean and easy-to-use collection.
There are several steps concerned with curating a dataset. The following diagram presents a possible option where the data is firstly explored, then selected and extracted, after which it is cleaned and normalised using tools such as OpenRefine, described through controlled vocabularies, and finally enriched using techniques such as Named Entity Recognition and Linked Data.
A wide range of data benefits result from the curated approach, such as a library catalogue in the following example.
Example: Migration of a library catalogue into RDA linked open data, Biblioteca Virtual Miguel de Cervantes
The catalogue of the Biblioteca Virtual Miguel de Cervantes contains about 200,000 records which were originally created in compliance with the MARC21 standard. The library wanted to open up their catalogue through linked open data. To do this, they mapped the contents of the database by means of an automated procedure to RDF triples which employ the RDA vocabulary to describe the entities, as well as their properties and relationships. A specific online interface was then built to query this newly created database. Next to this, the data is publicly available and easily linked to other applications. (Candela et al., 2018)
Most Labs open data without any curation. Users can then explore it, and decide how a collection might fit into their research. Technological solutions can sometimes provide workarounds, and data that is too messy for some uses might be easily analysed with other methods.
If the messiness of the data is detrimental to a particular research project, data cleaning should be incorporated into the project when proposing it. The cleaning can then be done by project partners, in collaboration with the Lab, or by the user community through a crowdsourcing platform. These costs and efforts needed for the data to be cleaned should then be factored into the project and cannot be done by the Lab alone.
Other dataset examples
In addition to the main digital collections of the institution, other types of datasets can be shared by the Lab.
Extracting data from a larger set produces a collection that qualifies for different usage. These extractions are usually time-consuming and sharing the end results benefits the Lab community. An example of a derived dataset is the KBK-1M set of the KB Lab.
Example: KBK-1M, KB Lab
During a researcher-in-residence programme at the Dutch KB Lab, the researchers and the Labs team extracted all illustrations and captions from a larger set of digitised newspapers. This set (KBK-1M) is now on offer as a derived set so other researchers do not have to re-extract the data.
Data that is suitable as training data in deep learning applications is much sought after. Providing accurate training data in adequate quantities is a prerequisite for a multitude of research projects. Going one step further and not only offering training data but also sharing the pre-trained model (or to be more specific: the weights for the model, which is the output of the training process) for reuse, significantly lowers the entry barrier for using the collection in a machine learning context and provides useful information for researchers working in this field.
Some users generate data that might be useful for others, and if they are willing to share them and this task falls within the scope of the Lab, the following questions should be considered:
Does the Lab have the technical infrastructure to accommodate incoming data from users?
How does the Lab ensure transparency about the creation of the data?
Who owns the rights to the created data? Who is the author?
Can the Lab accommodate possible necessary embargoes or other access limitations?
Is the Lab able to ensure (to a justifiable extent) that the offered data complies with existing national and transnational legal frameworks?
Crowdsourcing projects frequently exist in parallel to Labs, offering the possibility to collaborate and reintegrate the user-generated data back to the organisation through the Lab. Various forms of user-generated data exist and crowdsourcing initiatives are not the only source. For example, the ÖNB Labs works with user-generated data from Transkibus in the following fashion.
Example: Transkribus integration, ÖNB Labs
At the time of writing, the team of ÖNB Labs is enabling their users to upload collections of their Labs data to Transkribus, a platform to train and apply models for handwritten text recognition (HTR) and optical character recognition (OCR) on digital images. The result (user-generated text recognition) can then be re-integrated into Labs to be shared with and re-used by other Lab users. Doing so in a manner that satisfies all requirements concerning transparency and quality of data, sustainability, as well as all legal aspects, is a process that is anticipated to take the better half of a year to prepare and implement.
Case study: Data Foundry, National Library of Scotland
The National Library of Scotland launched its Data Foundry in September 2019. The Data Foundry is the Library's data delivery platform, and is a part of its Digital Scholarship Service. Initial data collections offerings included digitised collections, metadata collections, map and spatial data, and organisational data, with further collections, such as web archive data, collection usage data and audiovisual data, are planned for future release.
The Data Foundry is based on three core principles:
Open: The National Library of Scotland publishes data openly and in re-useable formats.
Transparent: The provenance of data is taken seriously, and there is openness about how and why it has been produced.
Practical: Datasets are presented in a variety of file formats to ensure that they are as accessible as possible.
This has involved cross-Library effort to produce data collections openly and in consistent formats, bringing together curators, rights experts, developers and metadata specialists, and has resulted in a mode of delivering data which seeks to establish — and continue to advance — best practice.
All data provided on the Data Foundry has been rights assessed, and licences and rights statements are made available clearly with each dataset: both on the web page and in the readme file associated with the dataset. The Library does not assert further copyright control over the datasets that it produces, and information about the licensing and rights statements used as well as Open Data Publication Plan is available on the Data Foundry.
This forms one of the five aims of the Digital Scholarship Service: 'Practise and promote transparency in our data creation processes'. Contextualising the data creation process maintains the thread from the original, physical object to the object-as-data. As there are no existing standards or processes for how to present information about how and why items and collections have been digitised and presented as data, the National Library of Scotland currently includes this information within the METS files of digitised material, and within the data in metadata collections.
Furthermore, each dataset is placed in context by a series of declarations on the web page on which it is presented, such as: whether OCR has been cleaned up; how many files the dataset includes, and in what format; how many words and lines are included (for text-based collections); and the years covered by the dataset. This information is a key part of the Data Foundry's design and serves to provide an at-a-glance contextualisation of data which, without this information, can feel rather abstract.
The library more broadly is transparent about its workings, and the Data Foundry provides a platform for organisational data, such as financial information and environmental data.
From the start, the vision for the Digital Scholarship Service's data offerings included the importance of making datasets available in a variety of formats, and in a consistent way, to enable users of varying skills and needs to use the collections. This involves making data available as downloads, based on feedback from the user community; offering trials of big datasets; and ensuring that all digitised collections are available as both METS / ALTO and plain text formats. Metadata collections are provided in MARC and Dublin Core, to help bring library metadata to new audiences, and organisational datasets are provided in regularly updated CSV files.
The Lab Data Recipe
This recipe puts together a collection as a dataset in a quick-and-dirty way. You may get messy, so wearing protective clothing such as a nice set of emotional armour is advised. Make sure it's comfortable, because you might need to wear it a long time and it might get sweaty. This recipe can be applied to any type of data, but here text is used as the main ingredient.
• A bucketload of digitised images and corresponding text.
• If possible: metadata.
• An enthusiastic Labber.
• A (boundary-pushing) legal advisor.
• A liberal sprinkling of resilience.
**Please note that cooking times may vary, as institutions have different styles of ovens, energy levels, and appetites for risk.
Dissect your collection and find out what it contains and how it was created. This might make no sense at all and be completely random and biased, but don't be alarmed. You may need to talk to other people in your organisation about this, but don't worry, they are usually quite happy to talk about their work and giving them cake helps. This is how you build relationships (and diabetes).
Document everything you have learned in step 1. You don't have to do this alone and copy / paste is an excellent approach.
Prepare a pitch for your legal advisor on why the set should be made available under an open licence.
NOTE: This step is only necessary if your legal advisor does not like to push boundaries.
NOTE: If you have data that is in copyright, include the workaround to provide access in your pitch. Read the chapter on Sharing Data for helpful tips.
Stir the documentation, your pitch, the legal advisor and your organisation's management vigorously in a big pot (ideally in a locked meeting room) until the decision is made to publish the collection as a dataset.
NOTE: This step may take some time and this is where you might get dirty. Don't take it personally as you are pushing a boundary and might feel the boundary push-back.
When you have been given the green light (if you choose to wait for that, we're not suggesting anything here...), the collection is ready to be published as data.
Serve with some herbs of your choice, all documentation, a clear rights statement with the open licence and contact information on a public platform.
Collections as data for GLAM Labs means:
Enabling computationally driven use of the collections.
Identifying collections and assessing their suitability for Labs projects.
Making collections accessible and reusable.
Dealing with messy data.
Considering related work in digitisation, metadata, rights and preservation.