LIBER Digital Preservation Workshop – 2nd post

Christopher J.Prom’s Technology Watch Report for the Digital Preservation Coalition on Preserving email (DPC, 2011) makes the point that “Email communications saturate many people’s lives and comprise a large portion of Internet traffic, but they are rarely captured in large-scale Internet preservation projects … Information professionals must make the case for preserving it …”  Ivan Bosserup in his presentation to the 2nd LIBER Workshop on Digital Preservation described a project to preserve emails and attached documents at the Royal Library of Denmark which was born in the Manuscripts Department out of a concern to capture and preserve the correspondence of scholars, authors in a variety of fields, and other notable individuals in Danish national life.

The project,, was pilotted in 2009-10 with six scholars in different disciplines. Academics with high levels of trust in the Royal Library who entered into written agreements with the library. By capturing emails at source the academics become active participants in the archiving of their own correspondents – they are supplied with a copy of the software to install on their PCs which allowed them to select emails to save to the archive, categorising them as ‘Essential to archive’, ‘Nice to keep’, ‘Discard’ etc. The emails are then added to the repository workflow. An index of correspondents is created on receipt by the archive and entries for them added automatically (?) to the metadata.

The beauty of this system is that it reduces the barriers to email archiving, involving minimal work by archive staff and making deposit part of the daily workflow of scholars. It is possible to envisage other archives implementing similar solutions, particularly since start-up funding was minimal. While is raises numerous further questions – what automated processes do you need to catalogue this volume of material; how much does it cost to store and curate; and what happens in relation to IPR for the other correspondents – it is a valuable lesson in what can be done in partnership with authors.

It’s hard not to envisage a latter day Charles Darwin archiving letters as he works. While they would not be lost to posterity, what would we want to know about the correspondence as it comes into the archive to make it useful for posterity?

Links to full presentations from the 2nd LIBER Workshop on Digital Preservation and other blog posts can be found at


2nd LIBER workshop on digital preservation: “Partnerships in curating European digital resources”

On 7-8 May I was in Florence for this LIBER workshop, hosted by the Fondazione Rinascimento Digitale, Firenze, which summed up its approach as “Do not go at this game alone”.  The title concealed a wider exploration of both digital curation and partnerships than I had expected.  No easy answers but some trends are becoming clearer: preserving research data is at the top of the list for many institutions, a growing number of libraries are outsourcing ejournal preservation, and the value of archiving the web, social media, and email, for future researchers is increasingly being recognised.

Inge Angevaare highlighted in her introduction the need for different partnerships to support the preservation of different types of digital asset (data, etheses, ejournals, web archives).  It is both expensive and IT intensive and re-positions the library in the information chain. For digitised works the library has the whole process in its hands, research data curation starts at production (by researchers), with web sites we may not even know who owns them, and ejournals are licensed to, not owned by, libraries. Some categories may be technically beyond the research library. She challenged libraries to question why we need local services or collections when Connection rather than Collection is the key.  As well as considering familiar questions, “Who are your users” and “What do you want to collect” she asks “How safe do you want the content to be?”  Some is naturally risky and expensive to curate than others, e.g. research data.

Liz Lyon, UKOLN, picked up this theme in her presentation on Partnering for research data, drawing on her Informatics transform paper in International Journal of Digital curation, 2012. Research data has many stakeholders in addition to libraries and researchers: institutions, universities, national academies and professional societies, publishers, institutional and subject repositories, and funders.  She emphasised the discipline-focussed nature of the data and the global network of expertise around it. A recent report by Mary Auckland for RLUK, Reskilling for Research (2012) identifies an alarming skills gap when it comes to managing the data, e.g. only 2% capacity exists now for metadata creation but 16% will be required.  There was a nice mention of Cambridge University Library as one of the few libraries to employ a data scientist (my colleague, Anna Collins).

Later in the day Joy Davidson had a succinct list of of the skills institutions need to put policy into practice, that focussed on practical skills within institution rather than formal educatio and on collaboration between various staff within the institution – not just library staff. They would make a great starting point for a research data manager’s job description and included:

  • Ability to understand researchers’ practices and data holdings.
  • Understanding of existing institutional policies and infrastructure
  •  Knowledge of central support systems and responsibilitiies
  • Knowing when not to re-invent the wheel
  • Work with existing committees rather than suggest a new one; use existing codes of practice for researchers; use these policy statements to your advantage – flag up risks of gaps in provision
  • Ability to advise on risks and benefits
  • Help researchers and senior management to understand funders’ expectations
  • Ability to advise on licensing data
  • Ability to support selection and appraisal
  • Support description of data
  • Ability to support data management plans
  • Ability to define infrastructure requirements, e.g. Implications of cloud computing.
  • Costing curation

I had heard Eric Meyer of the Oxford Internet Institute before on the subject of Web archiving but was struck this time around by the sheer size of the gap he describes, not just the scale of the data but the potential value of what is already being lost and the tools that will be needed to make it usable in future.

We also need to take more interest in the other sources of digital content alongside web archives, from mobile phones, Facebook, and Twitter.

While the Internet Archive Wayback approach is okay for individual sites.  More challenging to archive dynamic, personalised feeds like Facebook.  See Twitterverse for one view of complexity of that environment. Many elements will disappear over next few years. Tools, e.g. visualisation, built for the live web are little or no use for web archive use because there is no access to the infrastructure via APIs. Access to transactions themselves is important for some areas of research.

Trends in viewing data in, say, 2020, are likely to include – life logging, searching across collections, knowledge bases changing over time, e.g. Wikipedia, web tracks, predicting the future from the past, virtual worlds,  images of a changing world … time series.

How do we support this?

    • Capturing changing world in a fine grained way requires development of triggers for more frequent data collection, e.g. Tsunami, Arabic Spring
    • Archive logs about how a web site was used in the long term.  Allow them to be donated to your archive, for future sociological research.  More ambitious would be to archive web traffic itself.
    • “Archive in a box” – a change from looking at past pages to collections of pages and collections of collections. And Shift towards data analysis and data set rather than reference.  E-research, data sharing, linked data, APIs.

I’ll blog further on this workshop since there were no fewer than three presentations on ejournal archiving and a nice practical demonstration of how one library is archiving emails.