On 7-8 May I was in Florence for this LIBER workshop, hosted by the Fondazione Rinascimento Digitale, Firenze, which summed up its approach as “Do not go at this game alone”. The title concealed a wider exploration of both digital curation and partnerships than I had expected. No easy answers but some trends are becoming clearer: preserving research data is at the top of the list for many institutions, a growing number of libraries are outsourcing ejournal preservation, and the value of archiving the web, social media, and email, for future researchers is increasingly being recognised.
Inge Angevaare highlighted in her introduction the need for different partnerships to support the preservation of different types of digital asset (data, etheses, ejournals, web archives). It is both expensive and IT intensive and re-positions the library in the information chain. For digitised works the library has the whole process in its hands, research data curation starts at production (by researchers), with web sites we may not even know who owns them, and ejournals are licensed to, not owned by, libraries. Some categories may be technically beyond the research library. She challenged libraries to question why we need local services or collections when Connection rather than Collection is the key. As well as considering familiar questions, “Who are your users” and “What do you want to collect” she asks “How safe do you want the content to be?” Some is naturally risky and expensive to curate than others, e.g. research data.
Liz Lyon, UKOLN, picked up this theme in her presentation on Partnering for research data, drawing on her Informatics transform paper in International Journal of Digital curation, 2012. Research data has many stakeholders in addition to libraries and researchers: institutions, universities, national academies and professional societies, publishers, institutional and subject repositories, and funders. She emphasised the discipline-focussed nature of the data and the global network of expertise around it. A recent report by Mary Auckland for RLUK, Reskilling for Research (2012) identifies an alarming skills gap when it comes to managing the data, e.g. only 2% capacity exists now for metadata creation but 16% will be required. There was a nice mention of Cambridge University Library as one of the few libraries to employ a data scientist (my colleague, Anna Collins).
Later in the day Joy Davidson had a succinct list of of the skills institutions need to put policy into practice, that focussed on practical skills within institution rather than formal educatio and on collaboration between various staff within the institution – not just library staff. They would make a great starting point for a research data manager’s job description and included:
- Ability to understand researchers’ practices and data holdings.
- Understanding of existing institutional policies and infrastructure
- Knowledge of central support systems and responsibilitiies
- Knowing when not to re-invent the wheel
- Work with existing committees rather than suggest a new one; use existing codes of practice for researchers; use these policy statements to your advantage – flag up risks of gaps in provision
- Ability to advise on risks and benefits
- Help researchers and senior management to understand funders’ expectations
- Ability to advise on licensing data
- Ability to support selection and appraisal
- Support description of data
- Ability to support data management plans
- Ability to define infrastructure requirements, e.g. Implications of cloud computing.
- Costing curation
I had heard Eric Meyer of the Oxford Internet Institute before on the subject of Web archiving but was struck this time around by the sheer size of the gap he describes, not just the scale of the data but the potential value of what is already being lost and the tools that will be needed to make it usable in future.
We also need to take more interest in the other sources of digital content alongside web archives, from mobile phones, Facebook, and Twitter.
While the Internet Archive Wayback approach is okay for individual sites. More challenging to archive dynamic, personalised feeds like Facebook. See Twitterverse for one view of complexity of that environment. Many elements will disappear over next few years. Tools, e.g. visualisation, built for the live web are little or no use for web archive use because there is no access to the infrastructure via APIs. Access to transactions themselves is important for some areas of research.
Trends in viewing data in, say, 2020, are likely to include – life logging, searching across collections, knowledge bases changing over time, e.g. Wikipedia, web tracks, predicting the future from the past, virtual worlds, images of a changing world … time series.
How do we support this?
- Capturing changing world in a fine grained way requires development of triggers for more frequent data collection, e.g. Tsunami, Arabic Spring
- Archive logs about how a web site was used in the long term. Allow them to be donated to your archive, for future sociological research. More ambitious would be to archive web traffic itself.
- “Archive in a box” – a change from looking at past pages to collections of pages and collections of collections. And Shift towards data analysis and data set rather than reference. E-research, data sharing, linked data, APIs.
I’ll blog further on this workshop since there were no fewer than three presentations on ejournal archiving and a nice practical demonstration of how one library is archiving emails.