AIM25 Step change and Open Metadata Pathway

JISC Step change will create Linked Data architecture for the UK archive sector, completing in July 2012. It draws on the lessons of Open Metadata Pathway and brings together King's College London Archives, ULCC, Axiell, Cumbria Archive Service, Historypin and the charity, 'We are what we do'. The project will use data held by AIM25 and focuses on delivering a new UKAT Web service and toolset that will allow archivists to mark up catalogues with triples and other semantic entities.

Monday 18 July 2016

Linked Data in Sydney, 2015

I was fortunate to attend the biennial Linked Open Data, Libraries, Archives, Museums summit in July 2015 in Sydney, Australia, and played a very small role in the organising committee. The event showcases useful projects and brings together a disparate community of experts: https://graphcommons.com/graphs/0f874303-97c2-4e53-abc6-83a13a1a2030

What is Linked Data?

Linked Data is a way of structuring online and other data to improve its accuracy, visibility and connectedness. The technology has been available for more than a decade and has mainly been used by commercial entities such as publishing and media organisations including the BBC and Reuters. For archives, libraries and museums, Linked Data holds the prospect of providing a richer experience for users, better connectivity between pools of data, new ways of cataloguing collections, and improved access for researchers and the public.

It could, for example, provide the means to unlock research data or mix it with other types of data such as maps, or to search digitised content including books and image files and collection metadata. New, more robust, services are currently being developed by international initiatives such as Europeana which should make its adoption by libraries and archives much easier. There remain many challenges, however, and this conference provided the opportunity to explore these.

The conference comprised a mix of quick fire discussions, parallel breakout sessions, 2-minute introductions to interesting projects, and the Challenge entries.

Quick fire points from delegates

Need for improved visualisation of data (current visualisations are not scalable or require too much IT input for archivists and librarians to realistically use)
Need to build Linked Data creation and editing into vendor systems (the Step change model which we pursued at King’s Archives in a Jisc-funded project)
Exploring where text mining and Natural Language Processing overlap with LOD
World War One Linked Data: what next? (less of a theme this time around as the anniversary has already started)
LOD in archives: a particular challenge? (archives are lagging libraries and galleries in their implementation of Linked Data)
What is the next Getty vocabularies: a popular vocabulary that can encourage use of LOD?
Fedora 8 and LOD in similar open source or proprietary content management systems (how can Linked Data be used with these popular platforms?)
Linked Data is an off-putting term implying a data-centric set of skills (perhaps Linked Open Knowledge as an alternative?)
Building a directory of cultural heritage organisation LOD: how do we find available data sets? (such as Linked Open Vocabularies)
Implementing the European Data Model: next steps (stressing the importance of Europeana in the Linked Data landscape)
Can we connect different entities across different vocabularies to create new knowledge? (a lot of vocabularies have been created, but how do they communicate?)

Day One sessions

OASIS Deep Image Indexing (http://www.synaptica.com/oasis/).

This talk showcased a new product called OASIS from Synaptica, aimed at art galleries, which facilitates the identification, annotation and linking of parts of images. These elements can be linked semantically and described using externally-managed vocabularies such as the Getty suite of vocabularies or classifications like Iconclass. This helps curators do their job. End users enjoy an enriched appreciation of paintings and other art. It is the latest example of annotation services that overlay useful information and utilise agreed international standards like the Open Annotation Data Model and the IIIF standard for image zoom.

We were shown two examples: Botticelli’s The Birth of Venus and Holbein’s The Ambassadors for impressive zooming of well-known paintings and detailed descriptions of features. Future development will allow for crowdsourcing to identify key elements and utilising image recognition software to find these elements on the Web (‘find all examples of images of dogs in 16th century public works of art embedded in the art but not indexed in available metadata’).

This product mirrors the implementation of IIIF by an international consortium that includes leading US universities, the Bodleian, BL, Wellcome and others. Two services have evolved which offer archives the chance to provide deep zoom and interoperability for their images for their users: Mirador, and the Wellcome’s Universal Viewer (http://showcase.iiif.io/viewer/mirador/). These get around the problem of having to create differently sized derivatives of images for different uses, and of having to publish very large images on the internet when download speeds might be slow.

Digital New Zealand

Chris McDowall of Digital New Zealand explored how best to make LOD work for non-LOD people. Linked Open Data uses a lot of acronyms and assumes a fairly high level of technical knowledge of systems which should not be assumed. This is a particular bugbear of mine, which is why this talk
resonated. Chris’ advocacy of cross developer/user meetups also chimed with my own thinking: LOD will never be properly adopted if it is assumed to be the province of ‘techies’. Developers often don’t know what they are developing because they don’t understand the content or its purpose: they are not curators.

He stressed the importance of vocabulary cross-walks and the need for good communication in organisations to make services stable and sustainable. Again, this chimed with my own thinking: much work needs to be done to ‘sell’ the benefits of Linked Data to sceptical senior management.
These benefits might include context building around archive collections, gamification of data to encourage re-use, and serendipity searches and prompts which can aid researchers. Linked Data offers the kind of truly targeted searching in contrast to the ‘faith based technology’ of existing search
engines (a really memorable expression).

He warned that the infrastructure demands of LOD should not be underestimated, particularly from researchers making a lot of simultaneous queries: he mooted a pared down type of LOD for wider adoption.

Chris finished by highlighting a number of interesting use cases of LOD in Libraries as part of the Linked Data for Libraries (LD4L) project, a collaboration between Harvard, Cornell and Stanford (https://wiki.duraspace.org/pages/viewpage.action?pageId=41354028). See also Richard Wallis’ presentation on the benefit of LO for libraries:http://swib.org/swib13/slides/wallis_swib13_108.pdf

Schema.org

Richard Wallis of OCLC explored the potential of Schema.org, a growing vocabulary of high level terms agreed by the main search engines to make content more searchable. Schema.org helps power search result boxes one sees at the top of Google search return pages. Richard suggested the creation of an extension relevant to archives to add to the one for bibliographic material. The advantage of schema.org is that it can easily be added to web pages, resulting in appreciable improvement in ranking and the possibility of generating user-centred suggestions in search results. For an archive, this might mean a Google user searches for the papers of Winston Churchill and is
offered suggested other uses such as booking tickets to a talk about the papers, or viewing Google maps information showing the opening times and location of the archive.

The group discussion centred on the potential elements (would the extension refer to thesis, research data, university systems that contain archive data such as Finance and student information?), and on the need for use cases and setting out potential benefits. I agreed to be part of an international team through the W3C Consortium, to help set one up.

Dork shorts/Speedos – these are impromptu lightning talks lasting a few minutes, which highlight a project, idea or proposal. View here: http://summit2015.lodlam.net/about/speedos/

Highlights:

Cultuurlink (http://cultuurlink.beeldengeluid.nl/app/#/): Introduction by Johan Oomen

This Dutch service facilitates the linking of different controlled vocabularies and thesauri and helps address the problem faced by many cultural organisations ‘which thesauri do I use?’ and ‘how do I avoid reinventing the thesauri wheel?’. The services allows users to upload a SKOS vocabulary, link it with one of four supported vocabularies and visualise the results.

The service helps different types of organisation to connect their vocabularies, for example an audio-visual archive with a museum’s collections. The approach also allows content from one repository to be enhanced or deepened through contextual information from another. The example of Vermeer’s Milkmaid was cited: enhancing the discoverability of information on the painting held in the Rijksmuseum in Amsterdam through connecting the collection data held on the local museum
management system with DBPedia and with the Getty Art and Architecture Thesaurus. This sort of approach builds on the prototypes developed in the last few years to align vocabularies (and to ‘Skosify’ data – turn it into Linked Data) around shared Europeana initiatives (see http://semanticweb.cs.vu.nl/amalgame/).

Research Data Services project: Introduction by Ingrid Mason

This is a pan-Australian research data management project focusing on the repackaging of cultural heritage data for academic re-use. Linked Data will be used to describe a ‘meta-collection’ of the country’s cultural data, one that brings together academic users of data and curators. It will utilise the Australia-wide research data nodes for high speed retrieval (https://www.rds.edu.au/project-overview
and http://www.intersect.org.au/).

Tim Sherratt on historians using LOD

This fascinating short explained how historians have been creating LOD for years – and haven’t even known they were doing it – identifying links and narratives in text as part of the painstaking historical process. How can Linked Data be used to mimic and speed up this historical research process? Tim showed a working example and a step by step guide is available: http://discontents.com.au/stories-for-machines-data-for-humans/
and listen to the talk: http://summit2015.lodlam.net/2015/07/10/lod-book/

Jon Voss on historypin

Jon explained how the popular historical mapping service, historypin, is dealing with the problem of ‘roundtripping’ where heritage data is enhanced or augmented through crowdsourcing and returned to its source. This is of particular interest to Europeana, whose data might pass through many
hands. It highlights a potential difficulty of LOD: validating the authenticity and quality of data that has been distributed and enriched.

Chris McDowall of Digital New Zealand

Chris explained how to search across different types of data source in New Zealand, for example to match and search for people using phonetic algorithms to generate sound alike suggestions and fuzzy name matching: http://digitalnz.github.io/supplejack/.

Axes Project (http://www.axes-project.eu/): Introduction from Martijn Kleppe

This 6 million Euro EU-funded project aims to make audio-visual material more accessible and has been trialled with thousands of hours of video footage, and expert users, from the BBC. Its purpose is to help users mine vast quantities of audio-visual material in the public domain as accurately and quickly as possible. The team have developed tools using open source frameworks that allow users to detect people, places, events and other entities in speech and images and to annotate and refine these results. This sophisticated tool set utilises face, speech and place recognition to zero-in
on precise fragments without the need for accompanying (longhand) metadata. The results are undeniably impressive – with a speedy, clear, interface locating the parts of each video with filtering and similarity options. The main use for the toolset to date is with film studies and journalism students but it unquestionably has wider application.

The Axes website also highlights a number of interesting projects in this field. Two stand out: http://www.axes-project.eu/?page_id=25, notably Cubrik (http://www.cubrikproject.eu/),
another FP 7 multinational project which mixes crowd and machine analysis to refine and improving searching of multimedia assets; and the PATHS prototype (http://www.paths-project.eu/) ‘an interactive personalised tour guide through existing digital library collections. The system will offer suggestions about items to look at and assist in their interpretation. Navigation will be based
around the metaphor of a path through the collection.’ The project created an API, User Interface and launched a tested exemplar with Europeana to demonstrate the potential of new discovery journeys to open access to already-digitised collections.

Loom project (http://dxlab.sl.nsw.gov.au/making-loom/): Introduction from Paula Bray of State Library of New South Wales

The NSW State Library sought to find new ways of visualising their collections by date and geography through their DX Labs, an experimental data laboratory similar to BL Labs, which I have worked with in the UK. One visually arresting visualisation shows the proportions of collections relevant to particular geographical locations in the city of Sydney. Accompanied by approving gasps from the audience, this showed an iceberg graphic superimposed onto a map showing the proportion of collections about a place that had been digitised and yet to be digitised – a striking way of communicating the fragility of some collections and the work still to be done to make them
accessible to the public.

LODLAM challenge

19 entries were received: http://summit2015.lodlam.net/challenge/challenge-entries/

Open Memory Project. This Italian entry won the main prize. It uses Linked Data to re-connect victims of the Holocaust in wartime Italy. The project was thought provoking and moving and has the potential to capture the public imagination.
Polimedia is a service designed to answer questions from the media and journalists by querying multi-media libraries, identifying fragments of speech. It won second prize for its
innovative solution to the challenge of searching video archives.
LodView goes LAM is a new Italian software designed to make it easier for novices to publish data as Linked Data. A visually beautiful and engaging interface makes this a joy to look at.
EEXCESS is a European project to augment books and other research and teaching materials with contextual information, and to develop sophisticated tools to measure usage. This is an
exciting, ambitious, project to assemble different sources using Linked Data to
enable a new kind of publication made up of a portfolio of assets.
Preservation Planning Ontology is a proposal for using Linked Data in the planning of digital preservation by archives. It has been developed by Artefactual Systems, the Canadian company behind ATOM and Archivematica software. This made the shortlist as it is a good
example of a ‘behind the scenes’ management use of Linked data to make
preservation workflows easier.

A selection of other entries:

Public Domain City extracts curious images from digitised content. This is similar to BL Labs’
Mechanical Curator, a way of mining digitised books for interesting images and making them available to social media to improve the profile and use of a collection.

Project Mosul uses Linked Data to digitally recreate damaged archaeological heritage from Iraq. A
good example of using this technology to protect and recreate heritage damaged in conflict and disaster.

The Muninn Project combines 3D visualisations and printing using Linked Data taken from First
World War source material.

LOD Stories is a way of creating story maps between different pots of data about art and
visualising the results. The project is a good example of the need to make
Linked Data more appealing and useful, in this case by building ‘family trees’
of information about subjects to create picture narratives.

Get your coins out of your pocket is a Linked Data engine about Roman coinage and the stories it
has to tell – geographically and temporally. The project uses nodegoat as an engine for volunteers to map useful information: http://nodegoat.net/.

Graphity is a Danish project to improve access to historical Danish digitised newspapers and
enhancing with maps and other content using Linked Data.

Dutch Ships and Sailors brings together multiple historical data sources and uses Linked
Data to make them searchable.

Corbicula is a way of automating the extraction of data from collection management systems and
publishing it as Linked Data.

Day two sessions

Day two sessions focused on the future. A key session led by Richard Wallis explained how Google is moving from a page ranking approach to a triple confidence assertion approach to generating search results. The way in which Google generates its results will therefore move closer to the LOD method of attributing significance to results.

Highlights

Need for a vendor manifesto to encourage systems vendors such as Ex Libris, to build LOD into their systems (Corey Harper of New York University proposed this and is working closely with Ex Libris to bring this about)
Depositing APIs/documentation for maximum re-use (APIs are often a weak link – adoption of LOD won’t happen if services break or are unreliable)
Uses identified (mining digitised newspaper archives was cited)
Potential piggy-backing from Big Pharma investment in Big Data (massive investment by drugs companies to crunch huge quantities of data – how far can the heritage sector utilise even a fraction of that?)
Need to validate LOD: the quality issue – need for an assertion testing service (LOD won’t be used if its quality is questionable. Do curators (traditional guardians of quality) manage this?)
Training in Linked Data needs to be addressed
Need to encourage fundraising and make LO sustainable: what are we going to do with LOD in the next ten years? (Will the test of the success of Linked Open Data be if the term drops out of use when we are all doing it without noticing? Will 5 Star Linked Data be realised?http://5stardata.info/)

Summary

There were several key learning points from this conference:

The divide between technical experts and policy and decision makers remains significant: more work is needed to provide use cases and examples of improved efficiencies or innovative public engagement opportunities that the technology provides

The re-use and publication of Linked Data is becoming important and this brings challenges in terms of IPR, reliability of APIs and quality of data

Easy to use tools and widgets will help spread its use; avoiding complicated and unsustainable technical solutions that depend on project funding

Working with vendors to incorporate Linked Data tools in library and archive systems will speed its adoption

The Linked Data community ought to work towards the day Linked Data is business as usual and the terms goes out of use

Wednesday 26 June 2013

LODLAM 2013

I attended the LODLAM conference in Montreal last week. This was the second Linked Open Data in Libraries, Archives & Museums conference and around 100 delegates attended the National Library and Archives of Quebec from cultural institutions, universities, private companies such as Axiell and other organisations from around the globe such as OCLC and the BBC. Attendees included representatives from the US, Canada, UK and Europe, New Zealand and Australia, and other countries.

National Library & Archives of Quebec

The un-conference format provided plenty of room for brainstorming and breakout sessions on a multiplicity of themes - most running in parallel, which meant there wa a lot of running between seminar rooms! The conference also had a competitive element in its run-up, where teams of Linked Data experts developed or refined showcase projects. The semi-finalists lined up before four X-Factor judges before a winner was announced. Networking was also the name of the game as we tried to link up our respective projects and think about the development of more robust services - many of us ran out of business cards, I am sure.

The breakout sessions covered all areas of Linked Data, not least controlled vocabularies, mapping and patterns, the user experience, teaching Linked Data, ontologies such as CIDOC-CRM, natural language processing, historical mapping, annotation tools, World War One and many other subjects. Several sessions were an opportunity to view new tools such as KARMA, GNOSS and PUND.IT, which have been designed to facilitate the mark-up and annotation of web pages and the aggregation of different entities and types of information. One very informative session covered the Getty release of its vocabularies as Linked Data, which should prove enormously useful to the whole community. The World War One session was an opportunity for participants to gain an overview from across the globe, including trench mappers and authority experts from the US, Finland, Australia and New Zealand. The BBC showed off its World Service audio file transcription project which contains a significant crowdsourcing element to help refine data.

The LODLAM challenge finalists were universally excellent. Linked Jazz showed off its superb visualisation tools to depict relationships and influences between jazz artists. Free your Metadata was a song combo that got the floor buzzing with its appeal for data reconciliation and cleansing as a prerequisite for successful linking. Mismuseos combined museum metadata and images for museums. The winner was PUND.IT, a... 'client-server annotation system which lets you express semantics about any kind of web content through labeled relations among annotated items, linking them to the Web of Data. Annotations can be shared and organized into private or public notebooks which can openly accessed to build engaging visualizations'.

The conference as a whole showed the vast array of work that has been undertaken throughout the GLAM sector to create new datasets and develop or refine tools that will allow new, more robust, services to be developed - a good mix of the theoretical and the practical. The need for training that demystifies the subject, a stress on users including the front end experience and visualisation, and quantifying the benefits to organisations through use/economic cases appear to be priorities.

Montreal

Monday 17 June 2013

Historypin mapping tool

Good news that Historypin have completed development work on a test version of the new mapping tool for displaying Linked Data links from archive catalogues on the Historypin Google maps interface.

The tool interrogates a portion of the many thousands of place names associated with catalogue entries on AIM25, which aggregates collection level catalogue descriptions from archives held in around 130 institutions in the London area, including learned societies, universities, museums and local authorities. AIM25's 17,000 catalogues relate to collections containing a wealth of information about every corner of the globe, spanning around 500 years of history and covering topics as diverse of scientific discovery, war and international relations, exploration and travel, biography, politics and religion.

The mapping tool was an experimental component of Step change, which sought to release the UK Archival Thesaurus (UKAT), the key UK subject vocabulary, as a Linked Data Service; to develop a new editing tool for the creation of semantic archive catalogues; and to embed the tool in Axiell's CALM, one of the most popular proprietary archive cataloguing applications used by some 400 institutional customers in the UK and Europe.

The mapping project interrogates UKAT for relevant 'Scope and Content' related place names (ie catalogues tagged up with place names relevant to the source material and not biographical or other contextual information). These catalogue titles are then flagged on the map, which is badged as a Historypin themed channel. Users can expand titles to read relevant information about the collections and make arrangements for visiting the archives or to follow links to digital surrogates of the source material.

The sub-project was not without significant obstacles. Issues included performance associated with too many simultaneous queries; difficulty in visualising different levels of granularity - what constitutes a 'place' - geo-co-ordinates, a bounded area or a subjective understanding of the 'local'; and the problem of flagging up diverse locations listed in single entries (something that often happens with personal paper collections of individuals whose lifetime careers might include study, emplyment or family life across the world). Work arounds included the inclusion of a granularity filter to overcome the problem of excessive clustering of returns; and limits placed on simultaneous queries. The resulting map only returns a portion of the many tens of thousands of places described in AIM25 and doesn't yet link with a conventional Historypin map showing pinned photographs of local places. This will follow soon, to develop a more useful and integrated service combining photographs contributed by partner archives, along with relevant catalogue entries that provide useful contextual information.

Tuesday 4 December 2012

Linked Data in Archives, Libraries and Museums Meeting, 3rd December 2012

I organised a roundtable meeting of around 35 archivists, curators and Linked Data specialists drawn from the UK cultural sector, who met at King's College on the afternoon of 3 December. The audience included representatives of major institutions such as the British Library, British Museum and Imperial War Museum, from AIM25 partner organisations and from other key players including the Collections Trust, Mimas, JISC, Wikipedia, Historypin and Culture24. Software vendors were represented with Axiell CALM, Adlib and MODES.

The focus of the meeting was working out possible practical 'next steps' on Linked Data in archives, libraries and museums, following the completion of a number of successful projects over the past 18 months, including a clutch of JISC Discovery programme initiatives including Step change and up and coming events including the JISC discovery meeting planned for February and the LODLAM conference in Montreal in summer, 2013.

The meeting opened with a number of presentations. Gordon McKenna of Collections Trust reviewed Europeana initiatives, including the Linked Heritage project, a recent partner survey that revealed ongoing IP worries in the sector over access to material; and raising the point that partner-publishers arguably need more content to connect to (successful Linked Data is not just about what you publish but what you consume). Understanding user requirements better was also a key concern.

Andrew Gray, Wikipedian in Residence at the British Library, described the exciting work currently being carried out on authority files and introduced 'Wikidata' - the new DBpedia. He stressed the value of controlled vocabularies within the ALM sector and the need to demystify the language used in Linked Data projects as this was potentially putting off users.

Adrian Stevenson of Mimas reviewed the groundbreaking work of LOCAH, upon which Step change and other projects have built, and raised a number of important points including the need for more, easier-to-use, tools and the complexities of dealing with duplication, inconsistency and currency in the data. He called for more co-operation among cultural partners (not just ALM practitioners). Adrian rounded up by previewing the new World War One aggregation site, which while not using Linked Data per se, is a good example of a cross-cultural aggregation project where different archives will sometimes demonstrate variable levels of technical knowledge and expertise (for example concerning APIs) and consequently often need active support to make their data available.

Geoff Browell reviewed the Step change and Trenches to Triples projects and their rationale - to encourage the creation of archival Linked Data by making it part of the normal cataloguing/indexing process, and to do this through the incorporation of editing and publishing tools installed in CALM, Adlib and other archive software commonly used by the archival community. The experience of Cumbria on the Step change project shows that users need to come first and that there is a real demand for the release of key datasets such as Victoria County History as Linked Data.

Bruce Tate of the Institute of Historical Research concuded the first session by previewing the enlarged 'Connected histories' project, whose API will soon be available to consumers, including a new georeferencing tool to map content held in British History Online. He reviewed a recent impact measurement survey, which chimed with several speakers in the meeting, who argued that the community needs more, and better quality, information on how Linked Data might help different audiences including academics and the general public, in order to sell the concept (and secure necessary investment) to internal audiences within institutions (senior management), and to funders like the Research Councils.

The second half of the meeting comprised three discussions led by leading practitioners.

Nick Stanhope of Historypin led on community engagement and the opportunity afforded by new crowdsourcing tools being developed by Historypin to help crowdsource Linked Data - for example the verification of people, places and their relationships. He stressed the role of storytelling that Linked Data ought to seek to capture. Robert Baxter of Cumbria Archive Service and Step change, argued that most archives need full Google visability for their records as a starting point (which many do not currently have) and reiterated the need to sell Linked Data more effectively within institutions. A more intuitive 'stepping stones' approach is needed to support research discovery (something also raised by Nick Stanhope). Linked Data and other tools ought to support this view of research as exploration or journey. Richard Light reviewed the important development work carried out on the MODES software and the reviews undertaken by CIDOC-CRM. He focused on next steps, raising the questions of whose job it is to publish data and the value of an 'open ended distributed database of cultural history'. Among his recommendations were that:

· Publishers of authorities ought to publish as Linked Data as a matter of course

· Software vendors in the sector should be encouraged to provide some form of "web termlist" facility so that recorders can easily add Linked Data identifiers

· There needs to be agreement on the need for sector-specific guidelines for structuring Linked Data resources (the "mortar" in the "wall"), and ideally a working group actually producing some

· There should be an exploration of how we get "horizontal" resources for the common entity types (people, places, etc.) so we have some concepts/URLs we can actually share

Several key themes emerged from the afternoon:

· Advocacy: The role of case studies and impact assessments to support business cases to support internal and sectoral/funder investment

· Audiences: A renewed focus on the user and consumer of data, their stories and research journey

· Accessibility: to simplify data creation by involving vendors and minimising the variety of editing tools and by the use of agreed master authorities to cut down URI duplication. To create a registry of tools, and develop suitable plug-ins and mediation services but to do so based on sector agreement, not project-by-project. The Mellon-funded Research Base is one such initiative to minmise duplication.

Other themes included:

· Licensing - this still remains a stumbling block due to lack of clarity around Creative Commons licenses - CC0 or CC-BY?

· Training - pratitioners need technical support and training to get the best from Linked Data

· Cultural sector - this should be viewed in the round and not just archives, libraries and museums but the broader sector including galleries and other arts organisations, aggregators and funders. The Arts Council and national film archive community were two such organisations or communities of interest that were cited.

Friday 30 November 2012

ALiCat - 8 out of 10 archivists...?

ALiCat (Archival Linked-data Cataloguer) is one of the outputs of the step-change project. It is an editing tool for collection level records. Of course AIM25 has its own web based editing tool, the AIM25 archivists are also able to upload EAD and so can use desktop tools such as CALM or whatever they choose to produce their records. There are other web-based EAD editing tools such as the archivehub's excellent EAD editor.

Once the tortured acronym is expanded we can see that ALiCat is an attempt to allow archivists the ability to assign and amend the linkages between the resources and the pertinent terms both within the body of the record and those "access points" that are used for indexing the record.

So initially ALiCat presents a reasonably straight-forward tabbed form for inputing those ISAD(g) elements that archivists know and love.

One is alerted to the ALiCat's USP by the fact that the the index terms are ever present on the page and colour coded according to their type (AIM25 uses a subset of those EAD elements that can be contained within <controlaccess>).

The aim of the broader project was to assign persistent URIs and (the beginnings of) consumable semantic representations for both the AIM25 index terms and the records that they are associated with. Much of this was done retrospectively on the existing data. The role of ALiCat was to provide an interface for tweaking these enhancements (which were not always perfect) and to provide a possible method for archivists to include linked data in their records at the point of creation.

On a technical level ALiCat allowed the developers at ULCC (me) to demonstrate the use of the data.aim25.ac.uk data- service. The architecture of ALiCat is completely reliant on the RESTful output (and input) of data.aim25.ac.uk. ALiCat uses the jQuery javascript framework manage requests to the data.aim25.ac.uk and display the results.

Do archivists want the extra tasks associated with finding and assigning meaning - or at least that measning expressed at the end of a LOD URI? Well there are obvious benefits of recording these links, that I'm sure have been extolled at length in this blog and others. One of the problems that ALiCat is trying to solve is to assimilate this process into the workflow of the archivist in an efficient manner. The method ALiCat uses to try and solve this problem is to provide some useful and well integrated UI tools for suggesting and searching for URIs as the archivist edits.

Terms can be high-lighted by an editor who is then offered the opportunity to look up the term using a selection of LOD data services. The selection is relatively small at the moment but the exercise of consuming data formatted according to common and open formats (XML, JSON), standards (RDF), and vocabularies (including skos, geonames, foaf) should mean that the task of adding more 3rd party look-up services is well within reach.

Of course one of the services used to look-up terms was the data.aim25.ac.uk. This helped us to make sure that search service provided was more or less in line with others in similar fields.

If suitable results could be found the editor is given the opportunity to define a term according to the data structures used by AIM25-UKAT and coin a new URI in the data.aim.ac.uk domain.

In addition to looking up individual terms ALiCat an option is available for editors that will send all the text in a given ISAD(g) element to a suggestion service. This will be triggered when an archivist moves on from editing a given field (losing focus). The suggestion service will return return the text with 'found' terms tagged and highlighted and an associated URI(s) assigned. Some terms may need a few extra steps to disambiguate them, the process is much the same when looking up individual terms. The editor can then select terms by dragging them into the access point list on the right, suggested terms that are not selected are stripped out when saving. The suggestion services available for use are openCalais and the match service from data.aim25.ac.uk (though this lacks the linguistic analysis of openCalais)

Once the record has been saved the linked data URIs are included both in the access points - rendered in the EAD thus:

<controlaccess>
...
<persname source="AIM25-UKAT" uri="http://data.aim25.ac.uk/id/person/steinerrudolf1861-1925philosopher">Rudolf Steiner</persname>
...
</controlaccess>

Also embedded in the text of the record as RDFa

in 1913. It had its origins in the spiritual philosophy of <span property="http://xmlns.com/foaf/0.1/Person" resource="http://data.aim25.ac.uk/id/person/steinerrudolf1861-1925philosopher">Rudolf Steiner</span> (1861-1925).

Friday 5 October 2012

Lessons learned part 2

5. Usability

The recent JISC programme round-up in Birmingham highlighted a number of potential common problems or issues thrown up by the projects including the practicality of APIs and licensing, but I would add user experience to the list.

Linked Data discussions in the field of libraries, archives and museums have generally, and until recently, been confined to discussion of the complex technical challenges involved in making systems work. This is understandable, but recent discussions such as those held at the 2nd Linked Data in Libraries conference in Edinburgh on 21st September, have focused attention on making Linked Data as comprehensible and usable as possible by the general public.Papers on the consumption of data included introductions to the important work on visualisation currently under way. Linked Data provides the opportunity to move beyond the conventional cataloguing paradigm and its corollary, published lists and tables of data that risk being seen as visually unappealing and stodgy (sometimes unfairly). The complex relationships described in Linked Data don't translate easily to tables but rather lend themselves to the dynamic graphs and representations now becoming common, for example in the display of statistics by national governments.Linked Data provides an opportunity to begin to see data in new ways for library, archive and museum users.

Fundamentally, for Linked Data to be more widely adopted, there need to be a focus on the user experience and demonstrating the value added by combining sources, mapping sources using Linked Data and other practical improvements.

Step change set out to address the usability concern from the outset, but the project has highlighted how much work needs to be done in this area. CALM improvements include the display of relevant external links alongside catalogue records - for example British National Bibliography entries. User testing established that archivists need to exercise discretion in the links they set up and make visible (whatever is happening in the back-end).Links must work (accurate and complete data is returned speedily), but must also be relevant (for example appropriate to the level of record being displayed). Branding starts becoming important to distinguish the origin of data and mitigate a tendency for users to view the data in archives, libraries and museums websites as coming from that one source (that repository). Users will need to start viewing such websites more as they have learned to interrogate a page of Google search returns (as coming from multiple sources). A simple 'Linked Data' logo should be adopted to provide users with a shorthand way of recognising that an additional level of useful information is now available and can be trusted (because an information professional has actively check the source and chosen to link it).

Next steps:

Further user testing is now under way following the release of the Linked Data CALM and its front-end. This will take place in Cumbria involving users of CALM and members of the public familiar with the current archive website. This will drive improvements ready for the release of CALM version 10. Work is under way on creating RDFa and the rendering of selected terms to display useful external content in an attractive way, while not confusing the user with excess information.

Friday 7 September 2012

Lessons learned

The Step change project has identified a number of useful 'lessons learned' - more will follow in future posts.

1. Data quality

The creation of RDF and linking with similar resources might expose legacy catalogue data as uneven, inadequate or inaccurate. It is likely that many existing catalogues, though adequate for basic online searching, are not up to the task in a Linked Data environment. Date ranges cited in archive catalogues are too broad to identify components of collections; geographical designations insufficiently specific or too fuzzy (does 'London' mean Charing Cross or Croydon, the City or London, Canada? Which units are being described, and are they historically accurate?). The reality is that many catalogues not only predate RDF but the internet, and arguably are not fit for purpose in a Google-enabled search environment, either being inaccessible to search engines or not optimised for web-crawling.

Next steps:

Review of links: while an archivist or librarian might be familiar with their own collections, they are likely to be unfamilar with each other's content, or content from unrelated sources (such as maps, audio-visual material or database content). A real example encountered in Step change was the join-up between archive collection descriptions and bibliographic information using the BNB, where archivists accessing the live service in CALM were often unable to identify, and therefore select for linking, the correct edition of an author's publication to match the relevant archive description by, or about, that author - the service returned ambiguous or difficult-to-interpret bibliographic data. Confronted with practical problems such as these, the professional focus group, which convened to review the markup tool enbedded in CALM, recommended the implementation of an editing stage into CALM to preview possible selections of Linked Data join-ups, in order to minimise potential mistakes and make mark-up more efficient by reducing the necessity for time-consuming corrections post facto.

Knowledge transfer: Furthermore, the linking preview problem clearly exposes the cross-disciplinary knowledge gap that hinders joint-up between collections, except at the level of broad catagories, mapped across domains. Librarians, archivists, museum curators, academic experts and GIS and data curators simply don't know enough about each other's data to make truly informed decisions that will underpin the entity relationship-identification and entity relationship-building that is at the heart of the successful implementation of Linked Data methodologies.

Outcome and next steps: Axiell is considering incorporating an improved editing tool in future releases of CALM. For the mapping component of the project for AIM25, a preview tool has been developed and installed in the Alicat cataloguing utility that uses the Google maps API and Geonames to preview the names of places in micromaps, to allow the archivist to make speedier, more accurate choices of placenames before hitting the 'save' button.

Step change's publication of UKAT as a Linked Data service helps overcome the knowledge gap as it at least provides an agreed subject, place, person and corporate name listing as a common starting point in describing certain entities. What it doesn't do is capture relationships and more work needs to be done to describe subject and domain-specific triples. A publicly-supported triplestore would be an important infrastructure development that would give professionals confidence that Linked Data is here to stay, and to encourage investment to embed in conventional cataloguing. Further steps are necessary, though, not least sponsorship of co-working between different knowledge professionals using cross-domain data - to properly document the challenges of mixing and matching library, archive and museum metadata and linking it with, say, research outputs in the arts and humanities.

The problem of inadequate catalogues is difficult to resolve - cataloguing backlogs are a higher priority than retroconversion and should a catalogue be useful to potential researchers, it is usually deemed adequate. Training should be provided to potential cataloguers to understand better the implications of online search strategies and search engine optimisation (aside from Linked Data), which are probably poorly understood by most archivists. The use of certain agreed vocabularies should be encouraged where these exist as Linked Data and the AIM25-UKAT service helps supply this need for an indexing tool that coincidentally creates RDF without archivists necessarily being aware that this is happening. Some agreement should be reached on other specialist vocabularies, name authorities and place data (including historical places - at least in the UK) to create established hubs. These will potentially be more robust and avoid a fragile cats cradle of APIs prone to network disruption, and serve as trustworthy and authentic points of reference.

2. The value of public-private partnership

Step change was built on a good working relationship with a charity (We are what we do - responsible for Historypin), and a commercial vendor (Axiell). The rationale behind their involvement was that for Linked Data use to become widespread in libraries, archives and museums, it should be made available through the trusted suppliers upon which professionals have come to depend. Good will on both sides and in both cases enabled the team to overcome serious problems with enforced development staff absences. These challenges do point to a potential over-dependency on a relatively small number of experts able to combine knowledge of RDF technologies with knowledge of library, archive and museum data and practices.

The Axiell experience demonstrated, through the focus group and demo at the national CALM user group, and perhaps unexpectedly, that there is substantial interest from the archive community for Linked Data tools and understanding of their utility.

Next steps: Axiell is releasing the embedded Alicat markup tool in CALM version 9.3 and has agreed to further iterations and improvements in future releases. Crucially, these will be timetabled in response to user feedback. Similar partnerships ought to be explored with other software suppliers, such as Adlib and a meeting is planned with the UK Adlib user community and representatives from Adlib with this in mind.

3. Technical limitations of APIs

Considerable staff time needed to be set aside for dealing with poor quality responses to queries and trying to finetune services. Service reliability is essential if Linked Data approaches are to work. A significant obstacle were local firewalls and authentication protocols and persuading local IT to address these concerns. Change requests for an experimental Linked Data project involving archive catalogues were understandably deemed to be low priority. They also carried a cost implication that needs to be factored into budgets.

Next steps: the cost implications of technical implementation need to be quantified and documentation published to provide institutional IT with context to make informed technical decisions - and persuade managers to authorise expenditure.

4. Value of co-operation

Step change sought to build a number of professional relationships to help leverage goodwill and kickstart a more strategic appreciation of the types of datasets that ought to be output as RDF. So far, datasets have mainly been confied to the library and museum sectors and have been created in an ad hoc way by interested experts, rather than with end users in mind. Discussions were held with The National Archives with a view to using the National Register of Archives dataset as a prototype name authority service. This, and other heavily used TNA services such as the Manorial Documents Register, would prove particularly valuable to the types of local authority archives participating in Step change, with their focus on local history. Test data relating to women in the NRA was released via TNA Labs through Talis' Kasabi service. The withdrawal of support for the service at very short notice provides a salutory lesson that the availability of commercial services cannot be taken for granted. The National Archives is currently renewing its backend systems and will review the status of the NRA, MDR, Archon and other databases in due course.

Discussions were held with other interested parties, not least in the area those representing geographical data. Testing is due to commence with historical placenames supplied as part of the JISC DEEP project concerning the English Placenames Survey, relating to Cumbria, with a view to correcting locating and mapping catalogues.

As part of the CALM development work, a set of configuration instructions were published by Axiell to enable archivists to execute XSLT tranforms and link to other services as they become available. The British Museum collections were identified as a good contender with which to test out these instructions, on account of the high quality data that they provide and the mutual political benefit of local institutions to be able to demonstrate a link back to a major national collection held in London, and to the BM to be able to demonstrate that museum objects of local significance are being accessed be local people in an intelligent 'Linked Datery' way (for example mapping archaeological finds in the collection and linking with local catalogues or historical society publications). Work on testing this approach is still ongoing and conclusions will be presented in a future post.

Next steps: more cross-sectoral cooperation and scoping is required to think strategically about the kinds of datasets that different audiences need as Linked Data - archivists and different types of users - schools, the general public, genealogists, academics, researchers. Large national datasets that culd benefit from unlocking inclde the Clergy of the Church of England Database, British History Online and the Victoria County History. Testing is due to begin with DEEP data and ongoing with BM data.