Thursday 24 March 2011

Semantic Analysis of AIM25 EAD


Rory and I met with Richard Gartner and Gareth Knight at CeRch today, to catch up with their investigations into using GATE and OpenCalais to process the EAD outputs from AIM25.

Results look very encouraging. OpenCalais, in particular, generates a post-processing set of identified entities (personal names, place names, corporate names) which Richard G has then created regular expressions to locate these in the body of the EAD and wrap in appropriate EAD tags (<persname> etc).

This suggests that the way forward for enhancing the existing data entry processes for AIM25 will involve dispatching the EAD-compliant data entered by collections manager to OpenCalais, and returning the data, with enhanced markup, for checking by the submitter. This hook should be easy enough to insert for manual, form-based entry; for batch entry processes we will need to assess whether any significant delays are introduced.

We've also started to consider ideas for a URI scheme for the entities identified. Our current working hypothesis is that this will involve defining a "data" namespace for AIM25, binding to http://data.aim25.ac.uk/. Within that we can develop a structure along the lines /person, /place, /corporate_body, and append our unique IDs for each entity. Further research is necessary, particularly into the recommendations of the Cabinet Office recommendations for Designing URI Sets for the UK Public Sector.

These URIs can then be used in identifier attributes for our EAD elements (<persname>, etc.), and thence easily transformed into an RDFa format for the Web-based HTML rendering of the AIM25 catalogues.

Next steps include further investigating how to implement and assert relationships between our entities and other open datasets (e.g. our_entity  is_the_same_as  your_entity). And how to make the authority data, duly marked-up, available as open metadata.

Rory and I can now start to consider suitable approaches to embedding this in our development copy of the existing AIM25 system, and we'll continue to liaise closely with CeRch for advice on  the relative merits of Gate and OpenCalais processing, and guidance on URI implementation.

Wednesday 23 March 2011

The challenge of adding new records

Five new institutions were being recruited as part of the project to ensure clean data and a 'level playing field' to test the value of Linked Data. These include the National Maritime Museum, Zoological Society and British Postal Museum. The fragility of archive institutions in the current economic climate has been highlighted by news of the reorganisation of Hammersmith and Fulham Archives Service - an OMP partner -following local authority budget cuts. It is still hoped that their records can be included at a later stage, not least because it will enhance their public profile via internet searching, and thereby encourage more active use of collections by researchers, but in the meantime the Royal Botanic Gardens, Kew, have been recruited as a substitute.

The project also highlights the challenge of adding value to archive services through Linked Data or similar projects in the midst of reorganisation and the roll-out of other projects - the National Maritime Museum Archives, for example, are entering a period of public closure prior to the opening of new reading rooms and other public areas as part of a major investment. New catalogues and an archives/library management system are also being installed and tested in 2011. Participation in OMP appealed to the NMM, providing an opportunity to add rich Linked Data to new audiences as part of a larger, more public, initiative. It also represents a challenge, not least for NMM staff who are being asked to prioritise the records that we are seeking to include in the project (a mix of items that are heavily used by the public, for which they receive many written enquiries or which are underused and for which they hope to improve access), and prepare the EAD in a flavour that can readily be imported into AIM25.

I have visited the Zoo, the British Postal Museum and Wandsworth Heritage Services to examine their systems and records. The latter two use CALM, with which AIM25 is familiar, but the Zoo uses a library system - EOSi - requiring import of records in MARC21. A recurring theme of the OMP and other projects is that for new archive IT projects to be rolled out successfully the active support of busy institutional IT services is often indispensable - to set up export tools, develop database tools and amend websites. This places a brake on project delivery times as understandably bespoke work on archive databases might come a long way down an IT services priority list. It also reflects that need for archivists to understand what databases can do and how data can easily be shared - not least in order to communicate effectively with IT helpdesks and keep senior management on board. This is knowledge that is usually acquired through hands-on experience and trial and error, which in turn highlights the value of the many informal support networks among archivists who turn to each other for advice and guidance on how to make data work harder.

Monday 7 March 2011

National Archives Discovery Event

There were several interesting presentations on Linked Data at an event hosted by The National Archives on behalf of the National Archives Discovery Network on 2 March. The Network is a forum for aggregation services such as AIM25, SCAN, Genesis and the Hub, mapping services such as Vision of Britain, and major institutions such as the British Library, along with other information specialists. The event attracted more than 100 delegates. Presentations included keynote addresses by Richard Wallis of Talis and John Sheridan of the National Archives; a review of the state of play with EAC by Bill Stockting; and reviews of Linked Data projects carried out on Government data and by the BBC, along with the Hub's LOCAH project.

Other talks included reviews of progress on History Pin, the new Google-led initiative which embeds archive digital content in Google Streetview; updates on recent crowdsourcing projects such as the Bentham initiative; and news on cataloguing software including ICA AToM.

Slideshare reviews of the presentations will be available shortly.  

Project Overview for Programme Startup Meeting

What content and metadata are you working with?

Archival catalogue data; ISAD(G)/EAD; Collection level descriptions; Authority files (People, Organisations, Places, Subjects)


How will this data be made available?


Once the Linked Data research team (CeRCH) has established appropriate ontologies,schemas and URI schemes, the Open Metadata will be published in SKOS format (much as UKAT currently is).


What are your use cases for the data?


The use of linked data ontologies within the AIM25 system will provide many opportunities to associate AIM25 records automatically and intelligently with other information resources; and it will allow other information resources to locate and link to archives information in AIM25, enhancing discovery, and supporting the aggregation of AIM25 data into dynamic searches and aggregators across the sector.


The archival authority files that will be published as open data contain a wealth of information of interest and value that could be reused in many ways, in other archive and library systems, as well as in historical, biographical and genealogical contexts . The data could also be extended and enriched in the course of its reuse, and the derived datasets in turn be available for reuse in AIM25.


What benefits to your institution and the sector do you anticipate?

  • Improved discovery/discoverablility
  • Improved linking and interoperability with other web resources
  • Improve takeup for authoritative archives data and metadata 
  • Assessment of added value of linked/semantic data to online archives and cataloguing

Technical approaches / challenges
  • Agreeing and defining ontologies, schemata
  • Implementing effective tools in short timescale
  • Implementing the FLISM popup menu interface

Sunday 6 March 2011

The Project Plan

Aims, objectives and final outputs of the project


The Open Metadata Pathway or Pathfinder project will deliver a robustly validated demonstrator of the effectiveness of opening up archival catalogues to widened automated linking and discovery through embedding RDFa metadata in Archives in the M25 area (AIM25) collection level catalogue descriptions. It will also implement as part of the AIM25 system the automated publishing of the system's high quality authority metadata as open datasets. The project will include an assessment of the effectiveness of automated semantic data extraction through natural language processing tools (using GATE) and measure the effectiveness of the approach through statistical analysis and review by key stakeholders (users and archivists). All outputs of the project will be integrated into AIM25 resources and workflows, ensuring the sustainability of the benefits to the community.


Summary objectives



Standards based cataloguing with thesaurus support is both time consuming and constrained by subjective and contemporary views about subject choice and relevance. Use of automated semantic metadata extraction through natural language processing tools and Linked Data offer the possibility of upgraded harvesting and wider and more effective subject searching.

The project will deliver a robustly validated pilot embedding RDFa metadata in AIM25 archival collection catalogues, opening up archival catalogues to widened automated linking and discovery. This will include creation of metadata profiles and URI schemes and an assessment of the effectiveness of automated semantic metadata extraction through natural language processing tools (using GATE). The outputs of the project will be integrated into the AIM25 resources and workflows, ensuring that AIM25 content continues to be available in linked data form.

A large amount of accumulated authority metadata (subject terms, personal and place names, geographical names) exists in AIM25 SQL database tables and is already normalised in appropriate standard forms (e.g. NCA Rules). This is used to provide search and access points to the collection records. The project will reimplement these rich metadata resources as embedded RDFa within the online catalogues, and ensure the resulting datasets are openly available for reuse under appropriate open licensing tools (e.g ODL, GPL, Creative Commons) – in consultation with the community and the Programme Manager.

For the benefit of the data creators, workflow and input systems will be revised to support new metadata creation techniques, including authority-based. For the benefit of the end user, these key terms in the catalogues will be implemented as clickable hotspots, offering context-specific linking and searching to other systems. Existing features and functionality will not be compromised.

The pilot system will be used to demonstrate and evaluate the effectiveness of reimplementing existing search tools and entry points within the system using SPARQL, as well as creating an API enabling external services to use the same retrieval tools.

Dual input will be undertaken of over 1,140 entries from CALM and AdLib from six partner institutions, including editorial confirmation of ISAD(G) compliance and creation of UKAT and NCA Name Authority files.

Results and outputs will be evaluated at key milestones by a representative panel of archivists from AIM25 members and users to assess the usability and accessibility.

Outputs
·     A working model of an enhanced AIM25 web application, for demonstration and evaluation purposes, to include SPARQL APIs; reimplementation of existing end-user tools for searches, views and queries using RDF query tools and AJAX.
·     AIM25 authority metadata in linked data format, published with an Open Metadata licence, including SKOS implemenation of the AIM25 thesaurus data;
·     An AIM25 data profile based on the public schemas and ontologies identified for each domain  (eg.DBpedia) and a URI scheme for entities in the AIM25 namespace;
·     Reimplementation of existing AIM25 data creation tools to include RDFa creation, assisted by natural language processing of catalogues via the GATE service;
·     A  published report detailing ongoing and summative evaluation of the techniques used and final outputs;
·     Dissemination activities for the AIM25 partnership, wider archives and access and discovery communications;
·     Optimised user searching of AIM25.

Wider Benefits to Sector & Achievements for Host Institution


Among the contributions the project will make to the sector and host institution are:
·     Make open metadata about archives held in libraries, museums and archive repositories available through the delivery of an open, running pilot system demonstrating an enhanced version of the AIM25 system featuring embedded RDFa, a SPARQL-based query engine and SPARQL endpoint API. The records in the pilot system will number 1140 (increasing to 16,140 when the project outputs are implemented for the live AIM25 system).
·     Make the rich, validated and reviewed authority datasets of AIM25 available in an open format, under open licensing terms, for reuse by the archival and wider community. These include tens of thousands of entries including thesaurus terms (UKAT based, with local and MeSH additions), and personal, place and corporate names structured to NCA rules.
·     Deliver a detailed account of the process and outcomes of creating and implementing linked data profiles for ISAD(G)/EAD based archival metadata and offer a clear articulation of how established descriptions and authority metadata standards may be delivered and maintained as open metadata
·     Provide a coherent analysis and examples for the archival and wider access and discovery community of the value, effectiveness and potential of the approach to delivery using RDF,  in terms of widening access and deepening use and providing and opportunity to learn how the approach optimises the use of archival staff time.
·     Produce knowledge and practice that enhances and optimises AIM25, including a working model and which may be of benefit to the other institutions holding archives.
·     Deliver optimised user searching tools and techniques for use in the AIM25 system that AIM25 will commit to implementing in its live system as soon as possible after the completion of the project. (A full, live launch across AIM25 has been excluded from the project scope owing to the limited timescale available.)
Risk analysis
Risk
Probability
Severity
Score

Archives to prevent  / manage risk
Difficulty in recruiting and retaining staff
1
3
3
Most staff are already employed by partners and this time will be bought out. The project will also distribute knowledge throughout the project to limit the effects if a staff member leaves. Given the short duration of the project gaps will be filled by the use of agency staff or internal secondments
New partners are unable to supply numbers of descriptions
2
2
4
Utilise new accession material from existing partners. Fallback on existing data.

A complete testbed and evaluation cannot be implemented within the time frame
2
2
4
Project management team will closely monitor progress of objectives and outputs. If necessary, with the agreement of the Programme Manager, some activities can be re-scoped to ensure an effective outputs are achieved.
Failure to meet project milestones
2
3
6
Produce project plan with clear objectives. Continuous project assessment and close communication between project manager, technical leads, and JISC programme manager to ensure targets are realistic, achievable and focuse on project goals.
IPR

IPR in all reports and other documents produced by the project will be retained jointly by King’s College London and ULCC but made freely available on a non exclusive license as required/advised by JISC. All software and data created during the project will be made available to the community on an open licence. We will respect the licence model of all third parties and during the project, most of which is made available under open source licences.
Project team relationships and end user engagement

The project will be overseen by a board comprising: Patricia Methven, Director of AIM25 (Chair); Kevin Ashley (Director, Digital Curation Centre); Mark Hedges (Deputy Director, CeRCh), Geoffrey Browell, Senior Archivist (King’s College Archives Services), Richard Davis (ULCC Digital Archives), and five nominated members of AIM25 reflecting new and existing partners. Input from other leading figures from JISC digital archives projects will be invited. The project will be managed by Geoff Browell with specialist and technical support from Richard Davis and Gareth Knight. Project staff will be ex officio members.
End-user Engagement
The project will establish a project blog to record  progress and invite comment. The  project team will work proactively with other RDTF activities and projects, including LOCAH and CHALICE, to identify synergistic goals and approaches. We will also work with the Open/Linked Data and Semantic Web communities to ensure the maximum dissemination opportunities for outputs, and for developing the new AIM25 API. Services such as LinkedData.org and PTWS.com will be used to publicise the availability of the data. Project outputs will be made available on the project website. Dissemination to the wider archival, museum and library will be offered through professional conferences and press of ARA, CILIP, RLUK, SCONUL and the Museums Association. Websites such as Culture24 and Museums, Libraries and Archives Council will also be notified. A regional dissemination event will be hosted by the AIM25 partnership in addition to hosted JISC events.


Timeline, workplan and methodology

Work package 1: Project management
This covers management activity throughout the project. It will assemble the project team; prepare the detailed project plan; establish the steering group; and agree the configuration of the project testbed. Cross-institutional, cross-partnership involvement will require close liaison between all partners, including existing AIM25 partners. There will be monthly meetings, at least four focus groups, two from each of the user and archival communites, to undertake the evaluation and ad hoc communication.Deliverables: Detailed project plan; progress and risk assessment reports; project and focus group meetings; exit and sustainability plan; ongoing coordination; liaison with JISC programme manager. Led by King’s College London Archives.

Work package 2: Testbed record selection and creation
Import into existing AIM25 of 1140 ISAD(G) new compliant collection (fonds) descriptions directly from propriety software, CALM or AdLib as appropriate through an established automated ingest protocol developed in association with the Archives Hub. The entries will cover the full archival holdings of the National Maritime Museum (700 collections), and the most significant records of Zoological Society of Great Britain (100), and the British Postal Museum (100). Those for the London Boroughs of Hammersmith and Fulham (100 collections representing 8% of collection level descriptions) and Wandsworth (100 collections representing 80% of their collection level descriptions), de facto the collections regarded by custodians of significant wider interest and those which have been prioritised for cataloguing (the Borough percentages do not reflect physical extent). An additional 40 new descriptions will be added by King’s College London representing accessions for 2009/10, 2.5% percentage of the full total for King’s already available on AIM25. Name authority and subject terms will be added for these entries in the normal way through experienced externally contracted staff. Collections are defined accordingly to their provenance and range from one to a thousand boxes.Deliverables: Creation and configuration  of collection descriptions for testbed content. Led by King’s College London Archives.
Work package 3: Metadata profiling and processing
Analysis of testbed materials to define metadata requirements. This will include a review of relevant and recent outputs in the field, such as LOCAH, CHALICE. To drive out the rich seams of information in the narrative texts of the ISAD(G) descriptions (including personal and corporate names, place names and dates) the project will use GATE (General Architecture for Text Engineering) – a Java-based natural language processor developed by the University of Sheffield  - to parse unstructured content and identify key entities. The outputs of GATE processing will be evaluated in conjunction with existing authority records in AIM25. A URI scheme will be defined to enable the resulting metadata to be published as open data. Entities will be tagged and identified with a URI and marked-up text will be exported to EAD. Deliverables: Creation of an RDF enriched corpus; creation of a URI scheme utilising GATE outputs and existing authority records within AIM25; creation of style sheets to  transform GATE outputs to EAD; definition of requirements for RDF triple store. Led by King’s College London CeRch.

Work package 4: Implementation
Implementation of WP3 recommendations within a copy of the current AIM25 system. This will include RDF triple store, re-implementation ofe-existing search tools and entry points within the system using SPARQL, and creation of API enabling external services to use the same retrieval tools. Implementation of a tool to support highlighting of key terms in cataloguing as clickable hotspots, offering  content-specific linking and searching to other systems with Web APIs likely to be of use to researchers/end-users. Convert AIM25 authority records to RDF and publish as open metadata. Define and implement enhanced AIM25 browsing interface, date entry interface and APIs. Deliverables: Working model of RDFa-enhanced AIM25 system (including end-user and data-entry enhancements); tools to create and publish open metadata; published open metadata  and exemplar for evaluation. Led by ULCC

Work package 5: Evaluation
Evaluation of outputs of WP3 and WP4 using statistical assessment, web analytics and structured survey  techniques. Conduct of two focus groups with archivists, new and existing AIM25 partners, and two drawn from academic users from a variety of disciplines to compare existing AIM25 and open metadata AIM25 searches. Deliverables. Definition of evaluation approach; statistical user and community evaluation of approach to open metadata, GATE processing and enhancements. Led by King’s College London Archives Services.

Work package 6: Dissemination
The project will establish a project blog to record  progress and invite comment. The  project team will work proactively with other RDTF activities and projects, including LOCAH and CHALICE, to identify synergistic goals and approaches. We will also work with the Open/Linked Data and Semantic Web communities to ensure the maximum dissemination opportunities for outputs, and for developing the new AIM25 API. Services such as LinkedData.org and PTWS.com will be used to publicise the availability of the data. Project outputs will be made available on the project website. Dissemination to the wider archival, museum and library will be offered through professional conferences and press of ARA, CILIP, RLUK, SCONUL and the Museums Association. Websites such as Culture24 and Museums, Libraries and Archives Council will also be notified. A regional dissemination event will be hosted by the AIM25 partnership in addition to hosted JISC events.

2011
Feb
Mar
Apr
May
Jun
July
WP1
X
X
X
X
X
X
WP2
X
X
X



WP3
X
X
X



WP4


X
X
X
X
WP5




X
X
WP6
X


X
X
 
 
 
Budget
Directly Incurred
Staff
August 10– July 11
August 11– July 12
TOTAL £
Grade 6,  10 days & 9% FTE
£2340.80
£
£2340.80
Grade 6, 27 days & 24.5 %FTE
£5526.90
£
£5526.90
Grade 8- point 46, 8 days, 7%FTE
£2616.00
£
£2616.00
Grade 7-point 43, 29 days, 26% FTE
£9483
£
£9483
Indexer A, Grade 2, 6 months, 35 % FTE
£3545.30
£
£3545.30
 Indexer B, Grade 2, 6 months 35% FTE
£3545.30
£
£3545.30
External Contractor
£2679.42
£
£2679.42
Total Directly Incurred Staff (A)
£29736.72
£
£29736.72




Non-Staff
August 10– July 11
August 11– July 12
TOTAL £

Travel and expenses
£800
£
£800
Hardware/software
£1000
£
£1000
Dissemination
£800
£
£800
Evaluation
£400
£
£400
Other
£1000
£
£1000
Total Directly Incurred Non-Staff (B)
£ 4,000
£
£ 4,000




Directly Incurred Total (C)
(A+B=C)
£33,736.72

£
£33,736.72





Directly Allocated
August 10– July 11
August 11– July 12
TOTAL £

Staff Grade7 –point 38, 6 months, 20% FTE
£5317.33
£
£5317.33
Estates
£4956.00
£
£4956.00
Other
£
£
£
Directly Allocated Total (D)
£10273.33
£
£10273.33




Indirect Costs (E)
£30,734.68
£
£30,734.68




Total Project Cost (C+D+E)
£74,744.73
£
£74,744.73
Amount Requested from JISC
£40000
£
£40000
Institutional Contributions
£34744.73
£
£34744.73




Percentage Contributions over the life of the project
JISC
54%
Partners
46 %
Total
100%