The National Archives (UK)

mentionthewar · March 11, 2013

Possible to get a translator to work on the main catalogue (Discovery) for this?

http://discovery.nationalarchives.gov.uk/SearchUI/

An example item would be:
http://discovery.nationalarchives.gov.uk/SearchUI/Details?uri=C9156576

I can see there might be a problem deciding which fields to map to what. Happy to chat about it.

Jo Pugh

adamsmith · March 11, 2013

as for fields:
Reference --> Loc. in Archive
Description --> Title
Date --> Date
Held by --> Archive
Legal status -->Rights
Access conditions --> ??? (maybe add to rights?)
Publication note: --> note (or abstract?)

Won't happen immediately but seems well worthwhile doing.

mentionthewar · March 12, 2013

That would be amazing! It's possible that the underlying fields in the XML might be a better guide to the matching than the visible fields

http://discovery.nationalarchives.gov.uk/DiscoveryAPI/xml/informationasset/C3454320

compared to:

http://discovery.nationalarchives.gov.uk/SearchUI/Details?uri=C3454320

should work (I think). I can post the XML if it doesn't.

adamsmith · March 12, 2013

working with XML would be much better in general, yes - unfortunately their API is (currently?) restricted http://discovery.nationalarchives.gov.uk/SearchUI/api.htm
so I can't see that XML, nor can we use it for Zotero, which really is too bad.
If you think the XML provides better guidance for field mapping, you can post it to a public gist at gist.github.com (where it'll be easier to read than here) and link to it.
Any comments on the mapping above?

adamsmith · March 12, 2013

actually - never mind. I do get the xml, it's just not properly displayed in FF.

mentionthewar · March 19, 2013

It seems there might be a problem with the character set (certainly the fact that it is genuinely utf-16 seems a little implausible to me!).

Let me explain how it works. There are 7 levels to Discovery. Level 1 is a department like INF - the Ministry of Information. Level 6 is where an piece of art like INF 3/108 is sitting and level 7 would be an item level description - like COPY 1/400/23 (which is a photo in a box where the box is COPY 1/400).

In Discovery references must be formed by working up the <ParentIAID> chain and gluing together the <Reference>s.

The tricky thing is that we don't want all of them. We are only interested in levels 1 ("INF"), 3 ("3") and 6 ("108"). So although we need to follow the chain of <ParentIAID>s upwards, we need to throw away levels 2, 4 and 5. These tell us archivally useful things about the object but they are not part of forming the reference and if we include their <Reference>s, the value we generate will be a nonsense and generate an error when we come to look it up.

All clear as mud so far?

If we follow this rule there may be some issues since this is not completely uniform but there is probably a list of the rulebreakers which could be provided.

Does that make sense?

Jo

mentionthewar · March 19, 2013

<SourceLevelId tells you what level you are at. Probably should have mentioned that.

amme2 · March 19, 2013

Yes, as an archivist that kind of abstraction is familiar from elsewhere. To clarify, are you saying the Discovery Reference (in the Search UI, not the XML <Reference> tag, hmm this is confusing!) is nearly always formed from a combination of Levels 1 + (optionally, depending on what hierarchical level you wish to refer to) 3 + 6 + 7?

But in any case the rulebreakers must be somewhere coded into the Discovery API already else how does the Search UI itself know how to form a proper Reference?

adamsmith · March 19, 2013

thanks. That explanation makes perfect sense, but it'll be much easier to just scrape the info from the Reference field in the human readable discovery entry if we have to make multiple requests anyway.

Moving on -
- what about "Rights" - there were some questions about that on twitter

- I see that some items have a title field and that the "Description" field is quite long. Here's my current thinking on this:
If there is a title: Title --> Title, Description --> Abstact
if there isn't First line of Description --> Title, Whole Description --> Abtract

- To get a full mapping of field --> Zotero, I've pasted an XML entry here: http://titanpad.com/6QNLqhVl6l
please add the respective Zotero fields (maybe in bold?) as you see fit.

amme2 · March 19, 2013

Zotero's author field. I've had a look at how this is handled from NARA's ARC catalog - appears to pick up the department ie in Discovery's terms <Title> where <SourceLevelID> = 1, or the top of the 'hierarchy' box in the SearchUI.

More recent descriptions I notice have a separate <Creator> tag though...

amme2 · March 19, 2013

URL - I presume the SearchUI uri makes most sense here? Or would you expect this to point directly at a digital/digitised resource, if there was one?

To really complicate, eg my initial catalyst for wishing Discovery could talk to zotero was wanting to cite a specific dated instance of this http://discovery.nationalarchives.gov.uk/SearchUI/Details?uri=C18648 from the UK Government web archive...

milh0use · March 19, 2013

Quoting amme2
"are you saying the Discovery Reference [...] is nearly always formed from a combination of Levels 1 + (optionally, depending on what hierarchical level you wish to refer to) 3 + 6 + 7? "

Yes that is correct. I suggested to mentionthewar that there might be exceptions, but now that I think about it, I don't think there are. There are (legacy?) divisions called "subclasses" where at either level 3 or 6 ("series" and "piece") there are Reference fields which contain a forward slash character, but if you treat them as strings and concatenate them, they should still work. The problem really occurs in the other direction where you want to convert a Citable Reference into an IAID. The fact that the slashes are not necessarily delimiting references at different catalogue hierarchy levels makes this task virtually impossible to do reliably.

For example, "PRO 30/89/40" looks like department "PRO", series "PRO 30", piece "PRO 30/89", item "PRO 30/89/40". It is actually series "PRO 30/89", and when you look up that IA in Discovery you will see that the Reference field says "30/89". Painful when the "/" is usually used to delimit the different hierarchy levels.

I don't think there is a simple rule that allows you to work out which bits of the Citable Reference are the series, piece and item references. The number of slashes can vary even within a piece :(

adamsmith · March 19, 2013

URLs are for links to full-texts, so that will stay empty. We'll attach the SearchUI as a Snapshot which also saves the URL.

Author would only rarely be filled - after all, authors are cited so it can't have anyone that isn't the actual creator of the content, but where we have:
<CreatorName>
<CreatorName>
<Corporate_Body_Name_Text>The National Archives</Corporate_Body_Name_Text>
<Corporate_Body_Date_Start>2003</Corporate_Body_Date_Start>
<Corporate_Body_Date_End>0</Corporate_Body_Date_End>
<Birth_Date>0</Birth_Date>
<Death_Date>0</Death_Date>
</CreatorName>
</CreatorName>

we'll use that.
Is the XML format documented somewhere? I'd like to see how exactly the creator entries work.

mentionthewar · March 19, 2013

Fair enough!

Rights - I notice this is left blank in just about everything I have in Zotero unless it's explicitly CC. So let's leave (that can of worms and just leave) it blank here too.

Title - When things don't have a title, the reference should be the title, I reckon.

An example of what citation output is supposed to look like can be found at www.nationalarchives.gov.uk/records/citing-documents.htm

There you can see that citations are either supposed to look like:

The National Archives (TNA): Public Record Office (PRO) INF 3/140

OR:

The National Archives (TNA): Public Record Office (PRO) C 139 Chancery: Inquisitions Post Mortem, Series I, Henry VI

(But the latter is a citation of a scoundrel, since it refers to 181 different files.)

But let me have a go with the Titanpad

adamsmith · April 29, 2013

I've now added a translator for the National Archives of the UK

Your version of Zotero will automatically update within 24hs, or you can update manually using the "Update Now" button in the "General" tab of the Zotero preferences.

This should work for search results and item displays. Let me know how it works. Requests for additional/changed data are welcome, as are requests for other multiple views that aren't yet supported.

rtbell · August 27, 2014

It seems that the catalogue for this archive has changed recently, so the translator isn't working at the moment. I thought it would just be a case of updating the code to match the new URL structure, but it now appears that we can't access the XML via their API, even though it seems that the API hasn't changed much (the URLs for the XML data remain the same so far as I can see, but apparently you have to apply for access).

Can anyone confirm that they've tightened access to the API, or am I just missing something?

adamsmith · August 28, 2014

There's nothing helpful on their website. It sounds like they may have the API locked down by IP
http://labs.nationalarchives.gov.uk/wordpress/index.php/2011/09/the-national-archives-api/
but that post is older than the working translator, so I have no idea. If someone with contact at the archive can find out if there's still a way to get at the XML data, that'd be great.
It's obviously possible to scrape from the page, but that seems like such a waste (and would required completely re-writing that).

I've tweeted them, but if anyone has any contacts that'd be very helpful to find out.

rtbell · August 28, 2014

Thanks - good to know I wasn't just missing something. I've emailed the Discovery@ address provided with the API documentation. I'm also going to be at the archives this week and next so if we don't hear back I might see if it's possible to corner someone about it.

emmareisz · September 3, 2014

Thanks for looking into this, Adam and rtbell. It's unfortunate that TNA closed down the old catalogue with no transition period. I will email them to beg them to get in touch with you! I have no special contacts there - just another user - but the more voices the better.

rtbell · September 4, 2014

I heard back - they're looking into it and are aware of this thread, so hopefully something will get sorted soon. However, there's no telling how quickly they'll be able to act. So, in the meantime, I've cobbled together a very rough-and-ready workaround that scrapes from the page. It's imperfect and limited it to single items at the moment, as I really just put it together to serve my own purposes, but I thought I'd share just in case we're waiting a while for a real solution.

I've uploaded the code to: https://gist.github.com/rt-bell/95723931b04144db3633

If anyone uses it and has any problems, let me know.

adamsmith · September 5, 2014

Thanks - I've put rtbell's translator up for everyone to use

Your version of Zotero will automatically update within 24hs, or you can update manually using the "Update Now" button in the "General" tab of the Zotero preferences. If you're using Standalone, restart Zotero and your browser after updating.

This is - as they note - a makeshift translator and I just did a very cursory review in the hope that we'll get the real deal via API back asap. If the API situation hasn't changed in a couple of month we can revisit this.

emmareisz · September 5, 2014

Thanks, rtbell and Adam. I've given it a try and it's working correctly for me on a few different types of record. Thanks so much.

torros · September 5, 2014

thx Adam, that will help a lot!

milh0use · September 10, 2014

Hi Adam (and Zotero users),

Sorry about the problems caused by the temporary loss of the API. It is back online now. Would you be able to test the API-based Parser and switch back to it if everything works okay?

We have identified an issue with the API being unable to return data for non-National Archives records (those records that have recently been imported into Discovery from Access2Archives, the National Register of Archives and Archon), but those records have a different format of Discovery URL anyway and aren't offered for citation by the Zotero browser plugin. We've noted this API issue and will add it to the backlog.

Incidentally, I notice that there was talk of using the Discovery "Title" field (where present) for the Zotero "title" mapping, but that the current parser always uses a standard text of "The National Archive of the UK, COPY 1/2". Is this intentional?

Steven

adamsmith · September 10, 2014

Great, thanks, much appreciate you coming by for the heads-up. I'll test this tonight (or if rtbell is motivated, I'll also accept a pull request).

Incidentally, I notice that there was talk of using the Discovery "Title" field (where present) for the Zotero "title" mapping, but that the current parser always uses a standard text of "The National Archive of the UK, COPY 1/2". Is this intentional?

no, it's not. I'll look into it.

adamsmith · September 11, 2014

OK, this uses the API again. We're again using the "Title" field, too, though I'm finding that it doesn't exist for most entries, so if there are better ideas I'm happy to hear them.

Thanks again everyone for reporting to us and the Nat'l Archive and for the folks at the archive for fixing it.

milh0use · September 11, 2014

I have to admit, I don't know what proportion of the records have the title field populated. In Discovery, I believe we use the first n characters (followed by ellipsis) of the ScopeContent->Description field where title is absent.

Thanks for updating so quickly.

adamsmith · September 11, 2014

In Discovery, I believe we use the first n characters (followed by ellipsis) of the ScopeContent->Description field where title is absent.

yes--and that makes total sense for a catalog display. Having elipses in seemingly random places on import in Zotero, however, isn't satisfactory, so we'd need a different solution. It'd be possible to try to parse the description up to a period or so.

emmareisz · September 12, 2014

Thanks to milh0use for coming on here and helping get this fixed.

I've had an email from TNA adding a little detail to milh0use's comment above re. the title field:
"most records don't have a title but instead for display and indexing purposes we create one from the first 100 characters of the description. If there are more than 100 we end after the next whole word with ellipses."

The email from TNA also mentioned that whatever arrangements Zotero makes for the title field should reflect the fact that Discovery now includes catalogues from numerous other UK archives (although it seems that non-TNA records are not currently being scraped successfully by Zotero). So "The National Archives of the UK" should not be hard-coded into the title field and instead if the archive name is going to be in the title then that should be taken from the relevant field.

TNA gave an email address at ResourceDiscoveryDevelopment at nationalarchives.gsi.gov.uk for more detailed discussions of the structure of the data.

adamsmith · September 12, 2014

correct, we currently don't import items not from TNA, so that's not an issue at this point, but if we do (though it sounds like that might not currently be possible in the API?) we'll take that into account.