The National Archives (UK)

Possible to get a translator to work on the main catalogue (Discovery) for this?

http://discovery.nationalarchives.gov.uk/SearchUI/

An example item would be:
http://discovery.nationalarchives.gov.uk/SearchUI/Details?uri=C9156576

I can see there might be a problem deciding which fields to map to what. Happy to chat about it.

Jo Pugh
  • as for fields:
    Reference --> Loc. in Archive
    Description --> Title
    Date --> Date
    Held by --> Archive
    Legal status -->Rights
    Access conditions --> ??? (maybe add to rights?)
    Publication note: --> note (or abstract?)

    Won't happen immediately but seems well worthwhile doing.
  • That would be amazing! It's possible that the underlying fields in the XML might be a better guide to the matching than the visible fields

    http://discovery.nationalarchives.gov.uk/DiscoveryAPI/xml/informationasset/C3454320

    compared to:

    http://discovery.nationalarchives.gov.uk/SearchUI/Details?uri=C3454320


    should work (I think). I can post the XML if it doesn't.
  • working with XML would be much better in general, yes - unfortunately their API is (currently?) restricted http://discovery.nationalarchives.gov.uk/SearchUI/api.htm
    so I can't see that XML, nor can we use it for Zotero, which really is too bad.
    If you think the XML provides better guidance for field mapping, you can post it to a public gist at gist.github.com (where it'll be easier to read than here) and link to it.
    Any comments on the mapping above?
  • actually - never mind. I do get the xml, it's just not properly displayed in FF.
  • It seems there might be a problem with the character set (certainly the fact that it is genuinely utf-16 seems a little implausible to me!).

    Let me explain how it works. There are 7 levels to Discovery. Level 1 is a department like INF - the Ministry of Information. Level 6 is where an piece of art like INF 3/108 is sitting and level 7 would be an item level description - like COPY 1/400/23 (which is a photo in a box where the box is COPY 1/400).

    In Discovery references must be formed by working up the <ParentIAID> chain and gluing together the <Reference>s.

    The tricky thing is that we don't want all of them. We are only interested in levels 1 ("INF"), 3 ("3") and 6 ("108"). So although we need to follow the chain of <ParentIAID>s upwards, we need to throw away levels 2, 4 and 5. These tell us archivally useful things about the object but they are not part of forming the reference and if we include their <Reference>s, the value we generate will be a nonsense and generate an error when we come to look it up.

    All clear as mud so far?

    If we follow this rule there may be some issues since this is not completely uniform but there is probably a list of the rulebreakers which could be provided.

    Does that make sense?

    Jo
  • <SourceLevelId tells you what level you are at. Probably should have mentioned that.
  • Yes, as an archivist that kind of abstraction is familiar from elsewhere. To clarify, are you saying the Discovery Reference (in the Search UI, not the XML <Reference> tag, hmm this is confusing!) is nearly always formed from a combination of Levels 1 + (optionally, depending on what hierarchical level you wish to refer to) 3 + 6 + 7?

    But in any case the rulebreakers must be somewhere coded into the Discovery API already else how does the Search UI itself know how to form a proper Reference?
  • thanks. That explanation makes perfect sense, but it'll be much easier to just scrape the info from the Reference field in the human readable discovery entry if we have to make multiple requests anyway.

    Moving on -
    - what about "Rights" - there were some questions about that on twitter

    - I see that some items have a title field and that the "Description" field is quite long. Here's my current thinking on this:
    If there is a title: Title --> Title, Description --> Abstact
    if there isn't First line of Description --> Title, Whole Description --> Abtract

    - To get a full mapping of field --> Zotero, I've pasted an XML entry here: http://titanpad.com/6QNLqhVl6l
    please add the respective Zotero fields (maybe in bold?) as you see fit.
  • Zotero's author field. I've had a look at how this is handled from NARA's ARC catalog - appears to pick up the department ie in Discovery's terms <Title> where <SourceLevelID> = 1, or the top of the 'hierarchy' box in the SearchUI.

    More recent descriptions I notice have a separate <Creator> tag though...
  • URL - I presume the SearchUI uri makes most sense here? Or would you expect this to point directly at a digital/digitised resource, if there was one?

    To really complicate, eg my initial catalyst for wishing Discovery could talk to zotero was wanting to cite a specific dated instance of this http://discovery.nationalarchives.gov.uk/SearchUI/Details?uri=C18648 from the UK Government web archive...
  • Quoting amme2
    "are you saying the Discovery Reference [...] is nearly always formed from a combination of Levels 1 + (optionally, depending on what hierarchical level you wish to refer to) 3 + 6 + 7? "

    Yes that is correct. I suggested to mentionthewar that there might be exceptions, but now that I think about it, I don't think there are. There are (legacy?) divisions called "subclasses" where at either level 3 or 6 ("series" and "piece") there are Reference fields which contain a forward slash character, but if you treat them as strings and concatenate them, they should still work. The problem really occurs in the other direction where you want to convert a Citable Reference into an IAID. The fact that the slashes are not necessarily delimiting references at different catalogue hierarchy levels makes this task virtually impossible to do reliably.

    For example, "PRO 30/89/40" looks like department "PRO", series "PRO 30", piece "PRO 30/89", item "PRO 30/89/40". It is actually series "PRO 30/89", and when you look up that IA in Discovery you will see that the Reference field says "30/89". Painful when the "/" is usually used to delimit the different hierarchy levels.

    I don't think there is a simple rule that allows you to work out which bits of the Citable Reference are the series, piece and item references. The number of slashes can vary even within a piece :(
  • URLs are for links to full-texts, so that will stay empty. We'll attach the SearchUI as a Snapshot which also saves the URL.

    Author would only rarely be filled - after all, authors are cited so it can't have anyone that isn't the actual creator of the content, but where we have:
    <CreatorName>
    <CreatorName>
    <Corporate_Body_Name_Text>The National Archives</Corporate_Body_Name_Text>
    <Corporate_Body_Date_Start>2003</Corporate_Body_Date_Start>
    <Corporate_Body_Date_End>0</Corporate_Body_Date_End>
    <Birth_Date>0</Birth_Date>
    <Death_Date>0</Death_Date>
    </CreatorName>
    </CreatorName>

    we'll use that.
    Is the XML format documented somewhere? I'd like to see how exactly the creator entries work.
  • Fair enough!

    Rights - I notice this is left blank in just about everything I have in Zotero unless it's explicitly CC. So let's leave (that can of worms and just leave) it blank here too.

    Title - When things don't have a title, the reference should be the title, I reckon.

    An example of what citation output is supposed to look like can be found at www.nationalarchives.gov.uk/records/citing-documents.htm

    There you can see that citations are either supposed to look like:

    The National Archives (TNA): Public Record Office (PRO) INF 3/140

    OR:

    The National Archives (TNA): Public Record Office (PRO) C 139 Chancery: Inquisitions Post Mortem, Series I, Henry VI

    (But the latter is a citation of a scoundrel, since it refers to 181 different files.)

    But let me have a go with the Titanpad
  • I've now added a translator for the National Archives of the UK

    Your version of Zotero will automatically update within 24hs, or you can update manually using the "Update Now" button in the "General" tab of the Zotero preferences.

    This should work for search results and item displays. Let me know how it works. Requests for additional/changed data are welcome, as are requests for other multiple views that aren't yet supported.
  • edited August 27, 2014
    It seems that the catalogue for this archive has changed recently, so the translator isn't working at the moment. I thought it would just be a case of updating the code to match the new URL structure, but it now appears that we can't access the XML via their API, even though it seems that the API hasn't changed much (the URLs for the XML data remain the same so far as I can see, but apparently you have to apply for access).

    Can anyone confirm that they've tightened access to the API, or am I just missing something?
  • There's nothing helpful on their website. It sounds like they may have the API locked down by IP
    http://labs.nationalarchives.gov.uk/wordpress/index.php/2011/09/the-national-archives-api/
    but that post is older than the working translator, so I have no idea. If someone with contact at the archive can find out if there's still a way to get at the XML data, that'd be great.
    It's obviously possible to scrape from the page, but that seems like such a waste (and would required completely re-writing that).

    I've tweeted them, but if anyone has any contacts that'd be very helpful to find out.
  • Thanks - good to know I wasn't just missing something. I've emailed the Discovery@ address provided with the API documentation. I'm also going to be at the archives this week and next so if we don't hear back I might see if it's possible to corner someone about it.
  • edited September 3, 2014
    Thanks for looking into this, Adam and rtbell. It's unfortunate that TNA closed down the old catalogue with no transition period. I will email them to beg them to get in touch with you! I have no special contacts there - just another user - but the more voices the better.
  • I heard back - they're looking into it and are aware of this thread, so hopefully something will get sorted soon. However, there's no telling how quickly they'll be able to act. So, in the meantime, I've cobbled together a very rough-and-ready workaround that scrapes from the page. It's imperfect and limited it to single items at the moment, as I really just put it together to serve my own purposes, but I thought I'd share just in case we're waiting a while for a real solution.

    I've uploaded the code to: https://gist.github.com/rt-bell/95723931b04144db3633

    If anyone uses it and has any problems, let me know.
  • Thanks - I've put rtbell's translator up for everyone to use

    Your version of Zotero will automatically update within 24hs, or you can update manually using the "Update Now" button in the "General" tab of the Zotero preferences. If you're using Standalone, restart Zotero and your browser after updating.

    This is - as they note - a makeshift translator and I just did a very cursory review in the hope that we'll get the real deal via API back asap. If the API situation hasn't changed in a couple of month we can revisit this.
  • edited September 5, 2014
    Thanks, rtbell and Adam. I've given it a try and it's working correctly for me on a few different types of record. Thanks so much.
  • thx Adam, that will help a lot!
  • Hi Adam (and Zotero users),

    Sorry about the problems caused by the temporary loss of the API. It is back online now. Would you be able to test the API-based Parser and switch back to it if everything works okay?

    We have identified an issue with the API being unable to return data for non-National Archives records (those records that have recently been imported into Discovery from Access2Archives, the National Register of Archives and Archon), but those records have a different format of Discovery URL anyway and aren't offered for citation by the Zotero browser plugin. We've noted this API issue and will add it to the backlog.

    Incidentally, I notice that there was talk of using the Discovery "Title" field (where present) for the Zotero "title" mapping, but that the current parser always uses a standard text of "The National Archive of the UK, COPY 1/2". Is this intentional?

    Steven
  • Great, thanks, much appreciate you coming by for the heads-up. I'll test this tonight (or if rtbell is motivated, I'll also accept a pull request).
    Incidentally, I notice that there was talk of using the Discovery "Title" field (where present) for the Zotero "title" mapping, but that the current parser always uses a standard text of "The National Archive of the UK, COPY 1/2". Is this intentional?
    no, it's not. I'll look into it.
  • OK, this uses the API again. We're again using the "Title" field, too, though I'm finding that it doesn't exist for most entries, so if there are better ideas I'm happy to hear them.

    Thanks again everyone for reporting to us and the Nat'l Archive and for the folks at the archive for fixing it.
  • I have to admit, I don't know what proportion of the records have the title field populated. In Discovery, I believe we use the first n characters (followed by ellipsis) of the ScopeContent->Description field where title is absent.

    Thanks for updating so quickly.
  • In Discovery, I believe we use the first n characters (followed by ellipsis) of the ScopeContent->Description field where title is absent.
    yes--and that makes total sense for a catalog display. Having elipses in seemingly random places on import in Zotero, however, isn't satisfactory, so we'd need a different solution. It'd be possible to try to parse the description up to a period or so.
  • Thanks to milh0use for coming on here and helping get this fixed.

    I've had an email from TNA adding a little detail to milh0use's comment above re. the title field:
    "most records don't have a title but instead for display and indexing purposes we create one from the first 100 characters of the description. If there are more than 100 we end after the next whole word with ellipses."

    The email from TNA also mentioned that whatever arrangements Zotero makes for the title field should reflect the fact that Discovery now includes catalogues from numerous other UK archives (although it seems that non-TNA records are not currently being scraped successfully by Zotero). So "The National Archives of the UK" should not be hard-coded into the title field and instead if the archive name is going to be in the title then that should be taken from the relevant field.

    TNA gave an email address at ResourceDiscoveryDevelopment at nationalarchives.gsi.gov.uk for more detailed discussions of the structure of the data.
  • correct, we currently don't import items not from TNA, so that's not an issue at this point, but if we do (though it sounds like that might not currently be possible in the API?) we'll take that into account.
Sign In or Register to comment.