Improving PDF Quality for the Zotero's new PDF recognizer

When importing the PDF from https://www.oneducation.net/wp-content/uploads/2018/03/10.17899_on_ed.2018.1.1.pdf , Zotero's new PDF recognizer recognises author and title but not year or any other information. What can be done to improve this?
  • The DOI in that PDF redirects properly, but there's no metadata for it in any of the DOI registration agencies we currently check. @adamsmith, how common is it for there to be a redirect but no metadata? Is there a way to determine the RA from the DOI?

    In the absence of canonical metadata, we could probably do a bit better at extracting what's in the PDF, but that's obviously a worse option. The embedded metadata available on the page itself isn't great either.
  • It should be very rare for an item to resolve and have no metadata (it'd be from one of the RAs we don't currently support) and something else is going on here I think.

    You can find out the RA using the datacite prefix API:
    curl "https://api.datacite.org/prefixes/10.7899"

    and this is a datacite DOI, so it has metadata and
    curl -LH "Accept: application/vnd.citationstyles.csl+json" https://doi.org/10.17899/on_ed.2018.1.1

    looks good. I'll have to take a look what goes wrong. Probably a bug in the search translator.
  • It looks like there's no title in the metadata, which Zotero requires:

    https://data.datacite.org/application/citeproc+json/10.17899/on_ed.2018.1.1
  • ah yes, right. That's weird -- title is mandatory in Datacite for all currently active versions of their metadata schema. I'll try to track down if this is a bug on their end.
  • Yes, that's a Datacite bug. They have the title, it just doesn't make it into any of their JSON formats. Reported here: https://github.com/datacite/datacite/issues/324
  • edited March 26, 2018
    I found three more things that appear incorrect in the JSON format.

    1) The "type" should not be "report" but "journal article" instead.
    "type": "report",

    2) The title of the journal ended up in the "abstract" field.
    "abstract": "On Education. Journal for Research and Debate",
    3) The date should be month year (03/2018) but it is only 2018.

    Also the XML data for the articles of the first issue seem incomplete. Only Merry (2018) shows the publisher information and the CC BY-NC license – the others do not.
  • That's a combination of what the publisher deposits with Datacite and limits of the Datacite data model -- there's nothing we can do about those.
  • Thanks for tracking down the bug, @adamsmith . I contacted Datacite regarding the other issues.
  • I'd generally not contact DataCite with data quality issues unless you have a strong reason to believe that the problem are their services rather than the data deposited with them (which was the case for the issue I filed & linked to above). Your first point of contact should typically be the journal's publisher.
  • edited March 28, 2018
    Well, I registered the article metadata myself with da|ra. You are probably right that da|ra should be my first point of contact.
  • To add to what adamsmith said: mapping DataCite metadata to Citeproc JSON used for citation formatting can sometimes be tricky. The issues in this particular example are a) SeriesInformation goes into the DataCite Description field and thus shows up in the abstract in Citeproc JSON, b) there is no controlled vocabulary for resourceType, so "Article" is not recognized, and we default to "report", and c) the DataCite metadata don't contain a publication month.

    The fix would be for a) that DataCite correctly parses the relatedIdentifier "isPartOf" with the ISSN, b) that DataCite comes up with a controlled vocabulary that includes JournalArticle (unlikely in the short term), and c) that "data issued" metadata is used for a more specific publication date.

    Martin (DataCite Technical Director)
Sign In or Register to comment.