Improving PDF Quality for the Zotero's new PDF recognizer

zurpher · March 20, 2018

When importing the PDF from https://www.oneducation.net/wp-content/uploads/2018/03/10.17899_on_ed.2018.1.1.pdf , Zotero's new PDF recognizer recognises author and title but not year or any other information. What can be done to improve this?

dstillman · March 23, 2018

The DOI in that PDF redirects properly, but there's no metadata for it in any of the DOI registration agencies we currently check. @adamsmith, how common is it for there to be a redirect but no metadata? Is there a way to determine the RA from the DOI?

In the absence of canonical metadata, we could probably do a bit better at extracting what's in the PDF, but that's obviously a worse option. The embedded metadata available on the page itself isn't great either.

adamsmith · March 23, 2018

It should be very rare for an item to resolve and have no metadata (it'd be from one of the RAs we don't currently support) and something else is going on here I think.

You can find out the RA using the datacite prefix API:
curl "https://api.datacite.org/prefixes/10.7899"

and this is a datacite DOI, so it has metadata and
curl -LH "Accept: application/vnd.citationstyles.csl+json" https://doi.org/10.17899/on_ed.2018.1.1

looks good. I'll have to take a look what goes wrong. Probably a bug in the search translator.

dstillman · March 23, 2018

It looks like there's no title in the metadata, which Zotero requires:

https://data.datacite.org/application/citeproc+json/10.17899/on_ed.2018.1.1

adamsmith · March 23, 2018

ah yes, right. That's weird -- title is mandatory in Datacite for all currently active versions of their metadata schema. I'll try to track down if this is a bug on their end.

adamsmith · March 23, 2018

Yes, that's a Datacite bug. They have the title, it just doesn't make it into any of their JSON formats. Reported here: https://github.com/datacite/datacite/issues/324

zurpher · March 26, 2018

I found three more things that appear incorrect in the JSON format.

1) The "type" should not be "report" but "journal article" instead.

"type": "report",

2) The title of the journal ended up in the "abstract" field.

"abstract": "On Education. Journal for Research and Debate",

3) The date should be month year (03/2018) but it is only 2018.

Also the XML data for the articles of the first issue seem incomplete. Only Merry (2018) shows the publisher information and the CC BY-NC license – the others do not.

adamsmith · March 26, 2018

That's a combination of what the publisher deposits with Datacite and limits of the Datacite data model -- there's nothing we can do about those.

zurpher · March 26, 2018

Thanks for tracking down the bug, @adamsmith . I contacted Datacite regarding the other issues.

adamsmith · March 26, 2018

I'd generally not contact DataCite with data quality issues unless you have a strong reason to believe that the problem are their services rather than the data deposited with them (which was the case for the issue I filed & linked to above). Your first point of contact should typically be the journal's publisher.

zurpher · March 28, 2018

Well, I registered the article metadata myself with da|ra. You are probably right that da|ra should be my first point of contact.

mfenner · April 3, 2018

To add to what adamsmith said: mapping DataCite metadata to Citeproc JSON used for citation formatting can sometimes be tricky. The issues in this particular example are a) SeriesInformation goes into the DataCite Description field and thus shows up in the abstract in Citeproc JSON, b) there is no controlled vocabulary for resourceType, so "Article" is not recognized, and we default to "report", and c) the DataCite metadata don't contain a publication month.

The fix would be for a) that DataCite correctly parses the relatedIdentifier "isPartOf" with the ISSN, b) that DataCite comes up with a controlled vocabulary that includes JournalArticle (unlikely in the short term), and c) that "data issued" metadata is used for a more specific publication date.

Martin (DataCite Technical Director)