Zotero retrieves wrong PDF from org's library
I work for a small organisation that publishes research reports (all open-access). Each report has a landing page with a link to download a PDF of the report. We register DOIs (through Crossref) that resolve to the landing pages.
I recently found that when saving reports from our site via the Zotero connector, or when taking an existing reference and running "Find PDF", the wrong PDFs are retrieved/attached. I.e., if I go to the landing page for Report A and choose "Save to Zotero (DOI)" via the connector, it imports all the correct metadata for Report A, but attaches the PDF for Report B (from the same organisation).
In the Zotero logs (debug ID D847758907), I can see that a search for Report A's DOI is sent to https://services.zotero.org/oa/search and an open-access PDF URL is found, but it downloads from the URL for Report B's PDF. I understand Zotero gets its info on PDF locations from Crossref and Unpaywall, so I checked the records for the reports in both places, and didn't see any links to PDFs, correct or otherwise.
I checked with our web team and they confirmed that they do not include links to PDFs when registering DOIs, as they understood this to be best practice (the whole point of DOIs being to ensure people can get to a page where they can access a document, even if the location of the PDF changes).
So our guess is that the Zotero PDF service is using some heuristic for getting PDFs from our site, which goes wrong sometimes (which may well be because our site is or was configured in an atypical/unhelpful way). So the idea has come up to include the PDF URL in the CrossRef metadata as a
So, three questions:
I recently found that when saving reports from our site via the Zotero connector, or when taking an existing reference and running "Find PDF", the wrong PDFs are retrieved/attached. I.e., if I go to the landing page for Report A and choose "Save to Zotero (DOI)" via the connector, it imports all the correct metadata for Report A, but attaches the PDF for Report B (from the same organisation).
In the Zotero logs (debug ID D847758907), I can see that a search for Report A's DOI is sent to https://services.zotero.org/oa/search and an open-access PDF URL is found, but it downloads from the URL for Report B's PDF. I understand Zotero gets its info on PDF locations from Crossref and Unpaywall, so I checked the records for the reports in both places, and didn't see any links to PDFs, correct or otherwise.
I checked with our web team and they confirmed that they do not include links to PDFs when registering DOIs, as they understood this to be best practice (the whole point of DOIs being to ensure people can get to a page where they can access a document, even if the location of the PDF changes).
So our guess is that the Zotero PDF service is using some heuristic for getting PDFs from our site, which goes wrong sometimes (which may well be because our site is or was configured in an atypical/unhelpful way). So the idea has come up to include the PDF URL in the CrossRef metadata as a
<resource mime_type="application/pdf"> element, while still having the DOI resolve to the landing page. We know we would have to update the metadata if we change how PDFs are organised, but that would be manageable at our scale.So, three questions:
- If we start registering PDF links, would that be expected to fix the issue with incorrect PDFs being retrieved?
- If so, how long after we make the change should we expect to see "correct" behaviour from Zotero?
- If this isn't the best way to help Zotero find the correct PDFs for our reports, what should we do instead?
Upgrade Storage
1. It looks if you have access to it directly by following the DOI/URL (either because that site is OA or because you have access via IP where you are and
2. Then it looks up if there is an open access PDF on unpaywall (https://unpaywall.org/)
So I'd check the data for 2. for your PDF, which sounds like it might be wrong. I think changing what you have in CrossRef and making sure it's marked as open access might help (unpaywall uses their data extensively), but depends on how Unpaywall got the wrong PDF link (so might also be worthwhile reporting to them)
https://www.3ieimpact.org/evidence-hub/publications/impact-evaluations/impacts-judicial-reform-small-claims-procedures-court
the PDF for this report is retrieved:
https://www.3ieimpact.org/evidence-hub/publications/impact-evaluations/evaluation-secondary-school-teacher-training-under
Happy to provide other examples if useful.
Not sure what all info from the logs would be useful, but here are some relevant lines (full log report available at the debug ID posted above, if any devs want to take a look): @adamsmith, can you say more about how (1) works in this case? When Zotero "checks if you have access directly because the site is OA", what is it checking exactly? Is it relying on the site metadata in some way? Or looking for a "download" link and seeing if it can initiate a download without hitting a paywall? I'm trying to figure out why Zotero would fail to recognise that reports can be freely downloaded from our site (where hopefully it's fairly obvious to humans).
Thanks for the suggestion to check Unpaywall. I don't see anything in the JSON records for either report that would explain why it associates the PDF for Report B with the metadata for Report A. But I do see that our reports are incorrectly classified as
is_oa: falsein their data. I submitted a correction for one DOI and will see what comes of that, and may ask if they can advise on our situation more broadly.1) would have been the case, if, say, there's a citation_pdf_url link on your landing page, but I'm not seeing anything of the kind.