Zotero retrieves wrong PDF from org's library

I work for a small organisation that publishes research reports (all open-access). Each report has a landing page with a link to download a PDF of the report. We register DOIs (through Crossref) that resolve to the landing pages.

I recently found that when saving reports from our site via the Zotero connector, or when taking an existing reference and running "Find PDF", the wrong PDFs are retrieved/attached. I.e., if I go to the landing page for Report A and choose "Save to Zotero (DOI)" via the connector, it imports all the correct metadata for Report A, but attaches the PDF for Report B (from the same organisation).

In the Zotero logs (debug ID D847758907), I can see that a search for Report A's DOI is sent to https://services.zotero.org/oa/search and an open-access PDF URL is found, but it downloads from the URL for Report B's PDF. I understand Zotero gets its info on PDF locations from Crossref and Unpaywall, so I checked the records for the reports in both places, and didn't see any links to PDFs, correct or otherwise.

I checked with our web team and they confirmed that they do not include links to PDFs when registering DOIs, as they understood this to be best practice (the whole point of DOIs being to ensure people can get to a page where they can access a document, even if the location of the PDF changes).

So our guess is that the Zotero PDF service is using some heuristic for getting PDFs from our site, which goes wrong sometimes (which may well be because our site is or was configured in an atypical/unhelpful way). So the idea has come up to include the PDF URL in the CrossRef metadata as a <resource mime_type="application/pdf"> element, while still having the DOI resolve to the landing page. We know we would have to update the metadata if we change how PDFs are organised, but that would be manageable at our scale.

So, three questions:

  1. If we start registering PDF links, would that be expected to fix the issue with incorrect PDFs being retrieved?

  2. If so, how long after we make the change should we expect to see "correct" behaviour from Zotero?

  3. If this isn't the best way to help Zotero find the correct PDFs for our reports, what should we do instead?

Thanks in advance.
  • Can you post a specific example where the problem happens? It is often easier to study practical cases.
  • I agree that an example would help, but Zotero basically tries two things for a DOI/URL to get the PDF:
    1. It looks if you have access to it directly by following the DOI/URL (either because that site is OA or because you have access via IP where you are and
    2. Then it looks up if there is an open access PDF on unpaywall (https://unpaywall.org/)

    So I'd check the data for 2. for your PDF, which sounds like it might be wrong. I think changing what you have in CrossRef and making sure it's marked as open access might help (unpaywall uses their data extensively), but depends on how Unpaywall got the wrong PDF link (so might also be worthwhile reporting to them)
  • Thanks, I'd meant to post an example. When importing this report:
    https://www.3ieimpact.org/evidence-hub/publications/impact-evaluations/impacts-judicial-reform-small-claims-procedures-court

    the PDF for this report is retrieved:
    https://www.3ieimpact.org/evidence-hub/publications/impact-evaluations/evaluation-secondary-school-teacher-training-under

    Happy to provide other examples if useful.

    Not sure what all info from the logs would be useful, but here are some relevant lines (full log report available at the debug ID posted above, if any devs want to take a look):
    (3)(+0000002): Looking for open-access PDFs for 10.23846/PWPIE132

    (3)(+0000000): HTTP POST "{"doi":"10.23846/PWPIE132"}" to https://services.zotero.org/oa/search

    (3)(+0000001): HTTP GET https://api.zotero.org/keys/current succeeded with 200

    (3)(+0000001): HTTP POST https://services.zotero.org/oa/search succeeded with 200

    (3)(+0000000): Found 1 open-access PDF URL

    (3)(+0000000): Downloading file from https://www.3ieimpact.org/sites/default/files/2021-09/IE135-Nepal-SSDP.pdf
    @adamsmith, can you say more about how (1) works in this case? When Zotero "checks if you have access directly because the site is OA", what is it checking exactly? Is it relying on the site metadata in some way? Or looking for a "download" link and seeing if it can initiate a download without hitting a paywall? I'm trying to figure out why Zotero would fail to recognise that reports can be freely downloaded from our site (where hopefully it's fairly obvious to humans).

    Thanks for the suggestion to check Unpaywall. I don't see anything in the JSON records for either report that would explain why it associates the PDF for Report B with the metadata for Report A. But I do see that our reports are incorrectly classified as is_oa: false in their data. I submitted a correction for one DOI and will see what comes of that, and may ask if they can advise on our situation more broadly.
  • Yeah, I don't know what's going on here -- I don't see where Zotero would get the wrong (or the right, for that matter) PDF information from. Zotero devs would have to say.

    1) would have been the case, if, say, there's a citation_pdf_url link on your landing page, but I'm not seeing anything of the kind.
Sign In or Register to comment.