Available for beta testing: improved PDF retrieval with Unpaywall integration

sdspieg · September 1, 2018

Is it normal that re-running the same tool still always brings new 'finds'? The first time I ran it, it found and attached about 10% of the 1000 records that I had downloaded from Google Scholar through Publish or Perish. Through rerunning the tool a few times, I am now up to 150. And it is still discovering new pdfs. I just wonder why that is the case

dstillman · September 2, 2018

Is it normal that re-running the same tool still always brings new 'finds'?

It depends exactly what you're doing.

1) Are you running it from the same network each time?

2) Are you actually selecting 1000 records, right-clicking, and selecting "Find Available PDFs", or are you trying in smaller batches?

Recall that, when you use "Add Item by Identifier" or "Find Available PDF", Zotero will actually load the DOI/URL page before checking for OA sources. The latter can change over time as Unpaywall updates its data and we incorporate it (which happens no more than once a week), but the former can change based on whether you have access to a PDF from your current network and whether a site is blocking you. If you try to find PDFs for 1000 items at a time, there's a decent chance some sites will start blocking you for making automated downloads.

We should probably put in some automatic per-site rate-limiting to keep those requests under control no matter how many items you select.

sdspieg · September 3, 2018

1) Are you running it from the same network each time?

I was.

2) Are you actually selecting 1000 records, right-clicking, and selecting "Find Available PDFs", or are you trying in smaller batches?

I was trying the former, but have switched to the latter. 10-item batches seem to work perfectly. NOT in the sense that they find all pdfs. but they do seem to download all pdfs that are available in OA.

Recall that, when you use "Add Item by Identifier" or "Find Available PDF", Zotero will actually load the DOI/URL page before checking for OA sources. The latter can change over time as Unpaywall updates its data and we incorporate it (which happens no more than once a week),

These were right after one another. So that can't be it

but the former can change based on whether you have access to a PDF from your current network and whether a site is blocking you.

That was probably it. Would we be able to see that in the log?

We should probably put in some automatic per-site rate-limiting to keep those requests under control no matter how many items you select.

What would be even better, would be something like what 'Publish or perish' does for Google Scholar (quite effectively). But so a mix of this new unpaywall integration with something like PoP for Google Scholar and Microsoft Academic would be a dream come true...

y.sapolovych · September 3, 2018

I've noticed that it sometimes can't retrieve PDF despite a correct link attached to an item. So I have 5 items with links to PDFs and an open access page. But it still doesn't download them. Is there something wrong with the links? I imagine there might be witg the last page because of its odd structure, but the first 4 are seemingly ok. The Debug ID is D533727933. But in case you need links separately - I can post them as well.

dstillman · September 6, 2018

@sdspieg:

What would be even better, would be something like what 'Publish or perish' does for Google Scholar (quite effectively).

Which is what?

@y.sapolovych:

I've noticed that it sometimes can't retrieve PDF despite a correct link attached to an item.

It looks like you have some items with DOIs that resolve directly to PDFs as well as direct PDF URLs in the URL field (which is generally incorrect, since that's not normally the URL you would cite). In the latest beta, those should be handled properly. Thanks for reporting.

bwiernik · September 6, 2018

PoP uses a progressively longer rate limiter the larger the query set is. Not sure if they do more than that.
https://harzing.com/resources/publish-or-perish/tutorial/using-pop/queries-preferences
https://harzing.com/resources/publish-or-perish/tutorial/google-scholar/slow-searches

sdspieg · September 6, 2018

It's a bit more sophisticated than that: "To avoid hitting the maximum allowable request limit, Publish or Perish now uses an adaptive request rate limiter. This limits the number of requests that are sent to Google Scholar within a given period, both short-term (during the last 60 seconds) and medium term (during the last hour).

To achieve the required reduction in requests, Publish or Perish delays subsequent requests for a variable amount of time (up to 1 minute). The higher the recent request rate, the longer the delays."

And it really does work quite well - especially in combination with the ability to specify the year, this allows us to download 1000 hits per year for any give search query. Unfortunately without also downloading the actual pdfs. The program is primarily intended for bibliometric analyis; but the developer has shown some interest in also including a way to download the URLs. We currently export the bibliographic information in *.bib format from PoP and then import them into Zotero. But so it the fix that Dan just mentions to Yevhen's issue works, that will make us EXTREMELY happy (corpus linguistics) campers :)

dstillman · September 23, 2018

In the latest beta, Zotero will automatically wait 1 second between requests to a given domain (not including doi.org, but including domains that DOIs redirect to), and continue with other items from different domains in the meantime. It will also automatically back off if sites return certain error codes, including the "Too Many Requests" error code. If a given domain returns an error code for more than 5 requests in a row, Zotero will skip remaining items for that domain for the remainder of the run.

We can adjust this further if we can identify throttling behaviors of specific sites, but since you can always retry retrieval, and Zotero will skip items that already have files, the main concern here is just not overloading servers and obeying backoff instructions, not making sure it always works on the first try.

dstillman · October 10, 2018

This is now available in Zotero 5.0.56, along with a new progress window. Thanks for the help testing!