Find Available PDFs power use case; obstacles and ways forward?

I'm planning to use "Find Available PDFs" for thousands of entries. I know this is not the typical use case. This is for a large computer-aided lit review I'm doing in which Zotero does some of the heavy lifting of managing bib entries and retrieving and storing the full text. I plan to connect to my university's VPN to ensure I can access the most paywalled PDFs possible. Many of these items will not have PDFs available and will be culled afterward. (Also, I'm not using Sci-hub for this project.)

I've tried to dig through the forums a bit to understand what difficulties may lie ahead. I know that a couple years ago Zotero started using Unpaywall, but I'm not sure what those implications are, if any. I did read in the Zotero blog that 'When you use "Add Item by Identifier" or "Find Available PDF", Zotero will load the page associated with the item’s DOI or URL and try to find a PDF to download before looking for OA copies. This will work if you have direct or VPN-based access to the PDF.' I'm not sure what my university VPN will do with that, but there's one way to find out.

I've found through trial and error that using the Find Available PDF option on only 300 entries seems to trigger some sort of temporary lock-out for whatever service Zotero is calling. Does Unpaywall block requests of this size? The progress window seems to hang indefinitely after doing only a few items in the 300 queue. Doing 5 at a time seems to work fine. This makes me wonder if there's some non-obvious issue with the Find Available PDF option and how it queues items when making that many requests. Doing it 5 at a time for thousands of papers is a lot of unproductive manual work. I'm fine with limiting my rate, if there's a way to do that. I can take a custom code approach, but I'd like some guidance on how and where to do that.

What difficulties or possible issues should I be aware of?
What approach should I take for retrieving this many PDFs?
  • The freeze isn't expected -- what exactly do you see happening and how long have you let Zotero run?

    What I would expect is that several of the commercial resources you'll be querying this way will start locking you out of PDF downloads if you do this for large amounts of items at once (because you look like a bot downloading their PDFs) -- that shouldn't affect Unpaywall, though.
  • I've done a lot more testing now. I would like to be clear though, that this post wasn't about slyly raising the possibility of issues with Find Available PDFs (shortened: Find PDF). My questions are quite genuinely focused on getting through thousands of items. I also don't want to abuse any set up Zotero has with another organization, like Unpaywall.

    But as to the hanging. It lasts 25+ minutes for a single entry. The trouble seems to be items with proquest URLs. (I wish I could sort by URL but I can't list it as a column.) But the trouble doesn't quite stop there.

    Even stranger, if I cancel one and start the entry below it, both being single-selections, sometimes the title of the previous item will appear in the processing window, as if I had selected them together (which I did not). For example, say I start a proquest URL Find PDF but then cancel and do one below it instead. It won't progress either. If I close Zotero, and reopen, "Find ... PDF" again, that second (not-proquest) item will progress with Find PDF.

    I should mention that I have a number of add-ons, although I don't know if any of them make any difference. E.g., Zotfile, Zutillo, DOI Manager. I think all are fairly standard add-ons.

    And now, right as I preview this post, the items with proquest URLs are back to working for Find PDF!
  • I should have left an example URL from the item's entry. But as I mentioned, they're now working. It must have been a proquest issue, and Zotero wasn't able to move on, so to speak. https://search.proquest.com/openview/ca17c76cd80a0096a5cc137af7240730/1?pq-origsite=gscholar&cbl=2026366&diss=y
  • We’d want to see a Debug ID for reproducing this, with a few minutes of it hanging.
  • To bring the subject back to the point of it all, the question I raised wasn't a bug report, but instead a genuine request for guidance or advice. The feedback here from adamsmith is that using Find Available PDF for large numbers of entries will likely lock me out of some publisher sites if they're hit enough, but not Unpaywall. That's quite good to know. I'm a bit surprised. This may test Unpaywall's good graces.

    It sounds like the way forward is just trial and error. Somewhere between a rapid and slow rate--running all items as one big batch and running a few items every few minutes--is a sweet spot where I am unlikely to be locked out, but who knows where it is. And it probably varies by publisher. I will mention that no one publisher seems favored in my search results that have been imported into Zotero, so maybe that randomness will help.

    I think Zotero has huge potential to be used in these kinds of reviews, and these types of reviews have huge potential for science itself, but I realize it's not standard use. I do very much appreciate that Zotero has the functionality to help me do this, because it would be a lot more work otherwise.

    Related to the temporary hanging issue, I just saw there was another thread yesterday mentioning proquest having issues. Good to know it wasn't just me. Re: dstillman's request, there's no way I could submit a bug report or debug information if the issue requires replicating proquest having whatever issue it had. Hopefully if a site has that issue in the future and it affects someone trying Find Available PDFs they'll find this post in a search and realize the issue. The solution is to cancel the Find... PDFs and restart Zotero, then select items with URLs that do not point to a site/publisher having an issue.
  • To bring the subject back to the point of it all, the question I raised wasn't a bug report, but instead a genuine request for guidance or advice.
    understood, but you were asking based on experiencing problems that you shouldn't experience, so the advice is to (help) get them fixed and continue working without those issues.
    It's indeed that broken ProQuest import freezes the find PDF function, but that would likely be a bug, too: translators will fail occasionally, and the tool should handle that "gracefully," i.e. by trying Unpaywall and then moving on to the next item.
  • Yes, as adamsmith says, it shouldn’t freeze here We’ll try to reproduce this.

    (And another clarification: Unpaywall servers aren’t involved here. We don’t send any data to them. So there’s no “testing their good graces”.)
  • About a year ago I asked a similar question on the developer list, so I've been preparing to do this for a while. I'm finally getting back to this piece of the chain. The proquest issue really did just coincidentally pop up the day I was posting about it here. Call it bad luck. I didn't want to hit anyone's service or server hard if it might cause a problem. I figured if it might, someone would chime in with "don't do that," or "do it, but do it like this." I'm new to the realm of sending out a bunch of automated request for things.

    I went and reread the blog post about Unpaywall: "if you save an item from a webpage where Zotero can’t find or access a PDF, Zotero will automatically search for an open-access PDF using data from Unpaywall." Very subtle: you're using their data, but that doesn't necessarily mean you're hitting their server. I didn't catch that the first time, but maybe you can see why I assumed you were. Great, the situation is even better than I hoped! Thanks for the clarification.
  • I'm guessing y'all won't be shocked to hear this, but I've gotten the hanging issue again. And it's not proquest. This will be a bug thread afterall. Here's the ID: D120808672

    It was apparently trying to access this URL: https://stud.epsilon.slu.se/16926/

    When I go to that URL, I receive the page:
    EPrints System Error
    Error connecting to MySQL server: Too many connections. To fix this increase max_connections in my.cnf:

    [mysqld]
    max_connections=300

    Let me know if I can do anything further to help.
Sign In or Register to comment.