Automated way to run "Find Available PDFs"

yttuncel · January 20, 2022

Hi all,

I'm working on a survey paper, which involves running text-mining scripts on thousands of PDFs. For this I've done the following so far:

1) Dumped search results from IEEE xplore, Scopus and Web of Science to a csv file using their own APIs (their web interface didn't allow me to download the metadata of all the search results, so I had to use their APIs).

2) Converted the CSV to .bib and imported the article metadata into Zotero. I'm able to import the most important fields for my use-case (title, authors, year, doi, url).

3) I used Zotero to detect and merge all duplicates. I am left with 20000+ papers after this step.
For this step I used the js script by marcelparciak at the following link: https://forums.zotero.org/discussion/comment/347581/#Comment_347581

Now I'm trying to obtain the PDFs of these papers. Doing Select all + Find Available PDFs skips most of the papers for some reason. Also, I've installed the zotero-scihub addon since Scopus messes up the urlresolver sometimes and zotero is unable to fetch the pdf even though my institution has access to the full pdf.

Question: Can someone help me write a js script that would call the "Find Available PDFs" function in the right-click pane one by one, similar to the script in the #3 above? Ideally, for failed items it would then call the "Update Scihub PDF" function in the same pane.

I'd appreciate any input about this!

djross3 · January 20, 2022

(Possibly an easier workflow overall would simply be to save DOIs for all of the references you'd like, then, after easily removing all duplicate DOIs, automatically import all of the references into Zotero via DOI, which should give you better metadata and save the PDFs too. That is, assuming these are generally journal articles within the last 10-20 years that have DOIs. I'm not sure how easy it would be to replace your current workflow with this one, but it might be worth considering.)

yttuncel · January 20, 2022

Hi @djross3, thank you for the answer.
That change is actually pretty easy, I dumped everything into a .csv, and in excel I removed the duplicates (as well as some additional filtering).
I am now left with 33067 entries all with DOIs (all papers are published after 2011).

You mentioned importing into Zotero via DOI. How do I go about this? Using the magic wand tool? I suspect it can handle 33k items at once. Or is there a programmatic way of doing this?

djross3 · January 20, 2022

I'm not sure about the best way to do that automatically, although the magic wand can accept batches, so maybe try a dozen or one hundred at a time, and you'd get there eventually. (There probably is some way to do this automatically, but that's not something I've tried.)

Regarding the speed, however, note that any way you to this you are likely to run into access limits on websites, so slowing it down to a few dozen or a hundred at a time might be a good idea anyway.

(Note that the main reason I suggested this is that it's best way to get good metadata for each entry, rather than trying to import it piece by piece from elsewhere. And it may also solve your PDF question by doing it that way.)

yttuncel · January 20, 2022

Yeah doing it in batches makes sense, I'll try to find a way to do it programmatically, perhaps through the use of pyzotero.

I think the metadata that's important for the "Find PDF" function is the DOI and nothing else. In any case I'm importing the DOIs of the articles correctly, so doing it this way versus importing via .bib won't make a big change. I'll still try to do it this way for the sake of more complete metadata as you pointed out.
Doing the importing in batches I think will help my case, but my original question is still outstanding. If anybody can help me do this (or my original question) through a script, I'd appreciate it a lot.

dstillman · January 20, 2022

Just to clarify, Find Available PDF will only work if you have direct or VPN-based access to the PDF, since Zotero doesn't have access to any web-based proxy you use. If you use a web-based proxy, only open-access PDFs will be automatically retrieved via that feature (but you can save items with gated PDFs from the browser using the Zotero Connector).

yttuncel · January 20, 2022

@dstillman The PC zotero is running is in my institution's network, doesn't that mean I have access to the PDFs?

Is there a way of triggering this function through the js API? Perhaps I should use pyzotero to add 30k items one by one.
Also, do you have any suggestions/pointers about the general flow I'm following?

adamsmith · January 20, 2022

pyzotero won't help you here since it uses the web API which doesn't have that functionality.
If the PC is in the network then yes, you should have access, but this isn't 100% (Zotero just follows the DOI and checks for access, but you may have access through an aggregator like EBSCO or ProQuest, not the publisher in some cases).

No specific thoughts on writing a plugin -- technically feasible, but seems overkill given that the functionality can be triggered in batch in various ways.

yttuncel · January 20, 2022

Yes, that's what I've figured out in the past 2 hours, pyzotero is not useful for my case.
Could you please tell me more about those various ways? I can't seem to think of any that would allow me to batch import (say 50 items), wait for the Find PDF to do its thing, run the other plugin (zotero-scihub) for those that failed to find a PDF, and repeat for the next batch.

Maybe I should sketch up a python script to download the PDFs (using DOI or article URLs) and import to zotero after that via pyzotero?

dstillman · January 21, 2022

If you think Find Available PDFs isn't working correctly, we'd want to see a Debug ID for an attempt that's not working.

Beyond that I'm not really understanding what you're asking here, given that Zotero already allows you to run Find Available PDFs on any number of items.

adamsmith · January 21, 2022

I think the point is that making 30k requests for PDFs to presumably a small number of campus resources in short order is almost certainly going to get you locked out of most of them, so you'd want to stagger this.

yttuncel · January 21, 2022

@dstillman Debug ID: D646255144

This is one such entry where I have access to the full-text through my institution, but zotero fails to fetch the PDF. I can see in the debug log it goes to the correct PDF address (sciencedirect), which I think is scraped from the available Scopus link in the item's URL field, but fails to download the PDF for some reason.

This happens for a lot of papers in the library.

dstillman · January 23, 2022

@yttuncel: You'd likely have more luck with ScienceDirect URLs in the Zotero beta — Elsevier recently started doing some annoying anti-bot stuff that required some changes our end, and that's not available in 5.0.96.3.

agoldenvein · June 3, 2023

Is there a way to adjust the frequency/delay for fetching available PDFs? Running this on many PDFs tends to fail after a little while. I'm guessing as you stated it's based on the rate-limit of whatever service you're calling. Is there a way for the user to manually adjust the rate so we could maybe get it to the point where we don't have to go back and re-run the find available PDFs function all the time?