Automated way to run "Find Available PDFs"
Hi all,
I'm working on a survey paper, which involves running text-mining scripts on thousands of PDFs. For this I've done the following so far:
1) Dumped search results from IEEE xplore, Scopus and Web of Science to a csv file using their own APIs (their web interface didn't allow me to download the metadata of all the search results, so I had to use their APIs).
2) Converted the CSV to .bib and imported the article metadata into Zotero. I'm able to import the most important fields for my use-case (title, authors, year, doi, url).
3) I used Zotero to detect and merge all duplicates. I am left with 20000+ papers after this step.
For this step I used the js script by marcelparciak at the following link: https://forums.zotero.org/discussion/comment/347581/#Comment_347581
Now I'm trying to obtain the PDFs of these papers. Doing Select all + Find Available PDFs skips most of the papers for some reason. Also, I've installed the zotero-scihub addon since Scopus messes up the urlresolver sometimes and zotero is unable to fetch the pdf even though my institution has access to the full pdf.
Question: Can someone help me write a js script that would call the "Find Available PDFs" function in the right-click pane one by one, similar to the script in the #3 above? Ideally, for failed items it would then call the "Update Scihub PDF" function in the same pane.
I'd appreciate any input about this!
I'm working on a survey paper, which involves running text-mining scripts on thousands of PDFs. For this I've done the following so far:
1) Dumped search results from IEEE xplore, Scopus and Web of Science to a csv file using their own APIs (their web interface didn't allow me to download the metadata of all the search results, so I had to use their APIs).
2) Converted the CSV to .bib and imported the article metadata into Zotero. I'm able to import the most important fields for my use-case (title, authors, year, doi, url).
3) I used Zotero to detect and merge all duplicates. I am left with 20000+ papers after this step.
For this step I used the js script by marcelparciak at the following link: https://forums.zotero.org/discussion/comment/347581/#Comment_347581
Now I'm trying to obtain the PDFs of these papers. Doing Select all + Find Available PDFs skips most of the papers for some reason. Also, I've installed the zotero-scihub addon since Scopus messes up the urlresolver sometimes and zotero is unable to fetch the pdf even though my institution has access to the full pdf.
Question: Can someone help me write a js script that would call the "Find Available PDFs" function in the right-click pane one by one, similar to the script in the #3 above? Ideally, for failed items it would then call the "Update Scihub PDF" function in the same pane.
I'd appreciate any input about this!
That change is actually pretty easy, I dumped everything into a .csv, and in excel I removed the duplicates (as well as some additional filtering).
I am now left with 33067 entries all with DOIs (all papers are published after 2011).
You mentioned importing into Zotero via DOI. How do I go about this? Using the magic wand tool? I suspect it can handle 33k items at once. Or is there a programmatic way of doing this?
Regarding the speed, however, note that any way you to this you are likely to run into access limits on websites, so slowing it down to a few dozen or a hundred at a time might be a good idea anyway.
(Note that the main reason I suggested this is that it's best way to get good metadata for each entry, rather than trying to import it piece by piece from elsewhere. And it may also solve your PDF question by doing it that way.)
I think the metadata that's important for the "Find PDF" function is the DOI and nothing else. In any case I'm importing the DOIs of the articles correctly, so doing it this way versus importing via .bib won't make a big change. I'll still try to do it this way for the sake of more complete metadata as you pointed out.
Doing the importing in batches I think will help my case, but my original question is still outstanding. If anybody can help me do this (or my original question) through a script, I'd appreciate it a lot.
Is there a way of triggering this function through the js API? Perhaps I should use pyzotero to add 30k items one by one.
Also, do you have any suggestions/pointers about the general flow I'm following?
If the PC is in the network then yes, you should have access, but this isn't 100% (Zotero just follows the DOI and checks for access, but you may have access through an aggregator like EBSCO or ProQuest, not the publisher in some cases).
No specific thoughts on writing a plugin -- technically feasible, but seems overkill given that the functionality can be triggered in batch in various ways.
Could you please tell me more about those various ways? I can't seem to think of any that would allow me to batch import (say 50 items), wait for the Find PDF to do its thing, run the other plugin (zotero-scihub) for those that failed to find a PDF, and repeat for the next batch.
Maybe I should sketch up a python script to download the PDFs (using DOI or article URLs) and import to zotero after that via pyzotero?
Beyond that I'm not really understanding what you're asking here, given that Zotero already allows you to run Find Available PDFs on any number of items.
This is one such entry where I have access to the full-text through my institution, but zotero fails to fetch the PDF. I can see in the debug log it goes to the correct PDF address (sciencedirect), which I think is scraped from the available Scopus link in the item's URL field, but fails to download the PDF for some reason.
This happens for a lot of papers in the library.