Available for beta testing: improved PDF retrieval with Unpaywall integration

dstillman · August 16, 2018

The latest Zotero beta includes new functionality to help you find PDFs for items in your Zotero library.

While Zotero has always been able to save PDFs automatically as you save items from the web, that's only been possible when saving from the browser, and generally only when the PDF is available and accessible on the page you're saving from.

In the latest beta, if you save an item from a page where Zotero can't find or access a PDF, Zotero can now automatically search for an open-access PDF using data from Unpaywall and attach the PDF to your item. It can do the same when you create an item with "Add Item by Identifier", and a new "Find Available PDF" option in the item context menu lets you retrieve PDFs for existing items in your library. We run our own lookup service for these searches with no logging of the contents of requests.

When you use "Add Item by Identifier" or "Find Available PDF", Zotero will also load the page associated with the item's DOI or URL and try to find a PDF to download from there before looking for OA copies. This will work if you have direct or VPN-based institutional access to the PDF. (For web-based proxies, only open-access PDFs will be automatically retrieved using this new functionality. You can of course continue to save items with gated PDFs from the browser using the Zotero Connector.) Zotero won't currently check the DOI or URL page when saving from the browser, since loading it would result in additional requests and data leakage (to at least the DOI resolver) for many items that you save, and it would only be useful if 1) you weren't already on that page, 2) it wasn't already in Unpaywall, and 3) the PDF was OA or you had direct access.

For best results, you should try this out with the beta version of the Zotero Connector for Firefox available from the same page. (It will still basically work when saving from the current Zotero Connector for Chrome or Safari, but the save popup may not fully reflect what's happening.)

If there are other sources of PDFs you'd like Zotero to use, you can also set up custom PDF resolvers.

bjohas · August 17, 2018

Excellent! I’ve wanted this feature for a long time!

I assume this would also work without an existing DOI, I.e based on author, title, year only?

Also, if the request results in a pdf, which has a doi, is there a way this could be added to the metadata? (In the same way as it would be if you had dropped the pdf into zotero.)

adamsmith · August 17, 2018

Thanks Dan. That makes sense. How are those sources prioritized? I love unpaywall, but I generally have a preference for the "authoritative" copy where I have access, e.g. because author-copies may not have the same pagination.

I'm not sure if this is a universal preference (I'd suspect it would be).

bjohas · August 17, 2018

@adamsmith - yes, I agree. Official copy where possible would be helpful.

djross3 · August 17, 2018

@adamsmith, this was also my concern when I saw this feature being added. It might not matter so much for someone who typically uses a style that usually doesn't require citing page numbers (just main arguments from the paper), but I actually find myself spending a lot of my time specifically trying to figure out whether the paper I have actually has the final pagination or not. In some cases it's very clear (unformatted manuscripts), but not other times, especially when the final article from the publisher is very hard to track down (e.g., working papers and their website is no longer online!), and it's hard to know if the proof the author has uploaded is actually final or not.

So, in short, having some kind of warning that this is an unofficial copy would be relevant. Better than nothing, but possibly confusing. Unaware users-- students for example-- might not know to watch out for this situation.

Having this as optional, either as a Zotero preference, or as a case-by-case "do you want to save this author draft version?" dialog box, could be a useful part of this feature.

I suspect that usage of this feature will correlate some with fields of study, for example more in fields where arxiv.org is a normal place to get current papers, and less in fields where citing (published) page numbers is crucial.

DWL-SDCA · August 17, 2018

FWIW if the version available on arxiv or ssrn isn't a reprint of the final published version, my preference is to list the archive number and not pagination.

djross3 · August 17, 2018

Agreed. And that's why it's important to differentiate the type of PDF being retrieved when adding it to the library!

dstillman · August 18, 2018

@bjohas:

I assume this would also work without an existing DOI, I.e based on author, title, year only?

Are you asking if it works without a DOI currently or if it could? Currently the OA lookup requires a DOI, and I suspect that won't change. (The manual actions will use the URL field too, as I explain above.) Every record in Unpaywall has a DOI, and using that as a key allows us to make this very fast. I think the proper place to handle items without a DOI would be when updating metadata for existing items, after which a DOI would hopefully be available for PDF retrieval. We're planning to implement that feature soon. (The initial version of that will probably use identifiers to start, but we'll hopefully be able to support identifier-less items later.)

@adamsmith:

How are those sources prioritized? I love unpaywall, but I generally have a preference for the "authoritative" copy where I have access, e.g. because author-copies may not have the same pagination.

Unpaywall sources are ordered 'publishedVersion', 'acceptedVersion', 'submittedVersion', and we try them in that order. We're planning to show the version in the UI somewhere — we had been thinking in a new field in the right-hand pane, but maybe it'd be better to just name the attachment item something like "Full-Text PDF", "Accepted Version (PDF)", and "Submitted Version (PDF)". (Right now it names the title based on the parent metadata, the same as the filename, but that's sort of pointless.)

gutzhr · August 18, 2018

How to set up the custom PDF resolvers?

dstillman · August 18, 2018

I linked to instructions above.

adamsmith · August 18, 2018

I didn't know this about unpaywall -- that's neat. I (and I think Bjoern) were also referring to this part, though:

When you use "Add Item by Identifier" or "Find Available PDF", Zotero will also load the page associated with the item's DOI or URL and try to find a PDF to download from there.

I understand this to use the existing translators, or is that incorrect?
If so, could there be a setting to try this first, then go to unpaywall (or is that already the case)?
And I think Bjoern's question about URLs was also referring to this part, which presumably would/could work with just a URL.

gutzhr · August 18, 2018

Could you give an example how to set up the custom PDF resolvers? It is not work correctly after change the sting of the resolvers. I think there is something wrong for me.
I did not know how get the botton of "Add Item by Identifier" in Zotero.

adamsmith · August 18, 2018

Could you post the JSON you're trying here (between html <code> tags) and describe what you're trying to do and what's not working? There are examples in the documentation already.

gutzhr · August 18, 2018

I try it as follow step:
1. look up "extensions.zotero.findPDFs.resolvers"
2.eidt the file
3.post the JSON code in "extensions.zotero.findPDFs.resolvers"

adamsmith · August 18, 2018

OK, but what's the JSON code that you're using and what's happening/not happening?

gutzhr · August 18, 2018

I can not download PDF file from the webpage which defined in the JSON code such as "https://example.com/{doi}".

adamsmith · August 18, 2018

right, but that's because example.com doesn't host PDFs.

Could you maybe take a step back and explain what you're trying to do _specifically_? Why do you want to add a PDF resolver and what site do you want to add?

gutzhr · August 18, 2018

I have a webpage host PDFs, of course. I can download PDF from the webpage, But I can not download the PDF through Zotero after defined in the JSON code. I think the aim of the PDF resolver is to download PDFs automatically.

adamsmith · August 18, 2018

I think you're misunderstanding what the resolver is supposed to do, but unless you're willing to share _specifics_ (starting with which webpage) of what you're trying to do, we really can't help further.

bwiernik · August 18, 2018

Right now it names the title based on the parent metadata, the same as the filename, but that's sort of pointless.

It is really useful to have PDF filename be based on the metadata to facilitate emailing and sharing with colleagues. If the Zotero item title is used to describe the nature of the PDF, I would hope that the filename would still be based on item metadata.

The rename dialog for attachments could be improved, I think, by having separate fields for item title and filename, with the checkbox making the two fields the same.

dstillman · August 18, 2018

@adamsmith:

I (and I think Bjoern) were also referring to this part, though:

"When you use "Add Item by Identifier" or "Find Available PDF", Zotero will also load the page associated with the item's DOI or URL and try to find a PDF to download from there."

I understand this to use the existing translators, or is that incorrect?

Yes, DOI/URL uses existing translators.

If so, could there be a setting to try this first, then go to unpaywall (or is that already the case)?

Yes, sorry, it tries DOI, followed by URL, followed by OA (avoiding repeats). I've clarified the order above.

@gutzhr:

I have a webpage host PDFs, of course. I can download PDF from the webpage, But I can not download the PDF through Zotero after defined in the JSON code. I think the aim of the PDF resolver is to download PDFs automatically.

You can use Help → Debug Output Logging → View Output to see what it's doing. A couple things to note:

1) For both Unpaywall and custom resolvers, Zotero forces the PDF and page URLs to HTTPS, even if the URLs returned from the resolver are HTTP.

2) If the PDF download results in an HTML page (such as a login page or captcha), it will currently fail. I was planning to add an option for custom resolvers to pop up a browser window that lets you log in as necessary when that happens, but I'd like to find a way to avoid showing the Mozilla open/save dialog for the PDF, as would happen now if we did that.

@bwiernik:

It is really useful to have PDF filename be based on the metadata to facilitate emailing and sharing with colleagues. If the Zotero item title is used to describe the nature of the PDF, I would hope that the filename would still be based on item metadata.

Yes, for sure. I just meant it was pointless having the attachment title also be the filename. (For translators we already show "Full Text PDF" or similar in most cases while renaming the file. This would just be doing the same thing for retrieved PDFs.)

The rename dialog for attachments could be improved, I think, by having separate fields for item title and filename, with the checkbox making the two fields the same.

Issue created.

zuphilip · August 20, 2018

Excellent feature to come next! Thank you.

The messages during the PDF retrieval are in English only, also I run Zotero in German. Can these strings be localized as well? I am happy to translate any missing strings when they are in transifex.
- Searching for available PDFs...
- Checking 1 item
- Checking X items
- No PDF found
- 1 PDF added
- X PDFs added

Moreover, when selecting multiple items I see the wrong strings "Checking (null) items" and "(null) PDFs added".

sdspieg · August 20, 2018

Wow. This sounds fantastic! Thanks to the whole team for yet another great functionality. I think/hope this will also increase the appetite for the text mining functionality (that we used to have with the papermachine plugin, and that Olga Scrivner is still working on too).

dstillman · August 21, 2018

@zuphilip:

I am happy to translate any missing strings when they are in transifex.

Available now.

when selecting multiple items I see the wrong strings "Checking (null) items" and "(null) PDFs added"

Fixed, thanks.

bjohas · August 24, 2018

I'm just trying this - I'm on the latest Zotero beta, but cannot see the option in the item context menu (assuming that this is a right / ctrl-click) on the selected item in the middle pane?

adamsmith · August 24, 2018

It appears in the context menu for items that
a) don't already have a PDF and
b) have a URL or DOI

dstillman · August 24, 2018

(And that's specifically for single items. If you select more than one item, it always shows the option rather than checking fields for each item (but it won't make any requests if there are no DOIs or URLs).)

bjohas · August 25, 2018

AHA! Many thanks!

dstillman · August 27, 2018

In the latest beta, the PDF attachment title will be set to "Full Text", "Accepted Version", or "Submitted Version" based on the version information from Unpaywall.

bwiernik · August 27, 2018

Cool!

sdspieg · September 1, 2018

But so how come I don't see this option in the context menu? I have loads of items (e.g. from CrossRef through Publish or Perish) that have a (often short) DOI and/or a URL but no pdf. But it still doesn't give me the option - whether I select one item or many...
UPDATE - I wasn't running the beta. Now I do see it