Feature Request: Use openURL resolver for Find PDF

adamsmith · October 10, 2022

I just used Find Available PDF on a larger scale for the first time (importing a large number of items via DOI from OpenAlex, then retrieving the PDFs).
I made sure to run the Find PDF command on campus and it did reasonably well, finding 44/100 of fairly obscure PDFs, but I noticed that it didn't find a number of PDFs for which our library actually does provide access.
In almost all cases, this was when we get access not through the publisher (i.e. the target of the DOI) but via a 3rd party database like JSTOR. Library Lookup identified these reliably (and Zotero was able to get the PDF from the linked database in most cases).

So: Would it be possible to take advantage of library resolvers when running Find Available PDF?

It would require some extra work to follow the lookup links, but since there aren't actually that many different resolvers, that seems doable. Thoughts?

dstillman · October 10, 2022

And when you're on campus the whole chain of redirects and linked pages works based on your IP address without any sort of cookie-based authentication or other interaction?

E.g., the EBSCOhost resolver I have access to is behind an EZproxy login link that requires web-based authentication. I don't know what happens if that link is loaded on campus. Even if EZproxy generally does work automatically on campus, there are almost certainly some resolvers that are behind authentication, so I assume we wouldn't be able to support all resolvers.

If we did do this, it seems like we'd need some translator-like files that could extract the lookup links for individual resolvers, and some community involvement in creating and maintaining those.

I'd also worry that they would break after site changes without any real visibility. Some could be automatically tested, but not ones that didn't work without authentication off campus.

And, finally, getting the rate-limiting right might be tricky. We take some steps in Find Available PDF to throttle requests to individual domains and I believe stop trying a given domain on repeated failure, but here we'd need to send all requests (that failed a publisher lookup, at least) to the resolver, and we'd want to avoid sending too many too fast and also avoid sending hundreds or thousands of requests if it's just not working (e.g., because you're off campus).

So all in all, a tricky problem. But it's an interesting idea, and we have been talking about trying to use OpenURL resolvers more, since we've been getting a huge number of submissions to the directory.

adamsmith · October 10, 2022

And when you're on campus the whole chain of redirects and linked pages works based on your IP address without any sort of cookie-based authentication or other interaction?

Yes, at least for our 360 (serialssolutions) resolver, that's the case -- but I think that's due to a feature in EZProxy, if you look at the link to full text URL produced by our linkresolver, EZProxy is actually in there, but as I understand (and see it), it just gets skipped when I'm inside the university's IP range:

https://nq5hl7cp9d.search.serialssolutions.com/log?L=NQ5HL7CP9D&D=RCA&J=PERSONPO&P=Link&PT=EZProxy&A=Ethics,+Epistemology,+and+Openness+in+Research+with+Human+Participants&H=41ceba9e43&U=https://libezproxy.syr.edu/login?url=https://www.cambridge.org/core/product/identifier/S1537592720004703/type/journal_article

The authentication requirement doesn't appear to be universal, though.
JHU let's you use their lookup and they then authenticate you once you click on the resource link.

That also seems to be the case for the Primo resolvers, so we can check/test those.

Unfortunately, it looks like all the SFX resolvers don't exist anymore, so a lot of the current data we have is wrong (since SFX is/was an exlibris service, I think most of these would have moved to Primo).

If we did do this, it seems like we'd need some translator-like files that could extract the lookup links for individual resolvers, and some community involvement in creating and maintaining those.

Agreed, yes. I'd expect these to be pretty simple, but not simple enough as to be guessable. One thing we'd want to be reasonable sure about is that the resolver page is actually structurally identical for the same resolver (though I'd suspect that is the case).
One the plus side, we could have a list of supported resolvers, so we don't try the ones we don't understand (yet)