Web connector not picking up a publishing site that has DOI Links

seeingtheforest · March 11, 2020

Report ID: 1424510495

Hi, I'd like to pull lists of books/articles in from the following site (https://www.developmentbookshelf.com/action/showPublications) but Zotero requires me to go to each specific article/book page in order to do so. The main browsing list simply has links, but the links have the DOI embedded in it. Is there a way to get Zotero to scrape these lists? Mendeley is able to scrape these links as well as download most of the PDFs, so I suppose I can use that and then sync with Zotero, but I'd rather not.

Also, the same organization has a sister site that doesnt use DOIs, but has ISBNs. (https://developmentbookshop.com/ebooks). I'd love to be able to do a batch process rather than open each page. Mendeley isn't successful here, but it does recognize a book reference on the book pages given the ISBN - Zotero just saves a web page reference.

They have another site with smaller, non-DOI, papers that I'd love to be able to batch link (and even download the PDFs if possible) as well - (https://answers.practicalaction.org/our-resources/collection/biofuel-and-biomass-1). Mendeley doesnt work here other than saving a webpage.

I've tried it on both Chrome and Firefox to no avail.

Thanks!

adamsmith · March 12, 2020

1. At least currently, Zotero expects DOIs to be in text visible on the page in order to scrape them, so no to links, so no, no easy way to do this.@dstillman we could consider specifically also going through a.href?

For 2 and 3 one would have to write custom import code -- doable, but probably not very high priority -- first time this site has ever been mentioned here.

seeingtheforest · March 12, 2020

Thanks for the response! Looks like while you were writing, I was editing my post to compare to Mendeley.

1. It can scrape all of these AND find copies of most of the PDFs.
2. It recognizes references given the ISBN on the page. Zotero just makes a web page copy.
3. Same as zotero - web page copy.

Is #1 easy enough to do with a href?

I wouldnt expect a big effort for the other two though for #2, is there an adjustment that can be made in order to scrape the ISBN and create a proper reference rather than saving a web page?

Its a fantastic resource if you guys haven't seen it - huge repository of quality materials for developmental work (my field).

dstillman · March 12, 2020

@adamsmith:

we could consider specifically also going through a.href?

I guess we could do that. I doubt it would be useful very often, but there's probably no real downside (assuming there's no performance impact, but we're already scanning all text nodes).

@nixsee:

is there an adjustment that can be made in order to scrape the ISBN

We could do that, but the main problem is that there aren't proper APIs for many ISBN lookups like there are for DOIs. Granted, you can paste multiple ISBNs into Add Item by Identifier and Zotero will do the lookups, but that's a bit more intentional than just clicking the folder icon and retrieving metadata for any ISBN that might appear on the page. (I don't know what Mendeley is doing. It's possible they're doing lookups in their own database, but they're a bit more casual about data accuracy and sharing — identifiers seem to be frequently linked to wrong or incomplete items based on other people's entries.)

seeingtheforest · March 12, 2020

Also, I can't necessarily write code, but I am generally capable of splicing together code snippets - if there are limits to what you're able to help with here, I could try my hand at it if you could point me in the right direction (translators that might be similar and need tweaking for example, or good user guides for learning to make translators).

seeingtheforest · March 12, 2020

Looks like we posted at the same time again - thanks for the detailed response!

It seems to me that being able to scrape ISBNs is more efficient, even if having to double check them, than doing it all manually.

If you decide to do any of this, how might these sorts of changes be implemented? In upcoming general updates for Zotero? Something I'd have to install myself?

dstillman · March 12, 2020

It seems to me that being able to scrape ISBNs is more efficient, even if having to double check them, than doing it all manually.

You have to check imported data in any tool, but we're not going to knowingly give people garbage data — that's just not really our style. People on Twitter complain all the time about junk data in Mendeley — e.g., the wrong item showing up for a given identifier — and that's not something we're interested in emulating. (That's not to say that there can't be incorrect data in Zotero, but when there is it comes from bad website data or temporary translator bugs, not from reusing unverified data from other people's libraries.)

To clarify, though, you don't have to enter metadata manually in Zotero — you can just paste the ISBN into Add Item by Identifier to retrieve the metadata. The question is just turning that on for a site that might have dozens of ISBNs listed without the existence of proper APIs.

If you decide to do any of this, how might these sorts of changes be implemented? In upcoming general updates for Zotero?

They would be updates to the DOI translator and/or an addition of a generic ISBN translator. You'd get them automatically.

if there are limits to what you're able to help with here, I could try my hand at it if you could point me in the right direction

Thanks, but these are more policy decisions than technical undertakings. They're fairly trivial on a technical level.

You're certainly welcome to work on a translator for that site, though.

adamsmith · March 12, 2020

I guess we could do that. I doubt it would be useful very often, but there's probably no real downside (assuming there's no performance impact, but we're already scanning all text nodes).

I think it may be more often than you think, especially for scraping reference lists -- CrossRef DOI display guidelines for those explicitly allow the format:

Galli, S.J., and M. Tsai. 2010. Mast cells in allergy and infection: versatile effector and regulatory cells in innate and adaptive immunity. Eur. J. Immunol. 40:1843–1851. Crossref

dstillman · March 13, 2020

Ah, OK, this makes a lot of sense, then.

I've opened a pull request with this change (and some other improvements to the DOI translator).

seeingtheforest · March 14, 2020

Thanks very much! Looking forward to the functionality. I'm endlessly impressed with the responsiveness and helpfulness in this community - it is making my waffling between Zotero, Mendeley and Calibre much easier to resolve :)