EBSCO Host

sdspieg · September 21, 2012

It took me a bit longer than expected (/hoped) before I could try out the new translator, but I now have. The first time I tried with 20 articles and that went perfectly, and indeed quite a bit faster than before (although that might be because I usually did 50 and not 20).
After selecting all articles, the little red-bordered grey box on the bottom right appeared as before without the actual article that is being processed displaying in the box (which it DOES do when I for instance do the same in Google Scholar). After a while, that box started 'moving up' more quickly than it used to (although I still didn't 'see' the actual articles in those boxes). And the result had all 20 nicely downloaded but no duplicates any more.
When I tried again with 50, however, things still seemed to get stuck at the first box. When I clicked on that box, it just disappeared and nothing further happened.
Let me ask 3 questions:
* SHOULD this process be sensitive to external inputs etc (I noticed before too that I couldn't use that computer for anything else, or else the download would not work).
* is there no way to create a log for this script, whereby it logs what exactly it does, and what succeeds or does not?
* could it be that the problems are related to the fact that I am putting them in a sub-sub-folder?

I will keep trying and and will post the messages firefox error console again in the hope that we can fully 'fix' this!

-Stephan

aurimas · September 21, 2012

* SHOULD this process be sensitive to external inputs etc (I noticed before too that I couldn't use that computer for anything else, or else the download would not work).

No, this should not be the case other than refreshing or navigating away from the page that you are actually trying to scrape (I believe). Though Simon or Dan can probably give you a more concrete answer. I've done some testing both with Zotero as a Firefox plugin and Zotero Standalone + Chrome plugin and I was able in both cases to perform other tasks (including using Firefox and Chrome for browsing) with items being successfully downloaded.

EBSCOhost is not the fastest server to respond, so downloading 50 items may take some time and the red box might disappear. The way the translators work, the items are not displayed in the red box until we have collected all of the metadata for an item (which involves going to the item page, retrieving the RIS file, and if there is an associated PDF, going to the PDF page to retrieve the PDF link = 2-3 page downloads from EBSCO). This may take quite a while, and is further delayed by going through a proxy.

Keep in mind that when you download 50 articles, there are 50 simultaneous connections being made to the EBSCO servers. This may be a problem on some networks. On Windows (XP SP2+ and Vista up to SP2), for instance, I believe this may be a problem with the maximum limit of half-open connections (set to 10). Your router may also be configured to not allow that many connections (though that number is typically much higher than 50, but on a shared network you might be reaching it). Furthermore, the proxy might be imposing similar limits (in which case, we could reconsider making these connections sequentially instead of concurrently). The point is, that performing certain other tasks could push the number of connections even further (e.g. running a bittorrent client)

As far as using the computer for, let's say, writing in Word, this should not have any effect on Zotero's ability to scrape pages.

is there no way to create a log for this script, whereby it logs what exactly it does, and what succeeds or does not?

I've added a bunch of debug messages that could help you determine where the items choke. The updated translator is at the same location (link in previous post).

could it be that the problems are related to the fact that I am putting them in a sub-sub-folder?

I don't believe that would matter. Do you mean sub-sub-collection? or are you talking about folders on your computer. Sometimes there may be a problem with saving PDFs that have very long names. This would appear in the debug log.

Edit: Corrected "Vista up to SP1" to "Vista up to SP2"

sdspieg · September 22, 2012

Hi Aurimas,

Thanks much. I am running this on a Windows 7 home pc (and am not running a bittorrent client). I am still trying now (with the new version) with 50 items, but, like yesterday, it doesn't download anything. I'll post more details (and the error log - if I actually get any errors) later, but I was just wondering - is there any way to tell that the script is actually finished? Normally, AFTER it shows all items that it has captured on the right of the screen in separate 'stacked' boxes, it STILL take quite a while before I can actually start using Firefox again. Which suggest to me that it may still be doing sthg. And I'm sometimes afraid that doing sthg else then might still mess up the process. So I therefore wondered whether there is any way of telling that the whole process has run its course...

sdspieg · September 22, 2012

Ok. I started everything all over again. Disabled sync (because that sometimes takes a while and then I get an error message that I have to wait until the syncing stops). Opened Firefox with the error console window next to it. Ran the query, waited until the zotero icon appeared in the url-bar, clicked on it and selected all 50 items. And then I waited. The messages started scrolling in the error console - see https://docs.google.com/document/d/1G8rR0xVOV6NYjEEdRHIm3t6DG-DmnUM3Erp1MW8qjJ8/edit . I highlighted a few that struck me as being a bit suspicious:
(* the Source file is always http://global.ebsco-content.com.library3.webster.edu/interfacefiles/12.41.7.0.1/css/ehost/master_bundle.css)
*Warning: Selector expected. Ruleset ignored due to bad selector.
* Warning: Expected ',' or '{' but found '/'. Ruleset ignored due to bad selector.
Line: 1
* Unknown pseudo-class or pseudo-element 'relative'. Ruleset ignored due to bad selector.
* Warning: Unknown pseudo-class or pseudo-element 'text'. Ruleset ignored due to bad selector.

And at the end - nothing ended up in my Zotero folder (My Library/Title of Project/Title of query). I also looked in the root folder (as it sometimes saves items there too), but there was nothing there either. Any further ideas?

aurimas · September 23, 2012

Unfortunately none of those errors are related to Zotero's operation. There are a few warnings at the end of the log, but nothing serious. That's also not quite the debug log that I was referring to in my previous posts.

1) Before we do much more debugging, let's try to simplify the system a bit. I'm sorry if I missed this in your posts and you are in fact already doing this, but could you try using Zotero as a firefox extension and access EBSCO via Firefox?

If that does not make a difference and you are still not able to import items:

2) Update the translator to the latest version from https://raw.github.com/aurimasv/translators/EBSCOhost/EBSCOhost.js (same link as above, but I added some more debugging code)

3) And please generate a Debug ID by following the instructions in the _first_ section (Debug Output Logging) of this page: http://www.zotero.org/support/debug_output#debug_output_logging

When you get to step 5, hit Disable and then open the log by clicking View Output. Then select the entire log and copy paste it on a website like pastebin.com or gist.github.com

4) If using Firefox solves the problem, let us know and we'll go on to debugging the Chrome connector itself.

sdspieg · September 23, 2012

Yes I do use Firefox. Have just let the whole thing run for about an hour. Debug id=D288492050. See https://gist.github.com/3770531 . And nothing was actually downloaded.

Here some more results from records 40-60 of a search, whereby the first 2 pages (0-20 and 21-40) went fine: https://gist.github.com/3772358 . Debug#: D533709939

aurimas · September 24, 2012

OK, I think I figured it out.

In both of your reports one item never completes (they never manage to get RIS data)

In the first one (https://gist.github.com/3770531), item (35) never completes and in the second (https://gist.github.com/3772358), item (10) fails to complete.

I'm guessing that the page requests time out. I need to think about how to fix this, but it will probably require an update to the Zotero client. I'll give you an update in a day or two, but if I don't, bump this thread.

Thanks for all of your reports.

aurimas · September 24, 2012

@Simon or Dan,

Looking at the code, it seems that this should have generated a log message (utilities_translate.js#L415 -> translate.js#L1155). Since there is no log message, (A) was there no error, (B) was the error mishandled and never made it into the log, or (C) am I not following the code correctly?

If there was no error, there should have been a follow-up log message from the EBSCOhost translator (Last seen message is EBSCOhost.js#L332 which should then be followed by EBSCOhost.js#L32 if doGet calls back)

In either case, I am thinking that we should allow translators to handle errors from xmlhttprequest. Before I try to devise a detailed proposal, I was wondering if there is already something in the works or maybe even already in master branch (I suppose I can try and browse through the code, but I thought I'd ask anyway). Otherwise, I have some ideas how this could be handled in a user friendly way.

sdspieg · September 26, 2012

Any news on this front?

aurimas · September 28, 2012

I don't think we will be able to address this until the next Zotero release. In the mean time, I would suggest importing fewer articles at a time and retrying import if it fails. It looks like this is triggered by page timeouts, which, from my testing, are very intermittent.

sdspieg · September 29, 2012

Ok! Thanks for your efforts and I'll do what you suggest for the time being.

alphapapa · October 16, 2012

I'm having the same problem. When I try to import 50 search results from EBSCO (Zotero showing the folder icon in the address bar), after I choose the items and click OK, nothing seems to happen. Sometimes the red-bordered box appears for a while, but it's empty. Sometimes a single, random item will be captured when I click the folder subsequent times, but it never actually captures the page of items.

I glanced through the debug log, and what stood out to me most was:

Translate: Unknown RIS item type: Book. Defaulting to journalArticle

Obviously that's an unrelated problem, but it's rather annoying that books are being captured as articles, which throws away metadata and makes it useless.

Anyway, I guess I'll have to try capturing fewer at a time. But why does Zotero try to capture all 50 at once? Why doesn't it queue them and capture a few at a time? This would also make debugging easier.

alphapapa · October 16, 2012

Update: I tried capturing 20 items. It took a long time, but they finally appeared. But there are some duplicate items. And worst of all, like I said, all of the books were captured as journal articles, so the metadata is useless.

Ok, worse than that: this attempt to capture items in Zotero caused Firefox to allocate over 2 GB of memory, and Firefox won't release it. I tried to exit Firefox, and the GUI closed, but the process remained running and still didn't release memory. I had to kill Firefox. This is repeatable. Using Firefox 16.0.1.

Bottom line: Zotero seems to be just plain unreliable. It's a shame, because it is so useful when it works. But it seems like when I need it most, it lets me down. All I can use it for is capturing bibliographic data on single items, and then I just copy and paste into my editor.

adamsmith · October 16, 2012

we'll have to look at that but the problem with book isn't that Zotero doesn't recognize them, but that the EBSCO data is wrong - item types in RIS should be in all caps, i.e. BOOK not Book.
which EBSCO database are you searching/using?

adamsmith · October 16, 2012

Also, just to be clear - EBSCO stands out as being by far the most fragile database to use with Zotero - and that in spite of the fact that EBSCO contributed a lot of the code for that translator. Other databases import stably and reliably.

Beyond that see aurimas above.

alphapapa · October 16, 2012

Adam, thanks for your reply. In this case I'm using my university library's catalog search, which they seem to have contracted with EBSCO for.

I understand that EBSCO isn't returning data in compliance with the RIS spec, but capitalization is a very minor thing. Why isn't Zotero more robust? I can't help but think of Jon Postel's famous axiom: "Be conservative in what you do, be liberal in what you accept from others."

I wish that all resources had metadata in standard formats embedded in their HTML. And I wish that I didn't have to use EBSCO. But I'm afraid it's my only choice.

adamsmith · October 16, 2012

oh, we can almost certainly make Zotero disregard capitalization on RIS item types on import - it just has never come up before and it's something that needs to be coded explicitly.
@aurimas - any thoughts on whether to do that in RIS or in the EBSCO translator?

Could you describe your set-up in a bit more detail? I'm not quite sure I understand the library search via EBSCO.
The closer we can replicate this, the better we can troubleshoot. I don't want to raise expectations too much, EBSCO will likely remain a bit finicky, but we can probably improve.

alphapapa · October 16, 2012

Well, my university's default search used to use their in-house software. Now their default search box goes to EBSCO and apparently EBSCO has access to their entire catalog, including availability data. So I searched my university's catalog through EBSCO and tried to add a page of results to Zotero. As far as the HTML goes, it is the same as searching any database with EBSCO, it's just returning results that are found in my university's library.

aurimas · October 16, 2012

I glanced through the debug log, and what stood out to me most was:

Translate: Unknown RIS item type: Book. Defaulting to journalArticle

@aurimas - any thoughts on whether to do that in RIS or in the EBSCO translator?

I think this is safe to do in RIS. I'll make the necessary changes.

But why does Zotero try to capture all 50 at once? Why doesn't it queue them and capture a few at a time?

I'm not sure if you're concerned about seeing progress as Zotero captures items or amount of simultaneous connections. As far as progress is concerned, this should improve with upcoming (I believe the next major) Zotero release. As discussion above indicates, I'm trying to come up with a way to process partial imports and inform the users on which items failed to import. This might not be ready for next Zotero release.

@Dan, Simon, etc. As far as number of connections to the server is concerned, I think we should impose these limits using the value in network.http.max-connections-per-server. IMO this should be done in Zotero.HTTP.*

Update: I tried capturing 20 items. It took a long time, but they finally appeared. But there are some duplicate items.

See the updated translator linked by me a couple posts above (https://raw.github.com/aurimasv/translators/EBSCOhost/EBSCOhost.js) I'll push this version out to everyone when I get a chance to remove all the debugging code.

Ok, worse than that: this attempt to capture items in Zotero caused Firefox to allocate over 2 GB of memory, and Firefox won't release it. I tried to exit Firefox, and the GUI closed, but the process remained running and still didn't release memory. I had to kill Firefox. This is repeatable. Using Firefox 16.0.1.

This sounds serious. I will try to replicate it, but I have not seen this behavior yet. Are you pressing the folder icon multiple times (for instance you try to import again after a failed attempt)?

The best suggestion I can give you right now is to use the translator from the link above and import ~10 items or so at a time.

adamsmith · October 16, 2012

As for library search via EBSCO: I've never seen that and that might cause extra problems.
The books - those are books in the university library?
I don't think we've ever seen or dealt with that, so there's a good chance it wouldn't work well.
Is that search publicly accessible or do you need to be signed in?

How about if you go through EBSCO directly? Does your U allow that? Do you see any difference?

aurimas · October 17, 2012

Translate: Unknown RIS item type: Book. Defaulting to journalArticle

This should now be fixed. You can wait for the translators to update automatically or update manually via Gears menu -> Preferences -> General -> Update Now.

dstillman · October 17, 2012

Keep in mind that when you download 50 articles, there are 50 simultaneous connections being made to the EBSCO servers.

aurimas: Are you sure about that? I would expect network.http.max-connections-per-server to be enforced at the network level (though I haven't tested it).

I'll leave it to Simon to comment on ways we could support queuing in the translator architecture, but I agree that it would be a good thing. We've always said that Zotero wasn't really designed with the download-huge-pages-of-results-at-a-time use case in mind, but since people are clearly going to do it, it'd be nice if we handled it a bit better. (Among other things, a queue would make it less likely that people would get banned for violating the TOS—which, essentially, they probably are.)

aurimas · October 18, 2012

@Dan, you're right. The number of connections is in fact limited. From what I can tell it's limited to 6 connections on my setup (by the network.http.max-persistent-connections-per-server). So it seems to be working fine. I don't think we need to worry about queueing then, but we should allow translators to handle connection timeouts or other network related errors without completely breaking. I'll make a pull request with some suggestions.