Retrieving metadata for pdfs.

chestervonwinchester · July 8, 2015

I know this issue has been brought up multiple times. I also realize that much of the issue is a direct result of Google's protocols, not Zotero's. Nonetheless...

It appears I have been locked out of retrieving metadata for multiple days with no error message from Zotero. It simply hangs and says no reference can be found and no indication that any query limit reached on Google Scholar. I have waited as long as overnight, and tried again in the morning (having read that Scholar locks people out for just an hour after too many attempts) to no avail. This has been going on for 2+ days now.

Google has definitely flagged my IP because when I visit scholar through the browser, it makes me jump through hoops to use it. The problem is that there is no indication of this on the Zotero side, and there is no indication of how long I am locked out.

Is there any fix for this?

---

edit: version info

Zotero 4.0.27 on OS X 10.6

aurimas · July 8, 2015

Google has definitely flagged my IP because when I visit scholar through the browser, it makes me jump through hoops to use it.

Can you describe this in more detail? Google appears to have recently added additional CAPTCHAs that we are not accounting for. I've encountered it once, but, unfortunately, was not paying close-enough attention to the underlying code, so I don't have a fix for this yet.

chestervonwinchester · July 8, 2015

Ok, I would estimate about 1 or 2 times out of 10, I am confronted with a simple text CAPTCHA wall upon visiting Scholar. By CAPTCHA wall, I mean I'm redirected to a page with your standard run-of-the-mill text CAPTCHA. After completing it, I'm redirected to scholar.

The remainder of the time, I'm confronted with an image label verification tool, only after searching on scholar.

The latter looks like:

http://imgur.com/tPSKaV6

followed by:

http://imgur.com/OTRn9tO

aurimas · July 8, 2015

OK, great. That's what I've seen as well. We should already be supporting the simple CAPTCHA, though I've seen if fail once and I'm not entirely sure why yet. For the image verification, we hadn't seen that before today, so we'll need to come up with something there.

After you solve the CAPTCHA in your browser, does Zotero Standalone (that's what you're using right?) start working with Retrieve Metadata again?

You could use Zotero in Firefox if Zotero Standalone does not start working. This way, whatever you see in Firefox (e.g. CAPTCHA or image verification) is what Zotero would see and you can simply solve those in the browser. If you go this route, could you install Zotero Beta, which contains some additional debug logging code to help us track down why we're not handling some of these CAPTCHAs as smoothly?

With Zotero Beta, if you start getting stuck on not being able to retrieve metadata or you get locked out from Google Scholar, could you submit a Debug Log for an attempt to retrieve metadata for one PDF? Also, after submitting the debug log, go to Google Scholar in Firefox and see what kind of page you get (e.g. CAPTCHA, "We're sorry...", image verification).

chestervonwinchester · July 8, 2015

> After you solve the CAPTCHA in your browser, does Zotero Standalone (that's what you're using right?) start working with Retrieve Metadata again?

I am using the standalone. No, solving the CAPTCHA doesn't seem to do anything. It seems once the image verification started, it's trumped the CAPTCHA.

Actually - and I just noticed this - the CAPTCHA page only shows up if I use scholar as a search engine from the url bar (I'm using Chrome). In any case, I've realized now also that even AFTER solving this CAPTCHA redirect, I'm still confronted with the image verification tool after searching on scholar. This happens EVERY time I visit the site.

This is the CAPTCHA redirect page:

http://imgur.com/mvHBaGp

> You could use Zotero in Firefox if Zotero Standalone does not start working. This way, whatever you see in Firefox (e.g. CAPTCHA or image verification) is what Zotero would see and you can simply solve those in the browser.

My only issue with this is that I'm trying to attach metadata to PDFs that already exist in my standalone library. I haven't used the browser plugin, so forgive me if there's an easy way to sync these up - I haven't looked into it.

I'm not sure I have the time to install the Beta version right now. I could post a debug log from the standalone if that helps?

aurimas · July 8, 2015

My only issue with this is that I'm trying to attach metadata to PDFs that already exist in my standalone library. I haven't used the browser plugin, so forgive me if there's an easy way to sync these up - I haven't looked into it.

There actually is a very easy way to link the two. In fact, when you first start Zotero in Firefox it will prompt you to do so automatically. If you don't get the prompt or something doesn't work see https://www.zotero.org/support/kb/sharing_data_directory Note that to use Zotero for Firefox after it is linked with Zotero Standalone you will have to close Zotero Standalone.

I think I have enough information regarding Google Scholar CAPTCHAs at this point and I don't need the debug log from the Beta. So feel free to install the official Zotero release instead. It seems like, currently, using Zotero Firefox extension is going to be the only way to get out of being blocked from Google Scholar. Once you start having issues with retrieve metadata, you will have to visit Google Scholar in Firefox and solve their CAPTCHAs until you can search their website again.

We'll try to release an update for this as soon as possible, but this will probably take a week or so.

chestervonwinchester · July 8, 2015

Great, thanks. I appreciate your help!

tamunro · August 2, 2015

I've also just been locked out of google scholar after downloading 150 records. I believe the problem is not the total number, it's the very high rate of requests zotero sends - it appears to send several per second with no delay. I'd be very grateful if a throttle delay for bulk requests could be added, as done by OpenRefine and web scrapers. Those can send hundreds of google scholar queries before getting blocked. 1-1.5 seconds is usually enough, and indeed I think should be the default. I think waiting an extra thirty seconds per page of results is no great sacrifice, but getting locked out could be a disaster.

p.s. Thanks for all the great work! Zotero certainly slays Endnote for importing web sources.

adamsmith · August 2, 2015

requests are throttled already -- I don't know OpenRefine, but my guess would be that they're not downloading the bibtex data which a) google may be especially wary of and b) requires an additional request per item.

aurimas · August 2, 2015

Two additional requests actually (which are not throttled in between). I also wouldn't be surprised if OpenRefine maintains some special treatment from Google, since it was started by Google.