Query Limit reached?

myqlarson · September 30, 2013

I know how to use Zotero just fine, thank you. I have my workflow and it works for me.

The point, if we can stay on it, has to do with query limits.

I'm not suggesting scanning everyone's PDFs. I'm suggesting caching metadata results in zotero.org after a user successfully obtains metadata for a given query via Google Scholar. Queries could be sent to zotero in the first instance and if there is an exact match, the metadata returned. The lines between ubound and lbound are all that need to be stored (https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L131).

It could be as simple as including those lines as a hidden field in the citation entry which can be extracted when it is synchronised with the zotero servers. At the very minimum, at least storing those lines plus the results from Google and aggregating those data would give a better idea of the feasibility of finding a workaround.

Also, it appears that CrossRef does limit queries. I'm blocked at the moment and the pdf contains a DOI on the first line. I even removed spaces in the DOI of the .txt version and it still won't work.


$ head fulltext\(1\).txt
PSYCHOMETRIKA -- VOL . 75, NO . 4, D ECEMBER 2010 DOI:10.1007/S11336-010-9182-4

aurimas · September 30, 2013

Also, it appears that CrossRef does limit queries. I'm blocked at the moment and the pdf contains a DOI on the first line. I even removed spaces in the DOI of the .txt version and it still won't work.

Are you blocked if you right-click just that PDF and perform Retrieve Metadata?

There's (what I think) a bug in Zotero right now, where metadata retrieval via DOI will not proceed if Zotero determines that Google Scholar has blocked you. But if you just do metadata retrieval on that file only, it should connect to CrossRef.

myqlarson · September 30, 2013

Are you blocked if you right-click just that PDF and perform Retrieve Metadata?

— aurimas

Yes, right-clicking and selecting Retrieve Metadata on that file only gives an error: http://i.imgur.com/v79TQ3C.png?1

aurimas · September 30, 2013

OK, I can reproduce it with that PDF (although my pdftotext results are slightly different). The problem (on my end) is that the text that Zotero extracts from the PDF looks like DOI : 10.1007/ S 11336-010-9182-4 The spaces in the DOI prevents Zotero from treating this as a DOI, so it jumps to Google Scholar queries.

I'm not sure why there's no space in your pdftotext output on either side of the S. I do see that (just like on my system) you get a space after the D in DECEMBER

myqlarson · September 30, 2013

I'm not sure why there's no space in your pdftotext output on either side of the S. I do see that (just like on my system) you get a space after the D in DECEMBER

— aurimas

There are no spaces because I already took them out to test the theory you just described. However, the blocking error still appears even without said spaces, which is why I suggested that CrossRef may be limiting as well. If taking out the spaces fixed the problem, I would have suggested adding a few \s* to the DOI regex such as:

/DOI\s*:\s*10\.\s*[0-9]{4,}\s*\/\s*[^\s]*\s*[^\s\.,]/.test('DOI : 10.1007/ S 11336-010-9182-4')
true

aurimas · September 30, 2013

That error can only be the result of Google Scholar query. There is no code in place to detect query limits from CrossRef. Thus, Zotero still fails to reconize that as a DOI.

If I touch up the PDF to remove the spaces, the extracted doi looks like DOI : 10.1007/S11336-010-9182-4 and retrieving metadata succeeds.

I'd be interested to investigate why your touch-up did not work, because pdftotext does output the same DOI (Edit: I mean the same DOI as I see with my touch-up). If you're inclined to tinker with this, you can enable debugging in Preferences -> Advanced -> General and perform the Retrieve Metadata action. If you click on View Output afterwards, you should see exactly what's going on.

If taking out the spaces fixed the problem, I would have suggested adding a few \s* to the DOI regex.

I don't think we would consider this, because it would mess up a lot more than it would fix.

Edit: Following your edit:

/DOI\s*:\s*10\.\s*[0-9]{4,}\s*\/\s*[^\s]*\s*[^\s\.,]/.test('DOI : 10.1007/ S 11336-010-9182-4') true

That regex does match the string, but it would only extract the following: "DOI : 10.1007/ S 1", but even if we fix it to be /10\.\s*[0-9]{4,}\s*\/\s*[^\s]*\s*[^\s\.,]*/ it would still not be acceptable. I understand that this regex would fix the problem you are experiencing right now, but Zotero must be able to correctly handle various other cases as well. E.g. who's to say that the doi could not be "10.1005/123 S 11336-010-9182-4"? Also, what about correctly formatted DOIs?

/10\.\s*[0-9]{4,}\s*\/\s*[^\s]*\s*[^\s\.,]*/.test('DOI : 10.1007/S11336-010-9182-4 hello!!')
true

and would extract "10.1007/S11336-010-9182-4 hello!!" as the DOI.

adamsmith · September 30, 2013

FWIW even if CrossRef had a query limit, the "query limit" message only gets thrown for google scholar requests that hits a captcha:
https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L145
and
https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L191

So what's going on in your case is that CrossRef is failing for any reason or not getting queried and then Zotero reports hitting the GS query limit. [overlapping with aurimas here, but leaving this in for the code passages].

You can look at the debug output
http://www.zotero.org/support/debug_output
to see what's happening. It's quite expressive for retrieve metadata.

aurimas · September 30, 2013

Btw, if you are concerned about your privacy (I assume you are, since you were trying to black out some info in the screenshot), you may want to delete that screenshot from imgur, since your username and IP address are still visible at the top of the terminal window.

myqlarson · October 1, 2013

I understand that this regex would fix the problem you are experiencing right now, but Zotero must be able to correctly handle various other cases as well.

—aurimas

Clearly I'm not suggesting that the one-off hack above be merged into master or I would have submitted a pull request. I'm just trying to point out a few ideas to deal with the problem at hand of rate limiting.

There are several small changes which could make things easier on end users until a better solution is found.

Having more than one regex to search for DOIs would help. Of course a well-formed DOI regex would be given precedence, but in the event that Google is limiting requests, several less restrictive regexes could then be tried in an attempt to find something which looks like a DOI and submitted to CrossRef

After detecting when Google is limiting requests in the middle of a bulk request, continue downloading metadata for only those items where a DOI can be found, and, as above, progressively try more permissive regexes

Display the Captcha from Google for the user to solve

Set up an archive of successful metadata searches from Google, the text used to generate those searches, and the metadata found from those searches to act as a first point of call before searching on Google

etc.

Dupuytren · October 8, 2013

I'd like to put my vote in for upping the priority on this lockout issue. Zotero is the kind of program people don't start looking for until they have many refs. I have almost 2000 pdfs and the google lockout winds up being a dealbreaker for me. It is actually faster for me to enter the metadata manually than have to wait a half day or more between batches of 20 to 50 PDFs. A shame because there are not other products I've seen which have this type of functionality with Google Scholar.

myqlarson · October 8, 2013

It is a shame this issue is not given much value because there are many minor adjustments which could be made to help alleviate the problem. For example, in addition to the suggestions I posted above, it's very easy to convert JSTOR URLs to DOIs by taking the numeric component after stable/ and prepending 10.2307/ which CrossRef will then resolve. The URL is included in the first page of each JSTOR PDF. As it stands now though, JSTOR PDFs don't contain a DOI so are, unfortunately, routed to Google Scholar. I'm sure there are dozens of tricks like this which could be found and added as alternatives to querying Google Scholar. Many large publishers put the DOI in the document's URL which is often in the PDF metadata, even if the DOI itself is not.

aurimas · October 8, 2013

it's very easy to convert JSTOR URLs to DOIs by taking the numeric component after stable/ and prepending 10.2307/ which CrossRef will then resolve. The URL is included in the first page of each JSTOR PDF.

We can probably incorporate that without encountering any undesired effects. JSTOR is a rather large source of PDFs, so I don't think there would be much objection to adding hacks just for that.

Many large publishers put the DOI in the document's URL which is often in the PDF metadata, even if the DOI itself is not.

If the DOI is included in its entirety (i.e. including the 10.xxx prefix) then Zotero should be able to pick it up.

0avasns · December 11, 2013

I'm sorry, zotero folks, I don't have time for this BS. I thought zotero was going to make my life easier but it doesn't seem to like my accumulated list of pdfs. I am not going to start re-downloading them one by one whenever I need them again, and I am not going to try every day to see how many more I can get processed. If you ever decide to make the starting steps attractive to someone with an already-established line of work (and stored bibliography) I might like to try again. I hope I can find a way to unregister now.

Rintze · December 11, 2013

@0avasns, http://www.zotero.org/support/forum_guidelines#etiquette .

myqlarson · December 11, 2013

@0avasns, http://www.zotero.org/support/forum_guidelines#etiquette .

I don't see anything wrong with Oavasns' post. It accurately reflects the situation many first-time Zotero users find themselves in which is a serious barrier to wider adoption. This thread is littered with suggestions on how to ameliorate this problem.

adamsmith · December 11, 2013

from Rintze's link

The forums are not a place to vent frustration or anger.

phrases/acronyms like BS have no place here, that's all. Zotero has a small dev team, there are lots of things that could be improved, patches are welcome.

Rintze · December 11, 2013

@myqlarson, it's just that I'm not a fan of vulgar language on these forums. 0avasns also seems to have registered for a forum account for the sole purpose of posting this complaint, and with such a negative attitude there isn't much incentive for anybody here to value his/her opinion.

Besides, I think ultimately people are barking up the wrong tree. IMHO, the core problem here is the lack of (access to) full text indices of the scientific literature. Ideally we shouldn't have to rely on a commercial party like Google for full-text lookups.

myqlarson · December 11, 2013

I think BS is an acceptable alternative to what it represents. It's commonly used on TV and in the work place as a polite alternative. I'm happy to look past that at the reality that many new users face. I agree with you about what should be, but in the meantime, it would be useful to address what is presently the case.

I have gone/am going through the same situation. Importing large numbers of PDFs is not a pleasant experience. I've offered many ideas which may help alleviate the problem in the short term. I think it's short-sighted to just brush off these complaints.

adamsmith · December 12, 2013

if we brushed off these complaints we wouldn't be engaging with you at all. You have the lead dev and two of the most active community devs participating in this thread.
aurimasv, e.g., submitted the last major patch that improved retrieve metadata.

But someone has to implement these things and you'll have to accept that people have different priorities. Specific suggestions like you make them are appreciated, but as I'm sure you know, they are still usually the easiest and always the least work intensive parts of coding...

And you'll have to leave the decision about what level of rudeness is acceptable here to the people who provide the support here on a daily basis ;)

Sajid Humayun · December 26, 2013

Whilst I appreciate how much work goes into this piece of software, I'd also like to ask that this issue be given greater priority.

I had recently decided to adopt Zotero and move over from Qiqqa/Papers. I have an existing library of about 1700 papers with pdfs and wasn't happy with the errors Qiqqa had made to the metadata, either when finding metadata within existing pdfs or when pulling down new papers. I switched to Papers which, whilst better in many ways, made a bunch of (different) errors, mainly to journal titles.

I tried out shifting a few papers over to Zotero and was really impressed by its ability to find the correct metadata for my pdfs. As you can imagine, I did not want to import my library over (as an RIS file etc) as it would be imported with all those errors, so this ability was a big reason for making the switch.

So I spent hours building nested collections etc only to run up against this brick wall a couple of hundred papers in. As the forums are not a place to vent frustration or anger, I won't. All I'll say is that I'm sure you can imagine how very frustrated I am. Whilst I really like the look of Zotero and recommend it to my students, I am now thinking about going back to Papers and manually correcting all its errors. Please note, I decided to move over to Zotero even though I'd just spent $70 on a Papers license, because I was so impressed by it. So the decision to now give up on Zotero is not taken lightly.

myqlarson is right. People adopting a piece of software like this are likely to be doctoral students or those even further down the academic path and are therefore likely to come with their own collection of pdfs. This query limit is likely to put a lot of people off adopting your software. It is putting me off right now.

aurimas · December 26, 2013

Are you using Zotero Standalone or Firefox? In either case, if you navigate to Google scholar, search for something, then try to export the citation as BibTeX (there should be a link next to "related citations" or something like that), do you get some sort of message from Google about being a bot?

I'm trying to get someone to test an idea I have.

dstillman · December 26, 2013

Sajid Humayun: Keep in mind that the query limit is a direct consequence of the metadata quality that you appreciate. If other software doesn't have this problem, I would assume it's because they try to extract metadata locally, which would often produce terrible results.

No one is happy with the current situation, and there are certainly ways we could improve things, but the potential solutions that would have the biggest impact (e.g., running the detection very slowly in the background, some sort of server-based solution from zotero.org) are nowhere near as trivial as some of the people in this thread make them out to be.

dstillman · December 26, 2013

(That said, there's nothing wrong with periodic reminders that the current situation is annoying — just understand the reasons for it, and don't think we don't think it's a problem.)

Sajid Humayun · December 26, 2013

Firstly, thank you both for responding so quickly.

Dan I appreciate that you understand that this is a problem and that it is not an easy fix. Thanks for staying on it.

Aurimas - do you mean import into bibtex or something similar? You access this function by clicking scholar's cite link and then there are a bunch of links to do this through bibtex, endnote etc. If I choose the bibtex option it sends me to a paper with the bibtex data:

@article{perez2012etiology,
title={The etiology of psychopathy: A neuropsychological perspective},
author={Perez, Pamela R},
journal={Aggression and Violent Behavior},
year={2012},
publisher={Elsevier}
}

If I choose the endnote option it downloads a .enw file. I double checked whether this is because the one day (or however long it is) block is over and it is not. Tried retrieving metadata for a pdf and getting the same query limit error.

When I went to scholar initially it asked me to fill in one of those near intelligible you are not a bot type this word tests (what are these called?) but still won't allow me to retrieve metadata.

Let me know if there are any other tests you want me to run.

aurimas · December 26, 2013

So are you using Zotero Standalone or Zotero for Firefox to retrieve metadata? If Standalone, do you also have Firefox? If yes, does retrieve metadata work through Firefox?

Sajid Humayun · December 26, 2013

I use Standalone and usually use Safari as my browser. However, I have just tried metadata retrieval in Firefox and it has worked. Unfortunately (sort of) I have just tried metadata retrieval in Standalone and it has also worked. So the Firefox success might just be because my google scholar block has expired. I'll update you after the next 200 pdfs.

Sajid Humayun · December 27, 2013

I have added 262 papers and no block yet. I have to call it a night and have a beer. I have to say, I'm really loving the metadata retrieval capabilities of this software. It kicks the crap out of everything else I've tried (and I've tried most everything). I will update when I next add a large block of pdfs.

tfjern · December 30, 2013

I am using Zotero standalone. I have thousands of PDFs that are in need of metadata retrieval. As has been discussed on this forum, Google scholar shuts down when a certain number of retrievals have been requested ("Query limit reached. Try again later.") After a day or so it becomes unblocked.

With so many PDFs it will take forever to complete this task.

I tried experimenting with a VPN to change my IP address, but no dice. Deleting cookies with CCleaner and the like, ditto. Nothing seems to work.

This is perplexing. How does Google know, even though the IP has been changed multiple times? Also, is there some clever way I can convince Google I'm not a bot, and allow me to continue retrieving metadata from all those PDFs?

Be nice if Zotero would come up with a workaround.

ScienceNinja17 · January 10, 2014

I am new to Zotero and am running the Standalone version. I imported a small collection (possibly 150?) of pdfs from my work computer to zotero, and began using the "Retrieve metadata" function. At first it was remarkable, and the speed with which it retrieved all of this data was a welcome surprise!

After reading this thread, I still need help. I retrieved metadata for <100 pdf files yesterday, and then received the "Query limit reached. Try again later." error. It has been 24 HRs and I have tried again, even 1 pdf at a time, but to no avail.

I see that the last post here was approximately 2 weeks ago. I fully understand that this issue is very complex and that the developers are perhaps being under-appreciated (as I feel constantly in my current employment!).

I am wondering, however, if anyone has yet found a solution or a way to work around this problem? I see there were many programming/code re-writing suggestions posted, but I have very little experience with this sort of thing and am almost certain to create way more problems for myself. Any luck with even minor tweaks that might fix this issue?

Thank you so much!

aurimas · January 10, 2014

There's a patch under review that should practically eliminate this whole issue. Waiting for Dan to review this and maybe he can release a beta version of Zotero with these changes so it can get tested.

For reference, the patch is here: https://github.com/zotero/zotero/pull/433