Query Limit reached?
This is an old discussion that has not been active in a long time. Before commenting here, you should strongly consider starting a new discussion instead. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
The point, if we can stay on it, has to do with query limits.
I'm not suggesting scanning everyone's PDFs. I'm suggesting caching metadata results in zotero.org after a user successfully obtains metadata for a given query via Google Scholar. Queries could be sent to zotero in the first instance and if there is an exact match, the metadata returned. The lines between
ubound
andlbound
are all that need to be stored (https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L131).It could be as simple as including those lines as a hidden field in the citation entry which can be extracted when it is synchronised with the zotero servers. At the very minimum, at least storing those lines plus the results from Google and aggregating those data would give a better idea of the feasibility of finding a workaround.
Also, it appears that CrossRef does limit queries. I'm blocked at the moment and the pdf contains a DOI on the first line. I even removed spaces in the DOI of the
.txt
version and it still won't work.$ head fulltext\(1\).txt
PSYCHOMETRIKA -- VOL . 75, NO . 4, D ECEMBER 2010 DOI:10.1007/S11336-010-9182-4
There's (what I think) a bug in Zotero right now, where metadata retrieval via DOI will not proceed if Zotero determines that Google Scholar has blocked you. But if you just do metadata retrieval on that file only, it should connect to CrossRef.
Yes, right-clicking and selecting Retrieve Metadata on that file only gives an error: http://i.imgur.com/v79TQ3C.png?1
DOI : 10.1007/ S 11336-010-9182-4
The spaces in the DOI prevents Zotero from treating this as a DOI, so it jumps to Google Scholar queries.I'm not sure why there's no space in your pdftotext output on either side of the S. I do see that (just like on my system) you get a space after the D in DECEMBER
There are no spaces because I already took them out to test the theory you just described. However, the blocking error still appears even without said spaces, which is why I suggested that CrossRef may be limiting as well. If taking out the spaces fixed the problem, I would have suggested adding a few
\s*
to the DOI regex such as:/DOI\s*:\s*10\.\s*[0-9]{4,}\s*\/\s*[^\s]*\s*[^\s\.,]/.test('DOI : 10.1007/ S 11336-010-9182-4')
true
If I touch up the PDF to remove the spaces, the extracted doi looks like
DOI : 10.1007/S11336-010-9182-4
and retrieving metadata succeeds.I'd be interested to investigate why your touch-up did not work, because pdftotext does output the same DOI (Edit: I mean the same DOI as I see with my touch-up). If you're inclined to tinker with this, you can enable debugging in Preferences -> Advanced -> General and perform the Retrieve Metadata action. If you click on View Output afterwards, you should see exactly what's going on. I don't think we would consider this, because it would mess up a lot more than it would fix.
Edit: Following your edit: That regex does match the string, but it would only extract the following: "DOI : 10.1007/ S 1", but even if we fix it to be
/10\.\s*[0-9]{4,}\s*\/\s*[^\s]*\s*[^\s\.,]*/
it would still not be acceptable. I understand that this regex would fix the problem you are experiencing right now, but Zotero must be able to correctly handle various other cases as well. E.g. who's to say that the doi could not be "10.1005/123 S 11336-010-9182-4"? Also, what about correctly formatted DOIs?/10\.\s*[0-9]{4,}\s*\/\s*[^\s]*\s*[^\s\.,]*/.test('DOI : 10.1007/S11336-010-9182-4 hello!!')
and would extract "10.1007/S11336-010-9182-4 hello!!" as the DOI.true
https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L145
and
https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L191
So what's going on in your case is that CrossRef is failing for any reason or not getting queried and then Zotero reports hitting the GS query limit. [overlapping with aurimas here, but leaving this in for the code passages].
You can look at the debug output
http://www.zotero.org/support/debug_output
to see what's happening. It's quite expressive for retrieve metadata.
Clearly I'm not suggesting that the one-off hack above be merged into master or I would have submitted a pull request. I'm just trying to point out a few ideas to deal with the problem at hand of rate limiting.
There are several small changes which could make things easier on end users until a better solution is found.
stable/
and prepending10.2307/
which CrossRef will then resolve. The URL is included in the first page of each JSTOR PDF. As it stands now though, JSTOR PDFs don't contain a DOI so are, unfortunately, routed to Google Scholar. I'm sure there are dozens of tricks like this which could be found and added as alternatives to querying Google Scholar. Many large publishers put the DOI in the document's URL which is often in the PDF metadata, even if the DOI itself is not.Besides, I think ultimately people are barking up the wrong tree. IMHO, the core problem here is the lack of (access to) full text indices of the scientific literature. Ideally we shouldn't have to rely on a commercial party like Google for full-text lookups.
I have gone/am going through the same situation. Importing large numbers of PDFs is not a pleasant experience. I've offered many ideas which may help alleviate the problem in the short term. I think it's short-sighted to just brush off these complaints.
aurimasv, e.g., submitted the last major patch that improved retrieve metadata.
But someone has to implement these things and you'll have to accept that people have different priorities. Specific suggestions like you make them are appreciated, but as I'm sure you know, they are still usually the easiest and always the least work intensive parts of coding...
And you'll have to leave the decision about what level of rudeness is acceptable here to the people who provide the support here on a daily basis ;)
I had recently decided to adopt Zotero and move over from Qiqqa/Papers. I have an existing library of about 1700 papers with pdfs and wasn't happy with the errors Qiqqa had made to the metadata, either when finding metadata within existing pdfs or when pulling down new papers. I switched to Papers which, whilst better in many ways, made a bunch of (different) errors, mainly to journal titles.
I tried out shifting a few papers over to Zotero and was really impressed by its ability to find the correct metadata for my pdfs. As you can imagine, I did not want to import my library over (as an RIS file etc) as it would be imported with all those errors, so this ability was a big reason for making the switch.
So I spent hours building nested collections etc only to run up against this brick wall a couple of hundred papers in. As the forums are not a place to vent frustration or anger, I won't. All I'll say is that I'm sure you can imagine how very frustrated I am. Whilst I really like the look of Zotero and recommend it to my students, I am now thinking about going back to Papers and manually correcting all its errors. Please note, I decided to move over to Zotero even though I'd just spent $70 on a Papers license, because I was so impressed by it. So the decision to now give up on Zotero is not taken lightly.
myqlarson is right. People adopting a piece of software like this are likely to be doctoral students or those even further down the academic path and are therefore likely to come with their own collection of pdfs. This query limit is likely to put a lot of people off adopting your software. It is putting me off right now.
I'm trying to get someone to test an idea I have.
No one is happy with the current situation, and there are certainly ways we could improve things, but the potential solutions that would have the biggest impact (e.g., running the detection very slowly in the background, some sort of server-based solution from zotero.org) are nowhere near as trivial as some of the people in this thread make them out to be.
Dan I appreciate that you understand that this is a problem and that it is not an easy fix. Thanks for staying on it.
Aurimas - do you mean import into bibtex or something similar? You access this function by clicking scholar's cite link and then there are a bunch of links to do this through bibtex, endnote etc. If I choose the bibtex option it sends me to a paper with the bibtex data:
@article{perez2012etiology,
title={The etiology of psychopathy: A neuropsychological perspective},
author={Perez, Pamela R},
journal={Aggression and Violent Behavior},
year={2012},
publisher={Elsevier}
}
If I choose the endnote option it downloads a .enw file. I double checked whether this is because the one day (or however long it is) block is over and it is not. Tried retrieving metadata for a pdf and getting the same query limit error.
When I went to scholar initially it asked me to fill in one of those near intelligible you are not a bot type this word tests (what are these called?) but still won't allow me to retrieve metadata.
Let me know if there are any other tests you want me to run.
With so many PDFs it will take forever to complete this task.
I tried experimenting with a VPN to change my IP address, but no dice. Deleting cookies with CCleaner and the like, ditto. Nothing seems to work.
This is perplexing. How does Google know, even though the IP has been changed multiple times? Also, is there some clever way I can convince Google I'm not a bot, and allow me to continue retrieving metadata from all those PDFs?
Be nice if Zotero would come up with a workaround.
After reading this thread, I still need help. I retrieved metadata for <100 pdf files yesterday, and then received the "Query limit reached. Try again later." error. It has been 24 HRs and I have tried again, even 1 pdf at a time, but to no avail.
I see that the last post here was approximately 2 weeks ago. I fully understand that this issue is very complex and that the developers are perhaps being under-appreciated (as I feel constantly in my current employment!).
I am wondering, however, if anyone has yet found a solution or a way to work around this problem? I see there were many programming/code re-writing suggestions posted, but I have very little experience with this sort of thing and am almost certain to create way more problems for myself. Any luck with even minor tweaks that might fix this issue?
Thank you so much!
For reference, the patch is here: https://github.com/zotero/zotero/pull/433