Query Limit reached?

I am new to using the standalone program. I have been using the retrieve metadata function for the PDFs and I received an error that said query limit reached try again later. Can someone explain what this means and what try again later means (e.g. the time limit I would have to wait?) Thanks
«1345
  • Zotero queries google scholar (among other sources) to retrieve metadata for PDFs. Google scholar has a mechanisms that locks you out when it suspects you're a bot - i.e. when you make a large number of rapid requests.
    Since this isn't something that Zotero does, we can't give you exact information on how long it will take until this works again for you. Usually less than a day. You may be able to just try if you can use scholar.google.com
  • I, too, am new to using the standalone program. I was trying to integrate all the old PDFs I had on my computer (probably a few hundred), and I triggered this message. Only problem is that it has been well over a day and I still cannot retrieve metadata. I can use Google Scholar just fine, although I usually have to submit a captcha first.
  • unfortunately, we have no influence on the types of restrictions google scholar imposes. If you have a way to change your IP adress (e.g. move to another network or so) that would likely work.
  • I just ran into this problem too. I started using Zotero yesterday and today I uploaded ~100 PDFs from one of my directories and asked it to retrieve metadata for them. Bang! Google Scholar shut me down. Now I need to jump to another IP (and potentially get that one blocked) or wait until some unspecified later time (date? year?) to continue developing this aspect of my Zotero collections.

    If I may, let me offer a suggestion to Zotero developers. Rather than flooding Google Scholar with requests, one right after another, insert random brief pauses between requests. This will spread out the request load to Google Scholar and make it less likely that users will run into this difficulty. You could even make user options related to this, like the --wait and --random-wait flags to wget.
  • Zotero already does that to some extent, but google isn't stupid, of course, and the types of bots that google scholar is trying to keep out could do exactly the same, so they account for the possibility of such random pauses.
    And once you make the pauses too long, you're affecting performance and confuse users.
  • Adam,

    Thanks for your quick response.

    I'm glad to hear that Zotero already implements my suggestion to some extent. That should make it quite easy to extend that feature to the extent necessary to work well with Google Scholar when adding metadata for numerous PDFs.

    You are right that Google isn't stupid. However, they also aren't omniscient. Their algorithms need to guess based on usage patterns whether a query comes from a user or a bot. If their algorithms make lots of false positive IDs for bots, they'll lose popularity. So, their algorithms will generally err on the side of letting in bots as long as those bots aren't overloading their servers or pulling down sufficient data to set up rival services. So, the answer is to simply slow down the rate of queries and add a large variance to the query interval.

    Regarding performance, that is an excellent point. Zotero should be concerned with its performance. Indeed, consider this: I put in a directory of ~100 PDFs, as I mentioned above. Zotero collected metadata for fewer than 50 of these before Google Scholar shut it down. Now I need to wait perhaps a day until I can try again for the rest. That comes out at a rate of approximately one successful query per half hour. That is incredibly poor performance and it requires the user to retry until success is achieved. Even if Zotero waited on average 3 minutes between queries, that would be a 10x increase over the current system of rapid requests followed by long shut-out periods. An average wait time of 18 seconds would be a 100x increase over the current system. In other words, my suggestion would greatly improve Zotero's performance, not detract from it.

    As for confusing users, I contend that the users of Zotero are intelligent enough to understand simple informative dialogs and progress bars. "Note that you have submitted 100 queries for PDF metadata. To avoid overloading the servers containing this metadata, Zotero will queue these requests and send them periodically. Progress can be tracked in the progress bar located ..." This is, I'd say, quite a bit less confusing than the current system of failing with an vague error message.

    Thanks again.

    Dean
  • edited June 25, 2013
    but then you're screwing users with 10 PDFs who'd otherwise just sail through but now have to wait 30mins, which is also in the vicinity of 50-fold slowdown. And this affects daily use of Zotero rather than initial import.

    Anyway, playing cat-and-mouse with google's bot protection isn't a dev priority, so this is unlikely to change anytime soon. Zotero is working on a solution that will enable it to use its own data for some queries, which would further reduce the number of hits on google scholar (already, PDFs that have DOIs early on go through CrossRef and don't hit google).

    Zotero is also open source, so if you want to patch this for your own use you're obviously welcome to.
    edit: relevant code is here:
    https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js#L70
  • @Simon - looking at the code, though, couldn't we hidden pref the GS delay? That way someone importing a whole library could set the pref to 60k and probably sail through. Or too obscure?
  • I don't think we want to pref this. Maybe we could keep a log of recent lookups and make the rate adaptive, though.
  • Adam,

    Thanks again.

    You are right that 10 PDFs would take 30 minutes if Zotero automatically went to an average of 3 minutes between queries regardless of the number of queries. (And it would take 3 minutes for 10 PDFs if the average wait was set at 18 seconds.) However, there is no need for the code to be this simplistic. It would be very easy to leave the delay at the current 2 seconds for small numbers of queries and bump this up as the number of queries increased. Algorithm adaptation based on number of items is a quite common coding technique -- many sorting libraries work like this behind the scenes, switching from a sorting algorithm with a low constant factor but poor big-O performance for small N to an algorithm with a large constant factor but good big-O performance for large N, giving users the best of both worlds.

    In any case, I'm sorry to hear that these simple code changes are not a development priority since I see that this problem has now been around for at least four years now and could probably be adequately addressed with minor effort. Still, I am pleased that Zotero is working on an alternative solution and I hope that is rolled out soon.

    You are also right that this affects initial import rather than daily use. Indeed, my impression so far is that Zotero has many nice features for daily use but is less well suited for many aspects of initial import. This is, unfortunately, a huge hurdle to widespread adoption of an otherwise apparently excellent tool.

    Think of it this way: If I am an established researcher, with a library of hundreds or thousands of documents that I want to organize and share with my research group, then I will care greatly about how difficult it is to accomplish this initial import. After all, I have a system that I am using now but it isn't great, which is why I am looking around for an alternative. However, if the Zotero alternative provides huge hurdles to adoption, then it isn't worth it for me to switch over to it. I might as well stick with what I am doing.

    Of course, there is an alternative to being the established researcher with the large library. I could be just starting out. Even here, though, one must realize that the usual pattern for a would-be researcher to become an actual researcher is for that person to join an existing research group that has an extant library of hundreds or thousands of documents, which brings us back to the first case.

    I wonder what portion of potential Zotero users give up due to a difficult adoption path and what percent decide to press on despite it.

    Thanks also for providing the link to the relevant code. I will certainly consider patching it and going from there. That is, if I decide to press on with Zotero.


    Dean
  • Ooops! That is what I get for responding too slowly. I'd be happy with either the pref or the rate adaptation approach!
  • By the way, my apologies if I slip into lecturing mode too much. With it being summer and all, I'm not getting my weekly dose of lecturing in.
  • Yet another reason to move to Mendeley. Much more reliable meta data extraction (actual metadata extraction, not look-up).
  • no, Mendeley also looks up metadata (including in their own data). I think they fall back to XMP if all else fails, but that's usually too bad to be of any use. On the other side of the coins their web-import is much worse.

    But there's no point to posts like this. We have a pretty good idea of what Mendeley does better and worse.
  • Thanks for clarifying. I'm not sure I'd say "there's no point" since you were able to clarify my error. Thanks for doing that.

    Here's some bigger picture input and comments below on why I'm going back to Mendeley. While I've found the Zotero help forums and the developers that post regularly (like you) helpful, I have to say that Mendeley's responses have always been more positive in my experience. I know you're both different kinds of entities with different levels of staffing, etc, but I often feel like Zotero's responses are more like "No, we don't/won't do that" or "You can develop that if you want to" and Mendeley's are more like "We don't do that now but we're working on it." Of course this is only from my own exchanges and maybe isn't a representative sample of all support posts.

    Most of us aren't developers, and while we don't mind spending a few hours poking around in software and tweaking it to help with our research workflow, we really want a program that "just works" (more or less) and removes as many manual steps as possible. I think Zotero is still missing this point in a few areas (see below).

    I know you said you know all the differences between Z and M, but here are the reasons I'm switching back to M, just for your reference. I suppose some are just user preferences, but for me they're my big hang-ups with Z (By the way, I thought the last Z release was a major improvement. Made a few things easier. Keep up the great work!).

    1) Search: Being able to search in text of PDFs. Also, something about their search seems more intuitive, too, though I can't tell if it's catching items in my group collections, too. If you guys could develop a search function that searches across libraries in one search that would be a MAJOR plus for Zotero.

    2) More storage: That will be hard to match as an academic project. That you can sync by WebDav is a big plus, but I've spent some time trying to make that work, and it just isn't working with the size of my library and money I have to spend on storage right now.

    3) Watched folder: I REALLY like being able to drop files somewhere and have them read in, rather than having to drag and drop to get them into my citation manager.

    4) File organization: I really hate having to right click and click "rename PDF from parent metadata" whenever I want to rename the PDFs. Love that M does that automatically.

    5) Auto metadata extraction: Even if you're using the exact same methods of metadata extraction, Mendeley does it automatically. In Zotero I have to select items and right click and click "Retrieve metadata". Doesn't sound like much but when trying to organize thousands of files it's really time consuming.

    I've never really used the web importer much in either program, just b/c of my work flow, so I don't miss that feature in in M.

    Don't get me wrong Z is a leader! You guys should be commended for the program and how much it's come up the past few years. Just wanted to give you my reasons for switching so you can add them to your customer feedback log.
  • We've talked about just writing "we're working on it" in response to requests that are unlikely to be implemented. Sounds positive, user feels good, you don't have to deal with a follow-up explaining why not...
    I've always felt like users of software, especially software designed for academics, are smart enough to get direct and honest answers, not customer-rep talk.

    That said, if your main workflow is importing PDFs directly, Mendeley is the better product for you without doubt. As a warning, you'll spend a fair amount of time cleaning up data. PDF look-up is imperfect, even in Mendeley
  • I know where you're coming from. In my office, there are some people who have to give the "customer servicey" responses and I'm the one has to be more direct and critical sometimes. And in the fixed number of hours in the day, time's better spent developing than responding to comments and making users feel good :) I feel you pain!

    I think you hit the nail on the head. Right now I'm consolidating my citable PDFs, which is where I've spent most of my time getting hungup in Z. I can see how it would (and it has been) relatively simple to use with a few files at once. Maybe I should try working web import into my workflow, too. I just like having the PDFs in my library and synced so I've always just downloaded them and then brought them into my management software.
  • Just wondering: would it be technically possible to leave the choice of the delay between queries to the individual users? Maybe adding it into the preference panel?
  • It would be technically possible, but Dan rejected putting it in the prefs above (and he's right - too many prefs aren't a good idea). But if someone wants to play with the code to come up with a more adaptive delay routine, I'll assume patches would be welcome.
  • I am having the same issue :( -- Love Zotero anyway :)
  • When I do research, I'll harvest perhaps 100 or more articles, documents, etc over the course of few days and check them as I go and still get dinged by GS. Anyone doing systematic reviews will understand.

    Have the folks at Zotero discussed this problem with the folks at Google as the problem is 1. not going to go away unless it is addressed head-on, 2. going to get worse (there is more to search, and more people searching) 3. metadata is important (and is becoming a type of currency).

    If it would make Google happy, we could always log on through GS Scholar profiles as that would at least tell them we were real people.
  • But I don't understand - when you do systematic reviews - why on earth would you download the PDFs individually and then to retrieve metadata? Why not go through actual databases (or even through google scholar directly) and use Zotero there? My understanding is that for a systematic review you enter certain search terms in multiple database and import all relevant articles. You should use Zotero right on import there.

    The major reason Zotero devs see this as a not-super-crucial issues is that the only time you should hit this during normal Zotero use is when initially importing a large set of PDFs that weren't managed by a different ref manager.

    With normal use, retrieve metadata should just not be something that you use all that much - not just because of the query limit, but also because google scholar data is incomplete and frequently even incorrect.
  • edited September 14, 2013
    But in either case, I'm not really sure that there is anyone we can talk to. I have sent a couple emails to Google about this, but there was never any response. Maybe someone knows an actual person that we can email, because generic Google support channels seem to be giant black holes.
  • I've mentioned this to Simon once and he wasn't hopeful about the talking to google option, either.
  • Adamsmith's point is well-taken. Perhaps Google needs to understand better how researchers organise information when they do what they do. I do tend toward the life sciences databases, but it doesn't capture other, not peer-reviewed works, and draft papers posted by academics who have (regretfully) not published in open access journals (especially if their research was funded publicly -- another issue for another place....)

    The co-inventor of Google Scholar is Anurag Acharya. His Google Scholar profile is here: http://scholar.google.com/citations?hl=en&user=nGEWZbkAAAAJ&cstart=20&sortby=pubdate&view_op=list_works&pagesize=100

    Email: acha@google.com

    In an interview in 2007: "By the way, although Acharya described the service in terms of outreach to other countries and other languages, he assured me that Google is "happy to work with anyone interested." In fact, the company is currently in negotiations with a Canadian scholarly society. Acharya said that content from the new digitization program should start entering Google Scholar before the end of the year."

    That speaks for itself.

    The other co-inventor is Alex Verstak, but Dr Acharya seems to be the one who is most often engaging in discussions with users.

    There is also Microsoft Academic Search, too.
  • we've looked at MS Academic Search - it's nowhere near where it would need to be in order to be useful as a GS alternative for retrieve metadata, unfortunately (though we do have a translator for importing directly from it, of course).
  • edited September 30, 2013

    But I don't understand - when you do systematic reviews - why on earth would you download the PDFs individually and then to retrieve metadata? Why not go through actual databases (or even through google scholar directly) and use Zotero there? My understanding is that for a systematic review you enter certain search terms in multiple database and import all relevant articles. You should use Zotero right on import there.
    — adamsmith Sep 14th 2013

    Because not everyone uses Zotero in Firefox. Some people use Zotero as a standalone programme. I download PDFs individually when doing a systematic review and then import them into Zotero in one go, but unfortunately I can't retrieve the metadata without getting blocked.

    I think the solution is to set up a metadata server which used in the first instance. Libraries are already synced with zotero.org, so why not query the metadata there first since it has access to everyone's library anyway with a fallback to other sources?

    Another option might be to display the Captcha test for the user to solve? Using the standalone version, it's very difficult to get unblocked. Changing IPs and/or solving the Captcha doesn't resolve the block

  • Because not everyone uses Zotero in Firefox. Some people use Zotero as a standalone programme. I download PDFs individually when doing a systematic review and then import them into Zotero in one go, but unfortunately I can't retrieve the metadata without getting blocked.
    that has nothing to do with Zotero for Firefox or Standalone. Standalone works with all common browsers (via extension for Chrome&Safari, bookmarklet for all others) and allows you the same one-click import of data, so downloading the PDFs individually and then retrieving metadata is still an inefficient way of doing things. Obviously you can use Zotero in any way you want, but it'll work better when you use it as it's intended to be used.
    so why not query the metadata there first since it has access to everyone's library anyway with a fallback to other sources?
    While some degree of this is planned, the technical difficulties are formidable: Zotero almost certainly can't index the full text of PDFs the way google does for a combination of technical, legal, and privacy reasons. Hash sums will work for some files, but it's increasingly common for publishers to embed identifying information into PDFs (such as download date and university), so those will not work for all that many files. There are more concerns here, both in terms of technical feasibility and in terms of privacy/permissions.
  • How one chooses to use Zotero doesn't change the fact that there's a demand for bulk metadata retrieval, and even if it's only one time in the majority of cases, it's likely to be the first experience for new users. Using what seems to be a killer feature of a spiffy new app and then finding out that it only works for the first 50 or so PDFs and then locks you out of Google Scholar may not be the first impression that Zotero is aiming for.

    While some degree of this is planned, the technical difficulties are formidable: Zotero almost certainly can't index the full text of PDFs the way google does for a combination of technical, legal, and privacy reasons.
    — adamsmith

    You wouldn't need to, just the first n characters needed to uniquely identify an article, or just a line or two every n lines up to a maximum. It wouldn't need to be a foolproof, full index system, just enough to catch 50% or more of requests that would otherwise go to Google.

    Displaying the Captcha to the user might also help mitigate any negative UX because it would show the user that it's a limitation with Google, not Zotero.

  • oh, I don't disagree that this should work better, I just think that systematic reviews are not a major reason and if people are using retrieve metadata for systematic reviews, they're likely misunderstanding something about Zotero.
    As I say above, initial imports are the major use case for this.


    I'd be very skeptical about Zotero running pdftotext on every PDF people store in Zotero. To me, telling Zotero users that their pdfs are, essentially, read by Zotero's server sounds like a bad idea. Even if devs disagree, it would require an entirely new process on the server and so would most certainly not be simple technically. Since false positives are a major issue and much worse than failures, the system does need to be quite good in preventing those (which is why first n characters doesn't work well). We might get some percentage via hash sums, which would never produce false positives - that's the current plan, I believe. And we already do get a good number via DOIs, which go through CrossRef, which doesn't have a query limit we're aware of), but I'm not super-optimistic that that will reduce lock-outs dramatically (I could be wrong, though).


    I'm not sure we know how google exactly determines the lock out, it might check for multiple things like IP, cookies, etc. I don't think Standalone can display the captcha (though Simon would have to say for sure) and I doubt whether it would make a difference if it did: Standalone doesn't save any cookies afaik and that would be the only other identifying feature.

    Beyond that, Zotero's entire code is open, so any specific recommendations on technical remedies are best discussed in terms of specific patches.
Sign In or Register to comment.