Retrieve Metadata For PDF Unexpected Error - Workaround & request for more info

Omer Barak · October 14, 2013

Hi,

I'm using Zotero standalone 4.0.11 on Windows 7 64bit and Chrome 30.

I kept getting "An unexpected error occurred" when retrieving metadata for PDFs.
What solved the problem was bypassing (that is, not using) my University proxy server. Alas, I must be connected to the proxy in order to openly access certain magazines and other databases...

The proxy server is defined through Windows' "global" internet options with the "automatic configuration script" option (this script is basically just a text file with the proxy address which deines exceptions for sites that must use direct no-proxy connection http://en.wikipedia.org/wiki/Proxy_auto-config).

So i have 3 comments:
1. The retrieve metadata should have better error messages... even "General Error connecting to X" would have been more informative, and help me find the cause...

2. I don't know if it's possible, but it would be nice if the metadata retrieval was done in a way so a proxy server won't interfere with it.

3. Is there a specific web-service address (e.g. api.scholar. google.com) that the metadata retrieval is connecting to? If so, please tell me so I can ask my University IT to put it in the proxy exception list to allow direct access and workaround this issue.

Thanks,
Omer

dstillman · October 14, 2013

Can you provide a Debug ID for a metadata retrieval attempt that's failing?

Omer Barak · October 16, 2013

UPDATE: when working with the university proxy but from within campus (connected to the local university network) the metadata retrieval works.

So I've submitted three debug logs:
Debug ID: D865055045 - Not Working – From home using proxy server.
Debug ID: D420619875 - Working – From home not using proxy server.
Debug ID: D1903733636 - Working – From campus using proxy server.

Thanks for the prompt response!
Omer

Omer Barak · October 16, 2013

Another UPDATE:
I've just used the chrome connector to import a paper from the IEEE Xplore site (first time I'm importing from that site).
First, the regular Zotero pop-up showed in Chrome. It marked that Fulltext download failed (a red X next to it) - but then the Zotero Standalone opened an "Authentication Required" dialog box asking me for my proxy username and password. the exact text in the dialog box was:
The proxy {ADDRESS REMOVED} is requesting a username and password. The site says: "remote access service".
After inserting those details metadata retrieval began working perfectly.
So, I conclude the problem is that trying to retrieve metadata doesn't make the authentication window appear.

I should mention that the authentication window never appeared before when I imported from several other sites (ACM DL, Google Scholar, Citeseer...) but when I browse to these sites using Chrome, a Chrome proxy authentication windows appears (though it doesn't seem to affect Zotero's behaviour).

To sum-up this long post... I restarted my computer so that the authentication dialog box would appear again and submitted three more debug logs:

Debug ID: D494160086 - *Chrome Connector* debug log when importing from IEEE Xplore.

Debug ID: D1758286705 - *Zotero Standalone* debug log when importing from IEEE Xplore. The steps I took were: Starting Chrome, Going to IEEE site, Clicking the "Save to Zotero" icon, pressing OK on the authentication dialog box (my details were already filled in).

Debug ID: D1454335864 - Successful metadata retrieval after the authentication process finished.

Thanks Again,
Omer

dstillman · October 17, 2013

Zotero Standalone currently doesn't work great with authenticated PAC setups. It's true that the PDF metadata lookup doesn't trigger the authentication prompt, but most translator-based saves won't either. (Among other things, having Standalone, which might be in the background, throw up an auth prompt in the middle of a save from a browser isn't really ideal.) The prompt you did see was actually just a side effect of a particular saving technique used for some pages, but that's the exception.

The real solution here is probably to trigger the proxy auth prompt when Standalone first starts up. We actually do that now for general proxies by hitting a static page under our control, but since the domain it's on wouldn't be in a PAC file, that doesn't help with PAC setups. What we'll probably do here is try to parse the configured PAC file (which isn't entirely straightforward, since it's arbitrary JS code) and pick out a proxied host at random and make a HEAD request to it to trigger the auth prompt. (We could just use a big site like JSTOR that's likely to be in the PAC file, but there's no guarantee it would be, and we don't really want to send a lot of extra fake requests at a single site.)

dstillman · October 17, 2013

(I also suspect most institutions don't proxy Google Scholar, so most people with authenticated PAC setups don't have trouble with Retrieve Metadata — they just get failed file saves from some sites.)

dstillman · October 17, 2013

(Ticket created, but we'll post here if there's any progress.)

Omer Barak · October 17, 2013

I understand...
well, like you said, parsing every possible configuration of PAC files doesn't make sense.
But, you could choose a random site from a list of "Big Sites" (JSTOR, IEEE, ACM, Citeseer, Springer... ). There are at least a dozen, and since HEAD requests are light-weight, I don't think the impact is that bad.

An even better solution would be to publish the address of your static page and tell institutions that if they want Zotero to work for their users they should include that site in their PAC files (reasonable request, since that's what PAC files are for). You can post (or privately send me) the address and I volunteer to be the first to include it in my institution's file...

Also, maybe there's a way to pass the required details from Chrome? Since Chrome itself does pop-up the auth...

Finally, as a fix for now, is there a way I can manually force Zotero to authenticate? (Other than going to IEEE and initiating a save)

dstillman · October 17, 2013

One concern I had with just picking a big site at random from a list was that there'd be no way to guarantee it was proxied by a given institution. However, it looks like we can actually easily test from code whether a given site would be proxied before making a request, so testing a few URLs from a list and making a HEAD request to the first one found is probably a reasonable approach. (Hopefully we can also determine if a PAC file is configured to begin with in order to determine whether the checks are actually necessary.)

An even better solution would be to publish the address of your static page and tell institutions that if they want Zotero to work for their users they should include that site in their PAC files (reasonable request, since that's what PAC files are for).

I don't think that's a realistic approach. This needs to just work.

Finally, as a fix for now, is there a way I can manually force Zotero to authenticate? (Other than going to IEEE and initiating a save)

Not easily, but we can probably fix this for 4.0.13.

If you want, you can download a copy of your institution's PAC file, add http://zotero.org.s3.amazonaws.com/proxy-auth as a proxied site, put it on a server somewhere (or possibly use a file:// URL), and point your system to that instead. But please don't ask your institution to add that, since it should be easy to fix this on our end and I don't want to set a precedent for bothering IT departments with that. Also remember to switch back to the real PAC URL after 4.0.13 so that you stay up to date with your institution's proxied sites.

adamsmith · October 17, 2013

(@Dan - I'd suggest putting JSTOR at the top of the list you use - that's almost universally supported, including internationally & requires proxy)

dstillman · October 17, 2013

Yeah, I mentioned JSTOR above, but I don't want every Standalone user with a PAC setup to hit JSTOR on startup, so I think the site needs to be picked at random. Suggestions for others (along with the ones suggested by Omer above) are welcome.

adamsmith · October 17, 2013

www.sciencedirect.com, web.ebscohost.com, gateway.proquest.com (or search.proquest.com), http://www.tandfonline.com/doi/full , *.sagepub.com/content (or asr.sagepub.com if you want a specific journal)

dstillman · October 17, 2013

That's great — thanks.

dstillman · October 17, 2013

OK, implemented for 4.0.13. See that thread for more details.

sylphche · October 21, 2013

Hello, can you please help me with the same issue? I cannot export the metadata from the PDF files. I am trying to do this at work - unfortunately I don't understand very well what a proxy is and how it prevents zotero from accessing the metadata. Also I don't know how to by-pass a proxy (the solution given by Omer above).
Can you please tell me step by step what I should try in order to get my metadata exporting working?
I have done it a couple of weeks ago from the same computer at work which no one else uses, so they cannot have changed anything on the PC - it worked fine. Now, I just can't understand why it doesn't want to work. The error message doesn't give any indication to the cause.
Many thanks!

Omer Barak · October 21, 2013

Hi sylphche,

I'm not sure if you can change the proxy setting at your workplace (sometimes organization block it). Also, since that error message is very general, it might be something else...

But if you want to try it anyway the steps are (on Windows 7/Vista):

1.Open Internet Explorer by clicking the Start button, All Programs, and then clicking Internet Explorer.
2.Click the Tools button, and then click Internet Options.
3.Click the Connections tab, and then click LAN settings.
4. Note which boxes are checked (so you can check them again if you need to re-enable proxy).
5. Now, to disable proxy, uncheck all boxes.

Omer

adamsmith · October 21, 2013

sylphche - you should probably start a new thread and provide more details (how does it fail, what error message are you getting, an error report ID). Omer's problem is very specific and there's a very high chance that the reason you're seeing problems with retrieve metadata is entirely different from his.

sylphche · October 21, 2013

Thank you all,

I will start a new thread, only I am still finding my way in these forums and didn't know the accepted ways of posting things.