Unable to retrieve metadata for PDFs

dstillman · April 10, 2018

This discussion was created from comments split from: "Retrieve metadata for PDF" fails with "an unexpected error occurred".

b0c5 · April 9, 2018

same error here

adamsmith · April 9, 2018

Could you say more j.cossio?
Which Zotero version, what document?

Retrieve metadata has completely changed since this was reported last year, so it's definitely not the same error under the hood.

b0c5 · April 9, 2018

I am using Zotero 5.0.44.
It happens for multiple documents. Next time I find an example I'll update her.

b0c5 · April 9, 2018

@adamsmith It basically happens for every PDF. Any clue of what can be going on? These are PDFs of recent articles, with OCR.

adamsmith · April 9, 2018

Could you produce a debug ID for dragging a PDF to Zotero and the failed (automatic) retrieve metadata?
https://www.zotero.org/support/debug_output

b0c5 · April 9, 2018

I think I just did.
D2057122540

b0c5 · April 9, 2018

By the way I am behind a proxy. Could be the source of the problem?

Although the Zotero connector works fine, and Retrieve metatadata worked fine until recent updates.

adamsmith · April 9, 2018

Proxy could be related, but if you can submit debug IDs, you should be able to retrieve metadata. Let's see what dstillman & adomasven see in the debug.

dstillman · April 10, 2018

@j.cossio: You're getting a 403 error from your proxy server. As adamsmith says, it's odd that you're able to submit debug output (which is a POST to https://repo.zotero.org), but for some reason your POST requests to https://recognize.zotero.org aren't working, so you'll have to debug that.

b0c5 · April 10, 2018

@dstillman Requests to https://recognize.zotero.org are a feature of newer versions I guess? In earlier versions I could retrieve metadata without problems. Probably my proxy is blocking this domain. I'll check with the network admins.

dstillman · April 10, 2018

Yes, that's new (though if your network access is based on some sort of whitelist, many things in Zotero are likely to be broken).

b0c5 · April 10, 2018

@dstillman Syncing with the zotero.org library fails. What is the server used in this case?
Is there a list of all the domain names accessed by Zotero?

dstillman · April 10, 2018

No, there's no such list. Zotero is a web-connected tool, and various things won't work properly without at least the same access as your web browser.

Zotero's own infrastructure is hosted on AWS. While a DNS-based access restriction could whitelist *.zotero.org, those IPs can change every minute, and any restriction that didn't take that into account would result in things breaking regularly.

Other functions in Zotero require access to any site you save in the browser (e.g., to save files) and to various other services that can change at any time (e.g., to retrieve metadata of various kinds).

b0c5 · April 10, 2018

@dstillman Is there at least a domain name for syncing the library?

dstillman · April 10, 2018

api.zotero.org and stream.zotero.org, but again, the associated IPs can change literally every 60 seconds.

cwru53 · July 30, 2018

I am having the same issue, under the same circumstances. My company tightly controls outside access and I just need the PDF metadata retrieval options to work.

Do you have any info or documentation stating exactly what is required for metadata access to PDFs to work? Is there a way to do this locally? Did this change when pdfxchange was removed?

I need information to submit to the network admins on how the software behaves and what it is required to access for this feature.

adamsmith · July 30, 2018

This has always been the case and has nothing to do with pdftools (which I think you're referring to; they also weren't removed, they're just automatically bundled now). It's possible this used to work for you previously because you were running the Firefox add-on. If that's the case, you could try to make the (accurate) argument that nothing has really changed. Zotero still has and requires exactly the same sort of access to the internet as a standalone app than it did as a browser extension.

For your other questions:
There's no way to do this locally, no. Zotero needs to query online databases -- both its own and others such as CrossRef/DataCite or Worldcat -- to be able to match papers to metadata. The metadata isn't in the paper itself; it's online.

dstillman's answer from April 10 really is the best we can give you on this. In order for Zotero to be able to retrieve metadata it needs to be able to send and receive data over standard http and https ports. It uses multiple and potentially changing domains for that and there's simply no way to reliably list them all.

stroom · January 15, 2019

Same, sort of, here: laptop at work does not get metadata, but at home or with VPN there's no problem. How to solve this?

adamsmith · January 16, 2019

As Dan says above, this is basically about Zotero having to be able to access the internet. You'll need to talk to your work IT how to do that -- might be a proxy configuration, might be some sort of firewall.