Zotero doesn't recognize the existing Metadata inside my own PDF

regelausnahme · August 8, 2018

Dear Developers,

whenever I write documents (using LibreOffice) I add the metadata as new properties using terms from https://en.wikipedia.org/wiki/Dublin_Core . After exporting the document to PDF and adding it to Zotero these Metadata are not recognized.

This also applies for many documents I get from collegues, which contain all the necessary metadata inside.

As far as I understood this is because Zotero doesn't even look inside the file for metadata. And as many of these are not to be published officially they cannot and will not be found in the web anywhere.

Nonetheless I would really like to use Zotero to manage these kind of internal papers also. Second problem is, that quite some of these papers are not allowed to be given to third parties as this would be a security breach which endangers the leak of personal information without consent.

This leads me to several conclusions:
a) I would really appreciate to mark all online functions as such. This would enable me to make an informed consent in sending you, as a third party, parts of my documents/ personal data. Although in many cases your approaches are really useful and helpful, as long as it is not marked I am afraid in this case it is a security breach.
b) This could also be a legal problem regarding the new DSGVO of the EU as in some cases the first pages of a document contain personal data of other persons which I am not allowed to give to third parties outside a certain group of people.

Thanks for your great work!
I really do appreciate it and I do like Zotero.

dstillman · August 8, 2018

As you saw, we don't look at embedded PDF metadata. It's simply not good enough to use in most cases, and we don't have a good way of knowing if it is. The only possibility might be to check for the presence of the embedded metadata within the first few pages of document text and use the metadata if at least some of it appears, but it's not clear to me if that would even be sufficient in your case.

For the second part, we'll be putting out an updated privacy policy shortly that outlines every Zotero function that accesses the network, how to disable or avoid each one, and what our retention policies are for each type of data. In the case of PDF metadata retrieval, as we noted when we announced it, the PDF recognizer service "doesn’t require a Zotero account and doesn’t log any data about the content or results of searches". If you still don't want to use it, you can disable it from the General pane of the preferences.

regelausnahme · August 8, 2018

Thanks for your quick and informing answer.
Regarding the first point: This would already help alot!
Of course it can be insufficient or wrong - so can the automatic retrieval or in some cases even the ISBN search. In any case the user has to take a short look over the result and keep or discard it.
So it won't be necessary to enable the program to judge the quality of the findings inside the PDF itself as long as it recognizes them since the user will already be doing the quality check. (Of course, the software can give a recommendation to use the metadata retrieval for better results.)

Thanks and enjoy the day!

adamsmith · August 8, 2018

No, that's a bit too cavalier about false positives. Repeated incorrect, let alone non-sensical, import is both a bad user experience and undermines trust in Zotero.

bwiernik · August 8, 2018

If a feature to extract metadata from PDFs is added, I almost feel like it should only be used if specifically requested by the user (as a separate context menu option?). Meaningful embedded metadata in PDFs is so rare that it would almost always only be useful if the user specifically put it there.