Identify PDFs via hashes

Would it make sense to extend the current "Retrieve Metadata for PDF" by a) storing hashes for all child PDF attachments synced via Zotero Storage, as well as the DOIs of the parent items, and b) matching unidentified PDFs against those hashes, and, if a match has been found, looking up the metadata via the correlated DOI?

Since this would only require storing PDF hashes and DOIs in aggregate, I don't see any privacy concerns. One caveat is that some publishers include the date of download in their PDFs, so here the file hash won't be constant for the same article.
  • I believe Mendeley does something like this.
    I don't think there is a privacy concern, but Zotero would be exploiting user-generated data to provide a service. To my (German) mind that should require some type of opt-in.
    In general I think Zotero could be thinking about such issues more - e.g. there are also a lot of questions from alt-metrics folks about using Zotero data and again, I'm not sure there's a big privacy concern per se, but I think distributing aggregated user data in any way for AltMetrics and related projects would require some type of opt-in.
    I think in both cases most users would willingly give that permission, but Zotero should ask for it.
Sign In or Register to comment.