PDFs and metadata

mroberts1839 · September 12, 2009

Like many others, I have a large collection of disorganized PDFs that I would like to add to my Zotero collections. I've been trying to get metadata for my PDFs, but with only limited success. I really like this feature of Zotero, but I feel it needs some work to become genuinely useful.

First, when I import a PDF by dragging it from my desktop into Zotero, sometimes it just sits for a long time. I understand that Zotero is copying the PDF file into its database, but it seems like there should be some kind of visual feedback, to tell me what's happening.

Second, it's too easy for the "Retrieve Metadata for PDF" search to misfire, and when it does, the fall back is too awkward. To correctly associate metadata when Zotero fails, I gather that I must (1) manually steer my browser to WorldCat (2) manually enter and search for the name of the author or title (3) select the appropriate search result, (4) click on the icon in the URL address field, to import the reference to Zotero, then drag my PDF onto the newly created item. Depending upon the title, this may mean scrolling clear to the other end of the list of items, grabbing the PDF, then dragging it back to the new item, etc.

Needless to say, this is a clunky procedure. The quasi-automatic approach is a good idea, but is certain to fail in many cases, no matter how "smart" you try to make the heuristic. For comparison, look at how an application like Papers <http://mekentosj.com/papers/>; handles the same trick. In Papers, you can get a list of search results from Google Scholar, allowing you to pick one to associate the metadata with your PDF. You can adjust the search right from inside Papers, even by selecting a swatch of text from the PDF and picking the author/title/etc search type. You don't need to create the reference by hand or drag the PDF to associate it with the reference.

Would it be possible to do something similar in Zotero (i.e., eliminate some of the manual steps) ?

Next, I like that Zotero can rename the PDF file into something consistent. It would be great if it could also update the metadata inside the PDF file, so that it could be read by other applications.

enozkan · September 12, 2009

I am going to second the point about misfiring Retrieve Metadata for PDF functionality, and the suggested solution from Papers. Comparing the two, Zotero comes up with incorrect or incomplete metadata much more often (I have actually not seen Papers give an incorrect or incomplete citation data). It is probably because of the repository it uses, so an availability of multiple repositories with an option to select might help (for example, Google Scholar citations miss many fields and manual manipulation is often required to make the citation ready to go into a manuscript).

P.S. I don't know the workings of Zotero's metadata retrieval algorithm, so I might be way off here.

dvs0826 · September 12, 2009

One idea I had for improving metadata retrieval is to scan the document for the word "copyright". the name of the journal and the year should be close by. Assuming it does have to fall back to a repository search, this should provide much more accurate results.

mroberts1839 · September 12, 2009

One other issue: on some PDFs that I try to fetch metadata for, Zotero gives an error that they "do not contain OCR'd text" when in fact they do. Why is that?

Looking over the support forum, I find several other discussions of this issue, and a fair amount of head-scratching over better algorithms for automatic metadata retrieval. While it might be possible to scan for certain keywords like "copyright" to get to the page with the author/title, etc., I think this is a pretty difficult problem to solve in a general way.

The strength of the design of an application like Papers is that it recognizes this difficulty and provides a decent UI for (1) refining the search to a repository (and I strongly agree with enozkan's comment, above, that being able to specify different repositories would be very good), and (2) selecting one item from a list of search results to then retrieve the metadata.

The approach, then, is not to place all eggs in the basket of one-click metadata retrieval, focusing all effort on trying to make that more consistently successful, but also to provide a UI to help when the auto-retrieval fails for some reason.

adamsmith · September 12, 2009

The OCR thing can have a bunch of reasons - including that the Metadata only searches on the first couple of pages (three or so).

I like the "pick and choose" idea. Does Papers do that for all pdfs or only for ones where it's less certain?

mroberts1839 · September 12, 2009

I am pretty new to Papers, but it seems to have two modes.

If there is but one search result from Google Scholar, it opens a dialog to tell you this and ask if you want to just go ahead and match the metadata. If there are multiple results, then it shows them inside Papers. You select the one that you want, and then click on a "Match" button to grab the metadata and associate it with the PDF.

Papers doesn't seem to have the same smarts of Zotero for parsing web pages, but the interface for getting the metadata and associating it with the PDF is much smoother.

Seems like the Zotero interface could be spiffed up a bit without much fuss.

eudinaesis · September 13, 2009

I had similar problems - I was trying to get metadata off a number of full ebooks, and it gave the OCR error. I'm guessing it's that "first couple of pages" thing, because these are exact copies, complete with blank pages, etc. Is there any reason it doesn't *directly* access the PDF's metadata? For example:
pdfinfo Vanberg-Pol_const_rev_germany.pdf
Title: THE POLITICS OF CONSTITUTIONAL REVIEW IN GERMANY
Author: GEORG VANBERG
Creator: AdobePS5.dll Version 5.2
Producer: Acrobat Distiller 5.0.5 (Windows)
CreationDate: Tue Jan 11 21:32:15 2005
ModDate: Sat Jan 29 16:11:45 2005
Tagged: no
Pages: 209
Encrypted: no
Page size: 303.12 x 497.52 pts
File size: 1086661 bytes
Optimized: yes
PDF version: 1.3

... now, most of that is useless, the Title / Author is nice!

anatolica · September 15, 2009

I second the idea to be able to read & write pdf metadata through zotero, just as zotero can rename the file from metadata. I have searched for some solution to write metadata into the pdf attachments here in the forum and elsewhere on the web but could not come up with a useful one.

It is really important for the attachments to be useful in terms of metadata outside of the zotero environment as well, which would make things even easier.

mroberts1839 · September 15, 2009

W.r.t. the "Retrieve Metadata for PDF" feature, I notice that 2.0b7 now checks the first three pages of a PDF (instead of the first two). This is good, though when I tried to get metadata for a PDF that didn't work for me in the previous version of Zotero, I still got the "does not contain OCR'd text" error. Looking at the PDF, I see that the text doesn't begin until the 4th page. First three pages are just images.

So, I'm wondering if this setting could be parameterized. I.e., surface it on the hidden preferences page in Firefox ("about:config"). That way, we could easily tweak the heuristic to accommodate "problem" PDFs.

studiosus · October 17, 2009

I second as well - most of the times I get the same error "do not contain OCR'd text", and there are lots of files where ISBN numbers etc are on the 5-6-7 page. It would be very great to improve this functionality.