PDFs and metadata
Like many others, I have a large collection of disorganized PDFs that I would like to add to my Zotero collections. I've been trying to get metadata for my PDFs, but with only limited success. I really like this feature of Zotero, but I feel it needs some work to become genuinely useful.
First, when I import a PDF by dragging it from my desktop into Zotero, sometimes it just sits for a long time. I understand that Zotero is copying the PDF file into its database, but it seems like there should be some kind of visual feedback, to tell me what's happening.
Second, it's too easy for the "Retrieve Metadata for PDF" search to misfire, and when it does, the fall back is too awkward. To correctly associate metadata when Zotero fails, I gather that I must (1) manually steer my browser to WorldCat (2) manually enter and search for the name of the author or title (3) select the appropriate search result, (4) click on the icon in the URL address field, to import the reference to Zotero, then drag my PDF onto the newly created item. Depending upon the title, this may mean scrolling clear to the other end of the list of items, grabbing the PDF, then dragging it back to the new item, etc.
Needless to say, this is a clunky procedure. The quasi-automatic approach is a good idea, but is certain to fail in many cases, no matter how "smart" you try to make the heuristic. For comparison, look at how an application like Papers <http://mekentosj.com/papers/> handles the same trick. In Papers, you can get a list of search results from Google Scholar, allowing you to pick one to associate the metadata with your PDF. You can adjust the search right from inside Papers, even by selecting a swatch of text from the PDF and picking the author/title/etc search type. You don't need to create the reference by hand or drag the PDF to associate it with the reference.
Would it be possible to do something similar in Zotero (i.e., eliminate some of the manual steps) ?
Next, I like that Zotero can rename the PDF file into something consistent. It would be great if it could also update the metadata inside the PDF file, so that it could be read by other applications.
First, when I import a PDF by dragging it from my desktop into Zotero, sometimes it just sits for a long time. I understand that Zotero is copying the PDF file into its database, but it seems like there should be some kind of visual feedback, to tell me what's happening.
Second, it's too easy for the "Retrieve Metadata for PDF" search to misfire, and when it does, the fall back is too awkward. To correctly associate metadata when Zotero fails, I gather that I must (1) manually steer my browser to WorldCat (2) manually enter and search for the name of the author or title (3) select the appropriate search result, (4) click on the icon in the URL address field, to import the reference to Zotero, then drag my PDF onto the newly created item. Depending upon the title, this may mean scrolling clear to the other end of the list of items, grabbing the PDF, then dragging it back to the new item, etc.
Needless to say, this is a clunky procedure. The quasi-automatic approach is a good idea, but is certain to fail in many cases, no matter how "smart" you try to make the heuristic. For comparison, look at how an application like Papers <http://mekentosj.com/papers/> handles the same trick. In Papers, you can get a list of search results from Google Scholar, allowing you to pick one to associate the metadata with your PDF. You can adjust the search right from inside Papers, even by selecting a swatch of text from the PDF and picking the author/title/etc search type. You don't need to create the reference by hand or drag the PDF to associate it with the reference.
Would it be possible to do something similar in Zotero (i.e., eliminate some of the manual steps) ?
Next, I like that Zotero can rename the PDF file into something consistent. It would be great if it could also update the metadata inside the PDF file, so that it could be read by other applications.
P.S. I don't know the workings of Zotero's metadata retrieval algorithm, so I might be way off here.
Looking over the support forum, I find several other discussions of this issue, and a fair amount of head-scratching over better algorithms for automatic metadata retrieval. While it might be possible to scan for certain keywords like "copyright" to get to the page with the author/title, etc., I think this is a pretty difficult problem to solve in a general way.
The strength of the design of an application like Papers is that it recognizes this difficulty and provides a decent UI for (1) refining the search to a repository (and I strongly agree with enozkan's comment, above, that being able to specify different repositories would be very good), and (2) selecting one item from a list of search results to then retrieve the metadata.
The approach, then, is not to place all eggs in the basket of one-click metadata retrieval, focusing all effort on trying to make that more consistently successful, but also to provide a UI to help when the auto-retrieval fails for some reason.
I like the "pick and choose" idea. Does Papers do that for all pdfs or only for ones where it's less certain?
If there is but one search result from Google Scholar, it opens a dialog to tell you this and ask if you want to just go ahead and match the metadata. If there are multiple results, then it shows them inside Papers. You select the one that you want, and then click on a "Match" button to grab the metadata and associate it with the PDF.
Papers doesn't seem to have the same smarts of Zotero for parsing web pages, but the interface for getting the metadata and associating it with the PDF is much smoother.
Seems like the Zotero interface could be spiffed up a bit without much fuss.
pdfinfo Vanberg-Pol_const_rev_germany.pdf
Title: THE POLITICS OF CONSTITUTIONAL REVIEW IN GERMANY
Author: GEORG VANBERG
Creator: AdobePS5.dll Version 5.2
Producer: Acrobat Distiller 5.0.5 (Windows)
CreationDate: Tue Jan 11 21:32:15 2005
ModDate: Sat Jan 29 16:11:45 2005
Tagged: no
Pages: 209
Encrypted: no
Page size: 303.12 x 497.52 pts
File size: 1086661 bytes
Optimized: yes
PDF version: 1.3
... now, most of that is useless, the Title / Author is nice!
It is really important for the attachments to be useful in terms of metadata outside of the zotero environment as well, which would make things even easier.
So, I'm wondering if this setting could be parameterized. I.e., surface it on the hidden preferences page in Firefox ("about:config"). That way, we could easily tweak the heuristic to accommodate "problem" PDFs.