problem with pdf metadata retrieval

dda-gre · August 28, 2009

Hi folks,

I use a lot of the copernicus open access journals, which are super-nice, because they use direct links to their pdf, that you can right-click to download them into zotero.

but then, when I try to retrieve the metadata, it invariably fails... which is kind of sad, given that the regular google (not the scholar) leads strait to the right page...

exemple
http://www.atmos-chem-phys.net/9/5155/2009/acp-9-5155-2009.html

where you get the pdf named : acp-9-5155-2009.pdf

currently, I have to create the library item from the web page and then manually link the pdf, which is quite painfull when you start having more than a screen page of items. but then maybe I missed an obvious trick here...

dda

dda-gre · August 28, 2009

OK, forget about my question.

I finally found out that making the library item from the web page, and then sliding the link to the pdf onto the newly creatied item did exactly what I needed...

oops

dda-gre · August 28, 2009

yet, it still surprise me that zotero couldn't retrieve anything from the pdfs I got from that journal. as if it didn't look at all inside the pdf

dda

Bionatsci · August 28, 2009

Edit: sorry, started writing before you posted again - most of it is still relevant though, and it should be even easier than dragging the link.

If you create the item from the web page (using the symbol in the Firefox address bar) it should add the PDF automatically if you tick "automatically attach associated PDFs and other files when saving references" in the preferences (gear menu -> preferences -> general tab). I've checked your link and it works for that article.

For any sites where this doesn't work you can drag any direct link to a PDF onto a Zotero item and it will attach the PDF to that item. You may also want to report the lack of automatic PDF saving, after searching the forum to see if it is a known issue, as the solution may be just a quick fix to the translator code.

Bionatsci · August 28, 2009

I'm also surprised that retrieve metadata doesn't catch this - especially as searching for the title in google scholar brings this up as the first result and it is possible to grab the metadata from this result.

Does anyone acquainted with the "retrieve metadata" code fancy taking a look at this? - it could be indicative of a more general bug, as I would guess the title should be the next thing passed to google scholar if no DOI is found.

dstillman · August 28, 2009

Zotero doesn't really have any way to know what the title is—it just gets plaintext from the PDF, and whitespace isn't a very reliable guide. As far as I can tell, the problem with this document, with Zotero's current implementation, is that 1) there's no DOI and 2) the first page of author footnotes—particularly what comes through pdftotext (which you can see by examining the hidden .zotero-ft-cache file in the directory)—passes Zotero's test for body text, and so it queries Google for a bunch of universities and addresses.

It should probably try grabbing text from a few pages in—at least as a fallback—rather than just trying different passages from the first page that looks like body text. For what it's worth, many phrases in this paper don't appear to return results when doing a quoted search on Google Scholar, so it's possible Google indexed a different version.

Bionatsci · August 30, 2009

Aha, I get it now.

Grabbing text from a few pages in as a fallback sounds like a good idea, as long as the algorithm avoids grabbing text from the last page or so to avoid the reference list/bibliography (would only really be an issue with short review articles - which are often only 3 or 4 pages long, but have about a page of references).

Thanks for the explanation.