retrieve pdf metadata problem

Iddo · October 31, 2008

Hi Again,
Strange problem - I watched the retrieve pdf metadata screencast and tried it on a number of PDFs.

The first time I drag and dropped 3 PDFs I got the option "retrieve metadata from PDF" when I right clicked. i tried it but it didn't work (opened a window with a small rotating circle which never stopped). I canceled and tried again - still nothing. I than tried it with a few other PDFs I imported and to my surprise I no longer had the option to retrieve PDF metadata at all when I right click.

Any ideas what is going on?

Iddo

Iddo · November 1, 2008

Any ideas?

rmjanjua · November 3, 2008

Iddo,
I posted a comment a while ago but never got a response from anyone. It has been mentioned in the past but the problem remains. The metadata retrieved is also incorrect and incomplete. The only engine used to retrieve is Google Scholar.
I also anxiously await the ability to drag my PDF's into Z without having them duplicated into the Z files.

Iddo · November 3, 2008

m... this is something Zotero is supposedly supporting so we are not talking about a future feature rather on an existing one which doesn't work (for me at least).

Tjowens · November 3, 2008

Which version of the preview are you both using? I just tried a few different PDFs in the newest version, Sync3.2, and the metadata came in fine for me. The best way for the community to refine the tool is to mention specific PDFs that fail and where one can find them in a database.

rmjanjua · November 3, 2008

I am using the most current one but the problem was there before I updated it a few days ago. It actually always fails to retrieve the journal and incorrectly assigns a author: Anatomical variations in the origin of the human ophthalmic artery with special reference to the cavernous sinus and surrounding meninges.
Matsumura Y, Nagashima M.
In this article when dragging the PDF to Z and trying the retrieve the metadata, the author I got was "Organs C.T." without the journal and pages but it did get the volume and issue.
In the mean time a window keeps appearing on the top of my screen asking me to enter a URL.
Thanks for your help and look forward to your suggestions.

Iddo · November 3, 2008

I am using version 1.5sync 3.2.
I just don't have the option to to get the meta-data.
On my desktop I never had the option and on my laptop I had it and now I don't.

Simon · November 3, 2008

Iddo, you need to install the PDF indexer for this feature to work. Go to "Search" in the Zotero preferences and click the "Check for installer" button. This should enable the option for you.

Iddo · November 3, 2008

Hi Simon,
Interesting - why is this not installed as a default? I would have never guessed I need to do that.

O.K. now I have the option - I tried two PDF files and it did not find any metadata - can you direct me to a free PDF I can download which you know for certain that has retrievable metadata so I can try and see if it actually works?

Thanks,
Iddo

noksagt · November 3, 2008

Interesting - why is this not installed as a default? I would have never guessed I need to do that.

It uses platform-dependent binaries that are distributed under a different license.

Tjowens · November 4, 2008

All the PDF's in the screencast work, I believe those came from ERIC. This PDF from Biomed Central worked fine for me. You can find the original article here.

Iddo · November 4, 2008

yup - it works!

So apperntly many PDFsfrom Jstor don't have matadata - what a shame :(
So basically what you are saying is that most PDF I will try to import this way will not have metadata? is there any conceivable way around it apart of course from typing all the data myself which is something I don't really want to do for hundreds of PDF files I already have?

sean · November 4, 2008

Until recently, JSTOR did not include a text layer in its PDFs. Without that layer, the PDF is effectively just an image, and there's no way for Zotero (or any other tool) to read or recognize anything. If the PDFs are from JSTOR, you could always go to JSTOR and reimport those resources.

Iddo · November 5, 2008

true - I just thought that JSTOR from all places will be more organized on this point - apparently no - so I guess I can't expect more from smaller places.

Just a thought - why not use OCR application and build an algorithm that can try and extract the title, author, year of publication etc. from the front page of a PDF?
It won't be 100% (probably not even 70%) but it might be better than typing everything by hand.

noksagt · November 5, 2008

Ideas for more intelligent parsing of PDFs have been brought up before. A regex search for identifiers (PMID, arXiv, DOI, etc.) would be useful. For recognition of titles, etc., a search against a large pool of data (as the Zotero server may one day have) could benefit heuristic identification.

I don't know if OCR has been discussed before. This seems rather heavy and platform-dependent to be a reasonable dependency to me, but the idea of allowing end users to plug-in command line apps for indexing has been discussed. If custom commands could be run on file attachment or for indexing, you could insert your favorite ocr app into the chain before pdftotext.

Iddo · November 5, 2008

sounds interesting.
You talk about "a large pool of data" - is this something you are currently activly looking into or are we talking about a distant future?

noksagt · November 5, 2008

I'm not a Zotero developer, but the Zotero server is under active development. Recommendations have been touted as a planned feature & a recommendation system may benefit similarly from a large data pool. There have been no announced plans for heuristics to be used to add metadata, so that would probably be a more distant feature.

Iddo · November 5, 2008

O.K. thanks :)

enozkan · November 7, 2008

I have to second rmjanjua on his comment. Metadata retrieved is very inaccurate in cases when Google Scholar is used as the repository. Can we set the repository for data retreival to something else? Honestly as a biochemist, >99.5% of articles I read are accurately recorded by PubMed (for which we have to thank the American government for), and on the rare occasion PubMed is used as the repository, the bibliography retrieved is always accurate.

Engin

rmjanjua · November 9, 2008

Engin,
One of the options that I would like to see is the ability top pick your repository. I agree that Pubmed is great and actually have been happy in all of my records taken directly from it. How do you save your PDF's? I have been saving them all in YEP with tags. This allows me to pull up all PDF's related to a subject and view in one glance. Although having PDF's associated with a citation in Z is functional, it does not have the visual feature of YEP and ends up duplicating the file on your drive.

Rashid

Simon · November 9, 2008

PubMed stores only abstracts, not full-text, and thus would probably not work too well. Our algorithm currently tries to extract a random text snippet and searches for that; even if we tried to extract only the abstract, I'd imagine our success rate with any kind of algorithm would be very low. PubMed Central would probably work properly as a repository for the articles in it, but has far less content than PubMed or Google Scholar.

enozkan · November 14, 2008

rmjanjua,

I do not save my PDFs any specific way any more, now that I'm using zotero. But I see your point.

Simon,

It does make me wonder, however, how other software manages accuracy in metadata retreival (I am specifically thinking of "Papers" which is available only on Macs). I guess their algorihthm is different, and if so, what are the chances of improving the current zotero algorithm?

Simon · November 14, 2008

As far as I can tell, Papers takes a similar approach to ours, grabbing metadata from Google Scholar. They try the DOI first, which might be worth looking into. After that, it seems like they make you select text from the PDF, which we aren't going to do. Are there PDFs that Papers does a better job with than Zotero?

Rintze · November 16, 2008

PubMed stores only abstracts, not full-text, and thus would probably not work too well.

Would it be feasible to first use Google Scholar to identify PDF papers by using random texts snippets, and when that gives a positive hit, to send the Google Scholar metadata of the identified paper to Pubmed (e.g. author names, titles, DOI's)?

Madigania · December 3, 2008

Back to the original problem, solved for Iddo above but not for me:
"I just don't have the option to to get the meta-data." when right clicking on a pdf in the Zotero library.

I followed Simon's instructions:
"Iddo, you need to install the PDF indexer for this feature to work. Go to "Search" in the Zotero preferences and click the "Check for installer" button. This should enable the option for you."
Confirmed that I do have PDF indexing (version 3.02) but I wonder if that is the problem? I'm using Zotero 1.07, updater says that is the most recent version (for Apple Os X.5), but Iddo stated 1.5 syncing with 3.2.

So this new feature only works on the Zotero 1.5 sync preview?

dstillman · December 3, 2008

So this new feature only works on the Zotero 1.5 sync preview?

That is correct. It's a new feature that will be in 1.5.

hamptondan · February 27, 2009

I'm having problems with accuracy, for instance a pdf of "Management of Large Segmental Tibial Defects Using a Cylindrical Mesh Cage" is being recognized as "Tumoral calcinosis in infants: a report of three cases and review of the literature". They have the correct journal but the wrong year

gary.pajer · March 2, 2009

I haven't had any success at all with this feature. WinXP, Zotero 1.5b1, and yes, the two necessary plug-ins are installed. I drag the pdf into the center pane, making sure that it's an entry by itself (not attached to an entry). Right click, the context menu entry for Retrieve PDF Metadata is there. The info window (with progress bar) opens, wheel spins, then "No matching references found". Every time. Every PDF, including the one suggested above by Tjownes on Nov 4 2008. If I go to google scholar by hand, the article(s) are found easily, the zotero icon appears in the address bar, I click on it, and an entry is created and its fields properly populated.

Any hints?

Tjowens · March 3, 2009

Can you confirm that your full text search plugins are functioning? Try doing some searches for terms that appear inside your PDFs and see if Zotero is searching through their full text.

I just tried this PDF again and it worked quite nicely, so it looks like this is not a general issue but something specific to your configuration.

gary.pajer · March 4, 2009

Evidently ... I have three zotero installations, and it works fine on two of them. When I get a chance I'll try to uninstall/reinstall the plug-ins and, if necessary, zotero. Unless you have a better suggestion.

Thanks,
Gary

bauct · March 31, 2009

I tried the above article, and the text is indexed correctly, but it cannot retrieve metadata! pdf indexers are up to date. Running 1.5b2.

Tried another article that is definitely found in google scholar, also did not find metadata.

I am using NitroPDF not Adobe reader, any connection? PDF document properties do show the right title and author...