Indexing

jeffreylacasse · March 10, 2010

I'm running latest version of Zotero on a Mac with pdf to text and pdfinfo 3.02 both installed.

I can't get it to index my pdfs correctly. If I pull a .pdf into Zotero and ask it to a "Retrieve metadata from PDF" ,it always gives an error, "Could not read data from PDF." I have 56 indexed, 1156 unindexed. Most files are unindexed.

I searched the forums and I'm unclear where to even start. I originally had all my Zotero files in my .mac iDisk directory; I moved them to the desktop and there was no improvement.

Clearly this limits the utility of Zotero for me and I'd love to have this functionality. Any ideas?

jeffreylacasse · March 10, 2010

I should add that when clearing the index and rebuilding it, I always end up with 56 indexed, 0 partial, 1156 unindexed, words=9966.

The metadata retrieval for PDFs does not work regardless of the type of article- I tried about ten different ones to ensure it wasn't a specific kind of pdf (from a specific journal for instance) that causes the problem.

adamsmith · March 10, 2010

first, make sure that the pdfs in question do in fact contain a text layer - i.e. in your pdf reader, can you select and copy text?
Then, try rebuilding your index once more and see if you can report an error afterwards - maybe the process gets stuck at one specific place and just stops then. Post the error ID and maybe the error text that seems relevant here - maybe that may even help you make sense of the problem.

jeffreylacasse · March 10, 2010

I verified that the pdfs do have a text layer.

Here is the Debug File for when I try to retrieve pdf metadata- D118947533.

Here is the Debug File for my attempt at indexing all my files: D2000249714.

Help appreciated!

kithairon · March 11, 2010

I have struggled with indexing as well, have some 300 indexed and 650 non-indexed ones. I'm on Mac, latest version of FF and Zotero, updated pdftotxt.
So far I have identified 4 different reasons why Zotero doesn't index a pdf even after repeated attempts (via the re-index-option in prefs or manually with the context menu's re-index point). Obvious one first: a) no text layer in pdf; but also b) pdf is password-encrypted; c) title of pdf on import contains spaces; and, tricky for anything non-english d) accents & umlauts in filenames (haven't figured out the complete list of no-nos but it definitely doesn't take to any specifically French, German or latinised Indic characters in the filename). After removing offending spaces and non-english characters Zotero indexes the file on a reimport pretty much ok. (Have one or two files that remain adamantely unindexable despite the above circumspections).
Obviously, it would be great if non-english characters could be handled by Zotero. My current solution is that I trim anything potentially alien from the filename before importing, scrape the data for the main item (or fill in the info panel by hand) then use the 'rename file from metadata' option on the now-indexed pdf – and the attachment's name is reinstated in its previously unacceptable glory, all bristling with spaces and accents.
kithairon

dstillman · March 11, 2010

Obviously, it would be great if non-english characters could be handled by Zotero.

It's a Mozilla bug.

https://www.zotero.org/trac/ticket/957

jeffreylacasse · March 11, 2010

Non-english characters are not a concern at my end; I'm still mystified why this isn't working.

Did the debug logs provide any useful information?

jeffreylacasse · March 11, 2010

.

adamsmith · March 11, 2010

have some patience, it has just been 24hs since your intial post.

Also, since developers are quite busy, try following my initial suggestion and, instead of creating debug output, see if you can create an error report - error reports are pretty easy to read and you might be able to figure out what's going on yourself.
You could also have a look at the debug output - if Zotero crashes while indexing a specific file that should be pretty easy to discern.

Btw - for not indexed pdfs - do you have green arrows next to the Indexed:no in item information? What happens when you press the green arrows.

jeffreylacasse · March 11, 2010

When I press the green arrows, absolutely nothing happens.

erazlogo · March 11, 2010

@jeffreylacasse: see kithairon's answer above--your pdfs are either secured, or their file name contains extended characters. try this: rename the pdf as something simple, then click the green arrows; if that doesn't work, open your file in Acrobat Professional and see if it includes "(SECURED)" next to the file title. If the file is secured, you can use a program to unlock it, then index.

jeffreylacasse · March 15, 2010

Thanks. They don't appear to be secured nor do the file names contain extended characters- but I've just given up on it for the moment. Zotero works fine for citing in Word, etc., maybe in time the other functions will work in a more user-friendly manner.