A PDF attachment fails to be indexed

joehill · January 21, 2014

Hello,

I'm trying to index an OCR'd PDF but nothing happens. I logged the output, which is submitted under the bud ID below. The strange thing is that, when I changed the name of the PDF, cleared the log, then tried again, I got the same error message but the new error persisted in identifying the file by its old file name (although the directory was correct). I changed it back, and it tried to index the file, correct filename and directory name, but same error. No idea what's happening.

Thanks,

Joe

Bug ID: D1955072479

dstillman · January 21, 2014

Try deleting the pdfinfo* and pdftotext* files in your Zotero data directory and reinstalling them via the Search pane of the Zotero preferences.

dstillman · January 21, 2014

The strange thing is that, when I changed the name of the PDF, cleared the log, then tried again, I got the same error message but the new error persisted in identifying the file by its old file name (although the directory was correct).

Are you sure you didn't just rename the attachment item in Zotero? The filename of the associated file isn't changed unless you check the box to do so (or relink via Zotero if you renamed the file externally).

joehill · January 21, 2014

Yes, I checked multiple times. I just tested again and the same thing happens. The log starts with 2 identical lines ([JavaScript Error: "Brubaker and Cooper - 2000 - Beyond Identity.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 529}]
), using the old file name. But after a number of lines with JavaScript Errors there's a separator of =====, under which more details are given, which use the correct/current file name (which has an extra hyphen before the .pdf). This is not part of the log from before I changed the name, because in each case I've cleared the log between changing names.

But that's not the main problem--the problem is that this PDF doesn't index.

2nd bug ID: D174204464

dstillman · January 21, 2014

Clearing debug output doesn't affect the errors that appear at the top. If you restart Firefox I think you'll find that the filenames are consistent. (If not, let me know, because that would be very odd.)

In any case, if you go into that directory, is there a .zotero-ft-cache file, and if so, do you see text from the PDF in it?

joehill · January 22, 2014

Yes I just restarted Zotero (I'm using standalone) and it not just has new errors. (D253448755)

I opened up the cache file and there's no PDF text there. Here is all the text there is (repeated for every time I hit the index button):

Title: Ryso0470 1..47
Creator: 3B2 Total Publishing 6.03d/W
Producer: Acrobat Distiller 3.01 voor Windows
CreationDate: Tue Feb 29 09:41:02 2000
ModDate: Mon Jan 22 14:48:57 2007
Tagged: no
Pages: 47
Encrypted: yes (print:yes copy:no change:yes addNotes:yes)
Page size: 595 x 841 pts
File size: 275838 bytes
Optimized: yes
PDF version: 1.4

joehill · January 22, 2014

(Sorry, typo: the above should read "it now just has new errors", as in, it only mentions the current file name.)

dstillman · January 22, 2014

That's .zotero-ft-info, not .zotero-ft-cache.

dstillman · January 22, 2014

(repeated for every time I hit the index button)

Huh, that's a bug we've never noticed (though it's irrelevant to this, and I suspect it doesn't cause any problems).

joehill · January 22, 2014

Oh right, I figured maybe that was intentional.

dstillman · January 22, 2014

But see above: you didn't say what's in .zotero-ft-cache.

joehill · January 22, 2014

Sorry, I misunderstood what you were asking for. There is no .zotero-ft-cache in the directory.

dstillman · January 22, 2014

Oh, well, the problem is probably this:

Encrypted: yes

It's an encrypted PDF, which the PDF tools we're currently using probably won't read. We're planning to switch to a new PDF text extraction tool in the not-too-distant future, and it might do better on protected PDFs like this.

joehill · January 22, 2014

Ah, I see. I was assuming you're using the same pdftotext I have installed on my system (3.03), but I have no problem extracting it using the command line. (This one is readable without a password.) Thanks for looking into this.

dstillman · January 22, 2014

Yeah, we're still using 3.02. I think 3.03 may have improved this. You can try running the Zotero version from the command line to confirm that that's the problem, using the command line from the debug output — it will probably show a proper error message.

But if 3.03 works, I think on Linux you can safely swap in the 3.03 pdftotext binary in place of the existing Zotero one. (We use a custom pdfinfo build, since the standard pdfinfo build doesn't support text file output, and custom versions of both on Windows to prevent console windows from popping up.)

joehill · January 22, 2014

Sorry for the basic question, but can you help me find the pdftotext binary? I'm looking in the Zotero standalone folder and can't find it.

dstillman · January 22, 2014

The PDF tools are at the root of the Zotero data directory.

joehill · January 22, 2014

Aha, got it. Ok, I moved the Zotero version to a .bk extension and symlinked to the system's version of pdftotext (and updated the "version" file to "3.03") and it does indeed index this pdf file now. Thanks for solving this for me!

joehill · January 22, 2014

And of course .zotero-ft-cache now contains the output of pdftotext 3.03.