A PDF attachment fails to be indexed

Hello,

I'm trying to index an OCR'd PDF but nothing happens. I logged the output, which is submitted under the bud ID below. The strange thing is that, when I changed the name of the PDF, cleared the log, then tried again, I got the same error message but the new error persisted in identifying the file by its old file name (although the directory was correct). I changed it back, and it tried to index the file, correct filename and directory name, but same error. No idea what's happening.

Thanks,

Joe

Bug ID: D1955072479
  • edited January 21, 2014
    Try deleting the pdfinfo* and pdftotext* files in your Zotero data directory and reinstalling them via the Search pane of the Zotero preferences.
  • edited January 21, 2014
    The strange thing is that, when I changed the name of the PDF, cleared the log, then tried again, I got the same error message but the new error persisted in identifying the file by its old file name (although the directory was correct).
    Are you sure you didn't just rename the attachment item in Zotero? The filename of the associated file isn't changed unless you check the box to do so (or relink via Zotero if you renamed the file externally).
  • Yes, I checked multiple times. I just tested again and the same thing happens. The log starts with 2 identical lines ([JavaScript Error: "Brubaker and Cooper - 2000 - Beyond Identity.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 529}]
    ), using the old file name. But after a number of lines with JavaScript Errors there's a separator of =====, under which more details are given, which use the correct/current file name (which has an extra hyphen before the .pdf). This is not part of the log from before I changed the name, because in each case I've cleared the log between changing names.

    But that's not the main problem--the problem is that this PDF doesn't index.

    2nd bug ID: D174204464
  • Clearing debug output doesn't affect the errors that appear at the top. If you restart Firefox I think you'll find that the filenames are consistent. (If not, let me know, because that would be very odd.)

    In any case, if you go into that directory, is there a .zotero-ft-cache file, and if so, do you see text from the PDF in it?
  • Yes I just restarted Zotero (I'm using standalone) and it not just has new errors. (D253448755)

    I opened up the cache file and there's no PDF text there. Here is all the text there is (repeated for every time I hit the index button):

    Title: Ryso0470 1..47
    Creator: 3B2 Total Publishing 6.03d/W
    Producer: Acrobat Distiller 3.01 voor Windows
    CreationDate: Tue Feb 29 09:41:02 2000
    ModDate: Mon Jan 22 14:48:57 2007
    Tagged: no
    Pages: 47
    Encrypted: yes (print:yes copy:no change:yes addNotes:yes)
    Page size: 595 x 841 pts
    File size: 275838 bytes
    Optimized: yes
    PDF version: 1.4
  • (Sorry, typo: the above should read "it now just has new errors", as in, it only mentions the current file name.)
  • That's .zotero-ft-info, not .zotero-ft-cache.
  • (repeated for every time I hit the index button)
    Huh, that's a bug we've never noticed (though it's irrelevant to this, and I suspect it doesn't cause any problems).
  • Oh right, I figured maybe that was intentional.
  • But see above: you didn't say what's in .zotero-ft-cache.
  • Sorry, I misunderstood what you were asking for. There is no .zotero-ft-cache in the directory.
  • edited January 22, 2014
    Oh, well, the problem is probably this:
    Encrypted: yes
    It's an encrypted PDF, which the PDF tools we're currently using probably won't read. We're planning to switch to a new PDF text extraction tool in the not-too-distant future, and it might do better on protected PDFs like this.
  • Ah, I see. I was assuming you're using the same pdftotext I have installed on my system (3.03), but I have no problem extracting it using the command line. (This one is readable without a password.) Thanks for looking into this.
  • edited January 22, 2014
    Yeah, we're still using 3.02. I think 3.03 may have improved this. You can try running the Zotero version from the command line to confirm that that's the problem, using the command line from the debug output — it will probably show a proper error message.

    But if 3.03 works, I think on Linux you can safely swap in the 3.03 pdftotext binary in place of the existing Zotero one. (We use a custom pdfinfo build, since the standard pdfinfo build doesn't support text file output, and custom versions of both on Windows to prevent console windows from popping up.)
  • Sorry for the basic question, but can you help me find the pdftotext binary? I'm looking in the Zotero standalone folder and can't find it.
  • The PDF tools are at the root of the Zotero data directory.
  • Aha, got it. Ok, I moved the Zotero version to a .bk extension and symlinked to the system's version of pdftotext (and updated the "version" file to "3.03") and it does indeed index this pdf file now. Thanks for solving this for me!
  • And of course .zotero-ft-cache now contains the output of pdftotext 3.03.
Sign In or Register to comment.