Problem with indexing pdfs

salvadore · January 22, 2010

Hello,

I cannot seem to indexing all my text PDFs. 1578997951
Is there a way to find out what the problem is?

salvadore · January 22, 2010

Furthermore, after importing a few hundred pdfs and entries from endnote I started to get this error:
1352818398

The search engine stops working and I need to restart firefox to make zotero work.
thanks.

salvadore · January 22, 2010

One more note: the same files that would not index automatically can be indexed manually, but as you know it's time consuming.

adamsmith · January 22, 2010

Are you using the retrieve metadata function?
because it uses google scholar, google locks you down after a while because you look like a robot ;-).

salvadore · January 22, 2010

No I haven't, this is what I did:
- I imported my database with NO docs attached from Endnote;
- I used OCR on my 400 pdf files to then be able to index them with the Zotero function;
- I attached each file to an entry through a *link* using zotfile: this was the only way, to my knowledge, to rename in a standardize fashion my files and move them to a folder of my choice as opposed to zotero's numbered directories. I did this also to avoid conflicts with Dropbox - it took me some time but I was very pleased with the result.
- I then tried to maximize the number of indexed files in a variety of ways: I tried to rebuild the index from zero; I tried to "index unindexed items"; I tried to clear the index and do it again... No matter what I do, Zotero does not seem capable of indexing more than 150 files. Provided that a portion of my 400 files may have not converted to text (10-20% tops), I still don't understand while the indexing process stops at around 150 items instead of continuing until at least 300 files have been indexed.

Thanks.

adamsmith · January 22, 2010

hmm - Zotero certainly isn't limited to indexing 150 files - I have almost 1000 indexed.

reh · January 26, 2010

Hallo,
I have a likewise problem, but from the beginning, not after a import.
No pdfs (probably) are indexed - i fear, the 82 indexed entries are all websites.
I installed the pdftotxt files in the search options dialog, i tried to index single files containing text.
Supposing an XP problem i copied my data to a linux and tried it there, same effect :-(((
Its dont matter if i created them myself or downloaded anywhere.
Using pdftotxt on a linux i can get the text, so it seems, my pdfs are ok.

salvadore · January 26, 2010

Some of mine are indexed, but just a few. On top of this problem, I noticed that when I try to manually index those pdfs that are not indexed, my zotero crashes. Here is the error:
425236630

This is becoming quite frustrating actually, I hope someone will help.

adamsmith · January 26, 2010

salvadore - what OS are you on?
if you're on Linux or Mac, could you try to run pdftotxt on one of the files that crashes zotero? (I have no idea how to do that - or if it's even possible - on Windows, but if it is - the same).

@reh - what happens if you manually try to index one of the files in question (select the file and click on the green arrow-circle next to indexed: no
see if you get an error message that you could post.
In the search tab of your preferences - do you have both pdftotext and pdfinfo shown as installed?

Also, for both of you which Zotero version are you using?

salvadore · January 26, 2010

salvadore - what OS are you on? same problem under both vista and xp

if you're on Linux or Mac, could you try to run pdftotxt on one of the files that crashes zotero? (I have no idea how to do that - or if it's even possible - on Windows, but if it is - the same).
Sorry, never used a Mac/Linux in my life. By the way, I am running the latest version of Zotero and I have both pdf software installed.

Also, for both of you which Zotero version are you using?

dstillman · January 26, 2010

salvadore: Too many errors in there, most unrelated to Zotero. Restart Firefox and provide a Report ID and Debug ID for just the indexing attempt. Also, what do you mean by "crash"?

salvadore · January 26, 2010

First of all, thank you for your concern!
By crash I mean that my zotero database stops responding and all the entries disappear.
I tried to manually index a pdf linked to one of my entries and this is what happened:
- the two green "recycle"-like arrows disappeared and the "indexed" category still showed "No".
- I tried to go to another entry to see what would happened and the central window displayed the message "an error has occured. Please restart Firefox.....
This time the "report error" option was actually grey and I could not report the error. I closed Zotero without closing firefox and once I tried to reopen zotero I received an error message.

[removed non-Zotero error — D.S.]

dstillman · January 26, 2010

If Report Errors is grayed out then there's no error, but we still need a Debug ID.

salvadore · January 26, 2010

For some reason the error language was cut from the posting - sorry! Here is a new error for you, following the same actions as described above:
[JavaScript Error: "uncaught exception: [Exception... "Component returned failure code: 0x80520012 (NS_ERROR_FILE_NOT_FOUND) [nsIFile.moveTo]" nsresult: "0x80520012 (NS_ERROR_FILE_NOT_FOUND)" location: "JS frame :: chrome://zotero/content/xpcom/attachments.js :: _moveOrphanedDirectory :: line 1230" data: no]"]

648032580

dstillman · January 26, 2010

Again, we need a Debug ID. Please follow the link.

salvadore · January 26, 2010

Sorry, here it is: The Debug ID is D286758233

salvadore · January 26, 2010

Here is one more case:
The Debug ID is D904080953.

reh · January 27, 2010

> @reh - what happens if you manually try to index one of the files in question
the arrow shortly disappears - nothing else happens

> do you have both pdftotext and pdfinfo shown as installed?
yes, in both OS

> which Zotero version are you using?
the last beta, today i installed the 2.0rc2, same problem
Firefox is 3.0.3 on XP and 3.6 on Linux

> see if you get an error message that you could post.
(XP) seems like this is the problem: Cache file doesn't exist!
The Debug ID is D1719467896.

Im not sure, how the rights was on the virtual ubuntu, but here on XP there are no write restrictions.

dstillman · January 27, 2010

You both are receiving the "Cache file doesn't exist" message, and you both are trying to index files on drives other than C:, which I'm guessing is the issue. Can you provide any details on those drives?

This obviously shouldn't take down Zotero in any case, though, so we'll take a look.

dstillman · January 27, 2010

salvadore: Your Zotero crash was due to a bug that I've just fixed in the latest dev build. Your indexing failure, at least with the file you provided debug output for, was due to a Firefox limitation that prevents Zotero from indexing files with filenames containing extended characters. I've added some error logging for this to the latest dev build. The good news is that it looks like this will soon be fixed in the Mozilla codebase, though the fix probably won't be available until Firefox 3.7.

reh: I'm not sure why you're getting the indexing failure, assuming that PDF does indeed have embedded text, but the non-C: drive would be my best guess. You might be able to learn more by running pdftotext from the command line using the same arguments that are shown in the debug output.

reh · January 27, 2010

Its a NTFS partition on my 2nd HD.
I set the storage preferences to the default, install the pdf-tools:
Error running pdftotext
The Debug ID is D69316096.

On linux my home is a mounted NFS devise (XFS).
I set the storage preferences to the default, zotero to 777.
Same effect as yesterday: no cache file.

dstillman · January 27, 2010

Well, like I said, you'd have to run pdftotext from the command line to have any hope of figuring out what the actual error is.

reh · January 27, 2010

i read your post not before posting mine

I made a cd to the zotero dir and tried to use pdftotext-Linux-i686 [string from log] on a commandline (hopefully correct): command not found.
The same with pdftotext works and created the cache file.

Running pdftotext-Win32.exe [string from log]: program could not be executed (german text).

dstillman · January 27, 2010

I made a cd to the zotero dir and tried to use pdftotext-Linux-i686 [string from log] on a commandline (hopefully correct): command not found.

Try "./pdftotext-Linux-i686" instead.

reh · January 27, 2010

bash: file or directory not found (german)

reh · January 27, 2010

if i try to execute it without arguments i get: cant execute binary file (x option is correct set), if i click it in MC it was displayed like called with more.

on XP i get: program to bit for RAM (working memory)

dstillman · January 27, 2010

Well, you're pretty much on your own for these. They're just executable files, and they work on both Linux and XP. If you're having trouble, you can try erasing the binaries, restarting Firefox, and reinstalling them. If they don't work, there's some other problem on your system.

reh · January 28, 2010

Someone in an Uni web installed zotero and gave me the pdf bins from this system (very different size). With this the indexing works on linux.

But normally i work with XP.
Unfortunately i cant find this pdf tools for manual download.
(but i found another thread with the same problem: http://forums.zotero.org/discussion/7681/pdfinfo-pdftotext-crash-program-too-big-to-fit-in-memory/)
In another post i found a link to http://www.zotero.org/download/xpdf/pdfinfo-Linux-i686-3.02
but http://www.zotero.org/download/xpdf/ is not allowed.

The autoinstall is a good thing, but i think, it should be able to manually download the proper version, if needed. Please link it anywhere.

For the developer:
Our situation here is a PC with 64 bit linux running several virtual 32 bit machines (vmware).
And a separate PC (athlon dual core 4850e) with a 32 bit XP.

reh · January 29, 2010

Please, can someone tell me, where i manually can download the pdftotext-Win32.exe?

I already installed it several times, also with disabled cache, but it dont work - seems that the problem is not on my pc, but with the auto downloading.
Maybe there should be any form of checking in Zotero (checksum?), if the download was correct .

dstillman · January 29, 2010

http://www.zotero.org/download/xpdf/pdftotext-Win32.exe-3.02

Checksumming is planned, but, of course, that would only indicate a failure in your case, not fix it. The auto-download works for most people, so it's likely an issue either with your computer or a network glitch (or Firefox still had the corrupted version cached).

dstillman · January 29, 2010

If you're downloading manually, you need to remove the "-3.02" suffix, and you should create a pdftotext-Win32.exe.version file that contains "3.02". (The same applies to the Linux version.)