Indexing questions

realtime99 · March 15, 2012

Hi, a few questions on indexing.

When indexing pdf files that are just images of the pages and do not include text information under the image, sometimes when I individually reindex the item, the status remains index 'no', but other times it changes to index 'yes'. This is true even if I select another item after reindexing and then re-select the first. Why is this? Is there some difference between these pdf's that I am missing?

Also, I am trying to rebuild my index from scratch for a library of about 5000 items. My computer takes between 20 seconds and 5 minutes to index a pdf, so that is about 100 hours to do the whole thing. Is there any way to index a subset of pdf attachments without selecting them all one-by-one? I have tried to set up searches to do this, but even in search groups that show pdf attachments in normal text and their parent records as greyed-out, using shift-click selects all files, not just pdfs, and thus the 'reindex item' option is not visible.

I cannot interact with Zotero in any way during indexing, even to minimize the widow. Is that expected behavior?

Lastly, I would like to OCR the pdfs that do not have any text information, so they can be indexed and searched. As mentioned above, sometimes these pdfs are shown as indexed even if they have no text to index. Is there any way to distinguish these pdfs from those that are actually properly indexed by their text, without opening every pdf individually? Even if not in zotero, is there a characteristic of these pdfs (such as a very small file) that I can search for to aid in this task?

I am using Zotero standalone in Win 7 x64.

Thanks!

adamsmith · March 15, 2012

1. presumably because they have a small amount of text somewhere
2. Re-building your index shouldn't take that long if you do this from the Zotero preferences, manual re-indexing isn't something you should typically have to do - there is an option, I believe, to only index un-indexed items.
You can select just the pdfs in a saved search with parent items greyed out by using ctlr+a (cmd+a on a mac) instead of shift-select.
3. No, I don't believe there is a way to distinguish them - if anything, files w/o text will tend to be larger, but that's not a firm rule either, as there are also files with a text and an image layer.

dstillman · March 15, 2012

manual re-indexing isn't something you should typically have to do

The only catch here is that file syncing doesn't currently trigger indexing, so attachments that came from another computer would generally be unindexed. We'll try to address this soon.

aurimas · March 15, 2012

As far as OCR goes, you may find this question helpful http://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred

Tesseract (http://code.google.com/p/tesseract-ocr/) should do what you want, thought I haven't actually tried it.

EDIT: Also, maybe SimpleOCR (http://www.simpleocr.com/Info.asp). You can probably find more tools by googling batch ocr

realtime99 · March 15, 2012

Thanks for the helpful responses. adamsmith, you were right about 1.

On 2, ctrl+A is super-helpful--is that shortcut documented anywhere?

Also on 2, how long should it take? I just indexed a 400 pg. book and it took about 2 minutes. 5000 items at 1 minute each is 83 hours--are you saying that it will be much faster doing it through preferences? Remember, I am building the index from scratch.

On 3, I just tried searching my storage folder for .zotero-ft-cache files smaller than 5k, and it seemed to grab those pesky non-OCR'd pdfs that have a small bit of hidden text. Would that make sense? Is that file the index information for each pdf?

And last, is it the case that I should expect Zotero to totally freeze during indexing so that I cannot even minimize the window? If I have to cancel the process halfway through, will it still retain the indexing it has done?

Thanks again.

adamsmith · March 15, 2012

2. Ctrl+a is just the generic "select all" shortcut pretty much anywhere on your system. I actually found the shift+click behavior a bit puzzling myself, I would have expected that to also only select the attachments.

I don't know how long it should take - obviously it depends on your computer, too though W7 64bit sounds like it'd be on a pretty fast machine - someone just talked about a similar size library and much of it was indexed within 13hs - seems like an ideal overnight task.
On the other hand, 2mins strikes me as quite long, I've never seen that. I don't know if scripts like pdftotext run slower on Windows.

3. I believe the indexing will be retained, yes, and yes, I think currently indexing takes over Zotero completely, that will change in the future.
I'm not 100% sure about your ft-cache method, but it sounds right.