FInding unindexed items

PeterSmith · March 6, 2009

I have about 170 unindex items, out of the 2,000 I have in Zotero.

That's fine. I understand why they are unindexed, and for some of the reasons I can fix the situation (i.e., by replacing doc files with pdfs, or using newer pdfs that contain text rather than images).

My question is, how, other than checking each item individually, do I identify the 170 items. I can't find a way to search for them ... so I'm wondering what I'm missing.

mark · March 7, 2009

Yep, I would like to be able to easily find unindexed items, too. So far I've been doing a search for all PDF attachments and browsing through them to spot the unindexed ones, but that doesn't make sense with a large library.

asallen · June 24, 2009

I'd be interested in this functionality too -- finding unindexed pdfs in a large library.

mguelck · August 26, 2009

I know it's a bit of a pain but here's a workaround:

1. Enable "Debug Output Logging" (Preferences -> Advanced").
2. Click Preferences -> Search -> Rebuild Index. Tell Zotero to only index unindexed items in the dialog that shows up.
3. Disable "Debug Output Logging"
4. Click the "View Output" button just below.
5. Search for the entries with "Cache file doesn't exist!". In the line prior to those entries you will find the path to the PDF-file that could not be indexed enclosed in the first set of quotation marks.

mark · August 27, 2009

Great job, but really, if our workarounds have to look like this it is quite clear that we need a real solution badly.

Tjowens · August 27, 2009

I just opened a ticket to add "is indexed" and "unindexed" to options for Advanced Search's for Attachment Content. This would let you run searches, or even create a saved search to track unindexed items.

komrade · August 27, 2009

Yeah, agree this would be a great addition.

Just as a tip, when I find un-indexed items, google often has OCR copies of the pdfs that I take HTML snapshots of.

If you have pdfs that you scanned yourself, you can even co-opt google to OCR your stuff for you!

Any other tips on indexing welcome!

mark · August 28, 2009

Adobe Acrobat Professional has a relatively good OCR engine (supporting several languages). I always use it to do a quick OCR job of stuff I've scanned myself.

nico.hesser · September 4, 2009

I wanted to try the solution mguelck posted, but I couldn't find a "Debug Output Logging" button nor a "View Output" button in the zotero settings. I think I found the right setting in the about:config file for the debug logging, but couldn't find anything that looks live view output. Are those buttons only in the zotero 2.0 version or am I missing something?

Would be great if you could point me in the right direction, I have nearly 1.000 fies and 170 of them are unindexed :-(

dstillman · September 4, 2009

Are those buttons only in the zotero 2.0 version

Yes. In 1.0 you need to use the manual method:

http://www.zotero.org/support/debug_output

Note that this is very slow on Windows.

jszchen · May 31, 2011

I don't know if this has been resolved yet (very old thread by now), but if not, this seemed to do the trick for me on pdfs:
- Do an advanced search with two conditions: attachment is pdf and attachment contains the character "a" so that every indexed pdf would have it.
- tag matching items as "indexed" (the pdfs, not the parent item).
- Do another advanced search for attachment is pdf and tag is not "indexed."

PeterSmith · August 22, 2011

Hmmm ... bit to complex for me.

How about: Do an advanced search with two conditions: attachment is pdf and attachment DOES NOT contain the character "a" so that every indexed pdf would have it.

Odd that doesn't work the way I expected it too. Oh well.

JonEP · March 18, 2012

Hi, I've just begun using the indexing feature of Zotero. It would be great to add the following advanced search feature:

"Indexing" : "status equals" : with options "indexed", "partially indexed" and "unindexed".

JonEP · November 17, 2012

I still think this would be a very useful feature. :)

mark · November 17, 2012

I agree and note that it is somewhat tantalizing to know that Zotero already has to be aware of this internally — the info just needs to be exposed to the advanced search...

taylo469 · March 19, 2013

Does anyone know if it would be possible for this feature to be added? I just recently discovered Zotero, and I agree that this would be a helpful feature.

AndySymons · June 15, 2016

Re OCR for PDF files: I always use Apple's PDFScanner. http://www.pdfscannerapp.com
It's the best £10.99 I spent on an app! Scanning is just one thing it does. You can also import PDFs to process then; combine more than one PDF into a single document or split a document out into several.
It's great if you scanned or photographed a book. You can rotate pages, deskew them, and crop them. If you scan two book pages at a time on one 'page', you can crop to two separate pages, and provided you lined them all up the same, you can do the whole book with one command.
And it has OCR in several languages.

AndySymons · June 15, 2016

On finding out which files are not indexed:

Indexing is a great feature, so first: thanks Zotero for thinking of it.
However, it would be even better if it worked all the time. In my case I have 1643 indexed files and 422 unindexed. Reindexing them (even if I completely rebuild the index) does not change these numbers. There is no straightforward way of finding and fixing the 'rogue' files.

I thank mguelck for his imaginative solution. I tried it and got an error log; unfortunately it did not help in my case...

The first line says "[JavaScript Error: "Conflicts have suspended automatic syncing.", which is as unhelpful as it is worrying. What am I supposed to do about it?!

Then there are lots of messages at the top like "[JavaScript Error: "2010 What is postphenomenology.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]. I can find the PDF, but what am I supposed to do with it?

The only *apparently* useful message was "[JavaScript Error: "Weikop - 2001 - Culture Wars â€˜The Enemy Withinâ€™.pdf was not indexed -- PDFs with filenames containing extended characters cannot currently be indexed due to a Firefox limitation" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]". I could find the PDF file in question, change its name and reindex it. Cause: see BUG ALERT below!
The bad news is, after doing that it simply changed the error message to: "[JavaScript Error: "Weikop - 2001 - BOOK REVIEW Culture Wars.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]"

Back to the log file... thenthere is a second section with a lot of messages like "(1)(+0000002): Invalid character set null / (2)(+0000000): File is not text in indexFile(). There is no indication at all what file is involved or what the error means.

I pasted all the log file text into a WORD doc and searched on "Cache file doesn't exist!" as mguelck suggested, but it did not occur at all. :-(

So, please Zotero, let's have a simple function to identify
(a) which files cannot be indexed and
(b) what we have to do to correct the problem.

>>> BUG ALERT. The problem with the extended characters in the filename was caused by Zotero itself. The extended characters in question (scrambled in the error message) are simply quotation marks in the original. They got into the filename because I used the Zotero function [right click on PDF attachment and] "Rename file from Parent Metadata". The title has legitimate quotes in it. A workaround is to remove them, rename the file, then put them back in the title. Tedious though! Zotero should either change the "Rename file from Parent Metadata" function to remove extended characters, or change the indexing app so that it does not mind extended characters.

Cheers,
Andy

adamsmith · June 15, 2016

No statement on the need to better handle indexing error messages, but you should be able to find unindexed PDF much easier using a saved search created with

Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
Attachment File Type --> is --> PDF

dstillman · June 15, 2016

Unrelated to indexing, but:

The first line says "[JavaScript Error: "Conflicts have suspended automatic syncing.", which is as unhelpful as it is worrying. What am I supposed to do about it?!

It says what to do right after that: "Click the sync icon to resolve them."

dstillman · June 15, 2016

As for the indexing, indexing of filenames with extended characters was something we fixed in 2011, and it works for me for a filename with typographic quotes on OS X. Can you provide exact steps to reproduce? (In other words, download an item from this place, run "Rename File from Parent Metadata", etc.)

The error message (which, bear in mind, is for developers, not for end users) may actually be misleading. The actual test in the code is whether the indexing cache file exists after indexing. There's then some code that (still) checks for extended characters and adds that last part to the message, but you're getting the same message without the extended characters part, which suggests that there's some other issue here for you.

sdspieg · June 3, 2018

you should be able to find unindexed PDF much easier using a saved search created with

Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
Attachment File Type --> is --> PDF

Thanks for that workaround! But is there any way to run this on just a sublibrary? When I try this, I now only see my top-level libraries...

adamsmith · June 3, 2018

what do you mean by "sublibrary"? Collection? Group? You can include a collection in the search and select the group at the top of the advanced search.

sdspieg · June 3, 2018

Sorry - I meant both collection and various levels of subcollections in Groups. See here - I'd like to be able to search on just the 'EBSCO' sub-sub-sub-sub-subcollection :) Which does not show up in the list. And yes, I know that we are not your average user, but I suspect many might like to see the hierarchy of their collections reflected in the 'Search in library' dialog box...

No work-arounds for this? Maybe with sqlite-tools?

noksagt · June 3, 2018

Select the top-level library & then match on collection.

sdspieg · June 3, 2018

Brilliant! Thanks much...

kld123509945 · August 30, 2018

"
Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
Attachment File Type --> is --> PDF
"
I don't think this is still working in Zotero 5.0.55, as I'm searching the unindexed PDFs, just like this figure:

http://imglf5.nosdn0.126.net/img/M3B0VGdGQVdIRGo5dzJBOEpBSExNUGtZWkVKc3cxYlh4RHVGZDBnSlBkeC9hRkNjY0tFSDF3PT0.png?imageView&thumbnail=500x0&quality=96&stripmeta=0

Please help me check it, thanks!

adamsmith · August 30, 2018

Works for me. Did you set the search to regex?

kld123509945 · August 31, 2018

@adamsmith Thanks for your reply!

I have choose the regex, but this method don't work yet, just like this figure:

http://imglf3.nosdn0.126.net/img/M3B0VGdGQVdIRGowRHcvRTRRSUtrNWExNjNhRnBiSklDRnEzTXNFVWx5MzdFVUdzK1FGbjBBPT0.png?imageView&thumbnail=1680x0&quality=96&stripmeta=0

Is there any settings I didn't click? Thanks!

adamsmith · August 31, 2018

odd. Try instead setting it to "Phrase" and use

Attachment Content -- does not contain -- %

(% is a wildcard in the phrase search). See if that works?

kld123509945 · August 31, 2018

I have rebuilt the index of the PDFs and now this method works
"
Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
Attachment File Type --> is --> PDF
"
I think the failure before could be attribute to the missing of the file of " .zotero-ft-info". Thanks!

btw, the "Attachment Content -- does not contain -- %" method still doesn't work, just like this figure:

http://imglf4.nosdn0.126.net/img/M3B0VGdGQVdIRGowRHcvRTRRSUtrNDBjQ3V0T2c3YThZZ251a0JPRVYxcE1VNmpNcFZsclhBPT0.png?imageView&thumbnail=1705y1011&type=png&quality=96&stripmeta=0