FInding unindexed items
I have about 170 unindex items, out of the 2,000 I have in Zotero.
That's fine. I understand why they are unindexed, and for some of the reasons I can fix the situation (i.e., by replacing doc files with pdfs, or using newer pdfs that contain text rather than images).
My question is, how, other than checking each item individually, do I identify the 170 items. I can't find a way to search for them ... so I'm wondering what I'm missing.
That's fine. I understand why they are unindexed, and for some of the reasons I can fix the situation (i.e., by replacing doc files with pdfs, or using newer pdfs that contain text rather than images).
My question is, how, other than checking each item individually, do I identify the 170 items. I can't find a way to search for them ... so I'm wondering what I'm missing.
1. Enable "Debug Output Logging" (Preferences -> Advanced").
2. Click Preferences -> Search -> Rebuild Index. Tell Zotero to only index unindexed items in the dialog that shows up.
3. Disable "Debug Output Logging"
4. Click the "View Output" button just below.
5. Search for the entries with "Cache file doesn't exist!". In the line prior to those entries you will find the path to the PDF-file that could not be indexed enclosed in the first set of quotation marks.
Just as a tip, when I find un-indexed items, google often has OCR copies of the pdfs that I take HTML snapshots of.
If you have pdfs that you scanned yourself, you can even co-opt google to OCR your stuff for you!
Any other tips on indexing welcome!
Would be great if you could point me in the right direction, I have nearly 1.000 fies and 170 of them are unindexed :-(
http://www.zotero.org/support/debug_output
Note that this is very slow on Windows.
- Do an advanced search with two conditions: attachment is pdf and attachment contains the character "a" so that every indexed pdf would have it.
- tag matching items as "indexed" (the pdfs, not the parent item).
- Do another advanced search for attachment is pdf and tag is not "indexed."
How about: Do an advanced search with two conditions: attachment is pdf and attachment DOES NOT contain the character "a" so that every indexed pdf would have it.
Odd that doesn't work the way I expected it too. Oh well.
"Indexing" : "status equals" : with options "indexed", "partially indexed" and "unindexed".
It's the best £10.99 I spent on an app! Scanning is just one thing it does. You can also import PDFs to process then; combine more than one PDF into a single document or split a document out into several.
It's great if you scanned or photographed a book. You can rotate pages, deskew them, and crop them. If you scan two book pages at a time on one 'page', you can crop to two separate pages, and provided you lined them all up the same, you can do the whole book with one command.
And it has OCR in several languages.
Indexing is a great feature, so first: thanks Zotero for thinking of it.
However, it would be even better if it worked all the time. In my case I have 1643 indexed files and 422 unindexed. Reindexing them (even if I completely rebuild the index) does not change these numbers. There is no straightforward way of finding and fixing the 'rogue' files.
I thank mguelck for his imaginative solution. I tried it and got an error log; unfortunately it did not help in my case...
The first line says "[JavaScript Error: "Conflicts have suspended automatic syncing.", which is as unhelpful as it is worrying. What am I supposed to do about it?!
Then there are lots of messages at the top like "[JavaScript Error: "2010 What is postphenomenology.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]. I can find the PDF, but what am I supposed to do with it?
The only *apparently* useful message was "[JavaScript Error: "Weikop - 2001 - Culture Wars ‘The Enemy Within’.pdf was not indexed -- PDFs with filenames containing extended characters cannot currently be indexed due to a Firefox limitation" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]". I could find the PDF file in question, change its name and reindex it. Cause: see BUG ALERT below!
The bad news is, after doing that it simply changed the error message to: "[JavaScript Error: "Weikop - 2001 - BOOK REVIEW Culture Wars.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]"
Back to the log file... thenthere is a second section with a lot of messages like "(1)(+0000002): Invalid character set null / (2)(+0000000): File is not text in indexFile(). There is no indication at all what file is involved or what the error means.
I pasted all the log file text into a WORD doc and searched on "Cache file doesn't exist!" as mguelck suggested, but it did not occur at all. :-(
So, please Zotero, let's have a simple function to identify
(a) which files cannot be indexed and
(b) what we have to do to correct the problem.
>>> BUG ALERT. The problem with the extended characters in the filename was caused by Zotero itself. The extended characters in question (scrambled in the error message) are simply quotation marks in the original. They got into the filename because I used the Zotero function [right click on PDF attachment and] "Rename file from Parent Metadata". The title has legitimate quotes in it. A workaround is to remove them, rename the file, then put them back in the title. Tedious though! Zotero should either change the "Rename file from Parent Metadata" function to remove extended characters, or change the indexing app so that it does not mind extended characters.
Cheers,
Andy
Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
Attachment File Type --> is --> PDF
The error message (which, bear in mind, is for developers, not for end users) may actually be misleading. The actual test in the code is whether the indexing cache file exists after indexing. There's then some code that (still) checks for extended characters and adds that last part to the message, but you're getting the same message without the extended characters part, which suggests that there's some other issue here for you.
Thanks for that workaround! But is there any way to run this on just a sublibrary? When I try this, I now only see my top-level libraries...
Sorry - I meant both collection and various levels of subcollections in Groups. See here - I'd like to be able to search on just the 'EBSCO' sub-sub-sub-sub-subcollection :) Which does not show up in the list. And yes, I know that we are not your average user, but I suspect many might like to see the hierarchy of their collections reflected in the 'Search in library' dialog box...
No work-arounds for this? Maybe with sqlite-tools?
Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
Attachment File Type --> is --> PDF
"
I don't think this is still working in Zotero 5.0.55, as I'm searching the unindexed PDFs, just like this figure:
http://imglf5.nosdn0.126.net/img/M3B0VGdGQVdIRGo5dzJBOEpBSExNUGtZWkVKc3cxYlh4RHVGZDBnSlBkeC9hRkNjY0tFSDF3PT0.png?imageView&thumbnail=500x0&quality=96&stripmeta=0
Please help me check it, thanks!
I have choose the regex, but this method don't work yet, just like this figure:
http://imglf3.nosdn0.126.net/img/M3B0VGdGQVdIRGowRHcvRTRRSUtrNWExNjNhRnBiSklDRnEzTXNFVWx5MzdFVUdzK1FGbjBBPT0.png?imageView&thumbnail=1680x0&quality=96&stripmeta=0
Is there any settings I didn't click? Thanks!
Attachment Content -- does not contain -- %
(% is a wildcard in the phrase search). See if that works?
"
Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
Attachment File Type --> is --> PDF
"
I think the failure before could be attribute to the missing of the file of " .zotero-ft-info". Thanks!
btw, the "Attachment Content -- does not contain -- %" method still doesn't work, just like this figure:
http://imglf4.nosdn0.126.net/img/M3B0VGdGQVdIRGowRHcvRTRRSUtrNDBjQ3V0T2c3YThZZ251a0JPRVYxcE1VNmpNcFZsclhBPT0.png?imageView&thumbnail=1705y1011&type=png&quality=96&stripmeta=0