FInding unindexed items

I have about 170 unindex items, out of the 2,000 I have in Zotero.

That's fine. I understand why they are unindexed, and for some of the reasons I can fix the situation (i.e., by replacing doc files with pdfs, or using newer pdfs that contain text rather than images).

My question is, how, other than checking each item individually, do I identify the 170 items. I can't find a way to search for them ... so I'm wondering what I'm missing.
  • Yep, I would like to be able to easily find unindexed items, too. So far I've been doing a search for all PDF attachments and browsing through them to spot the unindexed ones, but that doesn't make sense with a large library.
  • I'd be interested in this functionality too -- finding unindexed pdfs in a large library.
  • I know it's a bit of a pain but here's a workaround:

    1. Enable "Debug Output Logging" (Preferences -> Advanced").
    2. Click Preferences -> Search -> Rebuild Index. Tell Zotero to only index unindexed items in the dialog that shows up.
    3. Disable "Debug Output Logging"
    4. Click the "View Output" button just below.
    5. Search for the entries with "Cache file doesn't exist!". In the line prior to those entries you will find the path to the PDF-file that could not be indexed enclosed in the first set of quotation marks.
  • Great job, but really, if our workarounds have to look like this it is quite clear that we need a real solution badly.
  • I just opened a ticket to add "is indexed" and "unindexed" to options for Advanced Search's for Attachment Content. This would let you run searches, or even create a saved search to track unindexed items.
  • edited August 27, 2009
    Yeah, agree this would be a great addition.

    Just as a tip, when I find un-indexed items, google often has OCR copies of the pdfs that I take HTML snapshots of.

    If you have pdfs that you scanned yourself, you can even co-opt google to OCR your stuff for you!

    Any other tips on indexing welcome!
  • Adobe Acrobat Professional has a relatively good OCR engine (supporting several languages). I always use it to do a quick OCR job of stuff I've scanned myself.
  • I wanted to try the solution mguelck posted, but I couldn't find a "Debug Output Logging" button nor a "View Output" button in the zotero settings. I think I found the right setting in the about:config file for the debug logging, but couldn't find anything that looks live view output. Are those buttons only in the zotero 2.0 version or am I missing something?

    Would be great if you could point me in the right direction, I have nearly 1.000 fies and 170 of them are unindexed :-(
  • edited September 10, 2009
    Are those buttons only in the zotero 2.0 version
    Yes. In 1.0 you need to use the manual method:

    http://www.zotero.org/support/debug_output

    Note that this is very slow on Windows.
  • I don't know if this has been resolved yet (very old thread by now), but if not, this seemed to do the trick for me on pdfs:
    - Do an advanced search with two conditions: attachment is pdf and attachment contains the character "a" so that every indexed pdf would have it.
    - tag matching items as "indexed" (the pdfs, not the parent item).
    - Do another advanced search for attachment is pdf and tag is not "indexed."
  • Hmmm ... bit to complex for me.

    How about: Do an advanced search with two conditions: attachment is pdf and attachment DOES NOT contain the character "a" so that every indexed pdf would have it.

    Odd that doesn't work the way I expected it too. Oh well.
  • Hi, I've just begun using the indexing feature of Zotero. It would be great to add the following advanced search feature:

    "Indexing" : "status equals" : with options "indexed", "partially indexed" and "unindexed".
  • I still think this would be a very useful feature. :)
  • I agree and note that it is somewhat tantalizing to know that Zotero already has to be aware of this internally — the info just needs to be exposed to the advanced search...
  • Does anyone know if it would be possible for this feature to be added? I just recently discovered Zotero, and I agree that this would be a helpful feature.
  • Re OCR for PDF files: I always use Apple's PDFScanner. http://www.pdfscannerapp.com
    It's the best £10.99 I spent on an app! Scanning is just one thing it does. You can also import PDFs to process then; combine more than one PDF into a single document or split a document out into several.
    It's great if you scanned or photographed a book. You can rotate pages, deskew them, and crop them. If you scan two book pages at a time on one 'page', you can crop to two separate pages, and provided you lined them all up the same, you can do the whole book with one command.
    And it has OCR in several languages.
  • On finding out which files are not indexed:

    Indexing is a great feature, so first: thanks Zotero for thinking of it.
    However, it would be even better if it worked all the time. In my case I have 1643 indexed files and 422 unindexed. Reindexing them (even if I completely rebuild the index) does not change these numbers. There is no straightforward way of finding and fixing the 'rogue' files.

    I thank mguelck for his imaginative solution. I tried it and got an error log; unfortunately it did not help in my case...

    The first line says "[JavaScript Error: "Conflicts have suspended automatic syncing.", which is as unhelpful as it is worrying. What am I supposed to do about it?!

    Then there are lots of messages at the top like "[JavaScript Error: "2010 What is postphenomenology.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]. I can find the PDF, but what am I supposed to do with it?

    The only *apparently* useful message was "[JavaScript Error: "Weikop - 2001 - Culture Wars ‘The Enemy Within’.pdf was not indexed -- PDFs with filenames containing extended characters cannot currently be indexed due to a Firefox limitation" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]". I could find the PDF file in question, change its name and reindex it. Cause: see BUG ALERT below!
    The bad news is, after doing that it simply changed the error message to: "[JavaScript Error: "Weikop - 2001 - BOOK REVIEW Culture Wars.pdf was not indexed" {file: "chrome://zotero/content/xpcom/fulltext.js" line: 730}]"

    Back to the log file... thenthere is a second section with a lot of messages like "(1)(+0000002): Invalid character set null / (2)(+0000000): File is not text in indexFile(). There is no indication at all what file is involved or what the error means.

    I pasted all the log file text into a WORD doc and searched on "Cache file doesn't exist!" as mguelck suggested, but it did not occur at all. :-(

    So, please Zotero, let's have a simple function to identify
    (a) which files cannot be indexed and
    (b) what we have to do to correct the problem.


    >>> BUG ALERT. The problem with the extended characters in the filename was caused by Zotero itself. The extended characters in question (scrambled in the error message) are simply quotation marks in the original. They got into the filename because I used the Zotero function [right click on PDF attachment and] "Rename file from Parent Metadata". The title has legitimate quotes in it. A workaround is to remove them, rename the file, then put them back in the title. Tedious though! Zotero should either change the "Rename file from Parent Metadata" function to remove extended characters, or change the indexing app so that it does not mind extended characters.


    Cheers,
    Andy
  • No statement on the need to better handle indexing error messages, but you should be able to find unindexed PDF much easier using a saved search created with

    Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
    Attachment File Type --> is --> PDF
  • Unrelated to indexing, but:
    The first line says "[JavaScript Error: "Conflicts have suspended automatic syncing.", which is as unhelpful as it is worrying. What am I supposed to do about it?!
    It says what to do right after that: "Click the sync icon to resolve them."
  • edited June 15, 2016
    As for the indexing, indexing of filenames with extended characters was something we fixed in 2011, and it works for me for a filename with typographic quotes on OS X. Can you provide exact steps to reproduce? (In other words, download an item from this place, run "Rename File from Parent Metadata", etc.)

    The error message (which, bear in mind, is for developers, not for end users) may actually be misleading. The actual test in the code is whether the indexing cache file exists after indexing. There's then some code that (still) checks for extended characters and adds that last part to the message, but you're getting the same message without the extended characters part, which suggests that there's some other issue here for you.
  • you should be able to find unindexed PDF much easier using a saved search created with

    Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
    Attachment File Type --> is --> PDF

    Thanks for that workaround! But is there any way to run this on just a sublibrary? When I try this, I now only see my top-level libraries... 

  • what do you mean by "sublibrary"? Collection? Group? You can include a collection in the search and select the group at the top of the advanced search.
  • edited June 3, 2018

    Sorry - I meant both collection and various levels of subcollections in Groups. See here - I'd like to be able to search on just the 'EBSCO' sub-sub-sub-sub-subcollection :) Which does not show up in the list. And yes, I know that we are not your average user, but I suspect many might like to see the hierarchy of their collections reflected in the 'Search in library' dialog box... 

    No work-arounds for this? Maybe with sqlite-tools? 

  • Select the top-level library & then match on collection.
  • Brilliant! Thanks much...
  • edited August 30, 2018
    "
    Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
    Attachment File Type --> is --> PDF
    "
    I don't think this is still working in Zotero 5.0.55, as I'm searching the unindexed PDFs, just like this figure:

    http://imglf5.nosdn0.126.net/img/M3B0VGdGQVdIRGo5dzJBOEpBSExNUGtZWkVKc3cxYlh4RHVGZDBnSlBkeC9hRkNjY0tFSDF3PT0.png?imageView&thumbnail=500x0&quality=96&stripmeta=0

    Please help me check it, thanks!
  • Works for me. Did you set the search to regex?
  • @adamsmith Thanks for your reply!

    I have choose the regex, but this method don't work yet, just like this figure:

    http://imglf3.nosdn0.126.net/img/M3B0VGdGQVdIRGowRHcvRTRRSUtrNWExNjNhRnBiSklDRnEzTXNFVWx5MzdFVUdzK1FGbjBBPT0.png?imageView&thumbnail=1680x0&quality=96&stripmeta=0

    Is there any settings I didn't click? Thanks!
  • odd. Try instead setting it to "Phrase" and use

    Attachment Content -- does not contain -- %

    (% is a wildcard in the phrase search). See if that works?
  • I have rebuilt the index of the PDFs and now this method works
    "
    Attachment Content --> does not contain --> . (that's a period; set the field to "regex" using the little arrow to its left)
    Attachment File Type --> is --> PDF
    "
    I think the failure before could be attribute to the missing of the file of " .zotero-ft-info". Thanks!

    btw, the "Attachment Content -- does not contain -- %" method still doesn't work, just like this figure:

    http://imglf4.nosdn0.126.net/img/M3B0VGdGQVdIRGowRHcvRTRRSUtrNDBjQ3V0T2c3YThZZ251a0JPRVYxcE1VNmpNcFZsclhBPT0.png?imageView&thumbnail=1705y1011&type=png&quality=96&stripmeta=0
Sign In or Register to comment.