Search index is lost
I had my whole library (approx. 10k titles across several groups) indexed for full text search - with <100 unindexed papers remaining, probably all scanned PDFs. Now I checked my Index Statistics again, and more than 90% of the files are Unindexed, while only an odd number of 752 files is indexed.
I had the same already last week and rebuilt the index, which was kept for several days, but then lost again. Any idea where such a bug could come from?
I had the same already last week and rebuilt the index, which was kept for several days, but then lost again. Any idea where such a bug could come from?
Another possibility is that you used ZotFile to move the attachment items, which I believe actually deletes and recreates the items, which would clear the index entries.
There's not really any code in Zotero to clear full-text index entries for existing items other than the code that runs when you clear the index manually.
ZotFile can be ruled out as it also happened to a colleague who does not have ZotFile installed (I have it installed but am using the zotero storage now).
But you're saying that you did something on another computer and then just synced your existing library on this computer and the index shrunk? What exactly did you do on the other computer? If you can reproduce this, can you provide a Debug ID for a sync where the index shrinks in size?
I switched to another computer, just because I went home. Then checked the search index and it looked good. Anyway, I chose to rebuild the index with the option to index only unindexed files. At some point he then a large automatic sync for which I do not see the motivation. Afterwards, I had 7k less files indexed. I checked the storage directory, a large number, but only 2k, folders were changed recently. Also, a backup of the zotero.sqlite database was created and the new one was only half as big as the original one.
it is the debug ID: D1498775293
And another more extensive one, which also catches the end of that sync: D1317263503
No idea, what triggered this extensive full-text sync, as only a handful of files should have been changed in the meantime.
It is quite annoying, I basically rebuild the index every day and sometime later it is largely depopulated again. A colleague of mine came to me with the same problem.
Could it be that building the search index on a second PC changes the properties of the PDF in some way so that the first PC thinks the PDF has changed and thus throws it out of the index?
Why exactly are you reindexing things at all? Generally speaking, you shouldn't need to use any of the manual indexing/reindexing functionality — this all should just work automatically, indexing things immediately on the computer where you added them and on idle after syncing on other computers. The main reason to use manual indexing would be if you OCRed items yourself after adding them to Zotero, but in that case there wouldn't be content on other computers to begin with, so the current behavior wouldn't be a problem.
I only started that, because I had the feeling that the background indexing was not filling up the library - but maybe I was just not patient enough. So if I understand correctly, background indexing does not reset the full-text status, while forced indexing does? And I just need to have Zotero running for a very long time to get the index completed? I see now that the index builds up very very slowly in the background - it was much much faster with the forced indexing of unindexed items. For my laptop it means that I will probably have to run it a week in a row to get the 10.000 items indexed.
"Forced indexing" would be re-processing the attachment file itself — i.e., extracting its text and adding it to the index as new data, which will then sync elsewhere. It doesn't use the queued content from the sync — it replaces it by reprocessing the local file.
I strongly suspect all the problems you're having are just from trying to manually intervene here (which is one reason I've wanted to de-emphasize/hide/remove the index stats and manual index functions). If you just let it do its thing, it should pretty much just work.
If you would hide/remove the index statistics, we would have filed an issue that the full text search is not functioning. So maybe, one could change the wording in the statistics to explicitly state the number of items 'in queue for background indexing' or something similar.
'In queue':
'Unindexed': (only for the ones like non-OCR PDFs that cannot be indexed)
For a colleague it is even worse, he has less than 10% of the library indexed and Zotero won't index the rest.
And if you look at an attachment that's not indexed, what does it say for the index status in the right-hand pane? Can you provide the 8-character folder name from Show File for an attachment that's not indexed?
D1106067759
D108860634
But between them, I left the computer idle for the whole night and nothing happened.
And here are some folder names from `queued` attachments:
NN8C4PLP, 9496UYPW
And here from some attachments with status `unknown`:
QVEVMHIB, T672LXH4
It's certainly possible that something on your system is preventing it from thinking the system is idle, but there's a good chance that's at the system level rather than the framework. Idle detection is just one of those things in computing that tends not to be very reliable, since it depends on all sorts of signals (input, audio/video playback, network activity, automated processes) that can sometimes misfire.
And these are some of the items, which still have 'unknown' indexing status: DIWXAP75, N64EGMWQ, BRSWJ77K, PRYYJ36I, TX6KZYAJ
If you search for words or phrases from those PDFs, as long as they contain extractable text, they should be found.
I still recommend mostly not worrying about this, but "Index Unindexed Items" should now be a way to force immediate indexing of any full-text content synced from other computers.