Search index is lost

jlaehne · February 19, 2020

I had my whole library (approx. 10k titles across several groups) indexed for full text search - with <100 unindexed papers remaining, probably all scanned PDFs. Now I checked my Index Statistics again, and more than 90% of the files are Unindexed, while only an odd number of 752 files is indexed.

I had the same already last week and rebuilt the index, which was kept for several days, but then lost again. Any idea where such a bug could come from?

dstillman · February 19, 2020

One possibility would be that you're somehow starting afresh with a new database and pulling down items via sync, though you'd presumably notice that.

Another possibility is that you used ZotFile to move the attachment items, which I believe actually deletes and recreates the items, which would clear the index entries.

There's not really any code in Zotero to clear full-text index entries for existing items other than the code that runs when you clear the index manually.

jlaehne · February 19, 2020

Well, after I tried to index a small remainder on my home computer he actually started syncing a lot and the index is depopulating - so it looks like #1, but why is he suddenly rebuilding the database?

ZotFile can be ruled out as it also happened to a colleague who does not have ZotFile installed (I have it installed but am using the zotero storage now).

dstillman · February 19, 2020

Well, #1 was specifically somehow ending up with a new, empty database (say, from changing your Zotero data directory setting without moving your data) and then syncing to pull down data from the online library, not just ongoing syncing.

But you're saying that you did something on another computer and then just synced your existing library on this computer and the index shrunk? What exactly did you do on the other computer? If you can reproduce this, can you provide a Debug ID for a sync where the index shrinks in size?

jlaehne · February 19, 2020

Well, I will try to catch a similar situation and produce a Debug ID.

I switched to another computer, just because I went home. Then checked the search index and it looked good. Anyway, I chose to rebuild the index with the option to index only unindexed files. At some point he then a large automatic sync for which I do not see the motivation. Afterwards, I had 7k less files indexed. I checked the storage directory, a large number, but only 2k, folders were changed recently. Also, a backup of the zotero.sqlite database was created and the new one was only half as big as the original one.

jlaehne · February 21, 2020

I finally managed to catch part of a sync where the index is shrinking,
it is the debug ID: D1498775293
And another more extensive one, which also catches the end of that sync: D1317263503

No idea, what triggered this extensive full-text sync, as only a handful of files should have been changed in the meantime.

It is quite annoying, I basically rebuild the index every day and sometime later it is largely depopulated again. A colleague of mine came to me with the same problem.

jlaehne · February 28, 2020

It does not happen any more since I gave up building the search index both on my office and home PCs. Now I have it only in the office and it is stable.

Could it be that building the search index on a second PC changes the properties of the PDF in some way so that the first PC thinks the PDF has changed and thus throws it out of the index?

dstillman · February 29, 2020

Yes, exactly that. If you reindex items on one computer, as far as Zotero is concerned that's new full-text content that hasn't yet been synced to other computers, so it will mark the items on other computers as needing to be indexed ("Queued", in the right-hand pane) until the new content is processed in the background, and in the meantime the items won't show up in full-text searches. I guess technically it'd be better if they were simply marked as needing to be reindexed without throwing out the old content, but this mostly doesn't come up.

Why exactly are you reindexing things at all? Generally speaking, you shouldn't need to use any of the manual indexing/reindexing functionality — this all should just work automatically, indexing things immediately on the computer where you added them and on idle after syncing on other computers. The main reason to use manual indexing would be if you OCRed items yourself after adding them to Zotero, but in that case there wouldn't be content on other computers to begin with, so the current behavior wouldn't be a problem.

jlaehne · March 1, 2020

I reindexed the library, because it was far from completely indexed (less than 10%) and then full-text search does not make much sense. Then, I went to the other computer and also rebuild the index (for not indexed items) to have a complete search index there as well. But that killed the index on the first one again ...

I only started that, because I had the feeling that the background indexing was not filling up the library - but maybe I was just not patient enough. So if I understand correctly, background indexing does not reset the full-text status, while forced indexing does? And I just need to have Zotero running for a very long time to get the index completed? I see now that the index builds up very very slowly in the background - it was much much faster with the forced indexing of unindexed items. For my laptop it means that I will probably have to run it a week in a row to get the 10.000 items indexed.

dstillman · March 1, 2020

"Background indexing" (I'll use your terms) is simply processing the full-text content that was synced from another computer — i.e., adding that content to the local index. Before those items are processed, they'll show as "Queued" in the right-hand pane. This should happen pretty quickly, but only when the computer is idle and Zotero is open. In normal usage where you're using multiple computers, you mostly shouldn't notice it. If you add a few attachments on one computer and then go to another, they'll be processed the first time the computer is idle for more than 30 seconds after the new items are synced down.

"Forced indexing" would be re-processing the attachment file itself — i.e., extracting its text and adding it to the index as new data, which will then sync elsewhere. It doesn't use the queued content from the sync — it replaces it by reprocessing the local file.

I strongly suspect all the problems you're having are just from trying to manually intervene here (which is one reason I've wanted to de-emphasize/hide/remove the index stats and manual index functions). If you just let it do its thing, it should pretty much just work.

dstillman · March 1, 2020

(That said, there might be some things we can do to make the background processing happen faster, or perhaps to allow searches to work before indexing, even if they're a bit slower. While I think the current behavior works for most people if they just ignore it, it's obviously not ideal if you download a lot of new data (say, while setting up a new synced computer) and then try to find things right away.)

jlaehne · March 1, 2020

Well, I guess it all started because we migrated from Mendeley with several large group databases. So that put several thousand items in the queue. Then a colleague came to me and was very unhappy that the full text search was not functioning as desired and we realized the index was not complete and tried to speed up the process.

If you would hide/remove the index statistics, we would have filed an issue that the full text search is not functioning. So maybe, one could change the wording in the statistics to explicitly state the number of items 'in queue for background indexing' or something similar.

jlaehne · March 1, 2020

I would suggest two separate items in the statistics:

'In queue':
'Unindexed': (only for the ones like non-OCR PDFs that cannot be indexed)

jlaehne · March 4, 2020

After waiting a few days, I realized that the indexing stalls after a while. On my laptop, I have about half of my library indexed with the process not proceeding any more. New additions are indexed, but he is not indexing the remainder of the existing library. If I choose re-index, he will build up the rest, but as discussed above, the index on my office PC will be lost.

For a colleague it is even worse, he has less than 10% of the library indexed and Zotero won't index the rest.

dstillman · March 4, 2020

Can you provide a Debug ID for leaving your computer idle for several minutes, with Zotero open?

And if you look at an attachment that's not indexed, what does it say for the index status in the right-hand pane? Can you provide the 8-character folder name from Show File for an attachment that's not indexed?

jlaehne · March 5, 2020

Well, the weird thing is. Once I start the output logging, he does index! Here are 2 debug IDs where he proceeded in indexing:
D1106067759
D108860634

But between them, I left the computer idle for the whole night and nothing happened.

And here are some folder names from `queued` attachments:
NN8C4PLP, 9496UYPW
And here from some attachments with status `unknown`:
QVEVMHIB, T672LXH4

dstillman · March 5, 2020

Debug output logging wouldn’t have any effect on indexing. It’s not inconceivable there could be some difference between Zotero being in the background vs. not, though that shouldn’t matter.

jlaehne · March 5, 2020

Could it be that the idle criteria are too strict, e.g. that any possible other background process keeps it from running?

dstillman · March 5, 2020

We just use the idle detection functionality available in the (Mozilla) framework we use. It's not something we control directly.

It's certainly possible that something on your system is preventing it from thinking the system is idle, but there's a good chance that's at the system level rather than the framework. Idle detection is just one of those things in computing that tends not to be very reliable, since it depends on all sorts of signals (input, audio/video playback, network activity, automated processes) that can sometimes misfire.

dstillman · March 5, 2020

As I say, though, we might be able to do more of the processing faster or to make searches work before everything is indexed. I've created a ticket to track that.

jlaehne · March 5, 2020

Here, I got a debug ID where the computer is idle and he seems to detect that, but says there is nothing to index (well I still have over 2000 unindexed items): D866971299

And these are some of the items, which still have 'unknown' indexing status: DIWXAP75, N64EGMWQ, BRSWJ77K, PRYYJ36I, TX6KZYAJ

dstillman · March 5, 2020

The stats aren't correct — you should ignore them. I'm not sure what causes the Unknown, and I'll look into that, but it doesn't mean the files aren't indexed.

If you search for words or phrases from those PDFs, as long as they contain extractable text, they should be found.

dstillman · March 5, 2020

Actually, I think that happens (at least in the case you're seeing) specifically when an item is indexed locally and is reindexed on another computer without the content changing. So that a bug, and we'll fix it, but it's happening specifically because you've been reindexing, and it doesn't mean the item isn't indexed — the opposite, in fact.

dstillman · March 9, 2020

The latest Zotero beta should fix a number of issues here. The state will no longer be reset to "Unknown" if you reindex the same PDF elsewhere, and both manual reindexing from the info pane and "Index Unindexed Items" will now use the synced full-text content if available rather than reindexing the file to avoid triggering additional full-text content syncing and indexing back and forth. (If you rebuild the index completely, it will go back to the original files.)

I still recommend mostly not worrying about this, but "Index Unindexed Items" should now be a way to force immediate indexing of any full-text content synced from other computers.