(auto-) index

migugg · February 10, 2023

Hi,
I am confused about the indexing features of zotero and the relationship of the bwiernik plugin called "auto-index" and the zotero indexing feature.
I recently realized that in my settings "search" tab, around half of my items appear as not indexed. I began to search on zotero and found a) the mentioned plugin, but b) a forum item https://forums.zotero.org/discussion/comment/239221#Comment_239221 where a user is advised to uninstall said plugin to make indexing work.
Otherwise, there is an absence of documentation of what this plugin is needed. So I am confused as to:
a) why are my items only partially indexed
b) would the plugin solve the issue or rather the opposite?
thanks for help

migugg · February 10, 2023

Should also add that it adds to the confusion that on the zotero support page https://www.zotero.org/support/searching it says that indexing is automatic, but on the plugin page https://www.zotero.org/support/plugins it says that auto index is "A Zotero extension which keeps the full-text index updated." But if indexing is already working in background, why does anyone need the extension? At a minimum, I suggest that if the extension merely duplicates functionality of zotero, it is removed from the official plugin page. Or, if it adds functionality, that both the search support page and the plugin page clarifies what the plugin does that zotero does not already do.
Also confusing: on the search support page mentioned above, there is mention that items can only be indexed if they contain searchable text, which obviously makes sense. But the settings do not make clear whether the "unindexed" items are or include those that cannot be indexed because they simply do not contain any searchable text.
In my specific case, I thought I might figure this out from the count, but the count does not add up: My library contains 11000ish items, of which my settings count says 3700ish are indexed, 4600 unindexed and 234 partially indexed. Which altogether means that around 2000 are missing from the count altogether. is the answer that the 2000 missing are the ones that do not contain searchable text, that 3700 could be indexed and that for 4600 the indexing failed? Or is it that 3700 could be indexed, 4600 do not contain searchable text and 2000ish failed? Or something else?

migugg · February 10, 2023

I have now tried to rebuild the index by selecting "indexing unindexed items" with the result that this has worked well, until it stopped at roughly 7000 indexed and 1000 unindexed with 400 partially indexed. Which means that the total has not changed (which therefore leaves me to assume that the 2000ish missing from the count are the ones that do not contain searchable text?). But I receive no alert as to why the remaining 1000 unindexed cannot be indexed. Even repeated clicking "index unindexed items" changes nothing, also closing and restarting zotero does not help.

Another issue: It is not clear to me:
a) why the max characters per item and the max pages are set so low? Is there any reason for this? clearly 100 pages is less than most books, so for people with books as pdf this does not make sense? Given that most people will never see this, this seems problematic?
b) the max character per item (500 000=200 pages) and the max pages (100) default seem to be very different: Why is this? And which one has precedence? I.e. if a book is say 450 000 characters on 160 pages, will it stop indexing at 100 pages? Or not index at all? Or is this the reason why items appear as "partially indexed"?
Given the above, would it not make sense to first of all explain in more detail how this works and second, and more importantly, to set the max values higher and that they roughly match each other, say the max characters to 1 mio, and the max pages at 400? Or is there any reason for these low numbers?

Finally, it is unclear what happens if I increase the max character/pages numbers. I thought that if I increase the numbers then maybe it will index the unindexed items because they were too long. But nothing seems to happen. Maybe its a bug, or maybe this is how it is indended, and zotero simply indexes the rest of the text of already indexed items in the background? Or do I need to rebuild the entire index from scratch to have the rest of already indexed items indexed?

Again, some help/explanation would be good. And apologies for the long posts, but I thought others might be similarly confused.

migugg · February 23, 2023

As I have not received any comment, I would like to bump this

Haffner64 · October 23, 2024

I have the same questions. Have you found a solution to the problem in the meantime? If so, would you describe this solution here?

migugg · October 25, 2024

No, I have not found a solution, nor have I ever found an explanation. It is mystifiying, in particular because this covers a whole page of settings, which can be user configured. Any help by the devs would really be appreciated.
I should also add that my library index has moved now to 57 partial, 1100 not indexed, and I have no idea why the partially indexed have gone down.

v4u6h4n · April 26, 2025

I'm also interested in figuring out how to resolve this issue. Even just a "automatic full indexing doesnt work yet" from a dev would be good; just so I know I'm not missing something.

dstillman · April 26, 2025

Full-text indexing in the desktop app is automatic. PDFs with readable text layers are indexed when they're added, and PDFs added on another computer have their full-text content synced (if you haven't disabled "Sync full-text content") and indexed on idle.

Files added via the mobile apps or the web library are not currently indexed automatically, so those would remain unindexed on other computers unless you index them. That's obviously something we need to fix.

"Partial" indexing means that the content was longer than the max page/character settings. A manual reindex of the file would cause it to be fully indexed. The default settings were set many years ago and haven't been adjusted since — we could probably do so now given computer performance increases, but it would ideally be done as part of a larger technical overhaul of the indexing system, which isn't on the immediate horizon.

If you're experiencing some problem that you don't think is described by all that, you should report it in a new thread.

As to why that plugin exists, you'd have to ask @emilianoeheyns what problem it was trying to solve. For now, I've removed it from the plugins page to avoid confusion (and because it hasn't been updated in years and wouldn't be compatible with Zotero 7 anyway).

emilianoeheyns · April 26, 2025

I don't recall. I had some massive pdfs back then that may not have been indexed fully automatically but I'm also a coder with enough time on my hands back then (a condition which no longer exists) that would have gladly spent 4 hours automating a 1 hour task. It's very possible it wasn't adressing a substantial problem then, and from the description @dstillman gives it doesn't seem to address one now.

And no, it isn't zotero 7 compatible.

v4u6h4n · April 26, 2025

@distillman

Thanks for the heads up :-)

Just wondering what the consequences would be on my performace if I pushed the defaults up high enough to automatically index everything. Am I only going to see increased resource usage when I actually use advanced search to search the contents of all attachments, or will it also effect general performance even when the indexes aren't being queried?

migugg · May 20, 2025

@v4u6h4n: Try it. I have set it to Words: 9 Mio, Pages: 5000 and it works perfectly fine.
This is on a library with approximately 12 000 items, and probably 3000 attachments, macbook pro with 8GB memory.

@dstillman: Is there any way to identify which pdfs have only been partially indexed?

best

migugg · May 20, 2025

@dstillman: "The default settings were set many years ago and haven't been adjusted since — we could probably do so now given computer performance increases, but it would ideally be done as part of a larger technical overhaul of the indexing system, which isn't on the immediate horizon."

But given this is hidden and might lead to confusion, or worse, people having text unindexed and not even realising that this is the case, and computers can easily deal with it now, and it would be a very simple thing for you to change, why not do it immediately and up it to a reasonable number? Say 1000 pages and 2 Mio characters?
The larger technical overhaul can then come later whenever it is convenient to you.