Indexing in Zotero standalone -- limit characters or pages?

JonEP · March 14, 2012

I've been using Zotero for several years now. I have traditionally not used the indexing feature -- I have many PDFs, and many of them are books, and I was under the impression that setting Firefox to work indexing them would slow down Firefox.

Now that I'm using the standalone version, I'd like to ask a couple of questions about indexing and performance.

1. Is there a reason why the indexing defaults are set to a limit of 100 pages per file 500,000 characters per file? If Zotero standalone is doing the indexing, should Firefox be able to work relatively speedily? Put another way, is there a performance drawback to setting Zotero to index as high as 350 pages, and perhaps 1 or 2million characters per item?

2. I store all of my PDFs in one folder in My Documents. My Zotero database has links for each entry, rather than the actual PDF. Will indexing from Zotero follow those links and actually index material that is stored in a separate folder, rather than within Zotero's own directory/program location?

3. Along the same lines, all of those PDFs are synced between desktop and laptop using an external syncing agent (Sugarsync), while Zotero itself is synced across computers using Zotero's own sync feature. The directory structures of both computers are identical. Will indexing work in this situation? Or will indexing take place independently on the two Zotero installations, and then cause a conflict when syncing occurs?

Thanks for your guidance.

adamsmith · March 14, 2012

1. The performance issue is not during the indexing, but due to a very large index - the larger your index, the slower full-text searches will be. That's a trade-off, you'll have to see what's more important to you. I think the balance has tipped somewhat in favor of indexing more since the quick-search fields in both client and word-processor plugin are now restricted to title, author, and year by deafult.

2. yes

3. indexing will be independent, will work on both computers, and won't cause a sync issue (the index isn't synced).

JonEP · March 14, 2012

Thanks. I'm going for 1500000 characters and 300 pages. Will report back on my experience...

JonEP · March 15, 2012

14 hours later, my indexing stats are:

Indexed: 3921
Partially: 1267
Unindexed: 1618

The statistics have stayed like that for the past hour or so, at least, so that it seems that Zotero has stopped indexing at this stage. I am quite sure that 99% of my PDFs have text layers (i.e., they are not image-only). Is there any thing that can be done to nudge Zotero to continue indexing? It would be nice if Zotero had some sort of indexing status indicator, letting us know that it is still on the job, or that it thinks that it has done all it can...

JonEP · March 15, 2012

Quick follow up --
I just realized I could "only index unindexed items" so I've done that. Still not sure why Zotero stopped indexing, though.

EDIT:
OK, Zotero has definitely stopped indexing, with half of my items either incompletely indexed or unindexed. What to do? Thanks.

dstillman · March 15, 2012

Assuming you have 3.0.3 (which fixed an indexing-related bug), the things that are unindexed are quite likely unindexable. If you find them in the items list you can look at the files and see why. If you think something should be indexed that isn't, provide an example.

"Index Unindexed Items" doesn't currently reindex partially indexed items, so if you changed your settings you'd have to clear the index and rebuild.

The indexing process currently freezes the UI, so there shouldn't be any mystery as to when it's done.

JonEP · March 15, 2012

Thanks for your response, Dan.

Those items are all (or at least 99%) indexable via Acrobat's own convoluted indexing features, and also are indexable via Windows after I've installed the PDF ifilter. So it seems odd that so many of them would be unindexable by Zotero.

I'd love to provide examples of things that should be indexed but aren't, but it isn't clear to me how to figure out which items have been included in Zotero's index and which items have not. Is it possible to do that?

Finally, regarding "if you've changed your settings you'd have to clear the index and rebuild" -- are there settings that I might be able to change in order to more thoroughly index items. Or were you referring to the limits on characters and pages? Currently, I am indexing 1.5 million characters max per document, and 300 pages max per document. I doubt that of my 5000 or so items there are as many as 1267 that exceed those limits. Are there other settings I might change?

Thanks.

dstillman · March 15, 2012

it isn't clear to me how to figure out which items have been included in Zotero's index and which items have not. Is it possible to do that?

Look at "Indexed" in the right-hand pane when the attachment is selected.

are there settings that I might be able to change in order to more thoroughly index items. Or were you referring to the limits on characters and pages?

Yes, I was referring to the character/page limits.

I doubt that of my 5000 or so items there are as many as 1267 that exceed those limits.

It'd be how many exceeded the settings when they were originally indexed, not the new settings.

Note that manual indexing in the right-hand pane does index a partially indexed attachment completely.

JonEP · March 15, 2012

After setting my character limit per item to 3 million and the max number of pages to be indexed to 500, and then setting Zotero to rebuild the index, Zotero seems to begin to rebuild the index and then stalls, I get a circling symbol ('please wait'), and a (Not Responding) indication. Should I assume Zotero is still working on it? Or is that an error and I should force a restart?

dstillman · March 15, 2012

That means it's still working.

dstillman · March 15, 2012

(We'll switch to doing indexing in the background in a future version.)

JonEP · March 15, 2012

Thanks. Still spinning, 4 hours later. I'll check back in the morning and hope it has reached some resolution. Odd that the 1.5 million characters and max pages 300 setting was "relatively" fast at doing the indexing, whereas this time around it seems to be a debilitating slog...

JonEP · March 16, 2012

So, I left it all night but this morning the "not responding" situation was still in place and I had to force a close of zotero. i reset the limits again, this time for 2million characters and 400 pages per item. Initially, zotero seemed to be doing its indexing (a spinning indicator, but no "not responding" message), but now it is "not responding" again and i wonder if it will indeed complete the indexing or if it has simply encountered an error.

I am finding that perhaps Zotero is not up to the task of indexing a large library....?

dstillman · March 16, 2012

If you weren't on Windows you could run with real-time debug output (adjusting those steps for Standalone, since that page hasn't yet been updated) to see if it's actually doing anything, but the debug console on Windows is prohibitively slow—even a small indexing operation would take an incredibly long time with real-time debug output enabled. The best you can do is check for disk access and CPU usage and try to gauge from that whether it's hung or still processing. You can also start Standalone from the command line with the -jsconsole flag to open the Error Console, but if there were an error Zotero would be much more likely to just stop working rather than hang.

We're not aware of any issues that would cause Zotero to hang during a large indexing attempt, but unfortunately you're not on a platform that allows us to debug this.