Indexing 100,000+ PDFs

trilobutt · June 13, 2013

Using Zotero 4.0 Standalone, 64-bit, on Linux Mint.

My situation is a bit weird: I have over 100,000 PDFs, all standard research articles I have accumulated over the years. I added the PDFs without indexing ("Store copy of file" for the PDFs, and indexing options all at 0). Now that I try to build the index through the option menu using the default numbers, it doesn't seem to work. The CPU usage is certainly there over several days, and I notice the PDF tools working several times, but after an emergency shut down, I went to restart the indexing and noticed that nothing was getting indexed, the number of indexed characters and items is still 0.

These are not protected documents. Indexing manually with the green arrow works... but doing that for 100,000 items would be fairly tedious. Is it just a matter of letting it run for several days (weeks?), or is there any way to do it in "chunks" rather than all in one go?

dstillman · June 29, 2013

Yes, you can select multiple PDFs, right-click, and choose Reindex Items.

It's complicated, but you should be able to do this with some saved searches and a temporary collection into which you drag batches of items. Keep in mind that, when viewing a search in the middle pane, you can press Ctrl-A to select only the matching items (in black) and not the non-matches (in gray), so you'll need to construct a search that only matches the PDFs. But since child items don't currently show up as being in collections, what you'd have to do is create two saved searches:

1) [Collection] [is] ["Temp Collection"], ["Include parent and child items of matching items"], save as "Temp Search"

2) [Saved Search] [is] ["Temp Search"], [Attachment File Type] [is] [PDF]

Then save the second search, which should have in black only the PDFs of the items in that temporary collection, press Ctrl-A, right-click, and Reindex Items.

(Obviously, if all the PDFs have "pdf" in the title you can also just go to the temporary collection and use the quick search bar and then Ctrl-A.)