Full-text search of PDFs: is there an upper limit?

benifex · February 22, 2019

I am planning to filter references for a systematic review by searching on text strings in attached PDFs. I've done a few test searches, and I've found that sometimes I click 'Search' and nothing much seems to happen. Other times it spits out results just fine. So for example I am running a test search right now:

Creator contains [name] AND
Attachment content contains [text]

The search terms are drawn from a PDF which is attached and which I re-indexed just before the search. So it should work, and find that one result, and yet it's been running for 20 min. now with no results showing. But CPU usage for Zotero has been tracking around 30%, so it is doing _something_.

The search I am planning to run will be on ~3,000 PDFs (yeah, too many social scientists writing too much ..) with a search expression of ~20 terms connected by OR.

Will this be too much? My Plan B is writing some regex in R.

@bwiernik I thought you might have some wisdom?

dstillman · February 22, 2019

We'd want to see a Debug ID for an operation that took longer than expected, but note that the latest Zotero beta greatly speeds up the "Attachment Content" search condition when you use a single word rather than a phrase (and it also fixes some bugginess with "Attachment Content" searches in general).

benifex · February 27, 2019

Thanks @dstillman. I will leave off installing the beta until I am about to begin the FT searches, inasmuch as it is a beta and hopefully that will allow time for further development.

I will also repeat the FT search and post a Debug ID.