Can I index linked pdf?

CellWhisperer · January 8, 2016

Hi,

I've based my whole library on linked pdf (and not attached). From what I understood, search within a pdf text is possible with Zotero (whereas now I can only search through fields and tags).

Is there any way (ie via a plugin?) that Zotero can index and search through linked pdf?

Thanks for your help!

adamsmith · January 8, 2016

Zotero does index linked pdf files by default. How are you testing this?

CellWhisperer · January 8, 2016

Hi Adam,

My library contains around 1447 entries, all but 206 are linked to a pdf.

I'm using the search function of Windows Explorer in the root folder of my library for a very specific term : it returns 17 files. I can check each of them individually in Adobe Reader to verify that the term is indeed contained.
However in Zotero (Firefox plugin) if I select My Library and input the same search term, nothing comes up.

I know that no search can be exhaustive since a lot of (old) pdf have really bad characters that can't be recognized as text, but obviously all those found by Windows Explorer are found without an issue, so it's not an OCR problem.

Zotero search prefs tells me that both pdftotext and pdfinfo are up-to-date (v3.02a), characters/pages values are set to default (FYI none of the 17 pdf found earlier in WE are over 100 pages). Stats show that 841 files are indexed, 1 partially and 723 non indexed (weird that it doesn't add up to 1447-206 but over 1500..?).

EDIT : I should add that this test yielded the same results before and after trying to rebuild the index (going with the "Index non-indexed files" option).

adamsmith · January 8, 2016

right but how _exactly_ are you searching for them? I believe the default search setting in the search bar of Zotero is just for title, creator, year. You'll need to select "everything" for full texts to be included in the search.

CellWhisperer · January 8, 2016

It's confirmed: I'm an idiot. I wasn't even reading anymore what's in the search field when empty ("Fields & tags")...

Indeed if I select "Everywhere", I now get 15 files. So there are only 2 files that rightfully show up with WE search but not with Zotero.

After checking those 2 files, I can't find any reason why they wouldn't show up in Zotero: the pdf seem ok in Adobe reader, I can select/copy text with no problem.
Is there any reason for this discrepancy? I know I'll miss occurrences on older files but I want to make sure that I find as much as I possibly can.

adamsmith · January 8, 2016

hard to say -- do those files show up as indexed in Zotero when you look at them there (select the file/attachment and it will say on the right)?
If not, are you able to manually index them (by clicking on the round arrow next to "No")?

Anything else perhaps unusual about the file, like read only protection or so? Those sometimes trip up the tools Zotero uses for indexing.

CellWhisperer · January 8, 2016

Yep, you're right: they were showing up as not indexed, I was able to index them manually and now they do show up in the search.

Why were they not indexed earlier? Should I try to rebuild the index from scratch?

adamsmith · January 8, 2016

You could try re-building the index from scratch (do it overnight -- it takes time and slows down/freezes Zotero). I honestly don't know why indexing sometimes just stops(?) and why re-indexing unindexed files doesn't, well, index unindexed files.

Maybe Dan has an idea?

CellWhisperer · January 13, 2016

I rebuilt the index from scratch, there were considerably less files left out. After a few reindexing unindexed files, the seemingly incompressible number of non-indexed files is down to 171.

Does that mean that it's the number of image-based pdfs that Zotero can't OCR and badly rendered text?

adamsmith · January 13, 2016

Yes, that sounds like a more plausible number, too, given what you say about the provenance of the files, no?

CellWhisperer · January 13, 2016

I was expecting a much lower number, given that they are all scientific articles coming from peer-reviewed journals.

However when developing the whole library, there were a number of items that were linked to a pdf but also to a "PubMed entry" (I think it's what gets created when you're adding a ref via the "magic wand" and PMID, which I did when I began Zotero). After deleting these links, the unindexed number is now down to 12, and I can account for half of it (.doc files). And I've decided I don't care enough about the other half to try and track them down, so it's all good for me now.

Thanks for your patience Adam!