how to phrase-search in PDF attachments?

hello,

I have most of articles in the form of PDF files (indexed). Simple search works fine if I look for a single word or multiple words, but if I use quotation marks (" ") Zotero finds only a few out of expected ca 50 (apparently it finds only those containing the phrase in a title and saved HTML attachments).

If I use advanced search through attachment content (phrase, incl. binary files or not, it does not make any difference), it only brings me saved web pages and no PDF articles. So how can I make it search PDFs for a phrase?

v.
  • Hello,

    Could you please answer my question? I really do not know whether there is no answer because the solution is so simple, or because it is not possible to phrase-search in PDFs.
  • Phrase search should work for PDFs that have been indexed. Have you tried for something simple (like "the")? Note that the text content of some PDFs is surprising: some have poor OCR or strange white space or use characters that you would not normally type.
  • Search as such works fine. My problem is that, for example, I would like to find all articles mentioning "hot dog" (food), but zotero displays all articles concerning a very different subject, but containing both mentioned words, like "I get very unhappy when I see a dog who can barely move because he’s so darn hot."
  • I meant that you should search for '"the"' i.e., with the quotes to use the "advanced quick (phrase-based)" search.

    '"hot dog"' (with the quotes) will perform a phrase-search through PDFs as well, but it may not lead to successful results due to the limitations of pdf->text conversion that I mentioned.

    This will confirm that such searches are possible. But, again,
  • Thanks, but it does not seem to work. As I examine what has has been found and what has not, it appears that the "the" word was found only in titles, not in PDFs. I mean that only those files were displayed, titles of which do contain "the". So it does look like zotero searched only though Zotero titles, not inside the actual files.
  • Phrase search within PDFs appears to work fine for me in Zotero 2.0b7.2, I guess it is possible that this has changed between versions.

    Vinthund - which version of Zotero are you using?
  • b6.5. Maybe I should try newer version.
  • I was thinking more of 1.0 compared to 2.0, but upgrading shouldn't do any harm (as with all these things, probably a good idea to backup your library beforehand).

    If it fixes your problem, great, if not you will have at least got a host of other bug fixes, which should help prevent other problems cropping up.

    Be aware that the note and attachment tabs have disappeared from the latest version (the notes tab will return at some point, but not the attachments tab) so if you are particularly attached to the notes tab you may want to consider delaying the upgrade. There are of course alternative ways (a couple of buttons in the toolbar) to deal with notes and attachments in the latest version.

    Another thing to check is that your PDFs are properly indexed. Does "Indexed: Yes" appear in the info pane for PDFs? If not, you will need to check that the "pdftotext" and "pdfinfo" tools are installed in the search pane of Zotero preferences, and install them if not. You can then index your PDFs by clicking "rebuild index" in the same pane. If your PDFs are not indexed that would explain why you are only finding words which appear in titles, rather than inside the PDFs.
  • "If your PDFs are not indexed that would explain why you are only finding words which appear in titles, rather than inside the PDFs."

    But that would be too simple an explanation ;)

    Yes, they are indexed, and I hope they are properly indexed. At least "Indexed: Yes" does appear, and I am able to look for single words in attached files. It is useful anyway, but if I could look for phrases, that would save me some effort.

    I wanted to recommend Zotero to some of my colleagues, but I shall wait until I can explain to them, how the phrase-search works. Or could this problem be software-dependent, somaybe it would work fine on their Windows? I run Ubuntu and FF 3.5.3, if that matters.
  • Yes, they are indexed, and I hope they are properly indexed.
    You could try re-indexing them. Also, note the maximum characters and pages to index per file.
    I run Ubuntu and FF 3.5.3, if that matters.
    Me too. Test on the file at:
    http://finaid.georgetown.edu/sample.pdf

    Using the quick search bar, the following should all lead to that PDF being found:
    • Test
    • Test PDF
    • Test your
    • "Test"
    • "Test PDF"
    And this should not lead to it being found:
    • "Test your"
    .

    Using advanced search, all searches are phrase searches & you should not use quotation marks. So, an advanced search with the single criterion being that attachment content contains either:
    • Test
    • Test PDF
    will lead to a match. Searches for:
    • Test your
    • "Test"
    • "Test PDF"
    • "Test your"
    will not lead to matches
  • edited October 7, 2009
    Thanks, I did the test and it worked as you had said it would. But still many files were not to be found (even though they were not longer tan 100 pages, not split into separate lines, words in a phrase were not divided by more than one space, anything else could matter?)

    After that I rebuilt the index and now it seems to work much better - for instance I get 100 hits whereas previously I got 9 using exactly the same phrase :) and I hope they are accurate.

    Thanks everyone.
Sign In or Register to comment.