Searching note content outside Zotero

cheflo · January 31, 2020

Searching in Zotero will reveal which items contain the search terms, but not where in these items the matches are. This means that I will have to click through each matched item attachment and search again inside the attachment. In contrast, if I search a text document using grep, it will return the line with the match and optionally any number of lines before and after the matching line. This is great for quickly finding relevant matches, and it would be very useful if I could search through Zotero’s notes and pdf-attachments in this way.

The PDF attachments are easy enough, I store them in a dedicated dir and could just run pdftotext on them, just like Zotero does. However, it seems like the notes are only contained within the sqlite database, and not the Zotero storage directory like web snapshots and other files. This means I need to perform a sql query to dump all the note content to files (preferably with the parent item as the file name) and then convert from html to md using pandoc. I am not familiar with the database structure of Zotero (or that much with sql either for that matter), would such as query be straightforward? Do you have any advice on where to start?

cheflo · January 31, 2020

After posting this I found out that the brilliant Better BibTex extension has an option to include notes when exporting to .bib files. It can also automatically keep these files up to date, which is quite convenient. The only downside is that it export a collection as a single file so tools like grep can’t indicate which item has the matching string simply by displaying the file name, but there is probably some BibText or BibLatex processing tool that could help me with that.

I am still interested in a reply to my original question, but do you think this second approach would be smoother (and are there any additional solutions that I have overlooked)?

cheflo · January 31, 2020

Something like this works well for parsing out the relevant fields from the BibLatex files created by Better BibTex:


#!/bin/env bash
csplit -kq $1 "/^@/-1" "{*}"
rm xx00  # An empty file that is created when splitting

mkdir -p notes
mkdir -p abstracts
for split_file in xx*; do
    # Create new file name
    file_name=$(grep "file =" $split_file)
    file_name="${file_name##*/}"
    file_name="${file_name%.*}"
    # Extract and save the relevant sections
    rg "annotation = \{.*?\},\n" $split_file \
    --multiline --multiline-dotall --no-line-number > notes/${file_name}.txt
    rg "abstract = \{.*?\},\n" $split_file \
    --multiline --multiline-dotall --no-line-number > abstracts/${file_name}.txt
    rm $split_file
done

Afterwards all files in the created dirs can be searched with grep/rg.

emilianoeheyns · February 1, 2020

If you just want the notes, BBT also has a "Collected notes" exporter.

emilianoeheyns · February 1, 2020

Export translators can all only export a single file. It's just how Zotero works. I mean *technically* it would be possible to hand-craft a zip file that contains multiple files inside a translator, but that would be madness.

I'll probably try it someday.

cheflo · February 1, 2020

Thanks @emilianoeheyns ! I originally thought I would only use this for notes, but realized it might be useful for grabbing abstracts as well, so it is quite convenient that these are included in the .bib file.

I actually tried Collected notes briefly before but it generates and error for me (pasted here, https://pastebin.com/JQYM3D4d, expires in a week). I don't need to use it myself but might

Also, THANK YOU for making and maintaining Better BibTex!! I have just started out with it but it has already been fantastically useful and enabled some of the features I considered switching to JabRef for.

emilianoeheyns · February 1, 2020

Thanks, fixed. I'll roll it out in a new release when I have feedback on one more open issue.

And you're welcome :)

If you want abstracts + notes, you might also want to look at https://github.com/retorquere/zotero-report-customizer

cheflo · February 1, 2020

Thanks for the link!