Exporting metadata AND (cached) full-text in some format
I've (sort of) asked this before and we are (sort of) able to do this now, but still: does anybody know of an elegant way to export Zotero libraries/collections in a way that includes both the metadata AND the full text in a single file (JSON, csv, ...) that can then be read by third-party tools (for textmining, corpus annotation, machine learning purposes, etc.)? Thanks!
I apologize for asking this, and we MAY be able to figure this out. But it would take us days or even weeks. With no guarantee of success.
Look at the code for retrieving a list of items in a collection, and the code lower down on retrieving item attachments.
If you're willing to use R, you can do it like this:
# install.packages(c("magrittr", "DBI", "RSQLite", "quanteda", "readtext"))
# connect to Zotero's SQLite database
con = dbConnect(drv = RSQLite::SQLite(),
dbname = "~/Zotero/zotero.sqlite")
# get names of all tables in the database
alltables = dbListTables(con)
# bring the items and itemNotes tables into R
table.items <- dbGetQuery(con, 'select * from items')
table.itemNotes <- dbGetQuery(con, 'select * from itemNotes')
# bring in Zotero fulltext cache plaintext
textDF <- readtext(paste0("~/Zotero/storage", "/*/.zotero-ft-cache"),
docvarsfrom = "filepaths")
# isolate "key" (8-character alphanumeric directory in storage/) in docvar1 associated with plaintext
textDF$docvar1 <- gsub(pattern = "^.*storage\\/", replacement = "", x = textDF$docvar1)
textDF$docvar1 <- gsub(pattern = "\\/.*", replacement = "", x = textDF$docvar1)
# bring in itemID (and some other metadata) and that's all
textDF <- textDF %>%
dplyr::rename(key = docvar1) %>%
dplyr::filter(!is.na(itemID), !itemID %in% table.itemNotes$itemID)
For a full example, you can see how I used this to calculate tf-idf for each indexed Zotero item and then added new tags based on those results to the database here: https://ntrlshrp.gitlab.io/post/zotfidf/
Hope that is helpful!
The part that's iffy is the re-writing to the database (i.e. the part in the blogpost that starts with if(NEW_TAGS). That almost certainly breaks the database at a minimum in the sense that it will lead to unpredictable behavior when syncing, so I'd very much discourage anyone to do this with a database they want to keep working with.
I can make this work in RStudio until
Which is where I get:
And so for 'dbname =', I entered the path to my zotero.sqlite. Which I guess it accepted. But what if I only want to use one group collection? How can I specify that?