Exporting metadata AND (cached) full-text in some format

sdspieg · January 22, 2019

I've (sort of) asked this before and we are (sort of) able to do this now, but still: does anybody know of an elegant way to export Zotero libraries/collections in a way that includes both the metadata AND the full text in a single file (JSON, csv, ...) that can then be read by third-party tools (for textmining, corpus annotation, machine learning purposes, etc.)? Thanks!

sdspieg · January 23, 2019

Nobody? We are really struggling with this...

bwiernik · January 23, 2019

The Zotero Beta now has a way to pass arbitrary JavaScript code to Zotero so that you can access the JavaScript API. You can use that API to access a collection, its items, and their full text content.

sdspieg · January 23, 2019

Thanks. I'm happy to hear it is possible, even though I still have no idea how exactly to do it . Could you provide an example? Or even refer us to some documentation on how to do this? Please also note that I have had other people (that I don't know personally) approach me on this - also on the forum; so we're not the only ones who'd love to find out how to export our collections from Zotero and then import them in tools like Voyant or ITMS or the new Thresher Quickcode or...
I apologize for asking this, and we MAY be able to figure this out. But it would take us days or even weeks. With no guarantee of success.

bwiernik · January 23, 2019

The JS API is described some here https://www.zotero.org/support/dev/client_coding/javascript_api

Look at the code for retrieving a list of items in a collection, and the code lower down on retrieving item attachments.

sdspieg · May 27, 2019

If anybody else is interested in this and has some js-coding skills: please do contact me. We are still extremely eager to find an elegant way to export both the metadata AND and the full-text (cached) text from Zotero libraries/collections into a format that can be processed by textmining tools. Thanks!

dstillman · May 28, 2019

By full text, you mean the contents of the .zotero-ft-cache file?

sdspieg · May 28, 2019

Yes I do

ntrlshrp · March 10, 2020

@sdspieg , I'm not sure if you're still seeking a solution 10 months later, but to future visitors who want to "export both the metadata AND and the full-text (cached) text from Zotero libraries/collections into a format that can be processed by textmining tools":

If you're willing to use R, you can do it like this:

#####
# install.packages(c("magrittr", "DBI", "RSQLite", "quanteda", "readtext"))
library(magrittr)
library(DBI)
library(RSQLite)
library(quanteda)
library(readtext)

# connect to Zotero's SQLite database
con = dbConnect(drv = RSQLite::SQLite(),
dbname = "~/Zotero/zotero.sqlite")

# get names of all tables in the database
alltables = dbListTables(con)

# bring the items and itemNotes tables into R
table.items <- dbGetQuery(con, 'select * from items')
table.itemNotes <- dbGetQuery(con, 'select * from itemNotes')

# bring in Zotero fulltext cache plaintext
textDF <- readtext(paste0("~/Zotero/storage", "/*/.zotero-ft-cache"),
docvarsfrom = "filepaths")

# isolate "key" (8-character alphanumeric directory in storage/) in docvar1 associated with plaintext
textDF$docvar1 <- gsub(pattern = "^.*storage\\/", replacement = "", x = textDF$docvar1)
textDF$docvar1 <- gsub(pattern = "\\/.*", replacement = "", x = textDF$docvar1)

# bring in itemID (and some other metadata) and that's all
textDF <- textDF %>%
dplyr::rename(key = docvar1) %>%
dplyr::left_join(table.items) %>%
dplyr::filter(!is.na(itemID), !itemID %in% table.itemNotes$itemID)
#####

For a full example, you can see how I used this to calculate tf-idf for each indexed Zotero item and then added new tags based on those results to the database here: https://ntrlshrp.gitlab.io/post/zotfidf/

*** NB: This directly accesses Zotero's local SQLite database, which is considered programatically more brittle / fragile than working with the local JavaScript API (see https://www.zotero.org/support/dev/client_coding/javascript_api).

Hope that is helpful!

adamsmith · March 10, 2020

This is cool. To be clear, the code posted above is completely fine to use. There is no issue with reading from the Zotero database (you'll want Zotero closed while running the code so it's not locked). It might stop working if the database structure changes, but I'd actually expect it to be fairly stable (and easy to fix if it does). In any case, worst case is that it doesn't run.

The part that's iffy is the re-writing to the database (i.e. the part in the blogpost that starts with if(NEW_TAGS). That almost certainly breaks the database at a minimum in the sense that it will lead to unpredictable behavior when syncing, so I'd very much discourage anyone to do this with a database they want to keep working with.

sdspieg · March 15, 2020

@ntrlshrp - thanks much. We're looking into this.

ntrlshrp · March 22, 2020

Thanks, @adamsmith, for your kudos and for reminding me of the read-only recommendation here: https://www.zotero.org/support/dev/client_coding/direct_sqlite_database_access. I've put that front and center on the post to warn readers.

sdspieg · March 23, 2020

I can make this work in RStudio until

> textDF <- readtext(paste0("~/Zotero/storage", "/*/.zotero-ft-cache"),
+ docvarsfrom = "filepaths")

Which is where I get:

Error in list_files(file, ignore_missing, TRUE, verbosity) :
File '' does not exist

And so for 'dbname =', I entered the path to my zotero.sqlite. Which I guess it accepted. But what if I only want to use one group collection? How can I specify that?