Exporting items with text extracted from pdf as single xml-files

sdspieg · June 3, 2018

We are still experimenting with new ways to apply various textmining tools to entire Zotero libraries that also contain full text. So far, we have been doing that by exporting these libraries as rdf-files with the export notes and files. That creates a folder with a single rdf-file with all of the metadata (including a reference to the various attachments), and a bunch of subfolders that contain these attachments (the individual pdf-files). Our textmining pipeline (the Integrated TextMining Suite) then currently runs pdf2text and various linguistic pre-processing tools on all of these pdfs, and then proceeds to allow users to run a variety of textmining tools on this. So this works, but it not very efficient. Especially since Zotero presumably already indexes these pdf-attachments.

So could somebody please offer us a suggestion for how me might be able to export a Zotero library in a format that would still provide a reference to the actual pdf, but would also already contain the underlying text WITHIN the (machine-readable) file format?

dstillman · June 3, 2018

The extracted text isn't available for export, but it's present in .zotero-ft-cache files in the 'storage' directories. You might be able to modify an export translator to include the 8-character item key, which would let you access the appropriate file in a script. (I'm not positive if the item key is available to export translators, but I think it is.)

sdspieg · June 3, 2018

Thanks Dan! We'll look into it. And if we manage to find a solution, we'll obviously share it. If anybody else reading this with experience with export translators would be able to lend a helping hand, that would be great, of course. To be continued...

adamsmith · June 3, 2018

The item key is available to translators (see CSV.js), not 100% sure about the key for attachments (though I'd assume so), which is what you'd need to get the storage directories.

y.sapolovych · June 26, 2018

There may a more brutal and straightforward way while we can't modify the translator - just to grab these cache files directly out of our storage.

But so not to deal with thousands of them, we would want to identify subfolders with pdfs connected to a certain collection. Is that possible (and that might be a question for another topic) to do in a not-manual way (i.e. Item->Show file)?

adamsmith · June 26, 2018

Not without scripting either within local Zotero or via the server API, no.