Exporting items with text extracted from pdf as single xml-files
We are still experimenting with new ways to apply various textmining tools to entire Zotero libraries that also contain full text. So far, we have been doing that by exporting these libraries as rdf-files with the export notes and files. That creates a folder with a single rdf-file with all of the metadata (including a reference to the various attachments), and a bunch of subfolders that contain these attachments (the individual pdf-files). Our textmining pipeline (the Integrated TextMining Suite) then currently runs pdf2text and various linguistic pre-processing tools on all of these pdfs, and then proceeds to allow users to run a variety of textmining tools on this. So this works, but it not very efficient. Especially since Zotero presumably already indexes these pdf-attachments.
So could somebody please offer us a suggestion for how me might be able to export a Zotero library in a format that would still provide a reference to the actual pdf, but would also already contain the underlying text WITHIN the (machine-readable) file format?
So could somebody please offer us a suggestion for how me might be able to export a Zotero library in a format that would still provide a reference to the actual pdf, but would also already contain the underlying text WITHIN the (machine-readable) file format?
But so not to deal with thousands of them, we would want to identify subfolders with pdfs connected to a certain collection. Is that possible (and that might be a question for another topic) to do in a not-manual way (i.e. Item->Show file)?