Exporting records from Zotero and Importing them into Quid
Has anybody ever succeeded in doing this (see quid.com)? Apparently, the only way to do this is through csv (see http://resources.quid.com/guide/uploading-your-own-data/). If anybody has any idea on how to do this, I'd be very grateful. Oh and also for any other papermachine-like graph/visualization tools that could be hooked up to a Zotero library (with pdfs). Papermachines was great, but is no longer developed (to the best of my knowledge). I also can't get it to work any more :( But more importantly: all of this stuff is progressing so quickly now - with things like spaCy, tensorflow, etc. So I'm surprised that nobody has built on the work that Cora has done to add some more of functionality to it...
Please expand further on your comment ... By Cora are you referring to Common Reference Architecture?
But so my main point was that papermachines may still be a nice 'base' to build upon. There are quite a few new open source tools out there (like the ones I mentioned), that might really raise our ability to 'analyze' large text corpora to a new level. And so if anybody is working on this, I'd certainly be interested to find out. I was unaware of things like spaCy and Parsey McParface until recently, but they seem to be making (especially if you also throw in TensorFlow and deep learning in general) the previous suite of cutting-edge NLP tools that papermachines was based on, obsolete.
Still - in the shorter term - I still hope that somebody will be able to help me with converting Zotero sqlite-libraries into text-only csv files that I could then import in Quid
I discussed here ..
https://forums.zotero.org/discussion/comment/263806#Comment_263806
using Zotero to export an NLP training corpus.
The "Hierarchical JSON" custom built translator in that thread might be a good starting point.
e.g. you might run the output through a JSON to CSV script to make the export format compatible with Quid.
I would hazard a guess that Quid is based on IBM Watson or similar. At quid.com site in a brief search I couldn't find any clues to the underlying NLP platform used. Only this ... And apparently Quid uses WebGL for visualisation of nodes.
Now, ideally, I'm looking for a platform where a private knowledge base can be analysed. In Quid and Watson you are required to add to the common knowledge base. But nevertheless Watson is interesting.
Does anybody know a way to get it WITH the entire pdf-extracted body of the text nicely placed in one column?
I have looked in /zotero/translators/ folder to view Hierarchical JSON (my amended version) in Geany editor and see that I added some extra code .. under "displayOptions"
"exportNotes": true,
"exportTags": true,
"exportFileData": false
{
"translatorID": "0a1250df-1678-4b09-88ee-ce5b7578d62a",
"label": "Hierarchical JSON",
"creator": "Laurence Diver",
"target": "json",
"minVersion": "4.0",
"maxVersion": "",
"priority": 50,
"configOptions": {
"getCollections": true
},
"displayOptions": {
"exportNotes": true,
"exportTags": true,
"exportFileData": false
},
"inRepository": false,
"translatorType": 2,
"lastUpdated": "2016-10-17 12:00:00"
}
But I need to reflect on the experimental code used (in the body of the script) for extracting notes and tags for visualisation (I am targetting D3.js for node visualisation).
The key is to draw on methods in Zotero Javascript API .. not sqlite as you wrote earlier (unless I'm missing a key point here about metadata saved in sqlite).
Cloud services such as IBM Watson can read pdf text from pdf url's so do you need to further extract pdf text?
http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html
http://www.unixuser.org/~euske/python/pdfminer/#source
It should be feasible to right click on the pdf, get the pdf path, and run pdf2txt.py and then save such extracted text into a note.
I repeat that IBM Watson (similar offering to Quid) only requires the URL's for the pdf files. No preprocessing is required client side.
~/.mozilla/firefox/[name_of_profile]/zotero/papermachines/processors
and using SearchMonkey (Ubuntu tool) simply searched “pdf”
Two files were found ..
d3.layout.cloud.js
extract.py
In extract.py
Line Number: 48
Extract text from PDF or HTML files
Line Number: 55
self.pdftotext = self.extra_args[0]
Line Number: 67
if not os.path.exists(self.pdftotext):
Line Number: 68
logging.error('pdftotext not found!')
Line Number: 108
elif fname.endswith('.pdf'):
Line Number: 110
self.pdftotext,
In d3.layout.cloud.js
Line Number: 2
// Algorithm due to Jonathan Feinberg, http://static.mrfeinberg.com/bv_ch03.pdf
It is encouraging to see that d3.js was used for the tag cloud visualisation (WebGL is used in Quid).
Diving deeper into extract.py the class for text extraction draws on
tikaPath = os.path.join(self.cwd, 'lib', 'tika-app-1.2.jar')
So Apache Tika (java) is used for text extraction.
Tika is listed in the link I posted earlier.
http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html
I did try to get hold of Quid API for such experiments. I might be wrong but I had the impression that Quid.com prefers to deal directly with enterprises. There is no public information on their API that I found. If you have a link to Quid.com open API and it is not confidential I can look at it.
But I would add that csv data export (alone) will not cut it for my needs since visualisation tools (D3.js) require json. In my view it would be better to export just the url's to text documents (text extracted from pdf url's) rather than including large blobs of extracted text into columns.
So I'm now looking at the flow of data between Zotero and more open source text analytics tools which have open API's in Python et. al.
Also I think it is safer not to be dependent on one A.I. tools vendor but to create an abstracted and open API which allows multiple vendor A.I. tools to be tested.
The other problem I see is that not every A.I. tools vendor allows a private instance of their technology for analysing enterprise confidential corpora. Not every enterprise wishes to share their data for others to benefit from training their common models.
...
One point I find in this forum is that development topics such as text analytics, A.I., visualisation become buried in the body of general discussions. Might it make sense to organise the forum into sub-forums to discuss such matters of interest?
And yes - the interests of the two of us are really quite similar. I am currently working with a team in the US and one in Australia (both academic researchers) to hook Zotero up to various open source textmining tools in R and/or python. One of the (R-based) systems is fully set up now and we're doing some alpha testing. We will hopefully be able to post something soon. The other (python-based) one uses (among other things) SpaCy and tensorflow
And yes, I also think we should have a separate sub-forum for text analytics (including viz, ai/deep learning, etc.) Sebastian - whom should we recommend this to?
But so back to Sebastian's answer - ok, so be it...