Exporting records from Zotero and Importing them into Quid

sdspieg · December 3, 2016

Has anybody ever succeeded in doing this (see quid.com)? Apparently, the only way to do this is through csv (see http://resources.quid.com/guide/uploading-your-own-data/). If anybody has any idea on how to do this, I'd be very grateful. Oh and also for any other papermachine-like graph/visualization tools that could be hooked up to a Zotero library (with pdfs). Papermachines was great, but is no longer developed (to the best of my knowledge). I also can't get it to work any more :( But more importantly: all of this stuff is progressing so quickly now - with things like spaCy, tensorflow, etc. So I'm surprised that nobody has built on the work that Cora has done to add some more of functionality to it...

dragonfly · December 3, 2016

Thanks for your links above .. they add to my own research into visualisation .. and, like you, I have dropped experimenting with papermachines.

Please expand further on your comment ...

So I'm surprised that nobody has built on the work that Cora has done to add some more of functionality to it...

By Cora are you referring to Common Reference Architecture?

sdspieg · December 3, 2016

:) No. I was referring to Cora Johnson-Roberson, who created papermachines.

But so my main point was that papermachines may still be a nice 'base' to build upon. There are quite a few new open source tools out there (like the ones I mentioned), that might really raise our ability to 'analyze' large text corpora to a new level. And so if anybody is working on this, I'd certainly be interested to find out. I was unaware of things like spaCy and Parsey McParface until recently, but they seem to be making (especially if you also throw in TensorFlow and deep learning in general) the previous suite of cutting-edge NLP tools that papermachines was based on, obsolete.

Still - in the shorter term - I still hope that somebody will be able to help me with converting Zotero sqlite-libraries into text-only csv files that I could then import in Quid

dragonfly · December 3, 2016

Thanks for the clarification.

I discussed here ..

https://forums.zotero.org/discussion/comment/263806#Comment_263806

using Zotero to export an NLP training corpus.

The "Hierarchical JSON" custom built translator in that thread might be a good starting point.
e.g. you might run the output through a JSON to CSV script to make the export format compatible with Quid.

I would hazard a guess that Quid is based on IBM Watson or similar. At quid.com site in a brief search I couldn't find any clues to the underlying NLP platform used. Only this ...

We’re building something that’s never been built before. Tackling technical problems on the frontier of intelligence. Creating a platform that collects, organizes and interprets all of the world’s human knowledge. We’re creating cutting-edge algorithms, analytical platforms, scalable infrastructure and high-performance visual frameworks to bring you powerful insights you won’t find anywhere else.

And apparently Quid uses WebGL for visualisation of nodes.

Now, ideally, I'm looking for a platform where a private knowledge base can be analysed. In Quid and Watson you are required to add to the common knowledge base. But nevertheless Watson is interesting.

sdspieg · December 4, 2016

Thanks dragonfly! I tried, and it did work - but it doesn't include the actual body of the text. Only a reference to where the pdf is located (e.g.http://zotero.org/users/NNNNN/items/3PFSW849. So it's the same as the regular CSV-export option.
Does anybody know a way to get it WITH the entire pdf-extracted body of the text nicely placed in one column?

dragonfly · December 4, 2016

I'm not running that experiment just now so I need to refresh my memory.
I have looked in /zotero/translators/ folder to view Hierarchical JSON (my amended version) in Geany editor and see that I added some extra code .. under "displayOptions"
"exportNotes": true,
"exportTags": true,
"exportFileData": false


{
	"translatorID": "0a1250df-1678-4b09-88ee-ce5b7578d62a",
	"label": "Hierarchical JSON",
	"creator": "Laurence Diver",
	"target": "json",
	"minVersion": "4.0",
	"maxVersion": "",
	"priority": 50,
	"configOptions": {
        "getCollections": true
	},
        "displayOptions": {
		"exportNotes": true,
		"exportTags": true,
                "exportFileData": false
	},
	"inRepository": false,
	"translatorType": 2,
	"lastUpdated": "2016-10-17 12:00:00"
}

But I need to reflect on the experimental code used (in the body of the script) for extracting notes and tags for visualisation (I am targetting D3.js for node visualisation).

The key is to draw on methods in Zotero Javascript API .. not sqlite as you wrote earlier (unless I'm missing a key point here about metadata saved in sqlite).

I still hope that somebody will be able to help me with converting Zotero sqlite-libraries into text-only csv files that I could then import in Quid.

dragonfly · December 4, 2016

P.S. .. some after thoughts ..

Does anybody know a way to get it WITH the entire pdf-extracted body of the text nicely placed in one column?

Are you not using zotfile to extract annotated text from pdf files and place as notes in your library before applying the export translator to your library?

Cloud services such as IBM Watson can read pdf text from pdf url's so do you need to further extract pdf text?

sdspieg · December 5, 2016

I do have to extract the text (:and I mean the actual pdf-extracted text of the document and not the annotations or highlights in the pdf) and get it in a column in that CSV file in order for Quid to process it. Just like we had to do that in Papermachines ('extract text') before running the textmining tools. If I had time, I'd look into Cora's source code for that on github. But I was hoping somebody would have a ready-made solution for this already :)

dragonfly · December 5, 2016

I am now looking at PDFMiner (python) which is listed here ...

http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html

http://www.unixuser.org/~euske/python/pdfminer/#source

It should be feasible to right click on the pdf, get the pdf path, and run pdf2txt.py and then save such extracted text into a note.

sdspieg · December 5, 2016

Thanks! And that's all fine and dandy, but I am talking of corpora of some thousands (sometimes tens of thousands) of articles. Paper machines processed them all nicely...

dragonfly · December 5, 2016

batch process?

I repeat that IBM Watson (similar offering to Quid) only requires the URL's for the pdf files. No preprocessing is required client side.

sdspieg · December 5, 2016

I guess I'll just have to have somebody look into the github code of paper machines to see how this was done there. I just find it so strange that not more people have tried doing this. Especially since the paper machines source code - and many other new tools - are open source... I like reading as much as the other guy, but when I'm dealing with a lage corpus, I really like getting a feel for what's in there before delving into it... But I guess not.

dragonfly · December 5, 2016

Being curious I dived into my old experimental installation of papermachines in

~/.mozilla/firefox/[name_of_profile]/zotero/papermachines/processors

and using SearchMonkey (Ubuntu tool) simply searched “pdf”

Two files were found ..

d3.layout.cloud.js
extract.py

In extract.py

Line Number: 48
Extract text from PDF or HTML files
Line Number: 55
self.pdftotext = self.extra_args[0]
Line Number: 67
if not os.path.exists(self.pdftotext):
Line Number: 68
logging.error('pdftotext not found!')
Line Number: 108
elif fname.endswith('.pdf'):
Line Number: 110
self.pdftotext,

In d3.layout.cloud.js

Line Number: 2
// Algorithm due to Jonathan Feinberg, http://static.mrfeinberg.com/bv_ch03.pdf

It is encouraging to see that d3.js was used for the tag cloud visualisation (WebGL is used in Quid).

Diving deeper into extract.py the class for text extraction draws on

tikaPath = os.path.join(self.cwd, 'lib', 'tika-app-1.2.jar')

So Apache Tika (java) is used for text extraction.

Tika is listed in the link I posted earlier.

http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html

sdspieg · December 21, 2016

I remain interested in anybody who might be able to produce an csv export capability for items with pdf-attachment. It would work like the current one, but would also put the raw text extracted from the pdf in a column. We might even be willing to provide some modest financial compensation for anybody who would be willing and able to do this. And we would still make the tool publicly available to anybody who might be interested in it. Thanks!

dragonfly · December 21, 2016

I have my own interest in building a connector between Zotero (stand alone) and back end A.I. and analytical tools. As I wrote earlier in some discussion I looked at IBM Watson. So in due course I'll get there (hopefully in next month).

I did try to get hold of Quid API for such experiments. I might be wrong but I had the impression that Quid.com prefers to deal directly with enterprises. There is no public information on their API that I found. If you have a link to Quid.com open API and it is not confidential I can look at it.

But I would add that csv data export (alone) will not cut it for my needs since visualisation tools (D3.js) require json. In my view it would be better to export just the url's to text documents (text extracted from pdf url's) rather than including large blobs of extracted text into columns.

So I'm now looking at the flow of data between Zotero and more open source text analytics tools which have open API's in Python et. al.

Also I think it is safer not to be dependent on one A.I. tools vendor but to create an abstracted and open API which allows multiple vendor A.I. tools to be tested.

The other problem I see is that not every A.I. tools vendor allows a private instance of their technology for analysing enterprise confidential corpora. Not every enterprise wishes to share their data for others to benefit from training their common models.

...

One point I find in this forum is that development topics such as text analytics, A.I., visualisation become buried in the body of general discussions. Might it make sense to organise the forum into sub-forums to discuss such matters of interest?

sdspieg · December 21, 2016

Yes. Quid is commercial. But I am talking to them (as a customer) to fill a short-term need. But that's all it is. Their stuff IS pretty cool, but you can't export, embed, etc.

And yes - the interests of the two of us are really quite similar. I am currently working with a team in the US and one in Australia (both academic researchers) to hook Zotero up to various open source textmining tools in R and/or python. One of the (R-based) systems is fully set up now and we're doing some alpha testing. We will hopefully be able to post something soon. The other (python-based) one uses (among other things) SpaCy and tensorflow

And yes, I also think we should have a separate sub-forum for text analytics (including viz, ai/deep learning, etc.) Sebastian - whom should we recommend this to?

adamsmith · December 21, 2016

Zotero got rid of forum categories with the forum update. I don't think there are that many people currently interested in the topic, so a separate forum would seem like overkill anyway. I'd start by just keeping it in one thread.

sdspieg · December 21, 2016

I think if users of Zotero would be more aware of what text analytics is and what it can do for them, far more WOULD be interested. Put differently - if phd cttees would stop letting people get away with mostly cherry-picked literature reviews that only regurgitate the 'well-trodden path' (typically in one language); if peer reviewers of journal articles would do the same - then things would change must faster. We started doing these things because there was no alternative. There IS now. So I honestly see no excuse anymore for people NOT exploring the entire universe of (maybe even just peer-reviewed) writings on the topics they are inquiring into. Of course as a starting point for more detailed (and - for the time being still mostly) 'mandraulic' work. But still as an obligatory first step. This would also lead away from the almost surrealistically stovepiped academic world and towards the 'cumulative knowledge building' that should it primary objective. Are we really going to have to wait for AI to do this for us? [sorry for this digression, but it really bothers me...]
But so back to Sebastian's answer - ok, so be it...

adamsmith · December 21, 2016

Right -- this wasn't a statement on the method but on the suitability of Zotero's (principally tech) support forums for detailed discussions about them.