Analyzing Zotero collections with Voyant

corajr · April 10, 2017

Hi all! Some of you may know me as the former dev of Paper Machines, which once upon a time let you visualize your Zotero collections. Regrettably I've not been able to maintain it, but I wanted to let folks know about an alternative for those still interested in text analysis.

I've put together a little extension that can export a full-text Zotero collection to Voyant (https://github.com/corajr/zotero-voyant-export/). Voyant seems well-suited to provide the rich, multifaceted data display I always dreamed of for Paper Machines — it even has topic modeling these days! Hopefully it will provide a better experience for that use case.

I've only tried this extension with the 5.0 beta, on Linux and Mac; it's a bit rough-and-ready but seems to work. Please let me know, here or via Github issues, if you have any thoughts on how to make it better or more useful for your specific case.

adamsmith · April 10, 2017

Cool! cc @sdspieg

Rintze · April 10, 2017

@sdspieg, this is probably of interest to you. (edit: @adamsmith beat me to it!)

sdspieg · April 10, 2017

Tears come to my eyes :) ! Thanks Cora, thanks Sebastian, thanks Rintze! I'm off trying it out. And also, Cora - Olga Scrivner at Indiana University Bloomington has been working on a way to get Zotero libraries into here new Integrated Text Mining Suite. See here - https://languagevariationsuite.shinyapps.io/TextMining/ She ha a couple of grad students working on it, but we had problems getting the Zotero export just 'right'. Maybe we can see whether your export tool could be adjusted for her suite as well?

sdspieg · April 10, 2017

Hmmmm. Is this working for anybody? I've exported a smallish collection (193 articles; 64Mb). That works nice and snappy; and the zip file looks fine - the folder structure, the xmls, etc. But then uploading the zip to the cloud take a long time (over 5 minutes for this one) and ends up with an error: Error (Document terms) and a red arrow.
The local server version imports much faster, then says 'uploading' for a while, and then goes back to the initial web page instead of in the actual program.
Does anybody know whether there's a log created somewhere?

sdspieg · April 10, 2017

Okok - I realize that this if off topic. Sorry, I'll take this elsewhere :)

adamsmith · April 10, 2017

Cora says "here or via github issues" so this seems a fine place to discuss issues.

corajr · April 10, 2017

I'm not surprised it's slow to import on Voyant -- this extension leaves all the heavy lifting to it (file conversions etc.), which if I understand right uses Apache Tika underneath for most files.

Voyant's local server should display a log screen when launched; the relevant entry for import should show up as "trombone: TOOL: corpus.CorpusCreator." Please look to see if there's any errors there, and possibly try bumping up the memory allotment to 2048 or 4096 MB.

If there are no errors on Voyant's side, you could enable Zotero's debug log under Preferences -> General and retry the export; the extension will output there as it processes items. Any relevant issues on the Zotero side would likely be interspersed between lines containing "doExport."

(BTW, the v0.0.1 add-on was missing the update URL; if you remove it and install v0.0.4 from https://github.com/corajr/zotero-voyant-export/releases/tag/v0.0.4 future updates should happen automatically.)

DWL-SDCA · April 10, 2017

I have just started to experiment with Voyant but I've had the opposite result. This wasn't a Zotero file but a pdf of a report that includes Titles and abstracts of 450 items. The report ran quickly. I added a large number of words to the list of stop-words (publishing company names, "copyright", etc.) The analysis ran quickly and produced results. This was just a quick test. To be useful I'll need to make modifications to my source document to eliminate running-heads, section titles etc. Over the next week or two I'll post back with my thoughts. My main database (not in Zotero) is approaching 600 thousand English language records of journal articles, technical reports, theses, proceedings, etc. concerning the approaches to safety by 30+ professional disciplines beginning with items published in the mid-17th century about the safety of farmers, mariners, and miners. I'm looking forward to examining time trends of word use, topics of interest, and other things.

sdspieg · April 10, 2017

Ok, so here are some excerpts from the output of the console [any way to enter code in such a way that readers can scroll down?]:

Apr 10, 2017 10:32:27 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+55 (55) in font DejaVuSansBold

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2017-04-10 22:32:34.174:WARN:/:qtp989110044-14: trombone: ERROR: An error occurred during multi-threaded document expansion.
java.lang.IllegalStateException: An error occurred during multi-threaded document expansion.

at java.lang.Thread.run(Unknown Source)
Caused by:
java.util.concurrent.ExecutionException: java.io.IOException: Unable to parse document: UNKNOWN: null

Anita Lucchesi · June 26, 2017

Ah, ok. Now I see. Such a pity, @corajr.

We are working on a quite large collection of academic journal articles (ca 10k items). As I had a previous experience with Paper Machines I thought that would be a good tool for us, as we are also interested in the Topic Modelling by time. I checked this forum by November 2016 and didn't find any info Paper Machines were off.

Then, last month, finally working on the data, we started struggling with Paper Machine blank pages (or absences of feedback). In my previous work with it, the collection was smaller and I was working on a Windows machine, by then (2012) I did not have any problem. Now, the collection is bigger, as I have to use other packages for Mac OS X, I am using a standalone Zotero on my Mac and have installed Paper Machines as an add-on. Unfortunately, so far, even splitting the collection in smaller sub collections, the extraction to Paper Machines works, but even after running it for more than 24 hours, I still get a blank page as result (empty html) or the message "No log file found."

I pretty much liked the layout of the topic modelling via Paper Machines using Mallet. I'm afraid result in Volant are not as helpful as those in Paper Machines for those interested to look at time... As the topic modelling by time was in the add-on to Zotero.

I will try to export the full-text Zotero collection to Voyant. I have already tried to run part of my data (865 articles; ca 2GB) on it (inserted manually, Zotero PDF collections converted into plain txt), but I also got the "Error (Document terms)", as @sdspieg did.

Would be interested to know whether @DWL-SDCA got the same problem when trying to run more data than in the first test mentioned.

And, still, by curiosity, would that be any possible "revival" of Paper Machines, as a sort of a way back machine for tools?

DWL-SDCA · June 26, 2017

I apologize for not writing back in a timely way. I ended my experiments at the point where it became clear that 1) handling stopwords and editing the source document (to eliminate headers/footers, etc.) required more preparation time than I was willing to spend; and most importantly 2) I do not have the necessary knowledge and skills in text analysis to make this powerful tool useful for _my_ needs. Gaining those skills and that knowledge would be both fun and nice but I must weigh priorities. That gain would cause me to spend less time on my real work. I made a judgement that working with Voyant was a luxury that would require more time than I could personally afford. My job is curating a database and editing several hundred bibliographic records each day for addition to the database. While it would be nice to have a graphic analysis of the sum of all the abstracts of the items in the database; I do not have time to pursue that goal.

[An added problem is that the items are drawn from scholarly journal articles from more than 30 distinct professions and published in 150+ nations. Many of these professions use different terms to label very similar concepts. While, during the editimg process, we add terms that make an abstract understandable to persons not in the authors' profession, we add the terms as "explainers" and keep the authors' own words. Also there is the problem of the same word being used for very different things. Take, for example, the word "football". Is the article about soccer, American football, Gaelic football, Australian-rules football, touch/flag football, or any of the other games called football or translated to football? Each of these games has very different rules and equipment and injury risks. When adding an article we identify which game is being discussed and add the term the author didn't think to mention {few European readers of an Italian journal article could be expected to think an article about football concerned anything other than soccer or that an article about "football" in an American journal by an author from the U.S. would concern soccer}. We do this term-explanation not only for search/query purposes but so that someone browsing and reading can quickly identify what the article is about. These administrative edits and explanations make interpretations of text analysis more troublesome than useful.]

tobyto · December 26, 2017

Hi! The Voyant extension seems very interesting. I have downloaded the plugin and have successfully installed it. However, when I right-clicked "Export Collection..." the option of 'Export to Voyant' didn't show up. Any recommendation on this?

ohelloworld · January 13, 2018

I have installed the voyant extension, but after I right-clicked on the collections, the 'export to voyant' didn't show up. Any suggestions?

kcorwin · January 22, 2018

It seems since last update, I have the same issue. On the GitHub page, I also posted an issue and offered to help. The "UI" file of the extension needs to be updated, but documentation is not ample enough to help a very basic coder like myself implement a fix.

ilc168 · July 11, 2018

Appreciate your work on this. Some collections are working (so far smaller ones ~ 20 files), a larger meta one is not (~300 files). The exported corpus is empty. Any suggestions? I'll update any troubleshooting that works

joshuawagner · August 15, 2018

This is amazing! Thank you! When I first saw the post I tried this out and something didn't work out. Just in case, I tried it again today, and it worked great! I look forward to digging into this further.