Looking for best practices in e-content knowledge management

sdspieg · April 18, 2020

Just a question for 'heavy' users of Zotero. So I essentially use Zotero as part of a broader knowledge management setup. Here's what that roughly looks like:

software
- 95% of the content I read and annotate is now in electronic format. I keep my ebook collection in Calibre
- Zotero is my research storage into which I download and in which I keep and annotate all academic and other expert text-based content I use for my research. Initially, these were sorted by research project. My Group libraries are now increasingly switching to a subject-centric model with a 'Paper' collection specifically for references for the papers/books I write. I still regret that Zotero doesn't handle ebook formats (epub, mobi,...) better - e.g. showing their icon like pdf; indexing them, etc., but the Zotero ecosystem still has so many more pluses than minuses that it's a no-brainer for me. Not in the least for this amazing community forum.
- For things I read outside of Zotero, I use Diigo for annotation purposes.
process
- reading/annotating
  - I read and highlight passages and add notes to almost all of my ebooks on a Kindle. I sync all of my ebooks with my Calibre; and every few months, I import my Kindle annotations to Diigo.
  - I also regularly read/annotate pdf-items from Zotero itself with Zotfile
- for all (or at least most) of the full-text files, we now also can export full groups from Zotero with metadata and (cached) text, which allows us to use various corpus analytics (textmining) tools on the corpus for things like topic modelling, various information extraction tools (entity and relation extraction, reference extraction hypothesis extraction, etc.)

remaining issues
- annotations
  - Diigo vs Zotero/Zotfile for regular annotation - I wish I didn't have to use them both. For instance, things like The Economist I MUCH prefer reading and annotating on my Kindle, but then i can't import those notes into Zotero. Reading PDF files still sucks on a 6" Kindle, so those I prefer reading on my computer. There annotating a pdf is a LOT easier in Diigo, but then you upload the pdf to Diigo cloud, and so you can't get these notes into Zotero.
  - hierarchical annotation (/coding) - we often 'code' corpora with more complex, layered (and often overlapping) coding schemes. So, for instance, certain text spans in an item might be about topic A, others about Topic B. ANd within those, you might have subtopics A1, A2, A3, B1, B2, etc.). That can't be done now from within Zotero now. So we do it outside of Zotero in other (QDA) program, but then we often lose the metadata.
  - (semi-)automated annotation - AI is making these increasingly user-friendly (and free for academic use). A great example is spaCy/prodi.gy.. But so as far as I know, there is no way to get the fully annotated text file back into Zotero.
  - So we now often end up with different versions of the same text items. One version of it will be structured nicely in Zotero, which we can then use as an e-content management tool but also as a pure bibliographical management tool that we can also use for adding references and bibliographies to our own publications. Other versions will end up being in different file formats (increasingly jsonl, though), with various annotations. It would, it seems to me, so much nicer if all of this could be 'integrated' or at least hooked up in a more elegant (SOA) way.
- bibliometrics
  - many exciting developments there too with new and very large far more user-friendly and 'open' datasets like Lens and Dimensions (with a LOT of advantages over the incumbent oligopolists Web of Science and Scopus); but also amazing powertools to visualize knowledge landscapes (also over time) like CiteSpace.
  - But this too leads to an entirely different 'pipeline' for the bibliometric work.
- why all of this matters (IMO)
  - we as humans have (so far) mostly encoded and transmitted the 'knowledge' we have built in text;
  - this text-encoded knowledge base remains, by far, the single most important 'raw materials' for scientists, across ALL disciplines;
  - the digitization of this knowledge base opens up entirely new avenues to access and enrich it - also in far more collaborative ways than most disciplines have used so far;
  - the oligopolistic 'ownership/market structure' behind that knowledge base has been impeding this 'unleashing' of its true epistemic potential [And their reaction to the coronavirus almost looks like an admission of guilt: when we REALLY need progress, we are willing to open up our treasure trove. As though other, more persistent, medical, educational, economic, security, etc. issues would NOT deserve the speediest possible solutions];
  - but so right now the content itself (think EBSCO, ProQuest, etc), the (bibliographical/bibliometric) metadata of that content (think WoS, Scopus) AND the tools that allow us to navigate all of it (think EndNote, etc.) - are still mostly closed access/source and in the hands of the Elseviers and Clarivates of this world;
  - the tide, however, IS changing - with OA content, (mostly) open source tools (NLP, visualization, etc.);
  - I see Zotero as being on the 'right' side of this epic battle that is unfolding before our very eyes;
  - but I still wished that the players on that 'right' side would also keep their eyes on the overall software architecture that will allow us to finally fully integrate these different aspects of this epistemic tooling: the corpus management, corpus analytics/visualization, the new meta-analytical layers this will undoubtedly spawn. And we're far away for that now...
But so on a more practical note: If anybody has any thoughts/suggestions about how some of the ways I have described in this post could be done in smarter ways, I'd be all ears! Thanks much.

dghumphrey · April 27, 2020

Thanks for writing this great post, I'm relatively new to academia and also trying to figure out a smoother reading/reviewing work flow based around eink reader. I don't mind paying a fair price for the reader/software/ecosystem but haven't found something that `just works`. It feels like all the pieces are there but not connected.

For me the dream work flow would be;
- 10 inch / full size ereader
- Seamless import from webclipping platform like Pocket or other bookmarklet
- Connection to Zotero library
- Good rendering of pdf including charts/images
- Great highlight and mark up with stylus or finger
- Perfect note extraction into zotero library with correct location in the citation

On a kindle I can just about handle the pdf render and mark up, but the note extract is painful and manual ie. it doesn't extract highlights with original location for citation of a paragraph.

carvalhar · September 30, 2020

Hey developpers, this sounds like a good request for new Zotero's improvement!!

netbuoy · October 20, 2021

SO much THIS!!!!!!!

tracylefebvre · November 2, 2021

Yes! If annotation extraction was available from Kindle/epub it would be amazing

mivanits · December 5, 2021

Hi there -- I ended up writing my own tool for the Kindle notes -> Zotero part of this problem. It's still very buggy, but I'd appreciate any feedback!

https://github.com/mivanit/kindle-clippings-zotero

- it does not require jailbreaking your kindle, its possible to set this up to happen automatically anytime you plug your kindle into your preferred computer
- it's specifically designed to work with non-kindle-store items
- it tries to find matching zotero items in a semi-smart way, asks if any match, and saves your choice in a json cache