Limited proof of concept "CSV to Zotero" import

PChemist · February 2, 2020

I have developed a limited proof of concept "workflow" for transferring a .csv table to Zotero database and just wanted to gauge public interest in this feature.

Briefly, I do not have time to develop an extension, so I aimed at the following workflow. My starting point was a table in an Excel sheet and my goal was to generate a file that can be imported to Zotero via existing mechanisms. Additionally, since none of the existing item types is suitable for me, I had to pick one and decide how to match my fields to the fields of the picked item type. The idea was to design a new table with columns matching selected fields and use worksheet formulas to populate it. Then a combination of worksheet formula and VBA could be used to generate a file for import to Zotero at the last step.

I have pursued two approaches. The most comprehensive import/export format for Zotero is "RDF", so that was my first target. I used a combination of worksheet formulas and VBA to generate the desired output. RDF is very verbose, however. Python with and rdf library is probably more suitable for this job than VBA. Regardless, I have a set of VBA macros generating and RDF for one item type. This is rather a working draft.

My second approach relies on JSON format. I consider CSL JSON format supported by Zotero to be limited and unsuitable for present purpose. However, I have started to figuring how to use the client Javascript API. Further, source code inspection revealed that Zotero supports a much more comprehensive version of JSON import/export format via JS API. So my current approach is to use a combination of worksheet formula and VBA to generate these JSON records, which then can be copy/pasted into a JS API template, with the resulting code performing the actual import procedure.

The present workflow supports a limited set of fields (which can be straightforwardly extended), tags (I use a new line separated list within a cell), note attachments (I picked " ;:; " as note delimiter to minimize the chances of delimiter/content collision), and URL link attachments (new line separated list, including optional link title separated from the link using the same pattern " ;:; ").

I can provide further details to anyone interested.

danb · February 6, 2020

For lack of a local API I wrote a library that creates an interface by running js through MozRepl. (https://gist.github.com/danbe/6547077), I have an updated version for Python 3 (can provide if anyone is interested), but I haven't used it in a while and can't guarantee that it still works.

In principle, this can create any type of item, including attachments, linked attachments, and notes.
It can also query the currently selected item.

I originally wrote this library to import a larger number of journal articles that I had stored in a folder structure. This worked (the library checks for every possible error code), but I never had a good feeling about the js approach.

PChemist · February 7, 2020

It sounds interesting, would check it out. Can you provide Python 3 version, @danb ?

Personally, I would prefer Python. There are a couple of Python libraries for the http API, but I thought there was no such option for the client API, which is the only option for me, as I do not enable sync with Zotero.org.

I had somewhat hard time trying to get started with JS API due to scarce documentation, but I feel I have a better understanding of it now after a few days of playing with it. Importantly, here I can take advantage of both JS introspection and the original source code inspection.

emilianoeheyns · February 7, 2020

There's also https://github.com/retorquere/zotero-better-bibtex/tree/master/test/fixtures/debug-bridge. It's basically mozrepl but you can send it JS snippets over http and get results back by returning anything that's JSON-serializable. The README describes running from VStudio, but you can use anything you want (such as python) to talk the the http interface.

The javascript you send is ran as if it is the body of

async function run() {
  [your code here]
  return { some: { fancy: { ['object'] } }
}
run()

I use this in my test suite to drive Zotero.

danb · February 7, 2020

@PChemGuy of course. I'll dig it out and will upload it later. The biggest challenge was to get all of the corner cases for unicode strings right. You'll see that 3/4 of the code is error checking. It will also undo a transaction on error.

@emilianoheyns That looks very similar to what I did. In the original version I used strings, but then switched to JSON for transmitting the variables.

danb · February 8, 2020

I have uploaded the improved version of the local client Python library at https://gist.github.com/danbe/6547077

I can't test this version at the moment, please let me know if you find any problems.

RadialThing · February 8, 2020

I use this PowerShell script to convert pre-prepared excel->csv files into RIS files for further import. Basically I pick the fields I want from the CSV and add or concatenate the appropriate RIS fields. Then I import the RIS file. I use those particular fields because they help with coursework I am importing and I want to keep info like start page and end page in a course book for example.

This was the simplest approach for me to get the data I imported.

-----
param(
[string]$infile = "infile",
[string]$course = "course",
[string]$outfile = "outfile"
)
Import-Csv -Delimiter "," $infile | ForEach-Object {
$Row =
"TY - " + $_."Type" +
"`nTI - " + $_."Citation Title" +
"`nT3 - " + $_."Citation Chapter Title" +
"`nSN - " + $_."Citation ISBN" +
"`nSN - " + $_."Citation ISSN" +
"`nUR - " + $_."Citation Source" +
"`nKW - " + $course + " | " + $_."Section Name" +
"`nKW - " + $_."Citation Tags" +
"`nKW - " + $course + " | " + $_."Section Name" + " | " + $_."Citation Tags" +
"`nAU - " + $_."Citation Author" +
"`nA3 - " + $_."Citation Chapter Author" +
"`nJF - " + $_."Citation Journal Title" +
#"`nDA - " + $_."Citation Publication Date" +
# Publication date is not reliable in the Excel export in ExLibris
"`nCY - " + $_."Citation Place of publication" +
"`nET - " + $_."Citation Edition" +

"`nN1 - " + $course + " | " + $_."Section Name" + " | " + $_."Citation Public Note" + " | Vol:" + $_."Citation Volume" + " | Iss:" + $_."Citation Issue" + " | Pages:" + $_."Citation Pages" + " | StP:" + $_."Citation Start Page" + " | EnP:" + $_."Citation End Page" + $_."Citation Notes" +

"`nDO - " + $_."Citation DOI" +

"`nER -";
Write-Output $Row } | Out-File $outfile

-----

danb · February 10, 2020

@KallistiMan Thanks for sharing! Did you run into any issues when importing many files? I tried RIS import a few years ago, but when I tried to import a large numbers of files, it would sometimes fail in the middle with no easy way to find the problem and re-start the remaining items.

Btw, the library I mentioned can also link files if one likes to keep them in their original folder.
https://forums.zotero.org/discussion/32504/linking-attachments-by-ris-import/

seeingtheforest · February 16, 2020

Wow. I'm looking to do precisely this with about 4000 PDFs. I simply want to add tags, notes and change title, while only linking to the external files. The technical coding stuff seems a bit beyond me though - are you able to share any sample excel/vba or some instructions on how to use the Python script that you made?

danb · February 18, 2020

@nixsee I'd be happy to, but I just found a problem:

MozRepl doesn't seem to be working anymore as of Zotero 5.0.82. I downloaded the latest version from GitHub and packaged it, when I try to install it, I get an error message that the version is incompatible with Zotero.

I found an alternative, which is the Zotero debug bridge.
https://github.com/retorquere/zotero-better-bibtex/releases/tag/debug-bridge

This bridge uses POST requests instead of a telnet connection and executes commands differently. I plan to re-write the library using this debug bridge, but I am sure there will be a couple of issues.

In the meantime, there are two alternative ideas using RDF and a new import plugin.
Maybe you can make this work:
https://forums.zotero.org/discussion/81329/create-zotero-library-using-disk-directory-structure/p1

emilianoeheyns · February 18, 2020

When you call the debug bridge with url query parameters, those will be available inside your code as query, which is a javascript object. But I usually just prepend variables to the script before I POST it.

danb · February 18, 2020

thanks, Emiliano.
I already got it to work partially and can already add notes. Can the bridge do arbitrary statements, or only expressions? I tried to execute Zotero.DB.beginTransaction() but got TypeError: Zotero.DB.beginTransaction is not a function.

emilianoeheyns · February 18, 2020

It should be able to do arbitrary code, except yield statements. The debug bridge executes as if you've called

async function debug_bridge(query) {
  <your code>
}

In that block you can await promises (such as async functions and the functions returned by coroutine).

I'm getting the same error, but you can do transactional code using

await Zotero.DB.executeTransaction(async () => {
  <your stuff here>
})

danb · February 18, 2020

I got it to work, thank you!

danb · February 19, 2020

@nixsee I fixed the MozRepl issue, the library is now using the Zotero Debug Bridge.

Installation instructions are on the gist page at the bottom https://gist.github.com/danbe/6547077

The usage example at the bottom of zotero.py is pretty complete.
If you run zotero.py, it will create a few items. Don't forget to either add the files to import or comment out the list of attachments.

First, load the CSV file into a list
listfromcsv = [['Title1', [('first name', 'last name')]]]

then add all items
with zotero() as z:
for title, authors in listfromcsv:
book = book(title=title, creators = [author(first, last) for first, last in authors])
z.addItem(book, - attachments, tags, and notes go here -)

Here is an example how to create a book, and add it to the library together with a couple of attachments, tags, and notes:

with zotero() as z:
# first you create the item you want to add
book0 = book(
abstractNote='''This is the abstract''', # leave out values if not needed
accessDate='2013-01-01',
archive='Archive',
archiveLocation='Archive Location',
callNumber='Call 1234',
creators=[ # list of authors, editors etc. Take from CSV file
author('Author First 2', 'Author Last 2'),
contributor('Contributor First 1', 'Contributor Last 1'),
editor('Editor First 1', 'Editor Last 1')
seriesEditor('S. Editor First 1', 'S. Editor Last 1'),

],

date='1234-12-34',
edition='1st Edition',
ISBN='12345678x',
language='SomeLanguage',
libraryCatalog='library catalog',
numberOfVolumes='10 Volumes',
numPages='3141',
place='Publisher place',
publisher='Publisher',
rights='Rights',
series='Series',
seriesNumber='Series Number 1',
shortTitle='A short title',
title='Item with file attachments', # take from CSV file
url='http://a.b.c',
volume='12')

# Add the book to the Zotero library, together with tags, files, and notes
# Attachments are linked or imported
# notes can contain html

report = z.addItem(
book0,
# to test, add files to folder. If not, comment out next line
attachmentMode='import', # the default. to link use 'link'
attachmentList=['book toc.pdf', 'book chapter 2.pdf', 'Test.png'],
tags=['defaultTag_type0', ('tag2', 0), ('tag_type_1', 1)],
notes=['A simple text note.', 'another note'])

emilianoeheyns · February 19, 2020

The return isn't necessary BTW, and may be unwanted, because it will immediately terminate the code you sent (since you are effectively returning from the function that's running the code)

danb · February 20, 2020

thanks, that is a good point. Probably the transactions are not needed for local operations anyway, but it helps for debugging the program and keep Zotero from crashing.

emilianoeheyns · February 20, 2020

Transactions are more for performance I think (where they can make a big difference). Individual items are always saved in a transaction, either one by one, or bundled.

danb · February 20, 2020

Btw., thanks for the debug bridge, it is much more stable than the telnet interface in MozRepl. MozRepl had all kinds of corner cases and the Unicode handling wasn't easy.

Do you think the bridge could eventually become part of Zotero as a client API?

emilianoeheyns · February 20, 2020

Seriously doubt it.