Data import from EndNote via bibutils to Zotero

This is continuation of a discussion started here
http://forums.zotero.org/discussion/8651/converter-endnote-xml-to-mods-or-rdf/#Item_1
and here
http://forums.zotero.org/discussion/8726/data-loss-on-import-of-bibutils-generated-mods-file/#Item_3

I believe, it's a better way to get data and attachments from EndNote, as compared to direct RIS export, which is the currently recommended procedure.

Here is a report on further progress.

1. Export data from EndNote in the XML format.

The tests were done with EndNote X. The output file is apparently unicode UTF-8 without BOM. Bibutils tool probably interprets it as ASCII (?), scrambling non-ASCII characters. Therefore,

2. Open the EndNote XML file with BBEdit, change the encoding to UTF-8 with BOM, and save. The same can probably be done with the free TextWrangler http://www.barebones.com/products/textwrangler/

3. As bibutils binaries for Intel Mac OS X are for some reason currently unavailable on the bibutils site http://www.scripps.edu/~cdputnam/software/bibutils/ , I have to switch into Ubuntu Linux, living in a virtual machine on my Mac.

3.1 I convert there the EndNote XML (unicode) into the intermediate MODS XML format (unicode as well), which - as is - cannot unfortunately yet be properly read by Zotero directly.

3.2 Then I convert the MODS into BibTeX. I have tried first with RIS, but there were problems with non-ASCII characters. As those get special encoding in BibTeX, it works OK here (not a lot of testing yet, however).

3.3 The I convert the EndNote representation of the PDF attachments storage location into something Zotero can read. This is made by searching for a substring 'url="internal-pdf://' and replacing it by e.g. 'url="file:///PDF-Folder/' .

You will have to create a folder 'PDF-Folder' at the root level of your Zotero computer and put all EndNote attachments there (may be deleted after import).

All three transformation (3.1-3.3) can be put into a single unix shell command:

endx2xml < YourEndNoteFile-as-unicode-with-BOM.xml | xml2bib | sed s/'url=\"internal-pdf:\/\/'/'url=\"file:\/\/\/PDF-Folder\/'/ > YourOutputBibTeXfile.bib

(Here is assumed your working directory is where the bibutils programs are)

It is not difficult to write a shell script containing basically the long command above, maybe with minor additions.

If anybody had suggestions how to convert a "UTF-8 without BOM" into "UTF-8 with BOM" using unix shell means, it would be then added into the script.

Then it would be great if somebody would write a couple of lines of PHP code and install our script onto a publicly available server (Zotero team??).

I could write a unix shell script, though... I could call myself a beginner unix shell programmer if I really wanted to begin (I don't). A professional can probably do the task in half an hour at most.

I'm not sure, but probably I could find somebody to write an accompanying PHP webserver script - with the aim of installing it on our intranet.

I do not have any public servers at my disposal.

P.S. Can probably anybody compile the Intel Mac binary from the bibutils open source? There is a broken link on the bibutils home page, and the author doesn't reply. I'd be happy to do all unix shell scripting on my Mac directly (Mac is unix).
  • Sorry, actually I wanted to post into the Import/Export category - a wrong click, apparently. Could moderator probably move the thread there, please.
  • Including a BOM in a UTF-8 file is "not recommended" according to this wikipedia article (and my recollection), and I have a hard time believing that bibutils has any problem properly recognizing a UTF-8 file without it (I vaguely recalling discussing this with Chris years back in fact, so am pretty certain he's aware of the issues around it).
  • BTW, bibutils is available via both macports and apt-get.
  • Bibutils tool probably interprets it as ASCII (?), scrambling non-ASCII characters.
    You can use the '-i, --input-encoding' argument in bibutils.
    3. As bibutils binaries for Intel Mac OS X are for some reason currently unavailable
    It is in darwinports & refbase has a mirror of an older, working version: http://bibutils.refbase.org/
    I have tried first with RIS, but there were problems with non-ASCII characters.
    You could specify the output encoding in bibutils, so this might be surmountable.
    If anybody had suggestions how to convert a "UTF-8 without BOM" into "UTF-8 with BOM" using unix shell means, it would be then added into the script.
    I don't think this is needed, as bibutils lets you specify both input and output encoding. However, one way to do it is to use:uconv --add-signature
    Then it would be great if somebody would write a couple of lines of PHP code and install our script onto a publicly available server (Zotero team??).
    This sounds fairly hacky for something that should "just work." I'm glad it worked for you, but effort should be spent improving import/export of the programs in question, instead of playing with glue.
    P.S. Can probably anybody compile the Intel Mac binary from the bibutils open source?
    It does compile quite easily, without having to drag in dependencies.
  • Including a BOM in a UTF-8 file is "not recommended" according to this wikipedia article (and my recollection), and I have a hard time believing...
    Yes, it is not recommended. But my (preliminary) test produced correct results with it and scrambled Czech names - without. As the BOM is to be included only in an intermediate file - I coulndn't be less concerned with the official recommendations, if it worked.

    Thank you for information on how to get bibutils. It was not found on apt-get. But is available through macports. Though I'm reluctant to install macports, but it's probably less pain as compared to roundtrips from Mac to Linux.

    Now, how do I add BOM to the beginning of a file? :)
  • edited September 19, 2009
    I coulndn't be less concerned with the official recommendations, if it worked.
    You should be, because it either means a bug in bibutils, or a bug in your use of it.

    As for apt-get and macports: I don't know what to tell you, but here's what I see in my terminal:

    $ aptitude show bibutils
    Package: bibutils
    State: not installed
    Version: 3.40-4

    As for macports, I strongly recommend it.
  • noksagt, bdarcus, thank you.

    I've now installed bibutils via macports. Then Chris responded and fixed the link on his site.

    @bdarcus, with the BOM, it was probably a bug in my use of it - I couldn't reproduce it (neither on Linux, nor on Mac). I get correct Unicode in the MODS file.

    There are apparently problems with Unicode in conversion to RIS and BibTeX - more on it later, when I have some spare time.
  • Well, now I've spent most of my weekend with the tests - the procedure description and the results can be downloaded from http://home.arcor.de/web_bill_be58/Zotero-Put2web/EndNote_Import_Tests.zip

    The results overview can be also get separately:
    http://home.arcor.de/web_bill_be58/Zotero-Put2web/results_overview.pdf

    --------
    In short:

    There is really a lot of issues, no matter which path you follow - they are just different :( But there is hope, too

    I have prepared a sample bibliography, containing only items common in natural sciences and engineering, then transferred it in different ways to Zotero.
    - EndNote produces XML apparently containing all information about the items.
    - bibutils do on the whole a good job converting EN XML into MODS, though there are some issues.
    - Zotero cannot practicably used yet to import (bibutils) MODS for two main reasons:
    a. There is a problem between bibutils and Zotero which explained here http://forums.zotero.org/discussion/8726/data-loss-on-import-of-bibutils-generated-mods-file/#Item_7 . I circumvented it by manually editing MODS XML.
    b. Zotero doesn't recognize most item types I've tested (patent, magazine article, report, conference paper, book section, thesis). In other words, MODS import is simply not here yet.

    There are some more issues to discuss, but not today.
  • In the meanwhile I have informed Chris on the results of my tests and he produced the next version of bibutils (v4.4). He also advised me on proper use of bibutils to get unicode RIS & BiBTeX output. I'm not sure when I'd have time to make more tests - but here a short summary:

    The proper syntax to get ris or bib output in Unicode is as following:

    xml2ris -o unicode < infile > outfile
    xml2bib -o unicode < infile > outfile
    ----

    bibutils now recognizes "web page" as a reference type, an imports the "language" and "DOI" fields from EndNote XML. Some other improvements have also be made.
  • edited October 10, 2009
    this is really valuable work, thanks Ben!
Sign In or Register to comment.