Data loss on import of bibutils generated MODS file

Using bibutils to convert EndNote exported XML into MODS, then reading data into Zotero is an interesting alternative, as XML is probably the least broken EndNote export format.

I've tried it, and found out a lot of information got lost.

I have uploaded the sample file.
http://home.arcor.de/web_bill_be58/Zotero-Put2web/Library_MODS.xml

Some of the problems.

- The books by Berliner imported without year, publisher, number of pages
- The book (by Berliner) Title concatenated with Subtitle without space or other separator

Following citation imported as "books:

- The Thesis by Sittner
- "Scotch Tape Test" without any given document type (Web Page in EndNote XML, but bibutils issued a warning and discarded the doc type)
- A Conference Paper by Meng - without pages, place, conference name

----

There are probably other issues - I didn't check any further
  • Piping the intermediary MODS into bibutils' xml2bib <the sample file here: http://home.arcor.de/web_bill_be58/Zotero-Put2web/Library_bib.bib.zip >
    to produce a BibTeX and importing that into Zotero produced much better results: Thesis imported as thesis, Journal Articles imported with Volume, Issue, and Pages... Only a Conference Paper (Meng) still got imported as Book, despite being tagged as "@Proceedings{Meng2005 " in the source.
  • It seems, I have figured out a procedure, though not perfect, but probably better then others for getting references from EndNote into Zotero.

    1. Export from EndNote as XML
    2. Send the file through two bibutils:

    end2xml Your-EndNote.xml > New-MODS-File.xml

    xml2ris New-MODS-File.xml > New-RIS-File.ris

    (Call files with the paths if necessary; piping presumably possible).

    Then, as proposed earlier:
    http://forums.zotero.org/discussion/5311/importing-endnote-libaray-including-pdf-attachments/#Item_10

    Copy the folder named "PDF" from the EndNote storage into you root (the start disk on my Mac)

    Search and replace in the RIS file:
    String to search for: "UR - internal-pdf://"
    String to replace: "L1 - file:///PDF/"

    Read the RIS file into Zotero. The import from the first impression more clean than direct import from EndNote RIS, and the PDFs are all here!

    It is certainly possible to combine two conversions and one search-and-replace in a single shell script (and put it onto a web server) - but not today, and probably not from me (my unix skills are very limited).

    Note: the link to the Mac/Intel binary on the bibutils home page http://www.scripps.edu/~cdputnam/software/bibutils/ is broken :(
  • edited September 15, 2009
    The books by Berliner
    Both of these are books in a series. I think that it would be more correct for the MODS XML to have<relatedItem type="series"> instead of<relatedItem type="host">but I don't know what your EndNote XML data looked like or if bibutils can get this right. Zotero does the right thing when these are labeled as being part of a series.

    However, I think the MODS XML translator could be improved to use the originInfo of the record in preference to that of relatedItems (particularly when originInfo is absent from those relatedItems).

    Note that NO references import with the total number of pages; this is a relatively recent field in Zotero. Also note that the total number of pages isn't enumerated by "Spin Labeling. Theory and Applications."
    The book (by Berliner) Title concatenated with Subtitle without space or other separator
    Ticket created. I'd need to test this more, but the translator could use something like:// title
    for each(var titleInfo in mods.m::titleInfo) {
    // dropping other title types so they don't overwrite the main title
    // we have same behaviour in the MARC translator
    if(!titleInfo.@type.toString()) {
    if (titleInfo.m::title.length()){
    newItem.title = titleInfo.m::title.text().toString();
    if (titleInfo.m::subTitle.length()) {
    newItem.title = newItem.title + ": " + titleInfo.m::subTitle.text().toString();
    }
    } else {
    newItem.title = titleInfo.*.text(); // including text from sub elements
    }
    }
    }
    Following citation imported as "books:

    - The Thesis by Sittner
    Ticket created. This reflects a bug in Zotero (for both import and export. Zotero uses the genre "theses", but the proper MARC genre is "thesis".
    - "Scotch Tape Test" without any given document type (Web Page in EndNote XML, but bibutils issued a warning and discarded the doc type)
    Garbage in leads to garbage out. There is no way for Zotero to know how to type this entry. Perhaps bibutils could be improved here, though.
    - A Conference Paper by Meng - without pages, place, conference name
    The Zotero MODS translator does not currently handle the "conference publication" genre. There are many genres that should be added in addition to this one.
  • The RIS sample file, which I could successfully import into Zotero, as well as the original EndNote file are here


    http://home.arcor.de/web_bill_be58/Zotero-Put2web/EndNOte-and-RIS.zip

    Note, Journal Article / Volume, Issue, Pages, Date are not retrieved on import form MODS, but read from RIS produced from this MODS file.
  • This is very useful, Ben. Thanks for documenting it. I also very much like your idea of making this a serverside script. A web service converting EndNote XML to good RIS would be a killer application.
  • but I don't know what your EndNote XML data looked like or if bibutils can get this right. Zotero does the right thing when these are labeled as being part of a series.
    EndNote XML just has the series title as a "secondary" title. What does EndNote's interface call it? It may be that bibutils could be improved slightly by assuming that all secondary titles for books were the series title; I don't know. Howerver as I noted, Zotero should use the top-level originInfo when possible.
    - "Scotch Tape Test" without any given document type (Web Page in EndNote XML, but bibutils issued a warning and discarded the doc type)
    Garbage in leads to garbage out. There is no way for Zotero to know how to type this entry. Perhaps bibutils could be improved here, though.
    Bibutils can probably be improved, as the LoC gives an example of a webpage in MODS, although (at first glance) it seems hard to differentiate it from a computer program.
    Note
    Please give at least one example of a note not being transferred via MODS.
    Journal Article / Volume, Issue, Pages, Date are not retrieved on import form MODS
    Presumably, you mean the vol/iss/pages/date, etc. from journal articles (as the article's title & publication are kept). These are all absent due to a similar issue re. the root level vs. the relatedItem branch, described above. The bibutils-produced MODS XML is valid; "part" is allowed at the root level. So, Zotero should get this data. However, the LOC's example & most other software put the part branch beneath the "host" "relatedItem" branch. I'd argue that bibutils should probably export this same way (as it is more expected).
  • Note, Journal Article / Volume, Issue, Pages, Date are not retrieved
    Please give at least one example of a note not being transferred via MODS.
    Written in haste before leaving to work :) Also, note, English is not my native language. Was meant as "Note that Journal Article..." But note, the phrase as I've written it is IMHO grammatically correct with the verbal meaning of "note", too :)
  • I'd argue that bibutils should probably export this same way (as it is more expected).
    The meaning was probably "The bibutils author should upgrade the software to better conform the standards".

    Is bibutils still in active development? I have sent an email to Chris Putnam, the author of the software, concerning a broken link (see above) and got no response yet.

    If it is still being developed, noksagt, could you probably contact Chris on the above point?

    Otherwise, what is your opinion - is any workaround from the Zotero side or by user possible? (I mean realistic possibilities only, not that any third person should take the bibutils source and implement your suggestions)
  • edited September 19, 2009
    Chris has been continually improving it AFAIK, but has been kind of incommunicado with me as well. I'd guess he just fixes stuff as it comes up.

    In general, bibutils is really good. I personally would like to see it moved to an open SCM repo (say GitHub) with an issue tracker, easier contribution, etc. I'd also like to see it combined into a single binary (right now there are what, 20?), and to see bindings developed for common scripting languages (he did already move the core to a library, so this is easier; there's one for haskell, for example).

    All of which is to also underline the point that bibutils seems to me a good basis for any conversion web service.
  • @noksagt
    Garbage in leads to garbage out. There is no way for Zotero to know how to type this entry. Perhaps bibutils could be improved here, though.
    Chris (whom I have informed on the results of my tests and this discussion) put out bibutils v4.4 .

    Among other fixes and improvements, endx2xml (MODS) now recognize "web page" genre. Zotero, however, doesn't recognize this genre.

    I have put the files, produced by new bibutils to
    http://home.arcor.de/web_bill_be58/Zotero-Put2web/EndNote_via_bibutils_4.4_imp.zip

    The bibutils produced MODS file "endx2MODS hand edited bbedit-reflow.xml" contains, among other, the item "i-PULSE" , which is a web page, and gets imported as a "book" by Zotero.
  • Ticket created. This reflects a bug in Zotero (for both import and export. Zotero uses the genre "theses", but the proper MARC genre is "thesis".
    I have finally submitted a patch for this.
  • The book (by Berliner) Title concatenated with Subtitle without space or other separator
    Ticket created.
    There is also now a patch for this.
  • - A Conference Paper by Meng - without pages, place, conference name
    The Zotero MODS translator does not currently handle the "conference publication" genre. There are many genres that should be added in addition to this one.
    Patch that partially addresses this: A majority of standard marcgt genres, including "conference publication" are mapped.
  • Patches have been committed.
Sign In or Register to comment.