Import fails, then generates duplicates

olivier.cailloux · December 18, 2009

Hello,

I am trying to import a BIB file you can find here [http://www.lamsade.dauphine.fr/mcda/biblio/index.html] (link down the page, export Bibtex, or direct link [http://www.lamsade.dauphine.fr/mcda/biblio/Biblio/complete-bibliography.bib]).

I converted the input BIB file to UTF-8 (original is in LATIN1) with iconv (on Debian). In zotero, I created a collection for that purpose (empty at first). I select this one, and click on the wheel icon / import / bibtex. It says that errors were encountered while importing. I observe the following:
- part of the entries (280, original file seems to count over 5k entries) seem to be imported correctly.
- however, they are placed inside my "normal" default personal collection.
- I move them to the other one (they get copied instead, but that's not related I guess).
- An other big problem: there are plenty of duplicates, which are not visible in the zotero GUI but appear when I try to re-export to a bibtex file. The generated bibtex file has much more than the 280 entries shown, and contains plenty of duplicates. E.g., entries corresponding to keys auriol_robust_2007, auriol_robust_2007-1, auriol_robust_2007-2, auriol_robust_2007-3 are identical (the original imported file contains one entry referencing the author Auriol).

Any idea what's going wrong?

dstillman · December 18, 2009

Export Library currently includes items in the trash folder.

We'll take a look at the import issue.

dstillman · December 18, 2009

An updated BibTeX translator that fixes this has been pushed to the repository. Your copy of Zotero should auto-update within 24 hours, or you can update manually by clicking Update Now in the General pane of the Zotero prefs. Let us know if you have further problems.

olivier.cailloux · December 22, 2009

Many thanks for your fast reaction. Import (of the same file) now works much better, although I still see a few glitches.

1) Accents encoded in the imported bibtex file using backslashes, e.g., Tsouki\`as for Tsoukiàs (witch is a Bibtex correct way to do it, AFAIK [http://www.bibtex.org/SpecialSymbols/]) are imported as-is with no interpretation of the encoding, e.g., I get an author named Tsouki\`as instead of Tsoukiàs. Naturally, when re-exporting it is all messed up. See for example the entry "@INBOOK{Tsoukias06inbook,
author={Yannis Dimopoulos and Pavlos Moraitis and Alexis Tsouki\`as}"

2) When re-exporting, the key generation algorithm does not properly account for special characters, see for example the entry (after re-export)
"@inproceedings{abdellaoui_`closing_1994,
title = {The `closing in' method: An experimental tool to investigate individual choice patterns under risk}".

3) The original file seems to count 5,898 entries (as per "grep -c '^@[^{]*{' complete-bibliography-utf8.bib"). After import, I see 5,899 entries in my library (in Zotero I select all entries in the just-created collection and it says "5899 selected entries" (translated from French)). I have unfortunately no idea how I could discover which one has been created / splitted / duplicated.

About the duplicates problem, it was indeed due to the export of other unwanted collections. When selecting only those 5k+ entries I want and right-clicking 'export', I do not see the duplicates any more.

Once again thank you for your help.

olivier.cailloux · January 4, 2010

I can provide a precise way to produce what seems very likely to be a bug in Zotero import and export of Bibtex files features. See points 1 and 2 above (third point is not valid I think).

If nobody has time to work on solving these bugs now, I understand that, Zotero being a free project. But shouldn't a bug report be opened to keep track of what is wrong? I followed http://www.zotero.org/support/reporting_bugs but as Zotero do not propose an error report I can't find a way to open a bug report. What can I do more?

I have observed that Zotero team usually takes this kind of issues seriously so I am a bit surprised to see no answer to my post... Is it because it is considered not important? (I think it is very likely that people trying to import large bibtex files in foreign languages with accents will encounter this bug sooner or later.) Or would be very difficult to solve ? (I have the feeling it would not be.) Please tell me if I am wrong. Or maybe these are not bugs, in which case I'd be very happy to know why...

We are thinking about switching the maintenance of our MCDA bibliography to Zotero instead of a bibtex file as it is currently, but we need to have good Bibtex export to be able to do that (for legacy reasons). I think this move would be very benefical for everyone (for our research community as well as for the ease of maintenance), but this issue is currently blocking.

Any help would be appreciated!

olivier.cailloux · January 11, 2010

Bump... Is anyone reading this?

noksagt · January 11, 2010

Accents encoded in the imported bibtex file using backslashes, e.g., Tsouki\`as for Tsoukiàs are imported as-is with no interpretation of the encoding

Yes. More of these can be added to the mapping table. As a work-around, enclose your accented character in brackets (e.g. \`{a} ).

When re-exporting, the key generation algorithm does not properly account for special characters

There has been past discussion on the topic of key generation. The " ' " is banned, but is the " ` "? What is your specific objection?

Can't help you much with the off-by-one reference issue. I haven't seen it. Import the two files into JabRef & see if anything sticks out?

olivier.cailloux · January 12, 2010

1. Incorrect import of accents.
=> yes, I'll try to do that. But that can't be done manually, requires to write a script to achieve this (as the input file is huge). Not extremely difficult, but correcting Zotero would be cleaner and more benefical to everyone. Can I change the mapping table Zotero uses somehow? I guess that would be even simpler than writing an ad-hoc script for transforming the input file. (Then I could submit the patch and if it is integrated in Zotero it would be benefical to everybody.)

2. The generated key, in the example, is abdellaoui_`closing_1994. This is an incorrect Bibtex key (needs to be corrected manually before running bibtex). Also even if it was a correct key I don't feel it's a very practical key to use, but anyway the main problem is that bibtex does not parse it.

Thanks for answering, I was beginning to feel lonely ;-).

dstillman · January 12, 2010

https://www.zotero.org/trac/browser/extension/trunk/translators/BibTeX.js

Patches welcome. You can post them to zotero-dev.

mark · January 12, 2010

Olivier, I just want to say that reading your posts makes me smile. You have a very gentle attitude. Thank you!

olivier.cailloux · January 13, 2010

Hum, don't know how to read that... Anyway, I'll take it as a positive remark.

(I guess that's gentle again...)

mark · January 13, 2010

It's a positive remark for sure!

olivier.cailloux · March 5, 2010

I just finished a patch correcting several bugs in the bibtex translation. I tried to send an e-mail to zotero-dev@googlegroups.com but it has been rejected (I should be a member of the group).

Is it possible to become a member using my usual e-mail address (which is not from google)?

adamsmith · March 5, 2010

yes - you just need to create a google account I believe - but you don't actually need a gmail address for that.

olivier.cailloux · March 29, 2010

I just submitted a patch [ http://groups.google.com/group/zotero-dev/web/diff2?hl=en ] to the google dev group to the BibTeX translator which fixes a few issues.

First, a test case. (Simply copy in the clipboard and import that in
zotero.)
@inProceedings{akey,
author={M. Myself},
title={The `best' method: {A} ``final'' method to prove that my method
is the very best.},
booktitle={My very best accented \'{e}xp\'er\'{\i}ments},
editor={B. Myself},
publisher={Myself Publishing Inc.},
pages={141-155},
year={1994}
}

1) Importing from BibTeX. When transforming input string containing an
encoded accented character (such as "Tsouki`as"), the character was left
as is (resulting to "Tsouki`as" instead of the expected "Tsoukiàs").
(This was simply a 'g' missing from mapped = mapped.replace(/[{}]/,
"");)
2) Importing from BibTeX. One regexp change has been changed to account for patterns like e.g. \'{\i}.
3) Importing from BibTeX. BibTeX quotes (` and ') are now transformed to
unicode English quotes.
4) Exporting to BibTeX. There was a bug when generating the citation
key: the string ` was not excluded from the key, although this character
is not allowed in a BibTeX key. While there, I also removed the other special characters.

Someone mentioned that special characters were accepted on his system (ubuntu). However, it is not on mine thus it does not seem fully compatible. Plus, I don't think having apostrophes or other strange characters in a bibtex key is really what the user wants. Finally, it has been suggested that the whole bibtex key generation should be done differently. I fully agree, but what I suggest is a simple change (one line of code, and it's a simplification rather than something more complex) in the meantime.

Also a note marked with TODO in the patch, feel free to remove it
if irrelevant: I noticed a warning in the debug logs about cleanString being deprecated, I thought maybe you'll want to know that...

About the dollar sign used in bibtex to input mathematics: I could not find an adequate solution, thus I did not change anything regarding this. Currently (as before the patch) the dollar sign is imported as-is with the mathematical formula, and when re-exported the dollars are changed to real dollar letters in bibtex (because zotero does not know how to distinguish from an intended dollar sign and a "begin mathematics" markup).

One idea could be to change $blah blah$ to blah blah (or other markup) so that at least when re-exported from zotero to bibtex the inverse conversion can be done and the user can find its dollar signs back (as goes the famous quote, "I want my dollars back"). But it is not very clean and I didn't implement this. I guess it's better to manually deal with these mathematical expressions when needed.

noksagt · May 3, 2010

Exporting to BibTeX. There was a bug when generating the citation
key: the string ` was not excluded from the key, although this character
is not allowed in a BibTeX key. While there, I also removed the other special characters.

Someone mentioned that special characters were accepted on his system (ubuntu). However, it is not on mine thus it does not seem fully compatible. Plus, I don't think having apostrophes or other strange characters in a bibtex key is really what the user wants.

It is very common to have dashes and underscores in methodically generated keys (as well as some of the so-called special characters you removed). I disagree with your patch to citeKeyCleanRe. The rest of it looks fine. Again, I haven't tested.

olivier.cailloux · May 12, 2010

Mmmh the automatic e-mail when the thread is updated does not seem to work for me, it did not see your comment.

What should be done about these special characters? If the suggestion is to leave the key generation as is (thus implying that it is up to me to solve my latex system not accepting these special characters in keys - it is after all possible that there is a misconfiguration somewhere on my box), then I guess it is a simple matter of integrating my patch but excluding the last change (the one to the key generation).

ajlyon · May 12, 2010

If the ` character is really not valid, then it should be excluded. Other symbols should probably be left for now-- if they are shown to cause trouble, then they can be removed in subsequent patches.

noksagt · May 12, 2010

If the ` character is really not valid, then it should be excluded.

I don't think it is. This works for me with no errors or warnings on multiple platforms:

test.bib:

@inProceedings{`,
author={M. Myself},
title={The `best' method: {A} ``final'' method to prove that my method
is the very best.},
booktitle={My very best accented \'{e}xp\'er\'{\i}ments},
editor={B. Myself},
publisher={Myself Publishing Inc.},
pages={141-155},
year={1994}
}

test.tex:

\documentclass{article}
\begin{document}
Test~\cite{`}
\bibliographystyle{unsrt}
\bibliography{test}
\end{document}

olivier.cailloux · May 26, 2010

... then I guess what's left to do is a simple matter of integrating my patch but excluding the last change (the one to the key generation).

ajlyon · May 26, 2010

This is one of the many cases where Zotero developers have been less than prompt in integrating patches.

[Off-topic, but important:
As I've mentioned before on zotero-dev, translator review should be done in a more transparent way that encourages authors. It gets a little old writing patches that moulder in the forums and Google Group without being committed or rejected.

If Zotero introduces a system of review and pushing, where patches need 1-2 reviewers then they can be pushed by non-core devs (or core devs will promptly push), I would be much, much happier. There are several of us who would, I'm sure, be glad to serve as reviewers.]

dstillman · May 26, 2010

FWIW, on all things BibTeX, I pretty much wait for noksagt to give the green light. It's not clear to me that there's even consensus on this issue. If there's a final patch, point me to it and I'll commit it.

noksagt · May 26, 2010

This is one of the many cases where Zotero developers have been less than prompt in integrating patches.

I wouldn't overstate this. The patch in question is fairly simple, but implements multiple changes. Two of these changes are contentious (using BibTeX entities on UTF-8 export for quote characters and removing several valid characters from generated keys). There has not yet been a revised patch to perform only 1-3 in olivier.cailloux's Mar. 29 posting & he only responded to my comments about the parts of the patch that should be removed today. (He also didn't submit a unified diff, so can't be applied to the current trunk version.)

I'd personally be fine with the following:


diff --git a/translators/BibTeX.js b/translators/BibTeX.js
index 2775430..74991e1 100644
--- a/translators/BibTeX.js
+++ b/translators/BibTeX.js
@@ -1111,6 +1111,8 @@ var reversemappingTable = {
     "{\\textunderscore}"              : "\u2017", // DOUBLE LOW LINE
     "{\\textquoteleft}"               : "\u2018", // LEFT SINGLE QUOTATION MARK
     "{\\textquoteright}"              : "\u2019", // RIGHT SINGLE QUOTATION MARK
+    "`"                               : "\u2018", // LEFT SINGLE QUOTATION MARK
+    "'"                               : "\u2019", // RIGHT SINGLE QUOTATION MARK
     "{\\quotesinglbase}"              : "\u201A", // SINGLE LOW-9 QUOTATION MARK
     "{\\textquotedblleft}"            : "\u201C", // LEFT DOUBLE QUOTATION MARK
     "{\\textquotedblright}"           : "\u201D", // RIGHT DOUBLE QUOTATION MARK
@@ -1687,14 +1689,14 @@ function getFieldValue(read) {
        
        if(value.length > 1) {
                // replace accented characters (yucky slow)
-               value = value.replace(/{(\\[`"'^~=a-z])([A-Za-z])}/g, "$1{$2}");
+               value = value.replace(/{?(\\[`"'^~=a-z]){?\\?([A-Za-z])}/g, "$1{$2}");
                for (var mapped in reversemappingTable) { // really really slow!
                        var unicode = reversemappingTable[mapped];
                        if (value.indexOf(mapped) != -1) {
                                Zotero.debug("Replace " + mapped + " in " + value + " with " + unicode);
                                value = value.replace(mapped, unicode, "g");
                        }
-                       mapped = mapped.replace(/[{}]/, "");
+                       mapped = mapped.replace(/[{}]/g, "");
                        if (value.indexOf(mapped) != -1) {
                                Zotero.debug("Replace(2) " + mapped + " in " + value + " with " + unicode);
                                value = value.replace(mapped, unicode, "g")

ajlyon · May 26, 2010

Thanks for the detailed explanation of what still needs to be done.

I am mainly concerned that the current set of processes don't lend themselves to good developer feedback and encouraging new contributors. Problems are discussed here, which lead to patches posted to the files section of zotero-dev, and referred to in a separate thread there. They might also be referenced in an issue in Trac. The status of a patch and its various incarnations can only be followed by combining forum and zotero-dev discussions, then finding the patches in the Files section of zotero-dev.

Can't we just do all this in Trac?

[Again, apologies for hijacking this thread, but I just realized that I have a half-dozen patches/translators in limbo, and a less patient casual hacker would have long ago stopped bothering to submit them.]

fbennett · May 26, 2010

[I agree with ajlyon. With a hat-tip to Randy Newman, and no disrespect to Utah, what we have at the moment is a Beehive State where "nobody seems to know".]

dstillman · May 27, 2010

[We're currently without a dedicated translator developer, which is why things have been slow lately. A new interface for better community-based development of translators is in the works. Feel free to bump neglected things in the meantime (as you both have been doing).]

olivier.cailloux · May 29, 2010

"I am mainly concerned that the current set of processes don't lend themselves to good developer feedback and encouraging new contributors." --ajlyon

Thanks a lot for these comments. I was beginning to feel a bit lonely ;-). Indeed I have been disappointed about the lack of reaction and the the feeling that I can do nothing but wait (for ever?) that someone integrates the patch... I thus had decided to stop submitting patches for zotero in the future. ('prefer to work with people welcoming patches and helping developers help them).

That said, I understand that it's not easy to react timely, that misunderstandings can occur, etc. I am simply happy to read that people understand that the current situation is not perfect and are thinking about it.

"There has not yet been a revised patch to perform only 1-3 in olivier.cailloux's Mar. 29 posting & he only responded to my comments about the parts of the patch that should be removed today. (He also didn't submit a unified diff, so can't be applied to the current trunk version.)"

=> that illustrates, IMHO, a problem in communication (probably partly because of me). I did not know what was expected from me. I did not notice I forgot to answer some comments (even now I don't know which ones). And I don't know what a "unified diff" is.

Note that I'm not blaming anyone or complaining, just mentioning how true the first sentence by ajlyon is.

dstillman · May 29, 2010

I haven't been following the above, but if I import the test input from noksagt above via the clipboard, I get "My very best accented \’exp\’er\’iments" in the Proceedings Title field. If I then export that, I get this in the .bib file:

booktitle = {My very best accented {\textbackslash}’exp{\textbackslash}’er{\textbackslash}’iments},

If I then import that via the clipboard, I get "My very best accented \textbackslash’exp\textbackslash’er\textbackslash’iments" in Proceedings Title.

Is this all expected?

olivier.cailloux · May 29, 2010

... Not expected, at least not after applying the patch: this is what it's supposed to correct!

On my box (i.e., patch applied), I get (after importing the test case from the clipboard) the proceedings title "My very best accented éxpéríments" (correct accents) ; and expected export.

*Before* applying the patch (i.e. with the BibTeX.js file as provided by the current zotero version), I also get something slightly different than you: the proceedings title in zotero is "My very best accented éxp\'er\'ıments".

dstillman · May 29, 2010

After applying the patch quoted by noksagt above?

fbennett · May 29, 2010

Before applying the patch, I get the "Before" behavior described by Olivier: only the first of the accented characters is transformed.

noksagt's version of the patch wouldn't apply on my system (patch reports it as corrupted, maybe something to do with C&P). So I put in the changes by hand. The editor I used generated a backup copy BibTeX.js~, which Zotero preferred to the edited version. Deleting this file got the patched version to load.

On my system (Linux), it behaves as Olivier describes, both with the test data posted by noksagt, and with the original test data posted by Olivier.

I have prepared a unified diff that satisfies noksagt's requirements to zotero-dev. A notice of the patch, with a link to the patch itself, is here:

http://groups.google.com/group/zotero-dev/browse_thread/thread/dfa542b5e643e505#