Import fails, then generates duplicates
Hello,
I am trying to import a BIB file you can find here [http://www.lamsade.dauphine.fr/mcda/biblio/index.html] (link down the page, export Bibtex, or direct link [http://www.lamsade.dauphine.fr/mcda/biblio/Biblio/complete-bibliography.bib]).
I converted the input BIB file to UTF-8 (original is in LATIN1) with iconv (on Debian). In zotero, I created a collection for that purpose (empty at first). I select this one, and click on the wheel icon / import / bibtex. It says that errors were encountered while importing. I observe the following:
- part of the entries (280, original file seems to count over 5k entries) seem to be imported correctly.
- however, they are placed inside my "normal" default personal collection.
- I move them to the other one (they get copied instead, but that's not related I guess).
- An other big problem: there are plenty of duplicates, which are not visible in the zotero GUI but appear when I try to re-export to a bibtex file. The generated bibtex file has much more than the 280 entries shown, and contains plenty of duplicates. E.g., entries corresponding to keys auriol_robust_2007, auriol_robust_2007-1, auriol_robust_2007-2, auriol_robust_2007-3 are identical (the original imported file contains one entry referencing the author Auriol).
Any idea what's going wrong?
I am trying to import a BIB file you can find here [http://www.lamsade.dauphine.fr/mcda/biblio/index.html] (link down the page, export Bibtex, or direct link [http://www.lamsade.dauphine.fr/mcda/biblio/Biblio/complete-bibliography.bib]).
I converted the input BIB file to UTF-8 (original is in LATIN1) with iconv (on Debian). In zotero, I created a collection for that purpose (empty at first). I select this one, and click on the wheel icon / import / bibtex. It says that errors were encountered while importing. I observe the following:
- part of the entries (280, original file seems to count over 5k entries) seem to be imported correctly.
- however, they are placed inside my "normal" default personal collection.
- I move them to the other one (they get copied instead, but that's not related I guess).
- An other big problem: there are plenty of duplicates, which are not visible in the zotero GUI but appear when I try to re-export to a bibtex file. The generated bibtex file has much more than the 280 entries shown, and contains plenty of duplicates. E.g., entries corresponding to keys auriol_robust_2007, auriol_robust_2007-1, auriol_robust_2007-2, auriol_robust_2007-3 are identical (the original imported file contains one entry referencing the author Auriol).
Any idea what's going wrong?
We'll take a look at the import issue.
1) Accents encoded in the imported bibtex file using backslashes, e.g., Tsouki\`as for Tsoukiàs (witch is a Bibtex correct way to do it, AFAIK [http://www.bibtex.org/SpecialSymbols/]) are imported as-is with no interpretation of the encoding, e.g., I get an author named Tsouki\`as instead of Tsoukiàs. Naturally, when re-exporting it is all messed up. See for example the entry "@INBOOK{Tsoukias06inbook,
author={Yannis Dimopoulos and Pavlos Moraitis and Alexis Tsouki\`as}"
2) When re-exporting, the key generation algorithm does not properly account for special characters, see for example the entry (after re-export)
"@inproceedings{abdellaoui_`closing_1994,
title = {The `closing in' method: An experimental tool to investigate individual choice patterns under risk}".
3) The original file seems to count 5,898 entries (as per "grep -c '^@[^{]*{' complete-bibliography-utf8.bib"). After import, I see 5,899 entries in my library (in Zotero I select all entries in the just-created collection and it says "5899 selected entries" (translated from French)). I have unfortunately no idea how I could discover which one has been created / splitted / duplicated.
About the duplicates problem, it was indeed due to the export of other unwanted collections. When selecting only those 5k+ entries I want and right-clicking 'export', I do not see the duplicates any more.
Once again thank you for your help.
If nobody has time to work on solving these bugs now, I understand that, Zotero being a free project. But shouldn't a bug report be opened to keep track of what is wrong? I followed http://www.zotero.org/support/reporting_bugs but as Zotero do not propose an error report I can't find a way to open a bug report. What can I do more?
I have observed that Zotero team usually takes this kind of issues seriously so I am a bit surprised to see no answer to my post... Is it because it is considered not important? (I think it is very likely that people trying to import large bibtex files in foreign languages with accents will encounter this bug sooner or later.) Or would be very difficult to solve ? (I have the feeling it would not be.) Please tell me if I am wrong. Or maybe these are not bugs, in which case I'd be very happy to know why...
We are thinking about switching the maintenance of our MCDA bibliography to Zotero instead of a bibtex file as it is currently, but we need to have good Bibtex export to be able to do that (for legacy reasons). I think this move would be very benefical for everyone (for our research community as well as for the ease of maintenance), but this issue is currently blocking.
Any help would be appreciated!
Can't help you much with the off-by-one reference issue. I haven't seen it. Import the two files into JabRef & see if anything sticks out?
=> yes, I'll try to do that. But that can't be done manually, requires to write a script to achieve this (as the input file is huge). Not extremely difficult, but correcting Zotero would be cleaner and more benefical to everyone. Can I change the mapping table Zotero uses somehow? I guess that would be even simpler than writing an ad-hoc script for transforming the input file. (Then I could submit the patch and if it is integrated in Zotero it would be benefical to everybody.)
2. The generated key, in the example, is abdellaoui_`closing_1994. This is an incorrect Bibtex key (needs to be corrected manually before running bibtex). Also even if it was a correct key I don't feel it's a very practical key to use, but anyway the main problem is that bibtex does not parse it.
Thanks for answering, I was beginning to feel lonely ;-).
Patches welcome. You can post them to zotero-dev.
(I guess that's gentle again...)
Is it possible to become a member using my usual e-mail address (which is not from google)?
First, a test case. (Simply copy in the clipboard and import that in
zotero.)
@inProceedings{akey,
author={M. Myself},
title={The `best' method: {A} ``final'' method to prove that my method
is the very best.},
booktitle={My very best accented \'{e}xp\'er\'{\i}ments},
editor={B. Myself},
publisher={Myself Publishing Inc.},
pages={141-155},
year={1994}
}
1) Importing from BibTeX. When transforming input string containing an
encoded accented character (such as "Tsouki`as"), the character was left
as is (resulting to "Tsouki`as" instead of the expected "Tsoukiàs").
(This was simply a 'g' missing from mapped = mapped.replace(/[{}]/,
"");)
2) Importing from BibTeX. One regexp change has been changed to account for patterns like e.g. \'{\i}.
3) Importing from BibTeX. BibTeX quotes (` and ') are now transformed to
unicode English quotes.
4) Exporting to BibTeX. There was a bug when generating the citation
key: the string ` was not excluded from the key, although this character
is not allowed in a BibTeX key. While there, I also removed the other special characters.
Someone mentioned that special characters were accepted on his system (ubuntu). However, it is not on mine thus it does not seem fully compatible. Plus, I don't think having apostrophes or other strange characters in a bibtex key is really what the user wants. Finally, it has been suggested that the whole bibtex key generation should be done differently. I fully agree, but what I suggest is a simple change (one line of code, and it's a simplification rather than something more complex) in the meantime.
Also a note marked with TODO in the patch, feel free to remove it
if irrelevant: I noticed a warning in the debug logs about cleanString being deprecated, I thought maybe you'll want to know that...
About the dollar sign used in bibtex to input mathematics: I could not find an adequate solution, thus I did not change anything regarding this. Currently (as before the patch) the dollar sign is imported as-is with the mathematical formula, and when re-exported the dollars are changed to real dollar letters in bibtex (because zotero does not know how to distinguish from an intended dollar sign and a "begin mathematics" markup).
One idea could be to change $blah blah$ to blah blah (or other markup) so that at least when re-exported from zotero to bibtex the inverse conversion can be done and the user can find its dollar signs back (as goes the famous quote, "I want my dollars back"). But it is not very clean and I didn't implement this. I guess it's better to manually deal with these mathematical expressions when needed.
What should be done about these special characters? If the suggestion is to leave the key generation as is (thus implying that it is up to me to solve my latex system not accepting these special characters in keys - it is after all possible that there is a misconfiguration somewhere on my box), then I guess it is a simple matter of integrating my patch but excluding the last change (the one to the key generation).
test.bib:
@inProceedings{`,
author={M. Myself},
title={The `best' method: {A} ``final'' method to prove that my method
is the very best.},
booktitle={My very best accented \'{e}xp\'er\'{\i}ments},
editor={B. Myself},
publisher={Myself Publishing Inc.},
pages={141-155},
year={1994}
}
test.tex:
\documentclass{article}
\begin{document}
Test~\cite{`}
\bibliographystyle{unsrt}
\bibliography{test}
\end{document}
[Off-topic, but important:
As I've mentioned before on zotero-dev, translator review should be done in a more transparent way that encourages authors. It gets a little old writing patches that moulder in the forums and Google Group without being committed or rejected.
If Zotero introduces a system of review and pushing, where patches need 1-2 reviewers then they can be pushed by non-core devs (or core devs will promptly push), I would be much, much happier. There are several of us who would, I'm sure, be glad to serve as reviewers.]
I'd personally be fine with the following:
diff --git a/translators/BibTeX.js b/translators/BibTeX.js
index 2775430..74991e1 100644
--- a/translators/BibTeX.js
+++ b/translators/BibTeX.js
@@ -1111,6 +1111,8 @@ var reversemappingTable = {
"{\\textunderscore}" : "\u2017", // DOUBLE LOW LINE
"{\\textquoteleft}" : "\u2018", // LEFT SINGLE QUOTATION MARK
"{\\textquoteright}" : "\u2019", // RIGHT SINGLE QUOTATION MARK
+ "`" : "\u2018", // LEFT SINGLE QUOTATION MARK
+ "'" : "\u2019", // RIGHT SINGLE QUOTATION MARK
"{\\quotesinglbase}" : "\u201A", // SINGLE LOW-9 QUOTATION MARK
"{\\textquotedblleft}" : "\u201C", // LEFT DOUBLE QUOTATION MARK
"{\\textquotedblright}" : "\u201D", // RIGHT DOUBLE QUOTATION MARK
@@ -1687,14 +1689,14 @@ function getFieldValue(read) {
if(value.length > 1) {
// replace accented characters (yucky slow)
- value = value.replace(/{(\\[`"'^~=a-z])([A-Za-z])}/g, "$1{$2}");
+ value = value.replace(/{?(\\[`"'^~=a-z]){?\\?([A-Za-z])}/g, "$1{$2}");
for (var mapped in reversemappingTable) { // really really slow!
var unicode = reversemappingTable[mapped];
if (value.indexOf(mapped) != -1) {
Zotero.debug("Replace " + mapped + " in " + value + " with " + unicode);
value = value.replace(mapped, unicode, "g");
}
- mapped = mapped.replace(/[{}]/, "");
+ mapped = mapped.replace(/[{}]/g, "");
if (value.indexOf(mapped) != -1) {
Zotero.debug("Replace(2) " + mapped + " in " + value + " with " + unicode);
value = value.replace(mapped, unicode, "g")
I am mainly concerned that the current set of processes don't lend themselves to good developer feedback and encouraging new contributors. Problems are discussed here, which lead to patches posted to the files section of zotero-dev, and referred to in a separate thread there. They might also be referenced in an issue in Trac. The status of a patch and its various incarnations can only be followed by combining forum and zotero-dev discussions, then finding the patches in the Files section of zotero-dev.
Can't we just do all this in Trac?
[Again, apologies for hijacking this thread, but I just realized that I have a half-dozen patches/translators in limbo, and a less patient casual hacker would have long ago stopped bothering to submit them.]
Thanks a lot for these comments. I was beginning to feel a bit lonely ;-). Indeed I have been disappointed about the lack of reaction and the the feeling that I can do nothing but wait (for ever?) that someone integrates the patch... I thus had decided to stop submitting patches for zotero in the future. ('prefer to work with people welcoming patches and helping developers help them).
That said, I understand that it's not easy to react timely, that misunderstandings can occur, etc. I am simply happy to read that people understand that the current situation is not perfect and are thinking about it.
"There has not yet been a revised patch to perform only 1-3 in olivier.cailloux's Mar. 29 posting & he only responded to my comments about the parts of the patch that should be removed today. (He also didn't submit a unified diff, so can't be applied to the current trunk version.)"
=> that illustrates, IMHO, a problem in communication (probably partly because of me). I did not know what was expected from me. I did not notice I forgot to answer some comments (even now I don't know which ones). And I don't know what a "unified diff" is.
Note that I'm not blaming anyone or complaining, just mentioning how true the first sentence by ajlyon is.
booktitle = {My very best accented {\textbackslash}’exp{\textbackslash}’er{\textbackslash}’iments},
If I then import that via the clipboard, I get "My very best accented \textbackslash’exp\textbackslash’er\textbackslash’iments" in Proceedings Title.
Is this all expected?
On my box (i.e., patch applied), I get (after importing the test case from the clipboard) the proceedings title "My very best accented éxpéríments" (correct accents) ; and expected export.
*Before* applying the patch (i.e. with the BibTeX.js file as provided by the current zotero version), I also get something slightly different than you: the proceedings title in zotero is "My very best accented éxp\'er\'ıments".
noksagt's version of the patch wouldn't apply on my system (patch reports it as corrupted, maybe something to do with C&P). So I put in the changes by hand. The editor I used generated a backup copy BibTeX.js~, which Zotero preferred to the edited version. Deleting this file got the patched version to load.
On my system (Linux), it behaves as Olivier describes, both with the test data posted by noksagt, and with the original test data posted by Olivier.
I have prepared a unified diff that satisfies noksagt's requirements to zotero-dev. A notice of the patch, with a link to the patch itself, is here:
http://groups.google.com/group/zotero-dev/browse_thread/thread/dfa542b5e643e505#