Porting Word citations to latex with \cite{}

edited September 28, 2018
I was trying to convert a word document with citations created using the Zotero plug-in to a Latex format that I could just copy-paste across.

I opened an arbitrary .csl file just to see if I had a chance of doing it.

My question: is there a "name variable" in that style editor that would allow me to put the BBT "Citation key" field in the in-text citations (for reference, other examples are things like "author" and "title" and "editor" etc, there's a whole list here page 30: https://media.readthedocs.org/pdf/citation-style-language/1.0-20100321/citation-style-language.pdf)? I tried loads of things to see if I got lucky. I have a feeling this is impossible but thought I'd ask.

One possible solution I don't know how to implement is to batch convert the citation keys of BBT to an unused field that is common to most formats, e.g. short title or something like that - which I could then reference in the csl file. Any ideas?

Cheers.
«1
  • Maybe @emilianoeheyns , who created BBT, has an idea, but I'm not aware of any way to do this until Zotero has proper citekey support:
    - BBT citekeys are not mapped to CSL
    - There is no way to access them via Zotero's server API that I know of (though this is the part that I may be wrong about)
    - There is no reasonably accessible local API for Zotero

    If there is a way to access them via the server API, it would be possible to copy them over to a different field using pyzotero
    https://github.com/urschrei/pyzotero
  • Thanks for the response. I figured it was a lost cause. Got to transplant my phd thesis from word to latex in three days so not got time to investigate much myself - but if anyone does find out how to do this I'm sure it would be very useful in the future...
  • I'm expecting citekeys in Zotero 5.1 (or 6.0 depending on versioning) and then this will be trivial. Until then any solution will be rather involved.
  • My recommendation for your in your current situation would be to convert your citations/bibliography to regular text (with the Unlink Citations button in Word), then use Pandoc (https://pandoc.org/) to convert the Word to LaTeX. AS adamsmith says, there isn't currently a very reasonable way to get from Zotero's Word citations to a stable BibTeX citekey.
  • edited September 28, 2018
    BBT citekeys are not mapped to CSL
    But they can be via a BBT trick:
    https://retorque.re/zotero-better-bibtex/configuration/#citeprocnotecitekey
    There is no way to access them via Zotero's server API that I know of (though this is the part that I may be wrong about)
    This is correct. BBT keys are not available via the API, nor can they be made available. BBT relies on a running Zotero to do its work
    There is no reasonably accessible local API for Zotero
    BBT does have some APIs, but I'm not sure how you'd want to use it.

    edit: ugh it would be so nice to have markdown editing here.
  • Wow. That’s incredibly helpful.
  • The post-edit is probably a lot more helpful than the single link I accidentally submitted while meaning to preview :)
  • Sorry to revive this thread but I too am trying to work in the space between Word and LaTeX...

    Given that BBT has citekeys, is it not then trivial to create a CSL that wraps the references in '\cite{' and '}'? I notice the 'BibTeX generic citation style' *almost* does it, but just spits out the citekey instead of wrapping it.

    Am I right in thinking that's all it would take, or am I missing something? Does one even need BBT for this, given Zoteros built-in citekeys are sufficient for most purposes (the changes of a conflicting author/year are quite slim)?
  • You do need BBT because Zotero doesn't have built-in citekeys, so the citekey in the CSL style and in Zotero's exported bibtex isn't necessarily the same.

    You could indeed easily modify the existing style, yes. BBT actually offers an option linked on the description of the citeproc functinoality:
    https://retorque.re/zotero-better-bibtex/installation/preferences/hidden-preferences/#citeprocnotecitekey
    also not quite what you want, but likely easier to work with then the regular CSL one.
  • Ah great, thanks @adamsmith

    One further question -- if I receive a document from someone containing Zotero references to sources I don't have in my local library, is that a problem? (I'm trying to create a publishing workflow for a journal, and I won't obviously have all the sources that authors use).
  • yeah, that won't work well, I'm afraid.
    In that case you'd have to just work with the CSL Bibtex style entirely, which is not without its flaws.
    To get high-quality bibtex, you need the items in Zotero (from which you can then export them) but they'd have a different internal ID and wouldn't be linked to the citations in Word.
  • Would the reference extractor be capable of extracting data of sufficient quality to build a toolchain around?
  • the data quality is fine, but I can't think of a way to update the citation metadata in the document? You'd have to do:
    - extract data
    - import into Zotero
    - add citekeys
    - replace existing citations with newly imported ones
    - Apply bibtex citation style with BBT citekeys

    It's the penultimate step that I think would be very tricky to do.
  • Yeah, you're right, that would easily get into fuzzy matching and that kind of ugliness. Not a great fan other than for tech proof of concepts.

    BTW wrt @laurence80386
    the changes of a conflicting author/year are quite slim
    when you say "putting together a workflow for a journal", do you mean you writing for a journal, or you working at a journal and managing the influx of documents? Because if the latter, at some volume "slim chances" become near-certainties.
  • I'm meaning publishing. Basically we're (notionally) accepting both Word and LaTeX docs, but I want to use LaTeX for typesetting so need to copy the Word documents' text into LaTeX for that purpose -- which means using BibTeX to ensure the links between references and the bibliography are automatically generated.

    Once the PDF is camera-ready the originals can be archived; the only place where a conflict is possible is within a single document, and if Smith has two publications in 2009 (i.e. two separate instances of \cite{smith_2009} then that can be detected at proofing (I guess?)
  • edited February 5, 2020
    For more context (again sorry for hijacking the thread...) I'd like to receive Zotero-based Word files, convert the CSL to BibTeX (including the '\cite{...}'), copy that text into LaTeX, then compile the PDF. Does that sound plausible?

    My main concern (but there may be others) is if I don't have the references in my local library -- do they travel with the Word document, in at least sufficient detail to be able to change CSL for the purposes of the workflow I've just described?
  • edited February 6, 2020

    the only place where a conflict is possible is within a single document, and if Smith has two publications in 2009 (i.e. two separate instances of \cite{smith_2009} then that can be detected at proofing (I guess?)

    Not sure that's correct. An article could cite smith_2009 twice, or cite two separate smiths that both published in 2009, or one smith that published two articles in 2009.

    the references do travel with the Word document (which is why the ref extractor can extract them) and you could, while extracting, save the item ID or key (I think one of them is in there) in the extra field somehow, replace the cite in the doc with \cite{key or item ID}, and then stitch that together somehow (whether that goes through Zotero or not is a different matter).

    As an author, I'd be pretty nervous that either I or someone not-I would have to check that all cites remain in order, that the required sentence-case-to-title-case translation has been done properly (not a trivial matter to automate, trust me), some people post-edit the in-doc cites to get parencite-equivalent output, CSL and bib(la)tex don't have a perfect mapping between fields which means compromises or even data loss during conversion...
  • the references do travel with the Word document (which is why the ref extractor can extract them) and you could, while extracting, save the item ID or key (I think one of them is in there) in the extra field somehow, replace the cite in the doc with \cite{key or item ID}, and then stitch that together somehow (whether that goes through Zotero or not is a different matter).
    Getting the item ID out of the citations is actually going to be quite messy. It is indeed in the document, but it is not typically extracted by the reference extractor (we're referring to this app: https://rintze.zelle.me/ref-extractor/ ) so that would require even more scripting. But on the positive side, all citeable data is included in the document and can be used to switch between citation styles.

    Given what you describe, I'd start by simply testing out two different and easy-to-implement workflows:
    1. Completely rely on CSL, including for the formatted bibtex. This will correctly match citekeys and bibtex, but the bibtex itself may have problems. Modifying the existing CSL style to include the cite command will indeed be trivial.
    2. Rely on CSL (again, modified) for the cite command in text, the use the ref extractor to import into Zotero and export using Zotero's (or BBT's) bibtex, set to just include author_year as the citekey. This is more likely to mismatch citation and key, but will give you better/correct bibtex.

    If either of them is good enough for you, you're set with minimal effort. If you need something that performs better, you can follow some of Emiliano's thoughts further -- I think with those, it may be possible to get to a 100% reliable solution (at least for all standard item types) but that'll require much more significant up-front scripting.
  • @emilianoheyns
    Not sure that's correct. An article could cite smith_2009 twice, or cite two separate smiths that both published in 2009, or one smith that published two articles in 2009.
    I think the first of these problems isn't really a problem, particularly since I'm using ACM referencing (which is just a number referencing an item in the bibliography, e.g. '[23]'). Multiple references to the same bibliographical item are simply multiple instances of this, i.e. multiple insertions of `\cite{smith_2009}`. You're right that there's a potential conflict if Zotero's author_title_year key applies to two items, but that seems extremely unlikely does it not? Two authors (or the same author) would need to have the same name, have written about the same subject, within the same year. The likelihood seems remote to me. Even when the same author does have >1 publications in the same year, I'm finding the default title key is reasonable, e.g. `hildebrandt_smart_2015` and `hildebrandt_radbruchs_2015`.

    But maybe I am missing something...?
  • @adamsmith

    Unfortunately your first option (relying entirely on BibTeX) won't work because we're soliciting submissions from both computer science and the humanities and the latter simply won't be willing to learn how it works, unfortunately.

    As to your second suggestion:
    2. Rely on CSL (again, modified) for the cite command in text, the use the ref extractor to import into Zotero and export using Zotero's (or BBT's) bibtex, set to just include author_year as the citekey. This is more likely to mismatch citation and key, but will give you better/correct bibtex.
    Can I clarify the process you're suggesting:

    (i) use a BibTeX CSL to format the citations in the document once received from the author (they will use ACM while writing; I will convert to BibTeX for publishing)
    (ii) extract the citations from the document with the ref extractor to get a BibTeX file
    (iii) copy the body text into LaTeX and create the relevant separate BibTeX file
    (iv) link the two and compile

    I'm not sure what you meant about including author_year as citekey. Can this be chosen in Zotero's bibtex? I notice the default is author_titlekeyword_year, as per the examples I gave above -- I presume, per the above, that I can stick with the default?

    Is there any need for step 2 (extracing the refs and importing them in Zotero)? If I insert a bibiography at the end of the authored document, this will already be in BibTeX, which I can then copy and paste into the bibtext file in LaTeX. Is that right? (This is why I was asking if the authored Word document includes all the metadata about the inserted references, without which this step will of course not work -- but it sounds from what you're saying that this will not be a problem).

    Strange request, but in order that I can test this, would one of you mind uploading a Word document with Zotero references somewhere on the web? This will let me emulate receiving an independently-authored document with references that have no connection with my own personal library, to see if the above process is feasible.

    Thank you for your help so far!!
  • edited February 6, 2020
    I can't help with a Word document -- I don't use Word unless I absolutely must (and I am one of those rare exceptions who writes mostly in humanities, but uses latex for everything).

    WRT extracting bibtex using the ref extractor -- I have looked at what I think is the source of the ref extractor, and I *think* it uses the bibtex-csl style (https://github.com/citation-style-language/styles/blob/master/bibtex.csl) to convert the CSL to bibtex, but if that is correct, it turns an item with title
    JR3: paper not in a group & other
    into
    @article{Author,
    title={JR3: paper not in a group & other},
    }
    and that, aside from the problem that it doesn't title-case the title, simply isn't valid bibtex. And I don't see how CSL could do a better job than this, given what I know about converting text to valid latex and what little I know about what CSL stylers can do to the text.

    For conversion to valid latex, you're looking at either passing it through Zotero, or the closest competition I know, astrocite (even though astrocite doesn't pass my full test suite, and does not title-case).
  • edited February 6, 2020
    @emilianoheyns

    That's not my experience with BibTeX CSL -- I get things like this:
    @article{rawls_two_1955,
    title = {Two {Concepts} of {Rules}},
    volume = {64},
    issn = {0031-8108},
    url = {https://www.jstor.org/stable/2182230},
    doi = {10.2307/2182230},
    number = {1},
    urldate = {2019-02-13},
    journal = {The Philosophical Review},
    author = {Rawls, John},
    year = {1955},
    keywords = {website-bibliography},
    pages = {3--32}
    }
    which looks pretty comprehensive, no...?
  • My option 1) was rely entirely on CSL, not rely entirely on Bibtex.

    What you have in your example, though, is not Bibtex produced by the CSL style. It's Bibtex exported by Zotero (which is generated by fairly elaborate javascript. It's absolutely crucial that you understand that distinction because it is at the heart of what makes this difficult.

    One of the problems of using the CSL style is, in fact, that you can't use short titles for the citekey, since CSL can't modify individual elements and you'd end up with spaces in the citekey, so you're stuck with just author(s)_year and you'd have to adjust the export (by either using BBT or modifying the javascript) accordingly. This also, of course, makes citekey issues much more likely.
  • I would recommend that rather than doing this in the abstract, you spend 2hs or so actually experimenting. You'll likely run into many of the issues we're trying to raise here, and it'll be much easier to discuss if and how they can be solved once you have a clearer sense of how this would work for you. Getting to a set-up with a couple of Word document to test with shouldn't take you very long, so we don't have to do all of this as a whiteboard coding exercise.
  • Apologies @adamsmith, I misunderstood your previous message. Yes I think I need to do some more experimentation to better understand what's happening under the hood -- I'll create a 'clean' file with external citations and see what happens (and doesn't happen) when I follow the procedure I have in mind.
  • As @adamsmith points out, that's not produced by the CSL style, so either I'm mistaken that ref-extractor uses the CSL style (it could technically use an online Zotero translator server, but I see no evidence of that, and it claims everything happens in the browser), or you are indeed passing it through Zotero.

    But {Two {Concepts} of {Rules}}, is still very likely wrong -- it's much more likely that it should be {Two Concepts of Rules},. This may sound nit-picky (OK, it *is* nit-picky), and I don't mean to be belligerent, but if you're the accepting journal, and you're going to potentially change the submitters' bibliography... I don't know, that sounds tricky to me.

    I must in this admit that I have a pretty narrow focus in bibliography production that can not easily be accused of being pragmatic.
  • edited February 6, 2020
    that's not produced by the CSL style, so either I'm mistaken that ref-extractor uses the CSL style (it could technically use an online Zotero translator server, but I see no evidence of that, and it claims everything happens in the browser), or you are indeed passing it through Zotero.
    Reference Extractor doesn't use CSL to generate BibTeX output. It feeds extracted CSL-JSON to the Citation.js library (https://citation.js.org/), and I'm pretty sure its BibTeX generator (https://github.com/citation-js/citation-js/tree/master/packages/plugin-bibtex) doesn't rely on CSL.

    I also just discovered that Citation.js has the (rather hidden) option of using the original item ID as BibTeX citekey (https://github.com/larsgw/citation.js/issues/181), which should be easy to enable in Reference Extractor.
  • Well there you go, option #3 I wasn't aware of.

    I haven't played with citation.js so I don't have an opinion on the conversion it offers. I do know that conversion between CSL (which is what can be extracted from a word doc with zotero references) and bibtex is a lot more complex than most people appreciate. I for one did not know what I was getting into when I started.
  • By the way, this might be a good feature request for pandoc (https://pandoc.org), which should be able to convert .docx files to LaTeX, and already can use CSL-JSON for citation rendering.
  • It's not just that it can use it, it's the primary path for most (but not all) pandoc output; for most conversions, if you offer it bibtex, it first converts that to CSL, and that conversion is not lossless. It mostly works reasonably well, but if you can use a pure-CSL path through pandoc (unless you're basically use it as latexmk), you get the best results.

    Other things to look out for are cross-references (specifically, but not exclusively, xrefs that point to a page number).
Sign In or Register to comment.