Frequency of a cited item in a document

Hi all,

I was wondering how to know how many times an item (eg. a reference) is cited in a document. Is is possible to create some output indicating each reference and its frequency in the document? Notice this is difference from the total number of references in the document.

Thanks
  • no, nothing easy/built-in, sorry.
  • edited November 4, 2016
    It would be rather trivial for anybody with some knowledge of JavaScript to generate that type of data for Word .docx documents with a customized version of http://rintze.zelle.me/ref-extractor/, though.

    (I'm still wondering if this is a frequent enough request to add this as a feature to the regular version)
  • (I would personally love that feature)
  • Any thoughts on how to best report these count results, if I make it a standard feature?

    Since I haven't built in citeproc-js or Citation.js (https://larsgw.github.io/citation.js/) yet, I really only have the item metadata itself to work with. Because multi-item citations don't contain item-specific pre-rendered citations (see e.g. https://github.com/rmzelle/ref-extractor/wiki#multi-item-zotero-citation), I can't easily show formatted citations and their counts (e.g. "(Doe, 2002): 2x"). I also don't seem to be able to easily extract the formatted bibliographic entry for a given item.

    What I could do, is (optionally) add the citation count to the "note" field in the CSL JSON. Alternatively, I could add an option to generate a CSV or TSV file with a few metadata fields (URIs, title) and the citation count.
  • edited November 5, 2016
    I think adding it to 'note', something like:

    Times cited: 13

    Would be good if there were some sort of toggle to include versus exclude it.
  • edited November 6, 2016
    This comment has nothing to do with the technical aspects of programming this feature but it may help illuminate why someone might want a count of used citations.

    @bwiernik. What I, personally, have encountered through the years with demands on my own work; and what several of my university colleagues require is a table that lists each source and the number of times it is cited. The complicated part was a demand that, in the case of an edited book, the number of times each chapter was cited _and_ the sum of all citations of the book should be listed.

    I've never heard of this as a requirement when submitting a manuscript for publication. It is, however, a not-uncommon requirement for students' reports and theses. I first encountered this when I was an undergraduate student at a small liberal arts college in the 1960s. This was back in the days of printed indices, note cards, and hand-typed manuscripts. (We were also required to provide our deck of notecards for cited and examined but not-cited references -- the times cited indicated by a number on each index card.)

    After this forum thread reminded me of this aspect of writing long past, I asked my colleagues about why this source listing was required. To my surprise their answers were not obscure and unreasonable. The source citation list is a quick way to identify how strongly each school of thought influenced the student. It is intended to be used by the student during the writing process as a guage to assess how broadly differing points-of-view are included. I followed up by pointing out that it appears to me that most students compile the list only after the document is complete. I asked if they told their students the purpose of the requirement. To a person they expressed an Ah-Ha moment. No they had not explained the reason for the requirement. Each person said though it was obvious to them, it was now obvious that their students didn't know why. I said that until now, I didn't have a clue that the citation count requirement was considered useful to the student and not merely some holdover formality from long ago.

    Another aspect of manuscript requirement minutiae is including the proportion of words in quotations to the total number of words in the document. That drove me near crazy in the '60s and '70s. I remember being docked a full grade letter by one professor because I had under-counted the number of non-quote words and had calculated the proportion of quote-words to total words instead of her unique requirement that the proportion be the number of quote-words to non-quote-words. She or an assistant actually counted each word.

    As I wrote this I also remember the days when a template transparency was placed over each typewritten page to test if text strayed into the margins. If the text entered a margin the page (and usually the full document) had to be retyped. The margins had to be wide enough for holding comments. Needless to say my fellow students and I had to pay a typist to do the work. Retyping due to margin errors was the typist's responsibly. Changes because of modifications, edits, deletions, small revisions, required paying again.
  • @segarra, @bwiernik, I just updated http://rintze.zelle.me/ref-extractor/, which now has a new toggle 'Store cite counts in "note" field'. When checked, the cite count is prepended to the "note" field ("Extra" in Zotero) in the format "Times cited: n".

    Again, feedback welcome.
  • @bwiernik @DWL-SDCA In my opinion, it is necessary to reduce the cited times of a certain bibliography in the paper.

    The bibliographies are intended to guide the reader to a detailed reading of the chapters in which they are interested, and the reader should turn to the reference for a quick look. Therefore, multiple citations of the same bibliography are not helpful for the reader to survey the background knowledge of this article. If an bibliography is cited in the discussion section, the duplicate citation of the bibliography in the introduction section should be indispensable, otherwise such citation should not be adopted.

    On the contrary, the bibliographies of a good paper should be relatively comprehensive to the previous research, which limits the cited times of a certain bibliography. If a reference is cited to prove a point of view, another appropriate reference will expand the scope of the study's contribution to this research scope when discussing another point of view.

    @segarra I put forward an solution.

    First, download the references from zotero through the plug-in of @Rintze . Then try to convert them into RIS format files and import them into Endnote X8. After that, re-inserted these references in the full text. Finally, check the cited times and the location distribution of each reference through the "Edit & Manage Citations " of Endnote X8.
  • Let's not give people guidance on scientific writing here. You don't know who you're talking to -- chances are, they're experienced academics themselves -- and disciplinary traditions vary a lot (e.g., in the humanities you may be exploring one text which can be cited dozens of times in an article, books differ from articles, etc.)

    As note by Rintze, for the original question, the number cited is now included in export from his tool.
  • A small note in support of @DWL-SDCA's large comment: I sometimes supply a rather long list of reference calls to support a statement, especially synthetic statements in the introduction. It can turn out that some of these references are not called again anywhere else in the write-up, which can make them seem superfluous if they are secondary sources. It would thus be useful to identify references that are only cited once to clean some of them out.
  • Reference Extractor linked above can do that.
  • As a PhD student with an imminent viva, I would like to generate a reference list sorted by citation frequency, as a rough metric of "importance".

    My source is R Markdown, so I don't think I can use Reference Extractor. The only approach I can currently think of is hacking apa.csl.
  • @earcanal What exactly do you mean by “citation frequency”? How many times your own works have been cited?
  • If you have an Rmd document, you can use R (or python or whatever you like) to analyze frequencies:

    Read the whole doc in as a string, extract all citation markers using a regex, and then count the frequency of each marker.

    Modifying the citation style won't work, I believe -- CSL doesn't have a citation-frequency count variable.
  • @bwiernik: number of times I've cited a work in my work.

    @adamsmith: I was hoping to avoid that approach (or at least find some existing code to do it, because I'm lazy). A benefit of using CSL is that the results would be a human-readable reference list, rather than a list of IDs. Couldn't you modify the CSL to count citations and sort on that?
  • Couldn't you modify the CSL to count citations and sort on that?
    No. CSL just isn't aware of citation counts at all.
  • Fair enough. Probably my ignorance of how CSL works. I assumed it was XSLT, so you could count() any type of node you can select.
  • @emilianoeheyns Can you link to your pandoc filter that produces live Zotero citations?

    You can export your document from pandoc to Word with live Zotero citations. Emiliano has written a pandoc filter for that. From there, you can use Reference Extractor, which has the option to add the number of times cited to Extra. You can write a CSL file to include the 'note' variable to show of the Extra field, or, if you want to be sure that only the times cited is shown, you could add 'annote: ' before that number and then add the 'annote' field in you CSL

    It might even be easier to write a small pandoc filter to count the unique occurrences of the citekeys in your markdown document and then do the same as above.
  • edited June 22, 2021
    Going through live citations + ref extractor could do it, but if all you need are the counts, this should get the job done:

    citation_count = {}

    count = {
    Cite = function(el)
    for _, item in pairs(el.citations) do
    if citation_count[item.id] == nil then
    citation_count[item.id] = 0
    end
    citation_count[item.id] = citation_count[item.id] + 1
    end
    end,
    }

    function Pandoc(el)
    pandoc.walk_block(pandoc.Div(el.blocks), count)
    for id, n in pairs(citation_count) do
    print(id, n)
    end
    os.exit(0)
    end


    ran as

    pandoc --lua-filter count-cite.lua main.md
  • Thanks, that's pretty close! This gives me a count of citation keys across the files which make up my thesis:

    pandoc --lua-filter count-cite.lua *.Rmd | sort -r -k 2

    Is there a way to inject something like this into a pandoc pipeline so that it generates a sorted, human-readable reference list as a PDF?
  • If you change the code to

    citation_count = {}

    count = {
    Cite = function(el)
    for _, item in pairs(el.citations) do
    if citation_count[item.id] == nil then
    citation_count[item.id] = 0
    end
    citation_count[item.id] = citation_count[item.id] + 1
    end
    end,
    }

    function Pandoc(el)
    pandoc.walk_block(pandoc.Div(el.blocks), count)
    for id, n in pairs(citation_count) do
    print('[@' .. id .. ', pp.' .. n .. ']')
    end
    os.exit(0)
    end


    and run

    pandoc --lua-filter count-cite.lua main.md | pandoc --bibliography=biblio.bib --citeproc

    you'll get a document that first lists all citekeys with their counts, and then a (presumably adequately sorted) bibliography. Since the in-doc citations are recognizable in the output (<span class="citation" data-cites="<citekey>">(Clark 2013, 1)</span>) and the lines in the bibliograpy carry the ID (<div id="<citekey>" class="csl-entry" role="doc-biblioentry">) you could use either a further lua script or xlst, or python, to move the former to the end of the latter, and et voila. That's however more lua code than I can put together in a few minutes, so I'll leave this as an exercise for the reader.
  • Thanks, this ticks the human-readable requirement:

    pandoc --lua-filter count-cite.lua 01-introduction.Rmd |
    pandoc --bibliography=references.bib --citeproc > foo.html


    but the references are still in ascending alphabetical order.
  • edited June 24, 2021
    If you mean you want them sorted on cite count, this works for me:

    pandoc -s --bibliography=biblio.bib --citeproc main.md | ./count-and-sort.py

    #!/usr/bin/env python3

    import xml.etree.ElementTree as ET
    import sys

    class CiteCount:
    def __init__(self, f):
    tree = ET.parse(f)
    root = tree.getroot()

    self.cited = {}
    self.collect(root)
    self.sort(root)
    tree.write(sys.stdout.buffer)

    def collect(self, root):
    for node in root.findall('.//{http://www.w3.org/1999/xhtml}span'):
    if node.attrib.get('class') == 'citation':
    for key in node.attrib['data-cites'].split(' '):
    if not key in self.cited: self.cited[key] = 0
    self.cited[key] += 1

    def key(self, key):
    if key.startswith('ref-'): key = key[4:]
    return key

    def sort(self, root):
    body = root.find('.//{http://www.w3.org/1999/xhtml}body')
    body[:] = [node for node in body if node.tag == '{http://www.w3.org/1999/xhtml}div' and node.attrib.get('role') == 'doc-bibliography']

    bib = [node for node in root.findall('.//{http://www.w3.org/1999/xhtml}div') if node.attrib.get('role') == 'doc-bibliography'][0]
    bib[:] = sorted(bib[:], reverse=True, key=lambda node: self.cited[self.key(node.attrib['id'])])

    for node in root.findall('.//{http://www.w3.org/1999/xhtml}div'):
    if node.attrib.get('role') == 'doc-biblioentry':
    count = ET.SubElement(node, '{http://www.w3.org/1999/xhtml}span')
    count.text = f': {self.cited[self.key(node.attrib["id"])]}'

    CiteCount(sys.stdin)


    a pandoc-lua script should be able to do the same, but I'm less familiar with lua.
  • Fantastic, that also works for me. Thanks for writing this.

    Do you know why all HTML tags are prefixed with html:?

    e.g.


    <html:html xmlns:html="http://www.w3.org/1999/xhtml"; lang="" xml:lang="">
    <html:head>
    <html:meta charset="utf-8" />
    <html:meta content="pandoc" name="generator" />
    <html:meta content="width=device-width, initial-scale=1.0, user-scalable=yes" name="viewport" />
    <html:title>01-introduction</html:title>
    <html:style>
    ...


    I need to remove the prefix to view the output in a browser.
  • edited June 24, 2021
    Do you know why all HTML tags are prefixed with html:?
    It's the xhtml namespace. It should work but I'd probably need to declare it as xhtml. This will just strip them:

    #!/usr/bin/env python3

    import xml.etree.ElementTree as ET
    import sys

    def citekey(key):
    if key.startswith('ref-'): key = key[4:]
    return key

    tree = ET.parse(sys.stdin)
    root = tree.getroot()

    for node in root.findall('.//*'):
    node.tag = node.tag.split('}')[-1]

    cited = {}
    for node in root.findall('.//span'):
    if node.attrib.get('class') == 'citation':
    for key in node.attrib['data-cites'].split(' '):
    if not key in cited: cited[key] = 0
    cited[key] += 1

    body = root.find('.//body')
    body[:] = [node for node in body if node.tag == 'div' and node.attrib.get('role') == 'doc-bibliography']

    bib = [node for node in root.findall('.//div') if node.attrib.get('role') == 'doc-bibliography'][0]
    bib[:] = sorted(bib[:], reverse=True, key=lambda node: cited[citekey(node.attrib['id'])])

    for node in root.findall('.//div'):
    if node.attrib.get('role') == 'doc-biblioentry':
    count = ET.SubElement(node, 'span')
    count.text = f': {cited[citekey(node.attrib["id"])]}'

    tree.write(sys.stdout.buffer)
  • Great. Is there a github repo that would make a good home for this. It's very useful!
  • I don't readily know of a repo where it would fit, but feel free to put it up somewhere.
Sign In or Register to comment.