Frequency of a cited item in a document

segarra · November 4, 2016

Hi all,

I was wondering how to know how many times an item (eg. a reference) is cited in a document. Is is possible to create some output indicating each reference and its frequency in the document? Notice this is difference from the total number of references in the document.

Thanks

adamsmith · November 4, 2016

no, nothing easy/built-in, sorry.

Rintze · November 4, 2016

It would be rather trivial for anybody with some knowledge of JavaScript to generate that type of data for Word .docx documents with a customized version of http://rintze.zelle.me/ref-extractor/, though.

(I'm still wondering if this is a frequent enough request to add this as a feature to the regular version)

bwiernik · November 5, 2016

(I would personally love that feature)

Rintze · November 5, 2016

Any thoughts on how to best report these count results, if I make it a standard feature?

Since I haven't built in citeproc-js or Citation.js (https://larsgw.github.io/citation.js/) yet, I really only have the item metadata itself to work with. Because multi-item citations don't contain item-specific pre-rendered citations (see e.g. https://github.com/rmzelle/ref-extractor/wiki#multi-item-zotero-citation), I can't easily show formatted citations and their counts (e.g. "(Doe, 2002): 2x"). I also don't seem to be able to easily extract the formatted bibliographic entry for a given item.

What I could do, is (optionally) add the citation count to the "note" field in the CSL JSON. Alternatively, I could add an option to generate a CSV or TSV file with a few metadata fields (URIs, title) and the citation count.

bwiernik · November 5, 2016

I think adding it to 'note', something like:

Times cited: 13

Would be good if there were some sort of toggle to include versus exclude it.

DWL-SDCA · November 5, 2016

This comment has nothing to do with the technical aspects of programming this feature but it may help illuminate why someone might want a count of used citations.

@bwiernik. What I, personally, have encountered through the years with demands on my own work; and what several of my university colleagues require is a table that lists each source and the number of times it is cited. The complicated part was a demand that, in the case of an edited book, the number of times each chapter was cited _and_ the sum of all citations of the book should be listed.

I've never heard of this as a requirement when submitting a manuscript for publication. It is, however, a not-uncommon requirement for students' reports and theses. I first encountered this when I was an undergraduate student at a small liberal arts college in the 1960s. This was back in the days of printed indices, note cards, and hand-typed manuscripts. (We were also required to provide our deck of notecards for cited and examined but not-cited references -- the times cited indicated by a number on each index card.)

After this forum thread reminded me of this aspect of writing long past, I asked my colleagues about why this source listing was required. To my surprise their answers were not obscure and unreasonable. The source citation list is a quick way to identify how strongly each school of thought influenced the student. It is intended to be used by the student during the writing process as a guage to assess how broadly differing points-of-view are included. I followed up by pointing out that it appears to me that most students compile the list only after the document is complete. I asked if they told their students the purpose of the requirement. To a person they expressed an Ah-Ha moment. No they had not explained the reason for the requirement. Each person said though it was obvious to them, it was now obvious that their students didn't know why. I said that until now, I didn't have a clue that the citation count requirement was considered useful to the student and not merely some holdover formality from long ago.

Another aspect of manuscript requirement minutiae is including the proportion of words in quotations to the total number of words in the document. That drove me near crazy in the '60s and '70s. I remember being docked a full grade letter by one professor because I had under-counted the number of non-quote words and had calculated the proportion of quote-words to total words instead of her unique requirement that the proportion be the number of quote-words to non-quote-words. She or an assistant actually counted each word.

As I wrote this I also remember the days when a template transparency was placed over each typewritten page to test if text strayed into the margins. If the text entered a margin the page (and usually the full document) had to be retyped. The margins had to be wide enough for holding comments. Needless to say my fellow students and I had to pay a typist to do the work. Retyping due to margin errors was the typist's responsibly. Changes because of modifications, edits, deletions, small revisions, required paying again.

Rintze · November 16, 2016

@segarra, @bwiernik, I just updated http://rintze.zelle.me/ref-extractor/, which now has a new toggle 'Store cite counts in "note" field'. When checked, the cite count is prepended to the "note" field ("Extra" in Zotero) in the format "Times cited: n".

Again, feedback welcome.

kld123509945 · January 17, 2018

@bwiernik @DWL-SDCA In my opinion, it is necessary to reduce the cited times of a certain bibliography in the paper.

The bibliographies are intended to guide the reader to a detailed reading of the chapters in which they are interested, and the reader should turn to the reference for a quick look. Therefore, multiple citations of the same bibliography are not helpful for the reader to survey the background knowledge of this article. If an bibliography is cited in the discussion section, the duplicate citation of the bibliography in the introduction section should be indispensable, otherwise such citation should not be adopted.

On the contrary, the bibliographies of a good paper should be relatively comprehensive to the previous research, which limits the cited times of a certain bibliography. If a reference is cited to prove a point of view, another appropriate reference will expand the scope of the study's contribution to this research scope when discussing another point of view.

@segarra I put forward an solution.

First, download the references from zotero through the plug-in of @Rintze . Then try to convert them into RIS format files and import them into Endnote X8. After that, re-inserted these references in the full text. Finally, check the cited times and the location distribution of each reference through the "Edit & Manage Citations " of Endnote X8.

adamsmith · January 17, 2018

Let's not give people guidance on scientific writing here. You don't know who you're talking to -- chances are, they're experienced academics themselves -- and disciplinary traditions vary a lot (e.g., in the humanities you may be exploring one text which can be cited dozens of times in an article, books differ from articles, etc.)

As note by Rintze, for the original question, the number cited is now included in export from his tool.

syr · July 22, 2020

A small note in support of @DWL-SDCA's large comment: I sometimes supply a rather long list of reference calls to support a statement, especially synthetic statements in the introduction. It can turn out that some of these references are not called again anywhere else in the write-up, which can make them seem superfluous if they are secondary sources. It would thus be useful to identify references that are only cited once to clean some of them out.

bwiernik · July 22, 2020

Reference Extractor linked above can do that.

earcanal · June 22, 2021

As a PhD student with an imminent viva, I would like to generate a reference list sorted by citation frequency, as a rough metric of "importance".

My source is R Markdown, so I don't think I can use Reference Extractor. The only approach I can currently think of is hacking apa.csl.

bwiernik · June 22, 2021

@earcanal What exactly do you mean by “citation frequency”? How many times your own works have been cited?

adamsmith · June 22, 2021

If you have an Rmd document, you can use R (or python or whatever you like) to analyze frequencies:

Read the whole doc in as a string, extract all citation markers using a regex, and then count the frequency of each marker.

Modifying the citation style won't work, I believe -- CSL doesn't have a citation-frequency count variable.

earcanal · June 22, 2021

@bwiernik: number of times I've cited a work in my work.

@adamsmith: I was hoping to avoid that approach (or at least find some existing code to do it, because I'm lazy). A benefit of using CSL is that the results would be a human-readable reference list, rather than a list of IDs. Couldn't you modify the CSL to count citations and sort on that?

adamsmith · June 22, 2021

Couldn't you modify the CSL to count citations and sort on that?

No. CSL just isn't aware of citation counts at all.

earcanal · June 22, 2021

Fair enough. Probably my ignorance of how CSL works. I assumed it was XSLT, so you could count() any type of node you can select.

bwiernik · June 22, 2021

@emilianoeheyns Can you link to your pandoc filter that produces live Zotero citations?

You can export your document from pandoc to Word with live Zotero citations. Emiliano has written a pandoc filter for that. From there, you can use Reference Extractor, which has the option to add the number of times cited to Extra. You can write a CSL file to include the 'note' variable to show of the Extra field, or, if you want to be sure that only the times cited is shown, you could add 'annote: ' before that number and then add the 'annote' field in you CSL

It might even be easier to write a small pandoc filter to count the unique occurrences of the citekeys in your markdown document and then do the same as above.

emilianoeheyns · June 22, 2021

Going through live citations + ref extractor could do it, but if all you need are the counts, this should get the job done:

citation_count = {}

count = {
  Cite = function(el)
    for _, item in pairs(el.citations) do
      if citation_count[item.id] == nil then
        citation_count[item.id] = 0
      end
      citation_count[item.id] = citation_count[item.id] + 1
    end
  end,
}

function Pandoc(el)
    pandoc.walk_block(pandoc.Div(el.blocks), count)
    for id, n in pairs(citation_count) do
      print(id, n)
    end
    os.exit(0)
end

ran as

pandoc --lua-filter count-cite.lua main.md

earcanal · June 23, 2021

Thanks, that's pretty close! This gives me a count of citation keys across the files which make up my thesis:

pandoc --lua-filter count-cite.lua *.Rmd | sort -r -k 2

Is there a way to inject something like this into a pandoc pipeline so that it generates a sorted, human-readable reference list as a PDF?

emilianoeheyns · June 23, 2021

If you change the code to

citation_count = {}

count = {
  Cite = function(el)
    for _, item in pairs(el.citations) do
      if citation_count[item.id] == nil then
        citation_count[item.id] = 0
      end
      citation_count[item.id] = citation_count[item.id] + 1
    end
  end,
}

function Pandoc(el)
    pandoc.walk_block(pandoc.Div(el.blocks), count)
    for id, n in pairs(citation_count) do
      print('[@' .. id .. ', pp.' .. n .. ']')
    end
    os.exit(0)
end

and run

pandoc --lua-filter count-cite.lua main.md | pandoc --bibliography=biblio.bib --citeproc

you'll get a document that first lists all citekeys with their counts, and then a (presumably adequately sorted) bibliography. Since the in-doc citations are recognizable in the output (<span class="citation" data-cites="<citekey>">(Clark 2013, 1)</span>) and the lines in the bibliograpy carry the ID (<div id="<citekey>" class="csl-entry" role="doc-biblioentry">) you could use either a further lua script or xlst, or python, to move the former to the end of the latter, and et voila. That's however more lua code than I can put together in a few minutes, so I'll leave this as an exercise for the reader.

earcanal · June 24, 2021

Thanks, this ticks the human-readable requirement:

pandoc --lua-filter count-cite.lua 01-introduction.Rmd |
    pandoc --bibliography=references.bib --citeproc > foo.html

but the references are still in ascending alphabetical order.

emilianoeheyns · June 24, 2021

If you mean you want them sorted on cite count, this works for me:

pandoc -s --bibliography=biblio.bib --citeproc main.md | ./count-and-sort.py

#!/usr/bin/env python3

import xml.etree.ElementTree as ET
import sys

class CiteCount:
  def __init__(self, f):
    tree = ET.parse(f)
    root = tree.getroot()

    self.cited = {}
    self.collect(root)
    self.sort(root)
    tree.write(sys.stdout.buffer)

  def collect(self, root):
    for node in root.findall('.//{http://www.w3.org/1999/xhtml}span'):
      if node.attrib.get('class') == 'citation':
        for key in node.attrib['data-cites'].split(' '):
          if not key in self.cited: self.cited[key] = 0
          self.cited[key] += 1

  def key(self, key):
    if key.startswith('ref-'): key = key[4:]
    return key

  def sort(self, root):
    body = root.find('.//{http://www.w3.org/1999/xhtml}body')
    body[:] = [node for node in body if node.tag == '{http://www.w3.org/1999/xhtml}div' and node.attrib.get('role') == 'doc-bibliography']

    bib = [node for node in root.findall('.//{http://www.w3.org/1999/xhtml}div') if node.attrib.get('role') == 'doc-bibliography'][0]
    bib[:] = sorted(bib[:], reverse=True, key=lambda node: self.cited[self.key(node.attrib['id'])])

    for node in root.findall('.//{http://www.w3.org/1999/xhtml}div'):
      if node.attrib.get('role') == 'doc-biblioentry':
        count = ET.SubElement(node, '{http://www.w3.org/1999/xhtml}span')
        count.text = f': {self.cited[self.key(node.attrib["id"])]}'

CiteCount(sys.stdin)

a pandoc-lua script should be able to do the same, but I'm less familiar with lua.

earcanal · June 24, 2021

Fantastic, that also works for me. Thanks for writing this.

Do you know why all HTML tags are prefixed with html:?

e.g.


<html:html xmlns:html="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<html:head>
  <html:meta charset="utf-8" />
  <html:meta content="pandoc" name="generator" />
  <html:meta content="width=device-width, initial-scale=1.0, user-scalable=yes" name="viewport" />
  <html:title>01-introduction</html:title>
  <html:style>
...

I need to remove the prefix to view the output in a browser.

emilianoeheyns · June 24, 2021

Do you know why all HTML tags are prefixed with html:?

It's the xhtml namespace. It should work but I'd probably need to declare it as xhtml. This will just strip them:

#!/usr/bin/env python3

import xml.etree.ElementTree as ET
import sys

def citekey(key):
  if key.startswith('ref-'): key = key[4:]
  return key

tree = ET.parse(sys.stdin)
root = tree.getroot()

for node in root.findall('.//*'):
  node.tag = node.tag.split('}')[-1]

cited = {}
for node in root.findall('.//span'):
  if node.attrib.get('class') == 'citation':
    for key in node.attrib['data-cites'].split(' '):
      if not key in cited: cited[key] = 0
      cited[key] += 1

body = root.find('.//body')
body[:] = [node for node in body if node.tag == 'div' and node.attrib.get('role') == 'doc-bibliography']

bib = [node for node in root.findall('.//div') if node.attrib.get('role') == 'doc-bibliography'][0]
bib[:] = sorted(bib[:], reverse=True, key=lambda node: cited[citekey(node.attrib['id'])])

for node in root.findall('.//div'):
  if node.attrib.get('role') == 'doc-biblioentry':
    count = ET.SubElement(node, 'span')
    count.text = f': {cited[citekey(node.attrib["id"])]}'

tree.write(sys.stdout.buffer)

earcanal · June 24, 2021

Great. Is there a github repo that would make a good home for this. It's very useful!

emilianoeheyns · June 24, 2021

I don't readily know of a repo where it would fit, but feel free to put it up somewhere.

earcanal · June 25, 2021

https://gist.github.com/paulsharpeY/0e67002b41d5d43c4c4ef9ddd3124fa0

kretender · May 31, 2024

Alternative, more manual approach but time-saving if just needed once:

In a versatile text editor such as Notepad++, run a regex on the .Rmd file ( eg. '\[@[a-zA-Z0-9]{1,}?\]' with single, '@[a-zA-Z0-9]+(?!(,|;|\]))' for in-text citations with multiple citekeys), depending on the citekey format.

Export the list (in a Notepad++ using marks) and paste to a spreadsheet processor, sort.

skanlecon · January 17, 2025

it's 2025, any plugin implemented this feature ?