Frequency of a cited item in a document
Hi all,
I was wondering how to know how many times an item (eg. a reference) is cited in a document. Is is possible to create some output indicating each reference and its frequency in the document? Notice this is difference from the total number of references in the document.
Thanks
I was wondering how to know how many times an item (eg. a reference) is cited in a document. Is is possible to create some output indicating each reference and its frequency in the document? Notice this is difference from the total number of references in the document.
Thanks
(I'm still wondering if this is a frequent enough request to add this as a feature to the regular version)
Since I haven't built in citeproc-js or Citation.js (https://larsgw.github.io/citation.js/) yet, I really only have the item metadata itself to work with. Because multi-item citations don't contain item-specific pre-rendered citations (see e.g. https://github.com/rmzelle/ref-extractor/wiki#multi-item-zotero-citation), I can't easily show formatted citations and their counts (e.g. "(Doe, 2002): 2x"). I also don't seem to be able to easily extract the formatted bibliographic entry for a given item.
What I could do, is (optionally) add the citation count to the "note" field in the CSL JSON. Alternatively, I could add an option to generate a CSV or TSV file with a few metadata fields (URIs, title) and the citation count.
Times cited: 13
Would be good if there were some sort of toggle to include versus exclude it.
@bwiernik. What I, personally, have encountered through the years with demands on my own work; and what several of my university colleagues require is a table that lists each source and the number of times it is cited. The complicated part was a demand that, in the case of an edited book, the number of times each chapter was cited _and_ the sum of all citations of the book should be listed.
I've never heard of this as a requirement when submitting a manuscript for publication. It is, however, a not-uncommon requirement for students' reports and theses. I first encountered this when I was an undergraduate student at a small liberal arts college in the 1960s. This was back in the days of printed indices, note cards, and hand-typed manuscripts. (We were also required to provide our deck of notecards for cited and examined but not-cited references -- the times cited indicated by a number on each index card.)
After this forum thread reminded me of this aspect of writing long past, I asked my colleagues about why this source listing was required. To my surprise their answers were not obscure and unreasonable. The source citation list is a quick way to identify how strongly each school of thought influenced the student. It is intended to be used by the student during the writing process as a guage to assess how broadly differing points-of-view are included. I followed up by pointing out that it appears to me that most students compile the list only after the document is complete. I asked if they told their students the purpose of the requirement. To a person they expressed an Ah-Ha moment. No they had not explained the reason for the requirement. Each person said though it was obvious to them, it was now obvious that their students didn't know why. I said that until now, I didn't have a clue that the citation count requirement was considered useful to the student and not merely some holdover formality from long ago.
Another aspect of manuscript requirement minutiae is including the proportion of words in quotations to the total number of words in the document. That drove me near crazy in the '60s and '70s. I remember being docked a full grade letter by one professor because I had under-counted the number of non-quote words and had calculated the proportion of quote-words to total words instead of her unique requirement that the proportion be the number of quote-words to non-quote-words. She or an assistant actually counted each word.
As I wrote this I also remember the days when a template transparency was placed over each typewritten page to test if text strayed into the margins. If the text entered a margin the page (and usually the full document) had to be retyped. The margins had to be wide enough for holding comments. Needless to say my fellow students and I had to pay a typist to do the work. Retyping due to margin errors was the typist's responsibly. Changes because of modifications, edits, deletions, small revisions, required paying again.
Again, feedback welcome.
The bibliographies are intended to guide the reader to a detailed reading of the chapters in which they are interested, and the reader should turn to the reference for a quick look. Therefore, multiple citations of the same bibliography are not helpful for the reader to survey the background knowledge of this article. If an bibliography is cited in the discussion section, the duplicate citation of the bibliography in the introduction section should be indispensable, otherwise such citation should not be adopted.
On the contrary, the bibliographies of a good paper should be relatively comprehensive to the previous research, which limits the cited times of a certain bibliography. If a reference is cited to prove a point of view, another appropriate reference will expand the scope of the study's contribution to this research scope when discussing another point of view.
@segarra I put forward an solution.
First, download the references from zotero through the plug-in of @Rintze . Then try to convert them into RIS format files and import them into Endnote X8. After that, re-inserted these references in the full text. Finally, check the cited times and the location distribution of each reference through the "Edit & Manage Citations " of Endnote X8.
As note by Rintze, for the original question, the number cited is now included in export from his tool.
My source is R Markdown, so I don't think I can use Reference Extractor. The only approach I can currently think of is hacking apa.csl.
Read the whole doc in as a string, extract all citation markers using a regex, and then count the frequency of each marker.
Modifying the citation style won't work, I believe -- CSL doesn't have a citation-frequency count variable.
@adamsmith: I was hoping to avoid that approach (or at least find some existing code to do it, because I'm lazy). A benefit of using CSL is that the results would be a human-readable reference list, rather than a list of IDs. Couldn't you modify the CSL to count citations and sort on that?
You can export your document from pandoc to Word with live Zotero citations. Emiliano has written a pandoc filter for that. From there, you can use Reference Extractor, which has the option to add the number of times cited to Extra. You can write a CSL file to include the 'note' variable to show of the Extra field, or, if you want to be sure that only the times cited is shown, you could add 'annote: ' before that number and then add the 'annote' field in you CSL
It might even be easier to write a small pandoc filter to count the unique occurrences of the citekeys in your markdown document and then do the same as above.
citation_count = {}
count = {
Cite = function(el)
for _, item in pairs(el.citations) do
if citation_count[item.id] == nil then
citation_count[item.id] = 0
end
citation_count[item.id] = citation_count[item.id] + 1
end
end,
}
function Pandoc(el)
pandoc.walk_block(pandoc.Div(el.blocks), count)
for id, n in pairs(citation_count) do
print(id, n)
end
os.exit(0)
end
ran as
pandoc --lua-filter count-cite.lua main.md
pandoc --lua-filter count-cite.lua *.Rmd | sort -r -k 2
Is there a way to inject something like this into a pandoc pipeline so that it generates a sorted, human-readable reference list as a PDF?
citation_count = {}
count = {
Cite = function(el)
for _, item in pairs(el.citations) do
if citation_count[item.id] == nil then
citation_count[item.id] = 0
end
citation_count[item.id] = citation_count[item.id] + 1
end
end,
}
function Pandoc(el)
pandoc.walk_block(pandoc.Div(el.blocks), count)
for id, n in pairs(citation_count) do
print('[@' .. id .. ', pp.' .. n .. ']')
end
os.exit(0)
end
and run
pandoc --lua-filter count-cite.lua main.md | pandoc --bibliography=biblio.bib --citeproc
you'll get a document that first lists all citekeys with their counts, and then a (presumably adequately sorted) bibliography. Since the in-doc citations are recognizable in the output (
<span class="citation" data-cites="<citekey>">(Clark 2013, 1)</span>
) and the lines in the bibliograpy carry the ID (<div id="<citekey>" class="csl-entry" role="doc-biblioentry">
) you could use either a further lua script or xlst, or python, to move the former to the end of the latter, and et voila. That's however more lua code than I can put together in a few minutes, so I'll leave this as an exercise for the reader.pandoc --lua-filter count-cite.lua 01-introduction.Rmd |
pandoc --bibliography=references.bib --citeproc > foo.html
but the references are still in ascending alphabetical order.
pandoc -s --bibliography=biblio.bib --citeproc main.md | ./count-and-sort.py
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import sys
class CiteCount:
def __init__(self, f):
tree = ET.parse(f)
root = tree.getroot()
self.cited = {}
self.collect(root)
self.sort(root)
tree.write(sys.stdout.buffer)
def collect(self, root):
for node in root.findall('.//{http://www.w3.org/1999/xhtml}span'):
if node.attrib.get('class') == 'citation':
for key in node.attrib['data-cites'].split(' '):
if not key in self.cited: self.cited[key] = 0
self.cited[key] += 1
def key(self, key):
if key.startswith('ref-'): key = key[4:]
return key
def sort(self, root):
body = root.find('.//{http://www.w3.org/1999/xhtml}body')
body[:] = [node for node in body if node.tag == '{http://www.w3.org/1999/xhtml}div' and node.attrib.get('role') == 'doc-bibliography']
bib = [node for node in root.findall('.//{http://www.w3.org/1999/xhtml}div') if node.attrib.get('role') == 'doc-bibliography'][0]
bib[:] = sorted(bib[:], reverse=True, key=lambda node: self.cited[self.key(node.attrib['id'])])
for node in root.findall('.//{http://www.w3.org/1999/xhtml}div'):
if node.attrib.get('role') == 'doc-biblioentry':
count = ET.SubElement(node, '{http://www.w3.org/1999/xhtml}span')
count.text = f': {self.cited[self.key(node.attrib["id"])]}'
CiteCount(sys.stdin)
a pandoc-lua script should be able to do the same, but I'm less familiar with lua.
Do you know why all HTML tags are prefixed with
html:
?e.g.
<html:html xmlns:html="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<html:head>
<html:meta charset="utf-8" />
<html:meta content="pandoc" name="generator" />
<html:meta content="width=device-width, initial-scale=1.0, user-scalable=yes" name="viewport" />
<html:title>01-introduction</html:title>
<html:style>
...
I need to remove the prefix to view the output in a browser.
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
import sys
def citekey(key):
if key.startswith('ref-'): key = key[4:]
return key
tree = ET.parse(sys.stdin)
root = tree.getroot()
for node in root.findall('.//*'):
node.tag = node.tag.split('}')[-1]
cited = {}
for node in root.findall('.//span'):
if node.attrib.get('class') == 'citation':
for key in node.attrib['data-cites'].split(' '):
if not key in cited: cited[key] = 0
cited[key] += 1
body = root.find('.//body')
body[:] = [node for node in body if node.tag == 'div' and node.attrib.get('role') == 'doc-bibliography']
bib = [node for node in root.findall('.//div') if node.attrib.get('role') == 'doc-bibliography'][0]
bib[:] = sorted(bib[:], reverse=True, key=lambda node: cited[citekey(node.attrib['id'])])
for node in root.findall('.//div'):
if node.attrib.get('role') == 'doc-biblioentry':
count = ET.SubElement(node, 'span')
count.text = f': {cited[citekey(node.attrib["id"])]}'
tree.write(sys.stdout.buffer)
In a versatile text editor such as Notepad++, run a regex on the .Rmd file ( eg. '\[@[a-zA-Z0-9]{1,}?\]' with single, '@[a-zA-Z0-9]+(?!(,|;|\]))' for in-text citations with multiple citekeys), depending on the citekey format.
Export the list (in a Notepad++ using marks) and paste to a spreadsheet processor, sort.