Get a count of authors in your library

peter.mattingly · June 15, 2019

I've made some software to help with my own research, and I wanted to get some feedback from the community about it. I want to eventually present a poster about it, but only after I'm reasonable sure it's actually useful.

Here's a link, let me know what you think: https://colab.research.google.com/drive/10yDfHxMO6Se0PFSQMJk7gcEePOYbGVpu

The motivation for this software is research. The way I go about research is attempt to find authors to read and then read most of their research within the past 5 years; this helps me to develop a literature review. One way of finding authors to read is to look for multiple mentions of that authors work while reading. This process is more qualitative than I would like, so instead I would like to look at the frequency of which authors appear in my library as authoring works I find to be interesting. Zotero certainly makes this simpler than looking at the raw papers, but it can be difficult to match authors between papers and can end up being a terribly manual and time consuming process. In addition, not all author names from papers are exactly the same. For example:

Kristen E. DiCerbo', 'Kristen DiCerbo', 'Kristen E. Dicerbo
Robert Mislevy', 'Robert J. Mislevy
Ryan Baker', 'Ryan S Baker', 'Ryan S. Baker
Tiffany Barnes', 'Tiffany M. Barnes
Drew Hicks', 'Andrew G. Hicks
etc.

All of these similar names can be understood by a human to be referring to the same author, but attempting to search for exact matches of these names would necessarily lead to missing some of the author's works'.

The proposed solution is to attempt to compare author names to each other such that the names are approximately equal, then track how many works have been authored by that person. Thus my software, which does precisely that.

Given that the process looks for approximately equal names, there is some failure rate, but I believe that similarity testing between the names is robust. In addition the program can handle a functionally infinite list of authors for comparison. It will also return the results of its inspection in a standard format for subsequent processing; e.g. it returns a CSV file of the author names and the times they are seen in data submitted.

Already this tool has helped my own research, as I was aware of many authors that came up commonly in my research; but several others I had not seen as they are commonly second (or later) authors, rather than the eye-catching first author position. For example, here is a truncated list of authors from my own library:

Author | Count
---|---
Steven L. Wise | 11
Kristen E. DiCerbo', 'Kristen DiCerbo', 'Kristen E. Dicerbo | 7
Robert Mislevy', 'Robert J. Mislevy | 5
Ryan Baker', 'Ryan S Baker', 'Ryan S. Baker | 5
Michael Eagle | 5
V. Elizabeth Owen | 5
Alina A. von Davier', 'Alina von Davier | 4
Tiffany Barnes', 'Tiffany M. Barnes | 4
Sebastiaan de Klerk | 3
Zhan Shu | 3
Elizabeth Rowe | 3
Drew Hicks', 'Andrew G. Hicks | 3

While I was aware that Dr. Wise was important to my research, as he has many first-authorships; I had not taken a close look at the work of Dr. Tiffany Barnes, Dr. Zhan Shu, or Dr. Drew Hicks.

The software, in its current form, does have some drawbacks. It only accepts input in the form of Zotero reports (with plans for input in the form of: Bibtex, RIS, and CSV). It also is currently running as a Jupyter notebook on Google Colab; to expose it to a wider audience I would want to host the software on something more professional and responsive, but Colab serves as a decent test-bed for this initial alpha release.

All that being said, I hope the software will be useful to others; and I welcome your feedback. Thank you.

dstillman · June 16, 2019

For what it's worth, you can get a quick count right in Zotero by enabling the Run JavaScript window and running the following script:

var rows = await Zotero.DB.queryAsync("SELECT TRIM(firstName || ' ' || lastName) AS author, COUNT(*) AS num FROM items JOIN itemCreators USING (itemID) JOIN creators USING (creatorID) WHERE libraryID=1 GROUP BY author ORDER BY COUNT(*) DESC");
var lines = [];
for (let row of rows) {
    lines.push(row.author + '\t' + row.num);
}
return lines.join('\n');

This will count author occurrences in your personal library (libraryID 1).

The above code won't handle name variations, but Zotero.Utilities.levenshtein(a, b) is available and could be used for that purpose with a little bit more work.

tim820 · July 3, 2022

I've found the above code from @dstillman to be very useful to get simple counts of the number of items by each author in one's entire library. A modification that addresses the duplicate-but-not-identical author names problem in a very basic way is to use ORDER BY lastName ASC instead of ORDER BY COUNT(*) DESC . Then at least you get the similar names' counts next to each other in the list. You can also show the last name first by changing TRIM(firstName || ' ' || lastName) to TRIM(lastName || ' ' || firstName).

How would it be re-coded to apply to just (1) the current collection or (2) selected items (in a collection or library) ?

tim820 · July 6, 2022

No takers so far ? ;) @dstillman or anyone else, in simple terms is there an equivalent for "WHERE libraryID=1" in the above SQL code that would limit the author search to the current collection instead of the whole library ?

tim820 · July 9, 2022

OK, I managed to cobble something together that works for me, from some other code snippets floating around here, and by looking at Zotoro's tables in a copy of my zotero.sqlite (using the free DB Browser for SQLite). Despite knowing no javascript nor SQL ! (but having a little knowledge of relational databases) The solution to wanting only the authors in the current collection to be included is more complex than the above code's SQL query. Because the collection name in the SQL query is not known until runtime. Plus some extra tables must be included. If there's a simpler way, I missed it. Anyway here's the code:

var CurrentCollection = Zotero.getActiveZoteroPane().getSelectedCollection();
var sql = "SELECT TRIM(lastName || ', ' || firstName) AS author, COUNT(*) AS num FROM items JOIN itemCreators USING (itemID) JOIN creators USING (creatorID) JOIN collectionItems USING (itemID) JOIN collections USING (collectionID) WHERE collectionName=? GROUP BY author ORDER BY COUNT(*) DESC";
var rows = await Zotero.DB.queryAsync(sql,CurrentCollection.name);
var lines = [];
for (let row of rows) {
lines.push(row.author + '\t' + row.num);
}
return lines.join('\n');

As previously noted, to get the item counts for each author listed alphabetically by lastName instead of numerically by count, replace ORDER BY COUNT(*) DESC with ORDER BY lastName ASC.

sdvr · August 11, 2022

I have come across this thread while looking for a way to count how many articles I have per journal. I assume the SQL above could be adapted for this, but before I try to work this out, has anyone done something like this already and is willing to share?

If not, could anyone give me some pointers to the SQL schema so I can work out what fields I need to SELECT and how to adapt the query?

(More broadly, these kind of simple 'metrics' about one's library would be useful functionality to have within Zotero itself - I vaguely recall that Papers used to have this in v2 back in 2012 - but I understand that there are limited resources available)

tim820 · August 11, 2022

Some code for that here:
https://forums.zotero.org/discussion/comment/412566/#Comment_412566

sdvr · August 13, 2022

Wow, thank you so much @timwr820 - very helpful!

I've taken a look at how to combine the code in that link with the code above, so it runs on whichever collection is currently selected, but this appears to be more complex than I thought at first glance. From what I can see, the code above joins multiple tables (items, itemCreators, creators, collectionItems, collections) to match collectionName using WHERE with the value passed through for the selected collection. By contrast, the code in the other thread by @AbeJellinek uses WHERE fieldID=${Zotero.ItemFields.getID('publicationTitle')}

I've tried combining these approaches, but quickly got myself tied up in knots. I'll try again when I have a clear head, but I suspect this is a lot simpler to people who know the Zotero schema and are more adept with SQL than me!