Parsing citations in full text

bobfutrelle · August 30, 2007

A comment on this was buried in the site translators thread, until I realized that it probably needed a new thread.

Zotero's focus on "sites" does not address an outstanding problem that we face in mining citation information from the literature. There are millions of *documents* in the world that contain, in total, hundreds of millions of citations. E.g., Medline stores about 17M abstracts of papers, each of which contains, say, 20 references, that's 340M references. Libraries rarely have this citation information, only citations of documents in their collections, typically, books. I'm sure that there are some databases of articles which may be relevant here; I'm sorry I'm not as familiar with them as I should be. But in any event, what scholars are most often (!) interested in citing are papers, not books.

Take as an example, one citation from one paper in my collection of a few thousand PDFs stored on my machine (MacBook Pro :-):

"Howard, J. H., Mutter, S. A., & Howard, D. V. (1992). Serial pattern learning by event observation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 1029 –1039."

One might think that the free form of such citations renders the project a non-starter. Not the case, when we realize that many (most?) of the citations are generated from bibliographic engines used by authors, EndNote being the leader. EndNote has a *finite* number of Styles, 2300 at my last count. Furthermore, journals try to enforce standards for the form of citations, e.g., Science and Nature.

The problem then is the "inversion" of published citations from the various published formats back to the citation that was used to create them.

As with my ignorance of databases of article citations, I may well be ignorant of a Zotero discussion thread on this topic.

If you google on "parsing citations" you get about 30 hits. The efforts discussed there probably pale in comparison to what the Zotero people could do if they threw themselves into this important task.

Still another important direction is the availability of XML versions of papers that include structured citations, so they're already parsed, courtesy of the journal! The journal obviously did the required parsing, from the authors' manuscripts. Some manual labor may have been involved; I don't know. Why can't Zotero match that? Here's an exmaple from a PLoS Biology paper:

<ref id="pbio-0050229-b088">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leveau</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Lindow</surname>
<given-names>SE</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Predictive and interpretive simulation of green fluorescent protein expression in reporter bacteria</article-title>
<source>J Bacteriol</source>
<volume>183</volume>
<fpage>6752</fpage>
<lpage>6762</lpage>
</citation>
</ref>

Bob

Robert P. Futrelle
Associate Professor
Biological Knowledge Laboratory
College of Computer and Information Science
Northeastern University MS WVH202
360 Huntington Ave.
Boston, MA 02115

Office: 617-373-4239
Fax: 617-373-5121
http://www.ccs.neu.edu/home/futrelle
http://www.bionlp.org
http://www.diagrams.org

noksagt · August 30, 2007

Thanks for posting--it is obvious that the issue is important to you.

I think people saw your first post, and I don't know if there was a reason to repost. You even got a reply.

As with my ignorance of databases of article citations, I may well be ignorant of a Zotero discussion thread on this topic.

The issue has come up a few times on these forums. See the links posted for you in the previous thread & search for cb2bib to see a few other brief discussions.

One might think that the free form of such citations renders the project a non-starter. Not the case, when we realize that many (most?) of the citations are generated from bibliographic engines used by authors, EndNote being the leader. EndNote has a *finite* number of Styles, 2300 at my last count. Furthermore, journals try to enforce standards for the form of citations, e.g., Science and Nature.

What interface do you propose to deduce what style is being used? Should users have to select from 2300+ styles? And what to do when there is too little formatting information to clearly differentiate between different fields. Just because the problem is "finite" doesn't mean that a solution will be elegant.

I think it will be hard to do much better than cb2bib. Yes, it might be nice if Zotero eventually had some of this functionality integrated (but, as a working tool exists, I personally think there are stronger priorities).

bobfutrelle · August 30, 2007

My comments lose some of their force when I recalled that Z. can pull citations out of Google Scholar. That satisfies most of my desires. But I'm not sure what the coverage of Scholar is; does it in fact have any reference I might find in any article that I might find referenced in an arbitrary paper? Of course, if I got my article via Scholar, then Scholar may have already scanned it, closing the loop. Scholar also has an "import into EndNote" link on each, which also satisfies my needs to a good extent.

So I've learned a lot just by posing my questions and then pursuing them.

Now all I have to do is to study the literature, create my own new research results, and keep moving ahead. That's what scholarship is really all about.

Cheers, all.

- Bob

bobfutrelle · August 30, 2007

Yes, I figured that this was old ground that had been covered before. I was happy that noksagt pointed out c2bib to me - I was unaware of it. I will certainly try it out. My hard core work on NLP these days is focused on XML-based corpora that have contain the metadata I need. But I'm always interested in ways of dealing with text and citations. Thanks again for helping bring my (and other readers) up to speed.

That said, the power and reach that Scholar brings to all this is good too.

- Bob

bobfutrelle · August 30, 2007

Scuse. That's cb2Bib. My typo.

cz · August 31, 2007

well well, there already are some projects for Reference Metadata Extraction. see e.g.
http://www.iis.sinica.edu.tw/~myday/slides/Slide2005_IEEE-IRI2005_A_Knowledge-based_Approach_to_Citation_Extraction.ppt, http://www.csie.cyut.edu.tw/~shwu/publication/DSS2007_Reference_Metadata_Extraction_Using_a_Hierarchical_Knowledge_Representation_Framework.pdf http://ieeexplore.ieee.org/iel5/10065/32280/01506448.pdf?arnumber=1506448,
http://wing.comp.nus.edu.sg/publications/theses/yongKiatNgThesis.pdf, http://wing.comp.nus.edu.sg/parsCit/,
http://paracite.eprints.org/developers/downloads.html
also, of course, citeseer does a good job in this. needless to say, as almost anyone, i would very much appreciate it if zotero would apply such technology in the midterm future ;)

noksagt · August 31, 2007

Actually, CiteSeer is an excellent example (http://dx.doi.org/0.1109/2.769447). Their heuristics take advantage of looking at ALL citations within a paper & also a large database of author names, journal names, journal abbreviations, etc. Expecting a comparable amount of data to be on a client is not very reasonable. As has been said, perhaps Zotero can play the regex games that cb2bib (and ParaTools and other desktop programs) play. But perhaps the central zotero server will be even better for this, as the contributions from others could improve the heuristics.

noksagt · August 31, 2007

And here's the recent code4lib discussion:
http://www.mail-archive.com/code4lib@listserv.nd.edu/msg01762.html