Parsing citations in full text
A comment on this was buried in the site translators thread, until I realized that it probably needed a new thread.
Zotero's focus on "sites" does not address an outstanding problem that we face in mining citation information from the literature. There are millions of *documents* in the world that contain, in total, hundreds of millions of citations. E.g., Medline stores about 17M abstracts of papers, each of which contains, say, 20 references, that's 340M references. Libraries rarely have this citation information, only citations of documents in their collections, typically, books. I'm sure that there are some databases of articles which may be relevant here; I'm sorry I'm not as familiar with them as I should be. But in any event, what scholars are most often (!) interested in citing are papers, not books.
Take as an example, one citation from one paper in my collection of a few thousand PDFs stored on my machine (MacBook Pro :-):
"Howard, J. H., Mutter, S. A., & Howard, D. V. (1992). Serial pattern learning by event observation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 1029 –1039."
One might think that the free form of such citations renders the project a non-starter. Not the case, when we realize that many (most?) of the citations are generated from bibliographic engines used by authors, EndNote being the leader. EndNote has a *finite* number of Styles, 2300 at my last count. Furthermore, journals try to enforce standards for the form of citations, e.g., Science and Nature.
The problem then is the "inversion" of published citations from the various published formats back to the citation that was used to create them.
As with my ignorance of databases of article citations, I may well be ignorant of a Zotero discussion thread on this topic.
If you google on "parsing citations" you get about 30 hits. The efforts discussed there probably pale in comparison to what the Zotero people could do if they threw themselves into this important task.
Still another important direction is the availability of XML versions of papers that include structured citations, so they're already parsed, courtesy of the journal! The journal obviously did the required parsing, from the authors' manuscripts. Some manual labor may have been involved; I don't know. Why can't Zotero match that? Here's an exmaple from a PLoS Biology paper:
<ref id="pbio-0050229-b088">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leveau</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Lindow</surname>
<given-names>SE</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Predictive and interpretive simulation of green fluorescent protein expression in reporter bacteria</article-title>
<source>J Bacteriol</source>
<volume>183</volume>
<fpage>6752</fpage>
<lpage>6762</lpage>
</citation>
</ref>
Bob
Robert P. Futrelle
Associate Professor
Biological Knowledge Laboratory
College of Computer and Information Science
Northeastern University MS WVH202
360 Huntington Ave.
Boston, MA 02115
Office: 617-373-4239
Fax: 617-373-5121
http://www.ccs.neu.edu/home/futrelle
http://www.bionlp.org
http://www.diagrams.org
Zotero's focus on "sites" does not address an outstanding problem that we face in mining citation information from the literature. There are millions of *documents* in the world that contain, in total, hundreds of millions of citations. E.g., Medline stores about 17M abstracts of papers, each of which contains, say, 20 references, that's 340M references. Libraries rarely have this citation information, only citations of documents in their collections, typically, books. I'm sure that there are some databases of articles which may be relevant here; I'm sorry I'm not as familiar with them as I should be. But in any event, what scholars are most often (!) interested in citing are papers, not books.
Take as an example, one citation from one paper in my collection of a few thousand PDFs stored on my machine (MacBook Pro :-):
"Howard, J. H., Mutter, S. A., & Howard, D. V. (1992). Serial pattern learning by event observation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 1029 –1039."
One might think that the free form of such citations renders the project a non-starter. Not the case, when we realize that many (most?) of the citations are generated from bibliographic engines used by authors, EndNote being the leader. EndNote has a *finite* number of Styles, 2300 at my last count. Furthermore, journals try to enforce standards for the form of citations, e.g., Science and Nature.
The problem then is the "inversion" of published citations from the various published formats back to the citation that was used to create them.
As with my ignorance of databases of article citations, I may well be ignorant of a Zotero discussion thread on this topic.
If you google on "parsing citations" you get about 30 hits. The efforts discussed there probably pale in comparison to what the Zotero people could do if they threw themselves into this important task.
Still another important direction is the availability of XML versions of papers that include structured citations, so they're already parsed, courtesy of the journal! The journal obviously did the required parsing, from the authors' manuscripts. Some manual labor may have been involved; I don't know. Why can't Zotero match that? Here's an exmaple from a PLoS Biology paper:
<ref id="pbio-0050229-b088">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leveau</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Lindow</surname>
<given-names>SE</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Predictive and interpretive simulation of green fluorescent protein expression in reporter bacteria</article-title>
<source>J Bacteriol</source>
<volume>183</volume>
<fpage>6752</fpage>
<lpage>6762</lpage>
</citation>
</ref>
Bob
Robert P. Futrelle
Associate Professor
Biological Knowledge Laboratory
College of Computer and Information Science
Northeastern University MS WVH202
360 Huntington Ave.
Boston, MA 02115
Office: 617-373-4239
Fax: 617-373-5121
http://www.ccs.neu.edu/home/futrelle
http://www.bionlp.org
http://www.diagrams.org
I think people saw your first post, and I don't know if there was a reason to repost. You even got a reply. The issue has come up a few times on these forums. See the links posted for you in the previous thread & search for cb2bib to see a few other brief discussions. What interface do you propose to deduce what style is being used? Should users have to select from 2300+ styles? And what to do when there is too little formatting information to clearly differentiate between different fields. Just because the problem is "finite" doesn't mean that a solution will be elegant.
I think it will be hard to do much better than cb2bib. Yes, it might be nice if Zotero eventually had some of this functionality integrated (but, as a working tool exists, I personally think there are stronger priorities).
So I've learned a lot just by posing my questions and then pursuing them.
Now all I have to do is to study the literature, create my own new research results, and keep moving ahead. That's what scholarship is really all about.
Cheers, all.
- Bob
That said, the power and reach that Scholar brings to all this is good too.
- Bob
http://www.iis.sinica.edu.tw/~myday/slides/Slide2005_IEEE-IRI2005_A_Knowledge-based_Approach_to_Citation_Extraction.ppt, http://www.csie.cyut.edu.tw/~shwu/publication/DSS2007_Reference_Metadata_Extraction_Using_a_Hierarchical_Knowledge_Representation_Framework.pdf http://ieeexplore.ieee.org/iel5/10065/32280/01506448.pdf?arnumber=1506448,
http://wing.comp.nus.edu.sg/publications/theses/yongKiatNgThesis.pdf, http://wing.comp.nus.edu.sg/parsCit/,
http://paracite.eprints.org/developers/downloads.html
also, of course, citeseer does a good job in this. needless to say, as almost anyone, i would very much appreciate it if zotero would apply such technology in the midterm future ;)
http://www.mail-archive.com/code4lib@listserv.nd.edu/msg01762.html