SPIRES, BibTeX, and the arXiv -- problems/missing features

johnnyboy624 · November 30, 2009

Problem 1: the Spires site translator imports a paper's information via the bibtex info, but does not keep all fields; the preprint (arxiv) info is lost. A search on these forums reveals that other site translators also drop extra fields, even if it is coming from a bibtex import. These extra fields are quite important, and now pretty standard in the physics community. Even if, by default, this info would not be displayed in Zotero, it should at least be used when exporting (e.g. to create a bibtex entry for a latex document; it should have exactly the same info as grabbing it from Spires directly).

Problem 2: the Spires site translator never downloads any associated PDFs. Considering that all the journal links I've come across through Spires will itself import just fine with Zotero, could we have the option of using those translators directly from a Spires search page? In other words, rather than following links manually to journal websites (tedious if trying to save many papers from one search result, for instance), Zotero should give the option to attempt to automatically do this. At the very least, there should be the ability to save the pdf from the arxiv page, as Spires search results have direct links to the pdf on the arxiv (if available, and now just about a given except for old papers).

To summarize: please do not drop any extra bibtex fields on import (in general, not just for Spires), and incorporate getting pdfs for papers through the journal link or directly from the arxiv pdf link in the search results.

These small changes would make Zotero absolutely perfect for these essential physic papers websites (which are basically all anyone needs in high energy physics, for instance).

johnnyboy624 · December 9, 2009

Nobody has any comments? Anyone else support these changes? Developers...?

fbennett · December 9, 2009

Saving out arbitrary data for which Zotero has no corresponding field, which I think you are suggesting, sounds like a complicated proposition. Some items would have this hidden information and some would not, depending on whether they were saved from Spires (say), and in a simple design, there would be no way of adding the information by hand to an item from which it is missing (nor indeed of checking whether it is there, without exporting to bibtex and checking the field content by hand). The simpler thing would be to just have fields that you feel are necessary added to Zotero's UI; then everything just works. What does the preprint (arxiv) info look like?

On fetching metadata from a separate site using another translator, I have never been able to get cross-site scripting to work in a translator. I think you run up against a security barrier that is built into Firefox on this one. There might be a way around it, but again it would be complicated, and probably fragile.

johnnyboy624 · December 9, 2009

I agree with the potential problems you mention about arbitrary extra fields that aren't visible in the UI. This would require more work, like just having the raw Bibtex entry or something like that (and I guess would be a bit cumbersome and add clutter).

Another idea might be to have, for sites that give an entry in a fixed form already like Bibtex, this information as an attachment. Then you could have an option in exporting a library (or just entries) on what to do if you are exporting to a format which entries have an attachment in the same format. Again, this would also require some work to add.

As for the actual info that is relevant for Spires/arxiv, it looks like this:
eprint = "0911.0687",
archivePrefix = "arXiv",
primaryClass = "hep-ph"
while older papers don't include the "primaryClass" since that appears in the "eprint" field as "hep-ph/" before the number.

Maybe it makes sense to have these extra fields collapsible or easily hidden somehow, to keep things simple looking? In fact, I think that would be worthwhile for other entries as well, or some option like "collapse all blank fields".

Ok, I guess cross-site scripting is out for now, but grabbing a pdf (if available) from the direct arxiv link on a Spires entry would be very useful.

fbennett · December 9, 2009

A URL for the target reference can be derived from that metadata: http://arxiv.org/abs/0911.0687

(The URL does not include the primaryClass, but the target page identifies it implicitly, judging from the breadcrumbs across the top of the page.)

Perhaps the translator for Spires could add this URL to the Zotero item, plus add the primaryClass as a Zotero tag (since it seems meant to work as a tag in the target system also)?

johnnyboy624 · December 9, 2009

Yes, the eprint info can be used to construct the url to the arxiv page, and even the pdf location itself (as the arxiv site translator does I believe). And you are correct that the primaryClass is not needed as newer entries (with a period in the number) don't need it and older ones have it in the number itself. However, the primaryClass is used for bibliography styles in latex that incorporate pre-print info (very common, at least in arxiv versions).

Also, as you point out, I think Zotero does use that identifier (primaryClass) in the arxiv site translator to generate a tag. It would be good for the Spires import to be consistent with that (presumably just copy the code of the arxiv translator).

I think the easiest solution right now that would incorporate everything would be to add those 3 Bibtex fields to Zotero, and grab the pdf from the arxiv link (constructed directly from the info as you say). The tag info should also be done as the arxiv importer does.

The arxiv site translator also has a bug, I think with swapping the actual journal and preprint info. I haven't looked at that in detail yet, and the best fix would rely on the above changes anyway (to use the real Bibtex fields above).

fbennett · December 9, 2009

However, the primaryClass is used for bibliography styles in latex that incorporate pre-print info (very common, at least in arxiv versions).

Sounds like this is the root of the fields issue. Can you provide a sample citation that uses this info? If the citation style differs between the old-style and new-style entries that you mention, examples for both would be very helpful.

johnnyboy624 · December 9, 2009

Any recent physics paper on the arxiv will likely have both old and new style pre-print citations, and how these appear depend on the bibliography style in latex (e.g. see http://golem.ph.utexas.edu/~distler/TeXstuff/utphys.bst for a style, and the arxiv hyperref help page for other info: http://arxiv.org/hypertex/). For instance, see the references in http://arxiv.org/abs/0911.0687 or http://arxiv.org/abs/0909.1615 but I will just show you below.

An older pre-print might appear as: "[arXiv:hep-th/0703196]" at the end of a bibliography entry, while a newer one would be "arXiv:0906.1273 [hep-th]" (again, at the very end of a reference).

So older papers do not have the primaryClass appear in the Bibtex entry from Spires, and it can be considered blank (this info is included in the eprint field). The newer ones do have this info, and while it is not needed directly for finding the paper on the arxiv, it is useful to quickly see what field (theory, gravity, etc.) the paper was listed as. I would put primary class on the same footing as eprint, as the "new" numbering is over 10 years old.

fbennett · December 9, 2009

The thing to do is set up the translator to store the eprint and primary class information in the Loc. in Archive field (which maps to archive_location in CSL-ese, and to archiveLocation in the Zotero database schema):

hep-th/0703196
0906.1273 [hep-th]

(with the archive field set to "arxiv")

For CSL-driven output, styles for target journals may need to be extended to do the right thing with the info, but that's easy to do with everything in place. BibTeX export is out of my league, but deriving BibTeX entries for export from the content of the Zotero database is definitely the right thing to do. If it's broken it needs to be fixed, but the Z infrastructure is reliable and robust and it will be simpler in the long run to rely on it.

(EDIT: amended sample field content to remove arxiv: prefix.)

johnnyboy624 · December 9, 2009

Thanks for helping me to narrow this down and make it quite explicit. Should I perhaps start a new topic asking for these specific changes, which I can spell out in some detail (and again summarizing the problems/goals)? Would it be more effective to send that to a developer list (not sure if that's an appropriate usage)?

fbennett · December 9, 2009

Might let this thread sit for a day or two and see if anyone picks it up. There are regular bibtex users on the forums who know how the bibtex export translator works, and might be able to help with that end.

For the translator and CSL style stuff, I'm kind of loaded up with work at the moment, but can take a look when things slow down a bit. Is there a style or a set of styles that are commonly used in the field -- or the name of a leading journal or two (so I can pick it out of the Zotero CSL styles repository)?

(EDIT: Sorry, missed the link to the BST file above. Is IEEE Transactions different from IEEE?)

johnnyboy624 · December 9, 2009

Some styles might be AIP or APS. Specifically, Physical Review D (published by APS) is a common journal in high energy physics. This page on arxiv might be helpful as well: http://arxiv.org/hypertex/bibstyles/ I'm not sure how eprint info is handled in actual print versions of a paper, but most everything is electronic versions these days anyway.

However, either Bibtex or a direct Latex (\bibitem) form is what I think everybody uses. Since Zotero does Bibtex export, getting that to match the Spires entry would be the most useful. Getting a journal style export would probably not be used in writing a paper (papers are submitted as a latex document), but maybe for some other uses...?

fbennett · December 9, 2009

Thanks. In the longer term, the simplest solution will be to build a wrapper for the new CSL processor that can operate as a drop-in replacement for BibTeX. Then you would be able to avoid the pain and uncertainty of running the item data across and extra data exchange boundary, and just produce the *.bbl file (or whatever LaTeX needs) direct from any CSL-compliant data storage engine. But that's a ways down the road (and will depend on the effort of others than myself).

shashiprabhakar · December 11, 2013

Arxiv articles can be added by Zotero addon very easily. However, I was thinking to add articles in Zotero Standalone using the arXiv-id as arXiv id is also a unique identifier for a particular article. As far as I know, arXiv provides some APIs which may help in pulling data from arXiv database. Even if the identifier can be put like "arXiv:1312.2515" may be fine. This facility may help many Zotero standalone users who wish to add arXiv article by its identifier.