meta request: easier tools to write translators with
The most frustrating thing about Z for me is that it is so very hard to write translators in. I'm not wholly clueless about small programming tasks. I can write useful macros in openoffice basic and in python; I have even written one zotero translator, for the Guardian, which mostly works. I don't think it works quite well enough to make public: it doesn't discriminate between the Guardian and the Observer, nor between pieces published in the paper and only online yet. But trying to write it on Windows was an extremely painful process. There are tools for writing and evaluating xpath expressions, but that's it. The various tutorials on this site are all old and obsolete.
What's really helpful to a beginner is the kind of interactive and incremental programming you can do in a python IDE: type a line, evaluate, see what's wrong, and try again. When it's working, save it. If there's a way to work like that in "raw" javascript with zotero, I can't find any reference to it. There used to be Scaffold, which was wonderful. But it doesn't work with 2.0 and won't, so far as I can see, be upgraded.
It seems to me that until it is much easier to write translators, they are always going to be a huge bottleneck in Zotero. It will never be really useful outside the context of academic libraries while it's so difficult to extend it to newspapers and news websites generally.
What's really helpful to a beginner is the kind of interactive and incremental programming you can do in a python IDE: type a line, evaluate, see what's wrong, and try again. When it's working, save it. If there's a way to work like that in "raw" javascript with zotero, I can't find any reference to it. There used to be Scaffold, which was wonderful. But it doesn't work with 2.0 and won't, so far as I can see, be upgraded.
It seems to me that until it is much easier to write translators, they are always going to be a huge bottleneck in Zotero. It will never be really useful outside the context of academic libraries while it's so difficult to extend it to newspapers and news websites generally.
To get a look at what's in a page (including internal refactoring by Javascript that is masked in the HTML), I run Firebug. It's "Inspect element" menu item is handy for identifying node names and attributes.
Start by copying one of the 2.0 translators. Gear menu -> Preferences -> Advanced -> Show Data Directory will take you to its parent directory. Click on ./translators and there they are. Just copy a file in that directory, then open it and alter the ID to prevent it from being affected by an update. Give it your title and you're ready to start.
First step is to strip out the content of the two API functions detectWeb() and doWeb(). The doWeb() function can be empty at the start. For detectWeb(), give it a simple return value, like this:
function detectWeb () {
return "book";
}
Save the file. Your translator will now show an icon when the URL in the JSON header matches a site. The JSON header data is reloaded only when you restart Firefox, so a little care with crafting the URL regexp is worthwhile. To get started on the code, you might want to set a broad-ish regexp to get an icon up, and tighten up the conditions later.
Once you're getting an icon, you can reload the JS code of the translator by hitting the refresh button in the browser. Start coding, refresh, see what you get. If you mess up the JS, you'll get an error that you can see through the Firefox error console under Tools.
For internal fetches of pages and PDF files, use the new Zotero.Utilities.retrieveDocument() and Zotero.Utilities.retrieveSource() functions. They run synchronously (as far as your translator is concerned) and don't require the Zotero.wait() / Zotero.done() functions. This simplifies the code a great deal; it's pretty much just straightforward scripting, as far as programming logic is concerned.
I agree that documentation would be welcome -- so there's a start on it (!).
I'll be taking a look-- reading and writing JS files in the translators directory can't be _that_ different from working with the SQLite database. The key source can be browsed at https://www.zotero.org/trac/browser/scaffold/trunk/chrome/content/scaffold .
Another useful new tool would be a test framework for translators-- I would like to see specific URLs and the expected results, to detect translator breakage. This could even be part of a new, updated Scaffold. The test URLs and expected results could be specified in each translator, then run on demand by translator developers from within Scaffold.
Other than working with individual files rather than the database, the biggest difference between 1.0 and 2.0 versions would be that in 2.0 detectWeb() and doWeb() aren't separated, so having separate panes would be tricky. The alternative would be to have a single code frame with a button to run each function separately instead of having a single button that depended on the active frame. Yup.
You could also use Zotero's built-in debug output logging with the View window open, hitting Refresh manually.
after I have just finished what I think is my 100s style (at least if minor adjustments count) I'm ready to do something different from time to time ;-).
My personal goal in the first instance is to build a really rich annotated bibliography browser, but I'm guessing that it wouldn't take much effort for a moderately skilled programmer to put together a GUI translator helper (maybe using a frame containing the page of interest, and some kind of custom DOM inspector perhaps based on Firebug Lite (http://getfirebug.com/firebuglite).
You were a part of the related thread on the dev list (where this discussion should probably continue), but I'll just link to Frank's response to Bruce on the subject. Summary: the web is a messy place, and even seemingly simple translators tend to get complicated before the end. There are really only a handful of exceptions to this in the existing translator library.
In terms of architecture, translators are basically as simple as assigning values to a set of predefined properties. But the rest of the code in existing translators isn't just for fun—it's there to deal with issues specific to those sites. (The exception is the async stuff, which we've been able to work around via retrieveDocument() and retrieveSource() thanks to new functionality in Fx3.0 and higher.)
Also, translators that a more automated tool would be able to generate would almost invariably be screen scrapers, which are the most fragile kind of translators producing the worst data. Would the increased ease of authorship make up for the increased likelihood that the translators would break or produce bad data? Maybe.
I think there are a few translator conventions—such as the messy XPath evaluate() lines—that could be simplified and a few additional things a tool such as Scaffold could offer—such as providing access to the list of available item fields—but I think that getting quality data is going to require scripting on most sites until embedded metadata becomes more common.
There are plenty of people who have the skills to help, though. I think the most effective thing we could offer is a better system for managing translator development, and that's something that's in the works.
First there is the structure of a Zotero entry itself; I don't know where -- outside the source -- there is a listing of the fields for the main types of entries: shall we say blogs, newspaper articles, journal articles, and books. That would be a great help when thinking about what a translator needs to do and breaking down the task of writing one into manageable pieces.
Secondly there is the business of writing xpath manipulations in javascript. This is partially documented in various places. But it would be helpful to have a clear and simple explanation, with the zotero docs, of the iterateNext() and textContent idioms.
Thirdly, there remains the Windows problem. Console debugging there is, as Dan says, ugly, hard to interact with, and incredibly slow. That does make it difficult for a very large proportion of users to do anything without scaffold, which is none of those things. What makes Scaffold worthwhile is not its graphical nature so much as the little lightning icon which lets you see the results of your mistakes almost instantly. Is there any way to reproduce this behaviour using, say, the Console2 addon? (http://bit.ly/chttp2)
I don't want to be unremittingly negative. I will try to write an annotated example of newspaper scrapers if I have time. But I think that the business of writing translators could be made a lot less forbidding even if it will never become particularly easy.
OK. I have spent much of today writing a translator for the Daily Mail - not a paper I like, but one which I am obliged for professional reasons to clip. I think it is just about usable by other people now. I could not have done even a tiny fraction of it without Scaffold, but it is reasonably easy to keep a second firefox profile just for running Zotero 1.0 and Scaffold in it. With the help of the dom inspector and xpather, this seems to supply what I need. Then I can develop incrementally in scaffold and when it works just paste the whole lot into a js file in my "real" profile.
So where should I put this thing, assuming anyone else would find it useful or informative?
I imagine these functions would return an array of nodes or text strings respectively. They would cut down a lot of repetitive typing and make the writing of scrapers seem less forbidding.
Another helpful widget on those lines would be Zotero.getMetaTags(doc).
I was going to write a simple version for my own use, and then realised that some of the translators in the repository (eg the Time magazine one) have a complicated dance around namespace resolvers. Since I don't know what a namespace resolver is, nor why I would want one, I just put "null" for that parameter, which hasn't broken yet. But I'm sure there is a more grown-up way to do it.
In any case, if those functions were in the Z toolkit, so to say, then writing a scraper would almost be reduced to cycling through the metatags for information, and if it is not there, looking at the page with dom inspector and xpather till you find the xpaths that you want.
[1] http://www.zotero.org/support/dev/api_user_docs
[2] http://github.com/singingfish/zotero-browser
I'm not a programmer at all. But I've found pretty accessible to write basic translator with Scaffold (and FF 3.5 installed) and the help of this great tutorial.
Other than that, it looks like Scaffold 2.0 works well enough to update http://www.zotero.org/support/dev/creating_translators_for_sites and
http://www.zotero.org/support/dev/scaffold, and upload the new version to Zotero.org. Can that be done?
@ajlyon: As far as I can remember, it was already the case with Scaffold 1.0.