meta request: easier tools to write translators with

seatrout · April 3, 2010

The most frustrating thing about Z for me is that it is so very hard to write translators in. I'm not wholly clueless about small programming tasks. I can write useful macros in openoffice basic and in python; I have even written one zotero translator, for the Guardian, which mostly works. I don't think it works quite well enough to make public: it doesn't discriminate between the Guardian and the Observer, nor between pieces published in the paper and only online yet. But trying to write it on Windows was an extremely painful process. There are tools for writing and evaluating xpath expressions, but that's it. The various tutorials on this site are all old and obsolete.

What's really helpful to a beginner is the kind of interactive and incremental programming you can do in a python IDE: type a line, evaluate, see what's wrong, and try again. When it's working, save it. If there's a way to work like that in "raw" javascript with zotero, I can't find any reference to it. There used to be Scaffold, which was wonderful. But it doesn't work with 2.0 and won't, so far as I can see, be upgraded.

It seems to me that until it is much easier to write translators, they are always going to be a huge bottleneck in Zotero. It will never be really useful outside the context of academic libraries while it's so difficult to extend it to newspapers and news websites generally.

fbennett · April 3, 2010

I also shied away from translator work after Scaffold went away, for the same reasons. But I've recently returned to it, and discovered that you don't need Scaffold. Zotero itself provides everything you need for incremental development of a translator. The workflow is actually simpler than in Scaffold.

To get a look at what's in a page (including internal refactoring by Javascript that is masked in the HTML), I run Firebug. It's "Inspect element" menu item is handy for identifying node names and attributes.

Start by copying one of the 2.0 translators. Gear menu -> Preferences -> Advanced -> Show Data Directory will take you to its parent directory. Click on ./translators and there they are. Just copy a file in that directory, then open it and alter the ID to prevent it from being affected by an update. Give it your title and you're ready to start.

First step is to strip out the content of the two API functions detectWeb() and doWeb(). The doWeb() function can be empty at the start. For detectWeb(), give it a simple return value, like this:

function detectWeb () {
    return "book";
}

Save the file. Your translator will now show an icon when the URL in the JSON header matches a site. The JSON header data is reloaded only when you restart Firefox, so a little care with crafting the URL regexp is worthwhile. To get started on the code, you might want to set a broad-ish regexp to get an icon up, and tighten up the conditions later.

Once you're getting an icon, you can reload the JS code of the translator by hitting the refresh button in the browser. Start coding, refresh, see what you get. If you mess up the JS, you'll get an error that you can see through the Firefox error console under Tools.

For internal fetches of pages and PDF files, use the new Zotero.Utilities.retrieveDocument() and Zotero.Utilities.retrieveSource() functions. They run synchronously (as far as your translator is concerned) and don't require the Zotero.wait() / Zotero.done() functions. This simplifies the code a great deal; it's pretty much just straightforward scripting, as far as programming logic is concerned.

I agree that documentation would be welcome -- so there's a start on it (!).

ajlyon · April 3, 2010

Having never used Scaffold, I have only worked in the manner that Frank describes, and it works decently well. That said, Scaffold looks like a great idea, and I wonder if some of us in the interested, relatively code-savvy Zotero community couldn't make it Zotero 2.0 compatible.

I'll be taking a look-- reading and writing JS files in the translators directory can't be _that_ different from working with the SQLite database. The key source can be browsed at https://www.zotero.org/trac/browser/scaffold/trunk/chrome/content/scaffold .

Another useful new tool would be a test framework for translators-- I would like to see specific URLs and the expected results, to detect translator breakage. This could even be part of a new, updated Scaffold. The test URLs and expected results could be specified in each translator, then run on demand by translator developers from within Scaffold.

dstillman · April 3, 2010

I agree with Frank that Scaffold isn't particularly necessary in 2.0, but we'd be happy to have others help get it working again. (Rintze was looking into it at one point, but I'm not sure if anything came of that.) The main advantages of Scaffold that come to mind are 1) automatic generation of a UUID (without going to a website and generating one) and metadata (without copying and pasting from another translator), 2) the ability to test changes to the target regexp, which is cached at Firefox startup, and 3) the ability to test saving without actually creating items in the database.

Other than working with individual files rather than the database, the biggest difference between 1.0 and 2.0 versions would be that in 2.0 detectWeb() and doWeb() aren't separated, so having separate panes would be tricky. The alternative would be to have a single code frame with a button to run each function separately instead of having a single button that depended on the active frame.

Another useful new tool would be a test framework for translators-- I would like to see specific URLs and the expected results, to detect translator breakage.

Yup.

seatrout · April 3, 2010

Thanks for all these suggestions. I think that part of the problem is that I work on Windows, and Dan, in an earlier thread, says essentially that console debugging doesn't work on Windows. I just cobbled together a Daily Mail translator that seems to work, using Scaffold and then copying and pasting the results into a javascript file. It's the ability to change a line of code, run it, and see the output at once which I find so useful.

dstillman · April 3, 2010

Console output works on Windows—it's just incredibly slow, ugly, and hard to interact with.

You could also use Zotero's built-in debug output logging with the View window open, hitting Refresh manually.

Rintze · April 3, 2010

Rintze was looking into it at one point, but I'm not sure if anything came of that

Not much, I'm afraid. My stranded attempt: http://bitbucket.org/rmzelle/scaffold/

adamsmith · April 3, 2010

I'd like to second the request to lower the barrier to entry on creating translators - I'm perfectly happy if the way this is done is to develop documentation as started by Frank above, maybe in addition to a testing framework. My guess would be that a couple of hours of work put into this by someone who knows what s/he is doing would have amazing returns to investment -
after I have just finished what I think is my 100s style (at least if minor adjustments count) I'm ready to do something different from time to time ;-).

kieren · April 3, 2010

It strikes me that someone could pretty easily rig together a translator builder using the toolkit I've just lashed together at: http://github.com/singingfish/zotero-browser which provides an easy way into the zotero API through the web browser.

My personal goal in the first instance is to build a really rich annotated bibliography browser, but I'm guessing that it wouldn't take much effort for a moderately skilled programmer to put together a GUI translator helper (maybe using a frame containing the page of interest, and some kind of custom DOM inspector perhaps based on Firebug Lite (http://getfirebug.com/firebuglite).

dstillman · April 3, 2010

I don't see any particular reason to abandon the XUL interface that already exists—it just needs a few small fixes to work in 2.0.

kieren · April 3, 2010

From when I used scaffold a few years ago, I remember that it still required considerable programming skill. Really I was just thinking aloud that a GUI translator builder would be reasonably easy to achieve using firebug lite, even if the corresponding xpath or DOM expressions needed to be cleaned up manually later on. The webserver / server side javascript stuff I've just discovered really means that the end user doesn't have to go to the inconvenience of setting up a Firefox development environment in order to delve in the zotero internals, although there's nothing stopping the same kind of functionality from being integrated into scaffold itself.

dstillman · April 4, 2010

From when I used scaffold a few years ago, I remember that it still required considerable programming skill.

I think the new retrieveDocument() and retrieveSource() functions reduce the level of skill needed dramatically, but, while I'd love to be proven wrong, I'm pretty skeptical that the basic requirements are going to change.

You were a part of the related thread on the dev list (where this discussion should probably continue), but I'll just link to Frank's response to Bruce on the subject. Summary: the web is a messy place, and even seemingly simple translators tend to get complicated before the end. There are really only a handful of exceptions to this in the existing translator library.

In terms of architecture, translators are basically as simple as assigning values to a set of predefined properties. But the rest of the code in existing translators isn't just for fun—it's there to deal with issues specific to those sites. (The exception is the async stuff, which we've been able to work around via retrieveDocument() and retrieveSource() thanks to new functionality in Fx3.0 and higher.)

Also, translators that a more automated tool would be able to generate would almost invariably be screen scrapers, which are the most fragile kind of translators producing the worst data. Would the increased ease of authorship make up for the increased likelihood that the translators would break or produce bad data? Maybe.

I think there are a few translator conventions—such as the messy XPath evaluate() lines—that could be simplified and a few additional things a tool such as Scaffold could offer—such as providing access to the list of available item fields—but I think that getting quality data is going to require scripting on most sites until embedded metadata becomes more common.

There are plenty of people who have the skills to help, though. I think the most effective thing we could offer is a better system for managing translator development, and that's something that's in the works.

seatrout · April 4, 2010

Part of the problem, I think is that the newcomer is faced with two unknown and orthogonal things to learn. Both have their documentations fairly well hidden.

First there is the structure of a Zotero entry itself; I don't know where -- outside the source -- there is a listing of the fields for the main types of entries: shall we say blogs, newspaper articles, journal articles, and books. That would be a great help when thinking about what a translator needs to do and breaking down the task of writing one into manageable pieces.

Secondly there is the business of writing xpath manipulations in javascript. This is partially documented in various places. But it would be helpful to have a clear and simple explanation, with the zotero docs, of the iterateNext() and textContent idioms.

Thirdly, there remains the Windows problem. Console debugging there is, as Dan says, ugly, hard to interact with, and incredibly slow. That does make it difficult for a very large proportion of users to do anything without scaffold, which is none of those things. What makes Scaffold worthwhile is not its graphical nature so much as the little lightning icon which lets you see the results of your mistakes almost instantly. Is there any way to reproduce this behaviour using, say, the Console2 addon? (http://bit.ly/chttp2)

I don't want to be unremittingly negative. I will try to write an annotated example of newspaper scrapers if I have time. But I think that the business of writing translators could be made a lot less forbidding even if it will never become particularly easy.

dstillman · April 4, 2010

I don't want to be unremittingly negative.

You're not—I've agreed with you on all of these points.

Is there any way to reproduce this behaviour using, say, the Console2 addon?

No. Console² is great, and I highly recommend it for Firefox-based development, but debug output doesn't go to the error console. But again, if you don't mind hitting Ctrl-R periodically you can use debug output logging from the Advanced pane of the Zotero prefs. That's what I do now when I need to debug an issue on Windows.

seatrout · April 4, 2010

Ah I finally understand what you mean by the debug output logging

OK. I have spent much of today writing a translator for the Daily Mail - not a paper I like, but one which I am obliged for professional reasons to clip. I think it is just about usable by other people now. I could not have done even a tiny fraction of it without Scaffold, but it is reasonably easy to keep a second firefox profile just for running Zotero 1.0 and Scaffold in it. With the help of the dom inspector and xpather, this seems to supply what I need. Then I can develop incrementally in scaffold and when it works just paste the whole lot into a js file in my "real" profile.

So where should I put this thing, assuming anyone else would find it useful or informative?

ajlyon · April 4, 2010

You should post the translator to the files section of the zotero-dev group and post a message there. The translator will then be reviewed and added to Zotero.

seatrout · April 5, 2010

Another constructive suggestion: could we have the standard doc.evaluate(xpath,blah,blah,blah) constructions abstracted into helper functions like "Zotero.getXpath(xpathexpression)" and "Zotero.getXpath.text(xpathexpression)"?

I imagine these functions would return an array of nodes or text strings respectively. They would cut down a lot of repetitive typing and make the writing of scrapers seem less forbidding.

Another helpful widget on those lines would be Zotero.getMetaTags(doc).

I was going to write a simple version for my own use, and then realised that some of the translators in the repository (eg the Time magazine one) have a complicated dance around namespace resolvers. Since I don't know what a namespace resolver is, nor why I would want one, I just put "null" for that parameter, which hasn't broken yet. But I'm sure there is a more grown-up way to do it.

In any case, if those functions were in the Z toolkit, so to say, then writing a scraper would almost be reduced to cycling through the metatags for information, and if it is not there, looking at the page with dom inspector and xpather till you find the xpaths that you want.

kieren · April 5, 2010

I'm inclined to think that there is a need for a simplified zotero API for common operations, but documenting the existing API properly should be a priority first. Hopefully the documentation[1] and tools[2] I've been working on over the past couple of weeks might be sufficiently easy to use to prompt other contributors to step in.

[1] http://www.zotero.org/support/dev/api_user_docs

[2] http://github.com/singingfish/zotero-browser

dstillman · April 5, 2010

Another constructive suggestion: could we have the standard doc.evaluate(xpath,blah,blah,blah) constructions abstracted into helper functions like "Zotero.getXpath(xpathexpression)" and "Zotero.getXpath.text(xpathexpression)"?

I already gave this as an example of something we can simplify above. I've created a ticket.

Another helpful widget on those lines would be Zotero.getMetaTags(doc).

I don't think that's any easier than doc.getElementsByTagName('meta'), which is very standard.

Since I don't know what a namespace resolver is, nor why I would want one, I just put "null" for that parameter, which hasn't broken yet.

This is the main other convention I was referring to above. The namespace resolver is a convention inherited from MIT SIMILE's Piggy Bank, on which the translator architecture was originally based. (It has since diverged a good bit.) We've never come across a page where that was actually necessary. It doesn't really matter, though, since this will be abstracted by the XPath helper functions. For now, passing null should almost always be fine.

dstillman · April 5, 2010

I'm inclined to think that there is a need for a simplified zotero API for common operations, but documenting the existing API properly should be a priority first. Hopefully the documentation[1] and tools[2] I've been working on over the past couple of weeks might be sufficiently easy to use to prompt other contributors to step in.

Kieren, the stuff you've been working on is great, but translators are sandboxed. Other than a handful of utility functions, they don't have anything to do with the "Zotero API" you're referring to or even to other Firefox/Zotero-based development.

Gracile · April 6, 2010

I could not have done even a tiny fraction of it without Scaffold, but it is reasonably easy to keep a second firefox profile just for running Zotero 1.0 and Scaffold in it. With the help of the dom inspector and xpather, this seems to supply what I need. Then I can develop incrementally in scaffold and when it works just paste the whole lot into a js file in my "real" profile.

+1
I'm not a programmer at all. But I've found pretty accessible to write basic translator with Scaffold (and FF 3.5 installed) and the help of this great tutorial.

fbennett · April 12, 2010

With a few small fixes, Rintze's Scaffold update for Zotero 2.0 now seems to be functioning. It has some rough edges in the display (due to some rearrangement of the XUL thrown in by yours truly without reading the XUL docs), but it seems to work pretty much as it did under 1.0. No promises at this point, but if anyone wants to zip up the scaffold.xpi from the sources and take it out for a spin, let us know how you get on.

ajlyon · April 12, 2010

It works! Now I see why people were complaining so much that Scaffold didn't work-- it certainly makes things nicer.

fbennett · April 12, 2010

I just posted a few further fixes that tidy up the layout a little. When Rintze wakes up in a couple of hours he can take a bow to our applause; my own contribution to this has been pretty small.

Rintze · April 12, 2010

@fbennett: Awesome! Thanks for waking Scaffold from hibernation. I guess it all came down to the right finishing touch.

seatrout · April 12, 2010

@fbennett -- thanks very much for this. I hope to try it out properly tomorrow

ajlyon · April 18, 2010

This works nicely. One issue, however, is that the JSON header is now written as a single line, making it a lot less legible. Could this be fixed?

Other than that, it looks like Scaffold 2.0 works well enough to update http://www.zotero.org/support/dev/creating_translators_for_sites and
http://www.zotero.org/support/dev/scaffold, and upload the new version to Zotero.org. Can that be done?

Gracile · April 18, 2010

@Rintze, fbennett: that's awesome! Thank you so much. I've tried Scaffold 2.0 and it works well with FF 3.6.3.

@ajlyon:

One issue, however, is that the JSON header is now written as a single line, making it a lot less legible. Could this be fixed?

As far as I can remember, it was already the case with Scaffold 1.0.

ajlyon · April 18, 2010

@Gracile: Well, the translators in the repository have generally had line breaks in the JSON header. Opening and modifying one of these translators with Scaffold 2.0 will result in less human-friendly formatting without line breaks.

fbennett · April 18, 2010

We're finding that it's not reporting JS errors in the translator code, which will be a little daunting when writing a translator from scratch. Hasn't been resolved yet, but we're looking at it. If anyone with js skills wants to take a look, by all means please have a go.

dstillman · April 18, 2010

As far as I can remember, it was already the case with Scaffold 1.0.

There were no JSON blocks in 1.0 translators. This should indeed be fixed.