Trouble with XPath

stimms · February 5, 2011

Hi,
I'm hoping one of you might be able to help me out with some fixes I wanted to make to an outdated translator. It is for the economist site (http://economist.com). I have found the xpath of the element I want to use for the title but when I run the translator it fails and reports that the doc.evaluate is returning null. I find this to be odd since I can copy these lines into firebug and they find the text perfectly, the translator also works from within scaffold. These are the lines:

var flyTitle = doc.evaluate('/html/body/div[2]/div[3]/div/div/div/h2', doc, nsResolver, XPathResult.ANY_TYPE, null).iterateNext().textContent;
var headline = doc.evaluate('/html/body/div[2]/div[3]/div/div/div/div', doc, nsResolver, XPathResult.ANY_TYPE, null).iterateNext().textContent;
newItem.title= flyTitle + ': ' + headline;

Is there something different about the context in which the translor runs with would cause it's behaviour to vary from that observed in firebug and scaffold? I have properly set up namespace and nsResolver in my function. The test page I've been using is http://www.economist.com/node/18065683.

ajlyon · February 5, 2011

First -- note that you don't want to use XPath like that. Try to identify a good @id or @name you can search from; counting divs and whatnot from the root is very fragile.

stimms · February 5, 2011

A reasonable point. This site makes heavy use of class names which I also tried using

For instance I used
doc.evaluate('//h2[@class="fly-title"]', doc, nsResolver, XPathResult.ANY_TYPE, null).iterateNext().textContent;

Which, again, works nicely in firebug and scaffold but fails miserably in zotero proper:
doc.evaluate("//h2[@class=\"fly-title\"]", doc, nsResolver, XPathResult.ANY_TYPE, null).iterateNext() is null

So the issue of which xpath I use aside I am still not getting the consistent results I would like.

ajlyon · February 5, 2011

This sounds suspiciously like a case of scraping the wrong doc -- if your target regex is too loose, sometimes you will end up scraping an iframe or something else. This frequently happens with things like "AddThis" boxes.

And this discussion should probably take place on zotero-dev, not the forum.

And I'm calling it a night-- more feedback in the morning if you still need help.

stimms · February 5, 2011

Thanks for your help, I'll move the discussion over there.

mysheepb · February 22, 2011

@ajlyon
I am looking into the Aleph translator and I suspect that some of the problems are related to scraping an iframe with an "AddThis" box. Can you show me some light on how to avoid this? Thanks! (pointing to a related discussion link is also welcome!)

ajlyon · February 22, 2011

The discussion above ended up pointing to a very nasty issue with the way The Economist works, but in general you want to be very selective in your target regex in the JSON header and check for clear indicators of content (not just the URL) in the detectWeb function.

As before, please direct further questions to zotero-dev.