FW.Scraper() major issue

salarmehr · August 12, 2012

How can I use document.evaluate in FW.Scraper() function?

var tVar="title';

FW.Scraper({
title: FW.Xpath('...').text().trim(), //works
title: FW.Xpath('.../text()').trim(), // Why this dose not work?
title: FW.Xpath('.../text()').text().trim(), // Why this dose not work?
title: "The title', //works
title: tVar, // Why this dose not work?
title: doc.evaluate(...), // Why this dose not work?
});

fbennett · August 12, 2012

You're trying to use tVar and doc via closure in that construct, but they would not be available inside the FW.Scraper() function if it was cast in a context where they were not defined.

Technical questions should probably go to the zotero-dev list.

aurimas · August 12, 2012

FW.Scraper({
title: FW.Xpath('...').text().trim(), //works
title: FW.Xpath('.../text()').trim(), // Why this dose not work?
title: FW.Xpath('.../text()').text().trim(), // Why this dose not work?
title: "The title', //works
title: tVar, // Why this dose not work?
title: doc.evaluate(...), // Why this dose not work?
});

title: FW.Xpath('.../text()').trim(), // Why this dose not work?

missing .text() (see your other post)

title: FW.Xpath('.../text()').text().trim(), // Why this dose not work?

works for me

title: tVar, // Why this dose not work?

works for me

title: doc.evaluate(...), // Why this dose not work?

doc is not available in the global context where you are trying to execute it. It is only passed to detectWeb and doWeb.

By using framework you are creating a set of rules that will be executed later. But you cannot use variables that may be available later to define the rules now (if that makes sense)

EDIT: If you care to understand the technical details behind Framework, you can take a look at the code at http://e6h.org/~egh/hg/zotero-transfw/raw-file/tip/framework.js I'm not sure if it's up to date with the framework code used in Zotero currently, but the general idea is the same. And if you feel comfortable reading that code, it would probably be easier for you to write a translator without using framework at all, as that gives you much more flexibility.

salarmehr · August 12, 2012

@aurimas.
date: FW.Xpath('/html/body//table[2]/tbody/tr[1]/td/text()').text(),
this returns empty string for me. Do you get no-empty result?

aurimas · August 12, 2012

I'm not sure what document you are applying that xpath to, but I think it's probably similar to your example in the other post. I was actually testing with a slightly different set up, but here's why you get an empty string:

Assuming you are targeting a similar td element

<td>
<span>A</span>
B
</td>

This contains two text nodes. The first text node is the empty string between <td> and <span>. The second is "B".

FW.Xpath.text() only selects the first node matched by the xpath (technical note: I would have expected this behavior to match Zotero.Utilities.xpathText, which concatenates the nodes. I don't use Framework that much). In your case, you want to select the second node, so
date: FW.Xpath('/html/body//table[2]/tbody/tr[1]/td/text()[2]').text()

Keep in mind that there is a text node (empty or not) before and after each tag

ajlyon · August 12, 2012

The .text() in Framework predates the ZU.xpathText() implementation, and it hasn't been revised to match it (and I'm not sure if it should).

Also note that Framework is not bundled with Zotero per se, a minified version of it is embedded in each translator. The code is hidden by Scaffold. The version is therefore generally whatever is in the version of Scaffold used to develop the translator, which is usually up to date.

aurimas · August 12, 2012

The .text() in Framework predates the ZU.xpathText() implementation, and it hasn't been revised to match it (and I'm not sure if it should).

I agree that the behavior probably makes most sense in the Framework context, though it's not immediately obvious. Perhaps this should be noted in the documentation.

salarmehr · August 14, 2012

I am working on this page:
http://sid.ir/en/ViewPaper.asp?ID=247419&varStr=1;TABIBIAN%20MANOUCHEHR,SHOLEH%20MAHSA;ARMANSHAHR;SPRING-SUMMER%202010;3;4;1;16

I want select "SPRING-SUMMER 2010; 3(4):1-16. " text using translator framework. Using FirePath plugin you can see that using pure xpath expression "/html/body/div/div[3]/table[2]/tbody/tr/td/text()" yield desired result but this xpath epresion will not work in framework. Actually I test followings without success.

articleTitle=doc.evaluate("/html/body/div/div[3]/table[2]/tbody/tr/td/font",doc,null, XPathResult.STRING_TYPE ,null).stringValue;

FW.Scraper({
...
date : FW.Xpath('/html/body//table[2]/tbody/tr[1]/td').text().remove(/\n/g).remove(/.*?\)/).remove(/;.*/),
date : FW.Xpath('/html/body//table[2]/tbody/tr[1]/td').text(),
date : FW.Xpath('/html/body//table[2]/tbody/tr[1]/td').text().remove(RegExp(articleTitle)).text(),
date : FW.Xpath('/html/body//table[2]/tbody/tr[1]/td').text().remove(RegExp(articleTitle)),
date: FW.Xpath("/html/body//table[2]/tbody/tr[1]/td/text()[2]").text()
date: FW.Xpath("/html/body//table[2]/tbody/tr[1]/td/text()").text()
..});

and a lot more.

Can any body write a code that just select "SPRING-SUMMER 2010; 3(4):1-16. " in that page?

dstillman · August 14, 2012

Please post technical questions to zotero-dev.

aurimas · August 14, 2012

Sorry, Dan. I'll answer this quickly here. For future information on this, please post at https://groups.google.com/forum/?fromgroups#!forum/zotero-dev as suggested by fbennett in the second post.

date: FW.Xpath("/html/body//table[2]/tbody/tr[1]/td/text()[2]").text()

You were close with this one, but if you read my post above, you will notice

Keep in mind that there is a text node (empty or not) before and after each tag

The node in question is

<td>[empty text node]
  <input ... />
  [empty text node]
  <b>...</b>
  [empty text node]
  <font color="#118811">ARMANSHAHR</font>
  SPRING-SUMMER 2010; 3(4):1-16.
</td>

I added the [empty text node] parts to illustrate a point. As you can see, the text node you are trying to select is 4th on that list.

This works
date: FW.Xpath("/html/body//table[2]/tbody/tr[1]/td/text()[4]").text()

Sometimes it is much more convenient to count from the end though. The text node is the last node in that block, so you can use "last()" instead of the "4". For the 3rd node, you could use "last()-1", etc.

But I would encourage to use less document structure-dependent xpaths. Instead of traversing down the entire HTML tree, I would suggest
FW.Xpath("//table[@id='Table2']/tbody/tr[1]/td/text()[last()]").text()

Also, pay attention to the number of nodes your xpaths select. In FirePath, you can see the number of nodes selected on the bottom bar. The xpath you posted that "works in FirePath" selects 12 nodes on the entire page.