Stuck on a scraper

fbennett · January 8, 2009

I'm setting out to build a scraper for our University OPAC. As luck would have it, the actual site, when I get to it, is a nightmarish maze of nested frames. But I'm having trouble getting even the most basic returns from a single flat page. Here's the code I'm trying in Scaffold:

function detectWeb(doc, url) {
	var v = doc.evaluate('//b',doc,null,XPathResult.ANY_TYPE,null).iterateNext();
	Zotero.debug(v);
	return url;
}

The target page contains several plain vanilla, unadulterated bold tags containing plain text and nothing else. There are no proxies or VPNs or firewalls between us. Many other scrapers seem to use this same syntax. And the url reported by the return is indeed the site I'm targeting. And it matches the URL pattern written into the metadata area in Scaffold (if that makes a difference). But the "v" variable consistently returns undefined.

I must be doing something wrong, but what on earth is it?

Frank Bennett

fbennett · January 8, 2009

Aha. This was the magic:

everything = Zotero.Utilities.cleanString(doc.evaluate('//div/b', doc, null, XPathResult.ANY_TYPE, null).iterateNext().textContent);

Helps to read the documentation sometimes. Next problem: frames.

fbennett · January 8, 2009

The target site (http://opac.nul.nagoya-u.ac.jp/) uses frames. In single-reference pages, the source of the page reported in the browser address bar looks like this:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name=GENERATOR" content="iLiswave_V1.0">
<title>資料問い合わせ(japanese)</title>
</head>

<frameset frameborder="no" border="0" cols="135,*">
	<frame name="menu" src="/cgi-bin/exec_cgi/imenu.cgi?CGILANG=japanese&U_CHARSET=utf-8&HTMLFILE=imenu.html&LANGUAGE=0" scrolling="NO">
	<frame name="main" src="/tmp/w24251/bibbr_1.html" scrolling="YES">
</frameset>
<noframes>
</noframes>
</html>

But the address returned by Zotero for the url variable in detectWeb, and so the only obvious thing available to scrape, is the address shown for the first frame, and that page is just a wrapper with buttons; the content is in "main".

Is there a way to deal with this, or should I just throw in the towel at this point?

fbennett · January 8, 2009

D'oh. I see the selection list above the test area has items for each frame, "menu", and "main". So ... if this is what detectWeb sees when it's running, how do I tell an installed scraper which of these two frames to use?

Ah, I think I get it. It must use the URL in the frame that has cursor focus. All's well.

dstillman · January 9, 2009

As far as I know, detectWeb() runs on all frames in the page with URLs that match the target regex, and the icon and saving process will correspond to the first frame for which detectWeb() returns a value.