Stuck on a scraper
I'm setting out to build a scraper for our University OPAC. As luck would have it, the actual site, when I get to it, is a nightmarish maze of nested frames. But I'm having trouble getting even the most basic returns from a single flat page. Here's the code I'm trying in Scaffold:
I must be doing something wrong, but what on earth is it?
Frank Bennett
function detectWeb(doc, url) {
var v = doc.evaluate('//b',doc,null,XPathResult.ANY_TYPE,null).iterateNext();
Zotero.debug(v);
return url;
}
The target page contains several plain vanilla, unadulterated bold tags containing plain text and nothing else. There are no proxies or VPNs or firewalls between us. Many other scrapers seem to use this same syntax. And the url reported by the return is indeed the site I'm targeting. And it matches the URL pattern written into the metadata area in Scaffold (if that makes a difference). But the "v" variable consistently returns undefined.I must be doing something wrong, but what on earth is it?
Frank Bennett
everything = Zotero.Utilities.cleanString(doc.evaluate('//div/b', doc, null, XPathResult.ANY_TYPE, null).iterateNext().textContent);
Helps to read the documentation sometimes. Next problem: frames.<html>
But the address returned by Zotero for the url variable in detectWeb, and so the only obvious thing available to scrape, is the address shown for the first frame, and that page is just a wrapper with buttons; the content is in "main".<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name=GENERATOR" content="iLiswave_V1.0">
<title>資料問い合わせ(japanese)</title>
</head>
<frameset frameborder="no" border="0" cols="135,*">
<frame name="menu" src="/cgi-bin/exec_cgi/imenu.cgi?CGILANG=japanese&U_CHARSET=utf-8&HTMLFILE=imenu.html&LANGUAGE=0" scrolling="NO">
<frame name="main" src="/tmp/w24251/bibbr_1.html" scrolling="YES">
</frameset>
<noframes>
</noframes>
</html>
Is there a way to deal with this, or should I just throw in the towel at this point?
Ah, I think I get it. It must use the URL in the frame that has cursor focus. All's well.