Stuck on a scraper
I'm setting out to build a scraper for our University OPAC. As luck would have it, the actual site, when I get to it, is a nightmarish maze of nested frames. But I'm having trouble getting even the most basic returns from a single flat page. Here's the code I'm trying in Scaffold:
I must be doing something wrong, but what on earth is it?
Frank Bennett
function detectWeb(doc, url) {
var v = doc.evaluate('//b',doc,null,XPathResult.ANY_TYPE,null).iterateNext();
Zotero.debug(v);
return url;
}
The target page contains several plain vanilla, unadulterated bold tags containing plain text and nothing else. There are no proxies or VPNs or firewalls between us. Many other scrapers seem to use this same syntax. And the url reported by the return is indeed the site I'm targeting. And it matches the URL pattern written into the metadata area in Scaffold (if that makes a difference). But the "v" variable consistently returns undefined.I must be doing something wrong, but what on earth is it?
Frank Bennett
This is an old discussion that has not been active in a long time. Instead of commenting here, you should start a new discussion. If you think the content of this discussion is still relevant, you can link to it from your new discussion.
everything = Zotero.Utilities.cleanString(doc.evaluate('//div/b', doc, null, XPathResult.ANY_TYPE, null).iterateNext().textContent);
Helps to read the documentation sometimes. Next problem: frames.<html>
But the address returned by Zotero for the url variable in detectWeb, and so the only obvious thing available to scrape, is the address shown for the first frame, and that page is just a wrapper with buttons; the content is in "main".<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name=GENERATOR" content="iLiswave_V1.0">
<title>資料問い合わせ(japanese)</title>
</head>
<frameset frameborder="no" border="0" cols="135,*">
<frame name="menu" src="/cgi-bin/exec_cgi/imenu.cgi?CGILANG=japanese&U_CHARSET=utf-8&HTMLFILE=imenu.html&LANGUAGE=0" scrolling="NO">
<frame name="main" src="/tmp/w24251/bibbr_1.html" scrolling="YES">
</frameset>
<noframes>
</noframes>
</html>
Is there a way to deal with this, or should I just throw in the towel at this point?
Ah, I think I get it. It must use the URL in the frame that has cursor focus. All's well.