Translator for meetinglibrary.asco.org

grahampositive · June 5, 2013

I would really love a site translator for the American Society of Clinical Oncology (ASCO) abstracts repository (meetinglibrary.asco.org) I am willing to try my hand at writing one (I successfully installed Scaffold) but I have no XML/JAVA coding experience so I would need a lot of hand-holding. The issue that I could foresee is that the abstracts seem to be formatted as plain text (HTML?) and don't seem to have any metadata associated with them. Furthermore, it doesn't appear that JCO (the journal that publishes ASCO abstracts) doesn't appear to assign unique DOIs for each abstract. I'm not sure what the implications of this is for scaffold. Bottom line: I'd love a direct import feature that will recognize content like this : http://meetinglibrary.asco.org/content/97508-114
as a journal article and correctly assign the citation information for a direct Zotero import. Can anyone with more Scaffold/translator experience help me out, or would someone be willing to help create one? If I can learn to do this one, there are other sites like this one (repositories for clinical meeting abstracts) that I'd be willing to help develop translators for.

Cheers!

adamsmith · June 5, 2013

This wouldn't be a bad place to learn writing translators with. Essentially you'd scrape all the information from the page using xpaths. You'd only need limited javascript (no java or xml at all).

I good place to start would be here
http://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus
this isn't perfect, but should give you a good idea about how to work with xpaths. I'm happy to answer specific questions and do some hand-holding on the way. If this gets more technical, Dan may eventually ask us to move this over to the development listserv. If you want to quote larger snippets of code for questions, use a public gist at gist.github.com or pastebin.com

Graham_MTM · June 5, 2013

Thanks so much for this. The guide is just what I was looking for. I'll spend some time on this today and post again when i have success or hit a wall.

adamsmith · June 5, 2013

for examples of similar translators, consider
https://github.com/zotero/translators/blob/master/SlideShare.js
and/or
https://github.com/zotero/translators/blob/master/Columbia University Press.js

Graham_MTM · June 5, 2013

In Firebug, if I select an object like the title, I have the option to copy some html strings. There is an option to copy the "unique selector", but this doesn't look like an xpath. is there a way to select objects and copy the xpath to that object directly?

adamsmith · June 5, 2013

I'd use this bookmarklet if you want to copy xpaths:
http://dl.dropboxusercontent.com/u/848981/it/xp/xp.html
IIRC Firebug does give you xpaths, but they're not very useful.
You'll have to manually adjust all automatically generated xpaths to be reliable and usable across sites.

Graham_MTM · June 5, 2013

What a great tool! Thanks

Graham_MTM · June 7, 2013

Ok- So I am working my way through the HWZT guide with some success. I've been trying out my translator on one of the ASCO abstracts' website and I was pretty satisfied when I was able to scrape the title successfully. Then I tried to iterate to scrape the Authors individually, and I'm getting a weird error. I defined the variable "Authors" and gave it an xpath, but when I ask it to iterate using "Authors.iterate.Next()" I get the error " string => TypeError: Authors.iterateNext is not a function"

the code looks like this
var AuthorXpath = '//div[contains(@class, "field") and contains(@class, "field-name-field-authors") and contains(@class, "field-type-text-long") and contains(@class, "field-label-above")]/div[@class="field-items"]/div[contains(@class, "field-item") and contains(@class, "even")]/p'
var Authors = doc.evaluate(AuthorXpath, doc, nsResolver, XPathResult.ANY_TYPE, null).iterateNext().textContent;

var items = new Array();
var headers;
while (headers = Authors.iterateNext()) {
items.push(headers.textContent);
}
Zotero.debug(items);
}

any thoughts? As far as I can tell the syntax is the same as what I used for the title (i.e. [defined variable name].iterateNext()] so why am I getting this error?

Thanks so much for your help so far. The guide has been great.

adamsmith · June 7, 2013

you do iterateNext twice - once when you define var Authors and then again in headers = Authors.iterateNext

you'll likely want to delete .iterateNext().textContent;
from the line starting with var Authors =

Graham_MTM · June 7, 2013

whoops- nevermind. I failed to remove the .iterateNext() from the end of the Authors variable. I fixed that and now it seems OK. moving on...

adamsmith · June 7, 2013

FWIW - there is a newer, simpler syntax for evaluating xpaths:

ZU.xpath and ZU.xpathText

they don't work exactly the same way as doc.evaluate does, but save you a lot of messy code, see the translators above for examples.
The main difference is that you can't use ZU.xpath in a while loop, you have to use a for loop

Rintze · June 7, 2013

(the HWZT guide is rather outdated)

Graham_MTM · June 11, 2013

After a few days of working on this in my free time, I feel I've hit a wall. I can use doc.evaluate to return single items correctly. I have tested multiple XPaths for all of the citation info I need, so that if I comment out existing variables I can Zotero.debug to test each XPath and see that it works. however I have 2 issues.
1) I am not able to use ZU.xpath. All of my attempts to use it have returned errors
2) I can't seem to figure out how to link together multiple pieces of bibliographic info into a working translator. I've looked at the examples above and they seem awfully different from the example ones in HWZT. Since these are for books and presentations, are there any relevant translators that are for journal articles? If I could find a translator that works and just adapt it by pointing it at the right URL and changing the XPaths for each piece of bibliographic information...well that might be cheating but it seems more in the realm of something i might be able to accomplish.

adamsmith · June 11, 2013

post what you have to gist.github.com as a public gist

Graham_MTM · June 11, 2013

https://gist.github.com/tEXkYzqK/5759947

aurimas · June 11, 2013

Use this http://www.zotero.org/support/dev/translators/framework

It should be sufficient for what you need and will make things a lot simpler

Edit: also, if you're not already, you should use Scaffold http://www.zotero.org/support/dev/translators/scaffold

adamsmith · June 11, 2013

look at https://github.com/zotero/translators/blob/master/SciELO.js#L20 as an example for a framework journal translator

Graham_MTM · June 11, 2013

@aurimas
This is a huge step forward. I now have a working translator that just needs a little tweaking. Thanks everyone for your support. I will post again when I have it up and running completely.
@adamsmith
would the Zotero project be interested in using this translator? A search of the forums seemed to indicate that only 1 other person was interested in using a translator for ASCO abstracts, and that was many years ago- still, if anyone could benefit from its use...
How would I go about submitting this for general use?

adamsmith · June 11, 2013

ideally, issue a pull request via github as described under A here
https://github.com/citation-style-language/styles/blob/master/CONTRIBUTING.md

substituting
https://github.com/zotero/translators
for https://github.com/citation-style-language/

if you can't make it work, putting it up as a gist will do as well, but the pull request is much preferable

Graham_MTM · June 13, 2013

hey guys, things are going really well and I'm just trying to tweak the way Zotero imports the author list. I was hoping I could have some help putting together the appropriate function.

Right now I have my scraper set up like this:

creators : FW.Xpath ('//div[@class = "author-list"]/p').text().remove(/"^;", -g/).split(/\,/).replace(/\s/," ").cleanAuthor("author"),

the .remove() was supposed to get rid of everything after the first instance of ";" and then run the cleanauthor function. instead, I don't see that it actually removes anything, although it might be removing a single instance of ";" and I just can't find it. How do I write a regex that says "everything after the first instance of ";"?

Thanks again for all your help

Graham_MTM · June 13, 2013

I'll also add that I've tried matching anyting before the first ";" instead WITH .match( /\;$/) to no avail.

Graham_MTM · June 13, 2013

.remove(/\;.*/gm )
seems to work

aurimas · June 13, 2013

yes, that would be the correct regex to use. Are you dealing with multi-line strings (the answer is probably yes, even if the string appears as a single line on a webpage because of the way HTML is written)? If so, do you want to remove everything after ; including any additional lines? The "gm" flags are a bit confusing in regards to what you want to accomplish.

I think what you want is probably .remove(/;[\s\S]*/)
This will remove all characters (including newlines) after (and including) the ;

aurimas · June 13, 2013

Also .replace(/\s/," ") will not really do anything (not what you expect anyway).

You probably want .replace(/[\s\r\n]+/g, " ")

Edit: great place to learn regex http://www.regular-expressions.info/

Graham_MTM · June 14, 2013

@aurimas
Thanks. I originally put the gm flags in there because I was sort of trying a shotgun approach. I thought perhaps the reason I was unable to remove any characters at first was because the match was stopping at the first ";"- so I put in the global flag. then I thought maybe it was a multi-line issue, so i put in the m flag. when I finally discovered the problem was a combination of syntax issues with /\;*/ (I needed to backslash out the ";" and add "*" to continue matching) I decided "if it ain't broke, don't fix it" and stopped messing with the code. In the interest of increasing my understanding, I'll play around with some of the expressions you recommended. In the end, I was able to scrape all of the bibliographic info I needed for my purposes.

When I have a spare minute, I'll clean it up a bit and add a few more fields and put it on Github. Thanks again for your help.

adamsmith · July 29, 2013

what's the status of this? Anything we can help you with?

Graham_MTM · July 29, 2013

The translator works OK: I used it to complete the project I was working on at the time, and I haven't had an opportunity to return to it recently. I didn't post it to Github because I noticed a few things wrong with it. When searching abstracts over several years on the ASCO meeting library website, certain years fail to return formatted results. I suspect that this is because the xpath to the information is different in those years (someone re-formatted the page when they entered the data or something). I know that a small tweak in the xpath I identified for each bibliographic entry would patch this problem, but it might not work in every case.
Instead, I thought it would be good to use a multi scraper and set it up to read the section headers directly to input the info. I will try again to work on this when I have some spare time. In the meantime, I could post the translator I have just as an FYI to anyone looking for this.

I'd also like to try developing one for clinicaltrials.gov, which is a website that collects data on clinical trials in the US. These entries wouldn't be journals, books, or anything that seems to fit neatly into an existing category, so I don't really know where to start when it comes to creating a translator for something like that.

Anyway, thanks for your help. I really learned a lot and in the future if I need a site translator, I can try to jump right in.

adamsmith · July 29, 2013

yeah, why not put it up.
As for clinical trials - you're right, there's no great category, I think I'd suggest going with report and maybe switching it to dataset (which in structure is quite similar) once we introduce that item type, likely in Zotero 4.2

Graham_MTM · July 29, 2013

added pull request here

https://github.com/zotero/translators/pull/601

adamsmith · September 13, 2013

the translator is now up. Thanks to Graham for the initial work on this.

asinclair · January 23, 2014

@Graham_MTM

I came across this thread while searching for a way to get ClinicalTrials.gov data into Zotero, other than by creating a web page entry and then editing manually. You'd mentioned wanting to work on a ClinicalTrials.gov filter - is that still on your wishlist? I'd aid and abet, although I'd be starting from the bottom of the learning curve.