Translator for meetinglibrary.asco.org

I would really love a site translator for the American Society of Clinical Oncology (ASCO) abstracts repository (meetinglibrary.asco.org) I am willing to try my hand at writing one (I successfully installed Scaffold) but I have no XML/JAVA coding experience so I would need a lot of hand-holding. The issue that I could foresee is that the abstracts seem to be formatted as plain text (HTML?) and don't seem to have any metadata associated with them. Furthermore, it doesn't appear that JCO (the journal that publishes ASCO abstracts) doesn't appear to assign unique DOIs for each abstract. I'm not sure what the implications of this is for scaffold. Bottom line: I'd love a direct import feature that will recognize content like this : http://meetinglibrary.asco.org/content/97508-114
as a journal article and correctly assign the citation information for a direct Zotero import. Can anyone with more Scaffold/translator experience help me out, or would someone be willing to help create one? If I can learn to do this one, there are other sites like this one (repositories for clinical meeting abstracts) that I'd be willing to help develop translators for.

Cheers!
«1
  • This wouldn't be a bad place to learn writing translators with. Essentially you'd scrape all the information from the page using xpaths. You'd only need limited javascript (no java or xml at all).

    I good place to start would be here
    http://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus
    this isn't perfect, but should give you a good idea about how to work with xpaths. I'm happy to answer specific questions and do some hand-holding on the way. If this gets more technical, Dan may eventually ask us to move this over to the development listserv. If you want to quote larger snippets of code for questions, use a public gist at gist.github.com or pastebin.com
  • Thanks so much for this. The guide is just what I was looking for. I'll spend some time on this today and post again when i have success or hit a wall.
  • In Firebug, if I select an object like the title, I have the option to copy some html strings. There is an option to copy the "unique selector", but this doesn't look like an xpath. is there a way to select objects and copy the xpath to that object directly?
  • I'd use this bookmarklet if you want to copy xpaths:
    http://dl.dropboxusercontent.com/u/848981/it/xp/xp.html
    IIRC Firebug does give you xpaths, but they're not very useful.
    You'll have to manually adjust all automatically generated xpaths to be reliable and usable across sites.
  • What a great tool! Thanks
  • Ok- So I am working my way through the HWZT guide with some success. I've been trying out my translator on one of the ASCO abstracts' website and I was pretty satisfied when I was able to scrape the title successfully. Then I tried to iterate to scrape the Authors individually, and I'm getting a weird error. I defined the variable "Authors" and gave it an xpath, but when I ask it to iterate using "Authors.iterate.Next()" I get the error " string => TypeError: Authors.iterateNext is not a function"

    the code looks like this
    var AuthorXpath = '//div[contains(@class, "field") and contains(@class, "field-name-field-authors") and contains(@class, "field-type-text-long") and contains(@class, "field-label-above")]/div[@class="field-items"]/div[contains(@class, "field-item") and contains(@class, "even")]/p'
    var Authors = doc.evaluate(AuthorXpath, doc, nsResolver, XPathResult.ANY_TYPE, null).iterateNext().textContent;

    var items = new Array();
    var headers;
    while (headers = Authors.iterateNext()) {
    items.push(headers.textContent);
    }
    Zotero.debug(items);
    }

    any thoughts? As far as I can tell the syntax is the same as what I used for the title (i.e. [defined variable name].iterateNext()] so why am I getting this error?

    Thanks so much for your help so far. The guide has been great.
  • you do iterateNext twice - once when you define var Authors and then again in headers = Authors.iterateNext

    you'll likely want to delete .iterateNext().textContent;
    from the line starting with var Authors =
  • whoops- nevermind. I failed to remove the .iterateNext() from the end of the Authors variable. I fixed that and now it seems OK. moving on...
  • FWIW - there is a newer, simpler syntax for evaluating xpaths:

    ZU.xpath and ZU.xpathText

    they don't work exactly the same way as doc.evaluate does, but save you a lot of messy code, see the translators above for examples.
    The main difference is that you can't use ZU.xpath in a while loop, you have to use a for loop
  • (the HWZT guide is rather outdated)
  • After a few days of working on this in my free time, I feel I've hit a wall. I can use doc.evaluate to return single items correctly. I have tested multiple XPaths for all of the citation info I need, so that if I comment out existing variables I can Zotero.debug to test each XPath and see that it works. however I have 2 issues.
    1) I am not able to use ZU.xpath. All of my attempts to use it have returned errors
    2) I can't seem to figure out how to link together multiple pieces of bibliographic info into a working translator. I've looked at the examples above and they seem awfully different from the example ones in HWZT. Since these are for books and presentations, are there any relevant translators that are for journal articles? If I could find a translator that works and just adapt it by pointing it at the right URL and changing the XPaths for each piece of bibliographic information...well that might be cheating but it seems more in the realm of something i might be able to accomplish.
  • post what you have to gist.github.com as a public gist
  • https://gist.github.com/tEXkYzqK/5759947
  • edited June 11, 2013
    Use this http://www.zotero.org/support/dev/translators/framework

    It should be sufficient for what you need and will make things a lot simpler

    Edit: also, if you're not already, you should use Scaffold http://www.zotero.org/support/dev/translators/scaffold
  • look at https://github.com/zotero/translators/blob/master/SciELO.js#L20 as an example for a framework journal translator
  • @aurimas
    This is a huge step forward. I now have a working translator that just needs a little tweaking. Thanks everyone for your support. I will post again when I have it up and running completely.
    @adamsmith
    would the Zotero project be interested in using this translator? A search of the forums seemed to indicate that only 1 other person was interested in using a translator for ASCO abstracts, and that was many years ago- still, if anyone could benefit from its use...
    How would I go about submitting this for general use?
  • ideally, issue a pull request via github as described under A here
    https://github.com/citation-style-language/styles/blob/master/CONTRIBUTING.md

    substituting
    https://github.com/zotero/translators
    for https://github.com/citation-style-language/

    if you can't make it work, putting it up as a gist will do as well, but the pull request is much preferable
  • hey guys, things are going really well and I'm just trying to tweak the way Zotero imports the author list. I was hoping I could have some help putting together the appropriate function.

    Right now I have my scraper set up like this:

    creators : FW.Xpath ('//div[@class = "author-list"]/p').text().remove(/"^;", -g/).split(/\,/).replace(/\s/," ").cleanAuthor("author"),

    the .remove() was supposed to get rid of everything after the first instance of ";" and then run the cleanauthor function. instead, I don't see that it actually removes anything, although it might be removing a single instance of ";" and I just can't find it. How do I write a regex that says "everything after the first instance of ";"?

    Thanks again for all your help
  • I'll also add that I've tried matching anyting before the first ";" instead WITH .match( /\;$/) to no avail.
  • .remove(/\;.*/gm )
    seems to work
  • yes, that would be the correct regex to use. Are you dealing with multi-line strings (the answer is probably yes, even if the string appears as a single line on a webpage because of the way HTML is written)? If so, do you want to remove everything after ; including any additional lines? The "gm" flags are a bit confusing in regards to what you want to accomplish.

    I think what you want is probably .remove(/;[\s\S]*/)
    This will remove all characters (including newlines) after (and including) the ;
  • edited June 13, 2013
    Also .replace(/\s/," ") will not really do anything (not what you expect anyway).

    You probably want .replace(/[\s\r\n]+/g, " ")

    Edit: great place to learn regex http://www.regular-expressions.info/
  • @aurimas
    Thanks. I originally put the gm flags in there because I was sort of trying a shotgun approach. I thought perhaps the reason I was unable to remove any characters at first was because the match was stopping at the first ";"- so I put in the global flag. then I thought maybe it was a multi-line issue, so i put in the m flag. when I finally discovered the problem was a combination of syntax issues with /\;*/ (I needed to backslash out the ";" and add "*" to continue matching) I decided "if it ain't broke, don't fix it" and stopped messing with the code. In the interest of increasing my understanding, I'll play around with some of the expressions you recommended. In the end, I was able to scrape all of the bibliographic info I needed for my purposes.

    When I have a spare minute, I'll clean it up a bit and add a few more fields and put it on Github. Thanks again for your help.
  • what's the status of this? Anything we can help you with?
  • edited July 29, 2013
    The translator works OK: I used it to complete the project I was working on at the time, and I haven't had an opportunity to return to it recently. I didn't post it to Github because I noticed a few things wrong with it. When searching abstracts over several years on the ASCO meeting library website, certain years fail to return formatted results. I suspect that this is because the xpath to the information is different in those years (someone re-formatted the page when they entered the data or something). I know that a small tweak in the xpath I identified for each bibliographic entry would patch this problem, but it might not work in every case.
    Instead, I thought it would be good to use a multi scraper and set it up to read the section headers directly to input the info. I will try again to work on this when I have some spare time. In the meantime, I could post the translator I have just as an FYI to anyone looking for this.

    I'd also like to try developing one for clinicaltrials.gov, which is a website that collects data on clinical trials in the US. These entries wouldn't be journals, books, or anything that seems to fit neatly into an existing category, so I don't really know where to start when it comes to creating a translator for something like that.

    Anyway, thanks for your help. I really learned a lot and in the future if I need a site translator, I can try to jump right in.
  • yeah, why not put it up.
    As for clinical trials - you're right, there's no great category, I think I'd suggest going with report and maybe switching it to dataset (which in structure is quite similar) once we introduce that item type, likely in Zotero 4.2
  • added pull request here

    https://github.com/zotero/translators/pull/601
  • the translator is now up. Thanks to Graham for the initial work on this.
  • @Graham_MTM

    I came across this thread while searching for a way to get ClinicalTrials.gov data into Zotero, other than by creating a web page entry and then editing manually. You'd mentioned wanting to work on a ClinicalTrials.gov filter - is that still on your wishlist? I'd aid and abet, although I'd be starting from the bottom of the learning curve.
Sign In or Register to comment.