New York Times (NYTimes, NYT) captures blank snapshot

alphapapa · December 2, 2011

The NYTimes translator is capturing blank article snapshots. The HTML ends up with meta tags but an empty body ("<body></body>"). It captures all the linked images and JS and CSS files, but an empty page. In contrast, the ScrapBook Firefox extension works fine. Here are two URLs that aren't working:

http://www.nytimes.com/2011/09/05/business/in-internet-age-postal-service-struggles-to-stay-solvent-and-relevant.html?_r=1
http://www.nytimes.com/2011/12/11/books/10-best-books-of-2011.html?src=me&ref=general

Also, other parts of the NYT site don't work, because they don't even show a translator icon, e.g. http://well.blogs.nytimes.com/2011/11/30/how-exercise-benefits-the-brain/?src=me&ref=general

ajlyon · December 2, 2011

There is a known issue where Adblock prevents Zotero from saving NYT snapshots correctly. We've never worked out what precisely is causing this.

NYT blogs are simply not yet supported.

alphapapa · December 2, 2011

I see. It'd be nice if that were documented on the known issues pages. :)

I'm not an extension developer, but I suspect that if Zotero captured the page by copying the HTML the browser already has, like ScrapBook does, instead of downloading a new copy from the server, it would solve the problem. It would also fix the problem whereby the page is generated by the server from a form submission as a one-off page, and won't be served by the server a second time when Zotero tries to redownload it. An example of this would be an order confirmation page from online shopping--not Zotero's mission, perhaps, but I'm sure there are other cases that Zotero users would encounter.

adamsmith · December 2, 2011

I have added this to the known issues here:
http://www.zotero.org/support/known_translator_issues?&#translators_with_minor_issues
There is also an open ticket/issue on this:
https://github.com/zotero/zotero/issues/10

Sending a new request makes sense because frequently you don't want exactly the page displayed - e.g. for newspapers and magazines, which commonly split articles into several pages, we always try to get the single page or print view. Having only half the article as a snapshot is almost as annoying as having an empty page.

ajlyon · December 2, 2011

Zotero often does save the page as currently displayed, dumping the current DOM. The NYTimes translator doesn't do that because it tries to get the single-page version, but we do try to use the DOM-based approach when possible, since it is faster and somewhat more reliable.