indexing web-pages without snapshots

kazzy · October 10, 2020

Hi,

New Zotero user here. Have been using it since a few months, and found it unbelievably useful.

Am looking for a way to save the textual content of a web-page (so that it gets indexed and is available for full-text search), without actually saving the snapshot (so as to conserve memory and clicks).

The content I am excepting is raw, unstructured text - like the result of "document.body.innerText" (nothing advanced like a data-scraper output).

I could implement a bookmarklet to get this data and send it to system clipboard, then paste it into a note attached to the saved web-page.

For example, this worked for me:

javascript: function copy (text) { var dummy = document.createElement("textarea"); document.body.appendChild(dummy); dummy.value = text; dummy.select(); document.execCommand("copy"); document.body.removeChild(dummy); } copy(document.body.innerText);

However, am wondering if this could be automated, and if I missed any plugin that's already doing this.

thanks a lot.

dstillman · October 12, 2020

Nothing I'm aware of, but I'm not sure what you had in mind — there's not that much to automate beyond a click on a bookmarklet, right-click → Add Note, and paste.

Note that the updated snapshot functionality in Zotero 5.0.91 will save the page as rendered, so if you used a readability-style bookmarklet or extension to strip the visible page down to clean text (while keeping the DOM in place for metadata extraction), the snapshot should now reflect that. But it'd still be HTML, not raw text, and the hidden elements might still be saved depending on the technique, so probably not what you're looking for.

kazzy · October 13, 2020

I agree about it being trivial.

But I am in the process of moving all my chrome bookmarks to Zotero [these are a good deal above 1000] - to keep everything centrally manageable - so those few clicks are getting tedious.

I have noticed that when I drag-n-drop a bookmark into Zotero, it queries the respective server for meta-data [if it does not get a proper response, a blank webpage item gets created].

This led me to think that it might be possible to fetch raw-text as part of this request - and index it (just like the contents of the snapshot), and maybe a plugin/configuration for that might already be around.

The reason I was pitching for just raw-text:

That way, if a page contains info I am looking for, it should turn up in the search results, even if the meta-data does not capture that info.

For example, I have saved the embedded data for a web-page about a book and the page contains a table of contents as well as information about other related books, I would like to have these indexed as well.

Why not just save snapshots then? Because if I am not selective about them, I might soon end up with a very large stack of HTML files - running into GBs perhaps.

about loading a page in readability-mode before saving the snapshot - I did try it before the current version but it did not work well.

will try that again now - thanks for the suggestion.

to conclude: great work, great software - now my primary tool for information management.