literal search inside HTML files

KeBugCheck · September 7, 2022

Sometimes(many times for me) chrome zotero connector doesn't save a snapshot of webpage properly. That's probably on account of many settings that exists for saving a webpage as seen here chrome/content/zotero/xpcom/singlefile.js

If that happens we can save the page by using other(better imo) page archivers like "SAVE PAGE WE" and then attach that manually to zotero.

This works but zotero doesn't searches inside that html file properly. I've had experience with website that are captured properly as HTML but zotero refuses to search inside them.

If I "attach" those same websites as .txt then zotero searches properly.

What I need is the ability to search literally inside HTML. I don't care if you match html reserved keywords.

dstillman · September 7, 2022

You should get a snapshot on any webpage that isn't handled by a specific translator. If you're not getting one on a given URL, you should report that. (Also, to be clear, Zotero uses the SingleFile extension for saving, which is widely regarded as one of the best webpage archivers.)

In any case, SingleFile saves HTML files as well, so I'm not sure why that part is relevant — we're not going to debug problems with some other extension. If you can reproduce the problem with a page saved via Zotero, can you give a specific example of exactly what you're doing and what's not working?

KeBugCheck · September 8, 2022

> SingleFile extension for saving, which is widely regarded as one of the best webpage archivers

This is wrong. Eg. https://github.com/gildas-lormeau/SingleFile/issues/1023 or https://github.com/gildas-lormeau/SingleFile/issues/843 and many more.

imo singlefile is not really good among the page archivers. I've had more success with Save Page WE or webscrapbook.

> not getting one on a given URL, you should report that

Different webpages require different archiver settings to be captured faithfully. It's thus obvious that you cannot support every webpage unless you expose those capture settings to user in a friendly manner.

> can you give a specific example of exactly what you're doing and what's not working

I have to capture a webpage from a bank website. As usual singlefile doesn't work. So I use "Save Page WE" and it does the job. But when I attach the saved page as .html Zotero doesn't searches inside it properly. When I attach the saved page as .txt Zotero searches inside it properly.

This is probably because of some tags that you guys are choosing to ignore while searching. That's ok. But I think there should be an option to search as-is even inside html. I think there'll be users for this.

I've a html file that Zotero attaches, indexes and opens in browser properly. The file displays a line of text prominently on opening in browser. Zotero just doesn't find that prominent text unless I change the attachment's extension to .txt

Definitely a bug.

dstillman · September 12, 2022

This is wrong.

Don't be obnoxious. If you prefer another tool (or, from the sound of it, another tool's default settings), that's fine, but it's just demonstrably true that SingleFile is very well regarded — it's clear from usage numbers, reviews, recommendations, and years of comments across the web.

And I'm not sure what you think those two tickets show.

The point of the first is just that SingleFile (and Zotero) removes JavaScript by default, for the reason explained in the FAQ. SingleFile can be configured to save scripts, but it makes for a terrible experience on many pages. Our goal in Zotero is to save the rendered content in a predictable, fully-local, searchable manner.

The second sounds like 1) a bug that was fixed and 2) a disagreement over how annoying/misleading to make the user interface.

Anyway, for your actual issue, we'd have to see the HTML file in question. You can post it somewhere and link to it or email it to support@zotero.org with a link to this thread. Zotero indexes HTML in general.

KeBugCheck · September 12, 2022

> Don't be obnoxious

I'm not.

> just demonstrably true

That's literally not what "demonstrably" mean. If I can find webpages that singlefile can't capture faithfully and others can then that's what demonstrably mean. Demonstrably doesn't have anything to do with reviews, comments and numbers which could be inflated and inorganic and imo usually are. The word you're looking for is subjectively.

> Our goal in Zotero is to save the rendered content in a predictable, fully-local, searchable manner.

Searchable? You're literally not doing that when you choose to not search verbatim/literally inside html files. You don't have a html viewer as such. You just open the html in browser. The point is you don't have tight integration with the html attached. Browser can be excused for not search inside html verbatim as it renders the file. But not you. Why not allow user to search inside it verbatim? No reason.

> we'd have to see the HTML file in question

Clearly you've proven to be not interested in actually listening. I gleaned that from your comments to other users. So I'll pass.
What I know is I'm a person who has 15GB+ of firefox scrapbook db that i've been using over 5 years and when I considered switching to zotero it neither captured the webpage faithfully nor did it search inside it properly. It's a dud for research purposes imo.

dstillman · September 12, 2022

If I can find webpages that singlefile can't capture faithfully and others can then that's what demonstrably mean.

But that wasn't the question? What I said was that it's widely regarded as one of the best archivers. So the demonstrable thing — for "widely" and "best" — would be usage numbers, reviews, recommendations, and comments. Why would you possibly be arguing about this?

Based on the tickets you linked to, it doesn't seem like you understand the technical issues here or the trade-offs involved in different settings. Finding a webpage that doesn't work properly with scripts removed is trivial, but that doesn't make it the wrong setting for most users. We (and the SingleFile developer) are happy to look at cases where static content isn't properly saved.

Searchable? You're literally not doing that when you choose to not search verbatim/literally inside html files in it's needed.

Again, Zotero indexes HTML and makes the rendered content searchable. Save this webpage from any browser, add the HTML file to Zotero, change the search mode to "Everything", and search for a phrase. It will show up. We're not going to index raw HTML markup and make "class" match every single snapshot in someone's library. That would be a bug.

I have no idea why you came here with this attitude, but it's not welcome here. I explicitly offered to look at the file in question, which is the only way for us to debug the issue you're reporting. If all you want to do is argue and be unpleasant, then go away and stop wasting our time.