URL handling in a rewriting proxy environment with multiple proxies

aander07 · August 15, 2020

When using a rewriting proxy sever, I would like to have the Zotero application save URLs in native vendor representation, e.g.

https://www.vendor.com/path/to/article

rather than store what is in the browser's location bar, which might take one of several different formats:

EZProxy (Host): https://www-vendor-com.proxy.example.edu/path/to/article
EZPRoxy (Port): http://proxy.example.edu:2049/path/to/article

Muse Knowledge Proxy (Host): https://sessionid-proxyid-y-https-www-vendor-com.proxy.example.edu/path/to/article
Muse Knowledge Proxy (Path): https://proxy.example.edu/MuseProxyID=proxyid/MuseSessionID=sessionid/MuseProtocol=https/MuseHost=www.vendor.com/MusePath/path/to/article

Juniper: https://proxy.example.edu/,DanaInfo=www.vendor.com,SSL,CT=html+article

As a user accumulates citations over time, there could currently be 5 different URL formats in their citations, most incompatible with one another, and unable to be easily used if the user's access to the original proxy server is no longer available. (This does not count WAM or OpenAthens proxy formats, either.)

A cleaner way to handle this would to communicate the concept of what is known as a "proxy prefix" to the Zotero client, and save the URLs in the native format (i.e. proxyless like the output of the Zotero.Proxy.prototype.toProper() method).

There are a few other threads in this forum, where different university proxy servers are being used, so this is a real issue that is being encountered in the wild today.

Also, some resources have special authentication hooks used by the proxy servers that are only activated when specific entry points are used. Loading the deep link URL directly can bypass these authentication hooks in some cases, leaving the user unable to retrieve the content. Using the proxy prefix method ensures that these authentication hooks are executed.

What I propose for discussion and possible implementation:

In the Zotero plugin, extend the Zotero.Proxy data set to include a proxy prefix, e.g.;

EZProxy: https://proxy.example.edu/login?url=
Muse Knowledge Proxy: https://proxy.example.edu/AppName?qurl=

and either a boolean flag like dotsToHyphens (perhaps called encodeURI) to indicate whether the destination URL pattern needs to call encodeURI(%p) constructing the URL to send to the browser to load.

In this scenario, what would be stored in the Zotero URL field in the client for the citation record would be the native URL, e.g.:

URL: https://www.vendor.com/path/to/article?arg1=value1&arg2=value2

And what would be sent to the browser to load when the article was retrieved would be in the prefix/qurl pattern for RFC compliance, e.g.

EZProxy: https://proxy.example.edu/login?qurl=https://www.vendor.com/path/to/article?arg1=value&arg2=value2
Muse Knowledge Proxy: https://proxy.example.edu/AppName?qurl=https://www.vendor.com/path/to/article?arg1=value&arg2=value2

NOTE: those URLs have the hex-encoded versions of "?" (% 3F) and "&" (% 26) in them as an example, but the forum software appears to be converting them back to ASCII representation when presenting them as links, instead of preserving the link text as-is.

For users that have access to multiple proxy systems, a "Load via..." action could be added to the context menu in the Zotero client for the citation with a list of known proxy servers to simplify switching between systems (similar to how Add Attachment has a sub-menu today).

Switching to using a proxy prefix would also address a "Reload via Proxy" issue where Zotero.Proxy.scheme patterns that use the "%a" (e.g. "https://%a-y-https-%h.proxy.example.edu/%p"match cannot be successfully used with the "Reload via Proxy" option in the browser plugin, because "%a" is left empty, yielding "https://-y-https-www-vendor-com.proxy.example.edu/path/to/article", which does not work.)

This would also make overall citation management cleaner, as the references in papers using Zotero citations would be using the vendor's native URL format, and a reader looking for more information could more easily retrieve that article URL using their own access methods (native/transparent proxy/rewriting proxy), without having to strip out another institution's rewriting proxy information from the citation prior to using it.

If this sounds reasonable, I can help with the plugin work, if someone else can handle the client UI side, as that is not my forte.

adamsmith · August 15, 2020

When using a rewriting proxy sever, I would like to have the Zotero application save URLs in native vendor representation, e.g.

Zotero should already do this and does for me. Could you provide specific examples where it doesn't?

dstillman · August 15, 2020

Yeah, I'm confused by this post. @aander07, you seem aware of a lot of Zotero's proxy functionality without seeming to recognize that the behavior you're requesting is just basic parts of that functionality, which has been in place nearly as long as Zotero has existed.

When a proxy server entry is configured correctly, unproxied URLs will be saved in the app. When you open an unproxied URL from the app, the app doesn't need to know anything about the proxy prefix — that happens in the Connector, based on the configured proxies, as part of the basic proxy redirection functionality.

Reload via Proxy is just a recent convenience feature for when you end up on a domain that you haven't previously accessed via your configured proxy (which generally wouldn't be the case for sites you've previously saved from via the Connector). Once you do, it will be registered as an associated domain and Reload via Proxy will no longer be necessary for that site.

None of this needs to be or should be a plugin.

adamsmith · August 15, 2020

The prefix vs. proxy schema part is different from the current behavior, though, isn't it? And while I don't know anything about this, to the extent that it is an issue, the prefix method may have advantages:

Also, some resources have special authentication hooks used by the proxy servers that are only activated when specific entry points are used. Loading the deep link URL directly can bypass these authentication hooks in some cases, leaving the user unable to retrieve the content. Using the proxy prefix method ensures that these authentication hooks are executed.

Using a prefix would also seem to be more stable than the proxy schema that includes a hostport [fixed typo].

dstillman · August 15, 2020

I'm not sure what those "special authentication hooks" would be. As far as I know, any properly configured EZproxy (which constitutes the vast majority of proxy usage) will redirect to authentication when using the standard URL scheme. If it doesn't, that sounds like a badly configured proxy that's fundamentally breaking stable URLs, gated as they may be, and people should complain to their IT department. (That's also why port-based EZproxy is a bad mechanism that's fortunately mostly been phased out.) While Zotero doesn't actually store proxied URLs, you should be able to store a proxied URL and retrieve it later — or, say, restore a previously loaded tab — without going to a different, special login URL.

aander07 · August 15, 2020

| | When using a rewriting proxy sever, I would like to have the Zotero application
| | save URLs in native vendor representation, e.g.

| Zotero should already do this and does for me. Could you provide specific
| examples where it doesn't?

I am not seeing unproxied URLs saved in the app:

https://123456789-p1-y-https-go-gale-com.proxy.example.edu/ps/i.do?p=AONE&u=12345&v=2.1&it=r&id=GALE|A624182319&inPS=true&linkSource=interlink&sid=AONE

If you can point me in the right direction for how to ensure that these are cleaned up between the browser plugin and the app, I will happily work on a patch for this.

aander07 · August 15, 2020

| Yeah, I'm confused by this post. @aander07, you seem aware of a lot of Zotero's
| proxy functionality without seeming to recognize that the behavior you're
| requesting is just basic parts of that functionality, which has been in place nearly as
| long as Zotero has existed.

| When a proxy server entry is configured correctly, unproxied URLs will be saved in
| the app. When you open an unproxied URL from the app, the app doesn't need to
| know anything about the proxy prefix — that happens in the Connector, based on
| the configured proxies, as part of the basic proxy redirection functionality.

OK, before I started work on the patch to add Muse Knowledge Proxy support, I was seeing this saved in the application:

https://123456789-p1-y-https-go-gale-com.proxy.example.edu/ps/i.do?p=AONE&u=12345&v=2.1&it=r&id=GALE|A624182319&inPS=true&linkSource=interlink&sid=AONE

With the patch, which uses this pattern:

"https://%a-y-https-%h.proxy.example.edu/%p"

I am still seeing the same URL formats saved in the app, rather than what I want to see which would be:

https://go.gale.com/ps/i.do?p=AONE&u=12345&v=2.1&it=r&id=GALE|A624182319&inPS=true&linkSource=interlink&sid=AONE

aander07 · August 15, 2020

| Reload via Proxy is just a recent convenience feature for when you end up on a
| domain that you haven't previously accessed via your configured proxy (which
| generally wouldn't be the case for sites you've previously saved from via the
| Connector). Once you do, it will be registered as an associated domain and Reload
| via Proxy will no longer be necessary for that site.

Reload via proxy has an issue when "%a" is used in the proxy scheme.

With "https://%a-y-https-%h.proxy.example.edu/%p" as the pattern, Reload via proxy yields this URL:

"https://-y-https-www-vendor-com.proxy.example.edu/path/to/article"

Since in this case %a is intended to cover a session ID and server ID in the schema for the proxied URL, the resulting URL is completely meaningless. Even if this were not, this style of URL does not work to start a new session on Muse Knowledge Proxy, as it is a multi-tenant application, and behaves differently than EZProxy does (see one of my next comments for more information on this).

| None of this needs to be or should be a plugin.

I am not proposing a new plugin, just a tweak to the current behavior to improve usability for those who use multiple proxy servers (faculty at institution A & student at institution B; student at institution C taking classes at an extension campus of institutionD, etc).

aander07 · August 15, 2020

| The prefix vs. proxy schema part is different from the current behavior, though,
| isn't it? And while I don't know anything about this, to the extent that it is an issue,
| the prefix method may have advantages:

| | Also, some resources have special authentication hooks used by the proxy servers
| | that are only activated when specific entry points are used. Loading the deep link
| | URL directly can bypass these authentication hooks in some cases, leaving the
| | user unable to retrieve the content. Using the proxy prefix method ensures that
| | these authentication hooks are executed.

| Using a prefix would also seem to be more stable than the proxy schema that
| includes a port.

That is what I'm trying to sort out -- how the browser plugin communicates the URL to the application so that I can make sure that the Muse Knowledge Proxy case is handled correctly. The patch that I created and pointed to in my other post mimics a lot of what the EZProxy path does, but the raw proxy URLs are still sent to the app instead of the native URLs (see previous example).

Prefixing is considered best practice, rather than using the embedded hostname style links, so I would strongly encourage the use of prefixes, as well as the use of qurl encoded links rather than the RFC violating "url" style links.

From my use of Zotero in classes a few years ago, I have references saved in this format:

https://ezproxy.example.edu/login?url=http://search.ebscohost.com/login.aspx?direct=true&AuthType=ip,uid&db=mth&AN=655947&site=ehost-live&scope=site

In order for me to use this saved citation today, I need to strip out the prefix that was saved in the URL field, and either rely on the "reload via proxy" feature to work (which it is not currently for my use case), or manually add my new proxy prefix to every URL record that I want to retrieve.

I am proposing that if the concept of URL does not contain the proxy prefix, but is handled separately, then using different rewriting proxy servers becomes simpler, either between different institutions or different proxy platforms.

aander07 · August 15, 2020

| I'm not sure what those "special authentication hooks" would be.

I am speaking specifically about the special interactions between the proxy server and the vendor platform, primarily on ebook platforms that are used to communicate the concept of an anonymized user between the proxy and the vendor using proprietary authentication API hooks between the proxy and the vendor.

| EZproxy (which constitutes the vast majority of proxy usage)

My interest in this is to make Zotero a viable option for the 300 institutions using the Muse Knowledge Proxy platform that I support.

| While Zotero doesn't actually store proxied
| URLs, you should be able to store a proxied URL and retrieve it later — or, say,
|restore a previously loaded tab — without going to a different, special login URL.

That is not the behavior that I am currently seeing (see comment above for specific URL and the proxy schema that was used. Also, I have citations from several years ago that do have a proxy prefix in the record URL, so this may be a legacy behavior that has since changed (see previous comment for an example).

dstillman · August 15, 2020

I am still seeing the same URL formats saved in the app, rather than what I want to see

That just means that something about the proxy detection in your patch isn't working properly. Again, this works fine for domain-based EZproxy installs when configured correctly, which it generally should be without user intervention. @adomasven, who's most familiar with the current proxy-handling code, will need to review your PR and may be able to tell you more based on the debug output you're seeing, though we don't have access to a Muse proxy. If you can get us a temporary login to one, we might be more help.

I am not proposing a new plugin

(Oh, OK. Calling it the Zotero Connector or the browser extension would be clearer. "Plugin" implies something different.)

just a tweak to the current behavior to improve usability for those who use multiple proxy servers

I don't think this really has much to do with multiple proxy servers. Again, Zotero stores unproxied URLs for supported proxy servers when things are working properly, and for redirection, there's no particular reason for a given host to be associated with more than one proxy. Reload via Proxy already gives you a choice if you have multiple proxies set up.

Prefixing is considered best practice, rather than using the embedded hostname style links

I don't know what "best practice" you're referring to. A best practice would be using stable URLs without embedded session keys.

> While Zotero doesn't actually store proxied URLs, you should be able to store a proxied URL and retrieve it later — or, say, restore a previously loaded tab — without going to a different, special login URL.

That is not the behavior that I am currently seeing

I'm talking about the current, built-in behavior with supported proxies. I can't speak to your patch. When Zotero understands the proxy, it will store unproxied URLs.

But that wasn't really my point in the sentence you quoted. The point was that URLs should be stable, and servers should handle authentication automatically. If you copy a URL from the address bar and try to reload it later, it should work properly, even if it's after an auth redirect. The client shouldn't have to know about some other magic login URL to get back to a functional webpage. It's the server's job to redirect to the login page and back again if authentication is required.

Also, I have citations from several years ago that do have a proxy prefix in the record URL, so this may be a legacy behavior that has since changed

No, Zotero has always aimed to store unproxied URLs. If proxy detection wasn't working properly for you, or there was some temporary bug, you could've ended up with the proxied URL, but saving the unproxied URL has always been the goal.

I'll let Adomas comment on the feasibility of fully supporting the Muse proxy. I recognize the problem you're pointing out — that the Muse proxy requires different schemes for login and for detection, and Zotero only has a single scheme field — but I'd argue that's a bit of a design flaw in the proxy, since it means that the proxy is serving users on unstable URLs.

aander07 · August 16, 2020

I understand what your are saying about persistent URLs, but this software has a long history going back decades now, and was originally part of a much larger software stack, so some of this is carry over from design decisions that were made long ago for a completely different purpose. Just like how I wish EZProxy had never supported non-RFC compliant "url=" behavior that leads to amazing and complex to unwind failures, if I had a time machine, I would go back and ask the developers to choose a different design for the behavior of this one piece of the platform. Sometimes we have to work within the limits of what we are given and push for changes as we can.

The leading value is part session identifier, part internal application identifier, so technically, yes, it is possible to get back to a login flow for the application using one of those URLs in the short term (and is actually being used to do so in my patch, similar to the code that probes for the EZproxy hostname), but for long term use, it would require us to guarantee internal implementation details will not change that I just do not feel comfortable guaranteeing due to other parts in the proxy architecture that may require changes that cannot be made backwards compatible as we scale and grow.

Therefore, it is much cleaner and more long-term sustainable to use a prefixed entry point with vendor native URLs instead of relying on internal implementation details that are exposed in the rewritten hostname.

This is not any worse than platforms like EBSCO where users cannot take browser bar URLs and paste them into documents due to SID values and other transient data in the URLs, so it is not without precedence. I have worked hard to ensure that when users use the platform's "Cite" functionality on the platform's web pages that those URLs are pristine and not rewritten, so this is limited in scope to just the browser location URL handling, not the actual citation data provided by the platform.

Regarding best practice, sites using a rewriting proxy have been encouraged to use the prefixed method of access going back decades now. The embedded hostname style access (www.vendor.com.proxy.example.edu) just happens to work for EZProxy, but is not guaranteed to work by OCLC, it is just a happy accident that it does work most of the time. OCLC reserves the right to change that at any time without any guarantees of backwards compatibility, just like their internal "ezp.2" internal link format could change at any time as well.

If you look at Innovative Interface's WAM proxy links, their format is similar since it was a fork of EZProxy from long ago (if I remember correctly, their links are formatted as "0-www.vendor.com.proxy.example.edu", where the first number is the port number, with 0 representing the default port 80).

I have not yet seen OpenAthens proxy behavior first hand, but I do know that sites converting to OpenAthens must first "Athenize" their URLs into a prefix style link so that users will either be redirected straight to vendor platform's SAML entry point and prompted for SAML authentication if they do not already have a SAML session OR use the OpenAthens rewriting proxy platform as a fallback for resources that are not SAML authenticated.

So that's what I mean when talk about best practices in a rewriting proxy environment -- taking the native vendor URL and prefixing it, as that is the defacto standard way that all of these platforms work today, and why I think it would be a great thing for Zotero to accommodate the prefix style access in a way that makes it as painless for the user as possible.

I disagree with your statement that "there's no particular reason for a given host to be associated with more than one proxy", however, because it is absolutely common for institutions to have different subscriptions on aggregator and journal platforms that have a common hostnames. Say that a user is a doctoral student at University A, but an adjunct professor at College B. University A has certain resources for their doctoral students at ProQuest, but College B has a completely different set of resources for their baccalaureate programs, yet both use "search.proquest.com" as the host. I do not see how Zotero has enough context solely based on the hostname to differentiate between the access method for those two institutions that the same user has access to content through. Either choice could be wrong -- the doctoral content is not available through College B's proxy, and the baccalaureate resources are not available through University A's proxy. This the scenario that I have in mind when I suggest that there are use cases that would seem to have room for improvement for users who fall into this category.

Anyway, I will get access setup to the proxy platform next week and work with Adomas to get the patch up to speed for inclusion. I feel that the patch is fairly close now, and just needs a few tweaks to have everything fall into place given what you have said about the intent being to keep URLs clean and free from access details.

That is the first step in all of this, and a few months ago, I would have said that the rest sounds like a great conversation to hash out around a table at a conference, but unfortunately we may have to settle for something a bit less personal these days.

dstillman · August 16, 2020

This is not any worse than platforms like EBSCO where users cannot take browser bar URLs and paste them into documents due to SID values and other transient data in the URLs, so it is not without precedence.

Right, but EBSCO is awful and antiquated in this regard, which it makes it far harder for us to help people with issues on EBSCO than nearly any other site we regularly have to deal with.

Anyway, the Muse proxy works how it works, so we'll see what we can do.