Feature request: Fully support cloud storage (dropbox/onedrive/google drive)

andychase · January 31, 2021

Currently Zotero supports syncing through it's official cloud service or self-hosting using the open source (though this is not well supported). Cloud storage is supported for files, not references.

The vision would be: Sharing your Zotero library by simply setting a directory path on a cloud service, then sharing that directory with your collaborators using that cloud service.

Why this is useful: Some organizations don't allow (easily) the use of cloud services. One example is most government agencies have a complicated process for authorizing the use of one. Self hosting in those organizations is possible, but would require set up and maintenance.

Technical architecture:

A new backend system would have to be made, one that conforms to the data model that cloud storage allows. The key limitation to be addressed is that cloud storage services struggle when two disconnected clients make changes to the same file and then re-connect.

The way this can be handled, is by having the backend architecture such that only new files are created in the storage directory.

I can see it looking like this:

data/references
- referenceid_machine_id.bib
- referenceid_machine_id_timestamp_edit1.bib
- referenceid_machine_id_timestamp_delete.bib
- referenceid_machine_id.bib
data/files
- file_attachement.pdf

Bibtex format could be used, or json or some human/machine readable format that is appropriate.

On application startup, Zotero would have to scan all files and then monitor for file system events. While this may take time for some very large libraries, it isn't unreasonable on most machines and there are several optimizations that could be done to speed up this scan.

Of course, there is a fundamental consideration here, which whether to keep the data denormalized or normalized. If denormalized, renaming a tag means making perhaps thousands of edits across the reference database or introducing data inconsistency. The alternative would be to make one file per row, and one directory per table. This would have some trade-offs. Relational would produce and manage many more files but maintain higher data integrity. Object model may produce inconsistencies, but might allow users to read the data files directly.

Another option would be to just store something that looks more like the syncCache table and keep the sqlite database as is.

Hope this is though provoking.

Thanks.

bwiernik · January 31, 2021

What you are describing is a fundamentally different program than what Zotero is (which is an SQLite database and interface). A complete rewrite of Zotero to use a completely different database design isn’t going to happen.

Zotero can used entirely locally without syncing the database at all. This can be integrated with cloud storage for attachment files using the Zotfile plugin.

There are other reference managers designed to use Dropbox (e.g., Papers did in the past, though I am unsure how active development and support for Paper is currently).

andychase · January 31, 2021

That's a fair point about the amount the amount of work that would be needed to implement a new data model.

Do you have feedback on the suggested design: "Another option would be to just store something that looks more like the syncCache table and keep the sqlite database as is." ?

adamsmith · January 31, 2021

I think the larger point here is that you're asking whether Zotero would implement something that's a significant amount of work, provides inferior functionality, requires additional troubleshooting because it involves 3rd party services, and also has the potential to affect Zotero's ability to generate revenue in the process.

I just don't think this makes sense. I think private sync instances for gov't and other users are a very real use case, but one that's more appropriately satisfied by facilitating the use server deployment (e.g. via a docker or other container image together with easy deploy instructions.). That would also have the added benefit of properly handling groups within an organization, which I don't see how a 3rd party-synced model reasonably could.

Zotero devs have indicated at various occasions that they're hoping to provide something like this in the future -- in principal, nothing would prevent an agency to develop and maintain an unofficial docker themselves, or they could look at whether Zotero would be willing to do that (more quickly) as a paid collaborative effort.

andychase · January 31, 2021

Yeah I agree it’s an open question how well it would work, and it would undoubtedly introduce more complexity and maintenance beyond probably what maintainers would want to deal with. Though it would save all those instances of people trying to do this currently with the SQLite file and corrupting it.

Title says feature request but in my mind I just assumed if I wanted to make this happen I would have to develop it myself, but wanted to get the idea out there anyway. Maybe someone else in the future will have a similar need and find my design ideas helpful.

andychase · January 31, 2021

One comment on the docker image. You are right that it’s doable in the organization I’m familiar with, but it just requires a big project to get the server authorized and deployed. Sometimes some of the approval groups, there’s also a chicken and egg problem where the data center admins don’t want to get all the approvals because there’s no usage, and there’s no usage because the server isn’t set up.

Another or in a university setting collaborating with a corporation or agency that doesn’t allow cloud services. Sharing some like a one drive doesn’t require any approvals and desktop freeware usage is usually not a problem. So it makes it extremely convenient to just setup and use no problem.

This is all people problems though, not technical ones.

emilianoeheyns · February 1, 2021

As an aside, BibTeX might be the worst possible format to use for something like this, and I say that as an avid bib(la)tex user. Converting Zotero data to/from biblatex takes a metric ton of partially-lossy structure-conversion compromises which would likely get increasingly worse every conversion and unnecessary (potentially lossy) text transformations for this use-case. JSON would be a much better choice.

If you're actually thinking of doing this, you might want to look at something like a git repo backend. That has desirable properties of its own, and will give you the basics of syncing and conflict detection. But I wouldn't plan much else in your life for the next year or so.