[New Plugin] ZotSeek: AI-Powered Semantic Search for Zotero 8
Hi everyone,
I've developed ZotSeek, a plugin that adds AI-powered semantic search to Zotero 8.
WHAT IT DOES
• Find Similar Documents - Right-click any paper → discover semantically related papers
• Natural Language Search - Search with queries like "machine learning in healthcare"
• Hybrid Search - Combines AI embeddings with Zotero's keyword search
• Section-Aware Results - See which section matched (Abstract, Methods, Results)
• 100% Local - Uses bundled AI model, no data leaves your machine
• Fast - Searches complete in ~70ms
LINKS
GitHub: https://github.com/introfini/ZotSeek
Download: https://github.com/introfini/ZotSeek/releases/latest
REQUIREMENTS
Zotero 8.0 or later (including beta)
TECHNICAL DETAILS
• Uses nomic-embed-text-v1.5 model (768 dimensions, 8K context)
• Transformers.js running in a ChromeWorker for non-blocking inference
• Embeddings stored in SQLite alongside Zotero's database
• ~131MB download due to bundled AI model (ensures offline functionality)
Feedback and suggestions welcome!
I've developed ZotSeek, a plugin that adds AI-powered semantic search to Zotero 8.
WHAT IT DOES
• Find Similar Documents - Right-click any paper → discover semantically related papers
• Natural Language Search - Search with queries like "machine learning in healthcare"
• Hybrid Search - Combines AI embeddings with Zotero's keyword search
• Section-Aware Results - See which section matched (Abstract, Methods, Results)
• 100% Local - Uses bundled AI model, no data leaves your machine
• Fast - Searches complete in ~70ms
LINKS
GitHub: https://github.com/introfini/ZotSeek
Download: https://github.com/introfini/ZotSeek/releases/latest
REQUIREMENTS
Zotero 8.0 or later (including beta)
TECHNICAL DETAILS
• Uses nomic-embed-text-v1.5 model (768 dimensions, 8K context)
• Transformers.js running in a ChromeWorker for non-blocking inference
• Embeddings stored in SQLite alongside Zotero's database
• ~131MB download due to bundled AI model (ensures offline functionality)
Feedback and suggestions welcome!
Upgrade Storage
1. Books (special issues, edited books) are important; it would be good to have a way to include them.
2. Putting things into the main database seems risky. The potential loss of work that took many people many years to compile is significant, individual development is more precarious over time, and the core Zotero team will say, remove plugins as a first recommendation to most problems. Therefore, I would prefer this functionality to work in its own bounded space so that it guarantees not to interfere with the core reliability.
I appreciate the elaborate documentation!
Thanks for the thoughtful feedback!
Re: Books - There's actually a setting in the preferences (Settings → ZotSeek → "Exclude books from indexing and search") that you can uncheck to include books. They're excluded by default because they lack the typical paper structure (Abstract, Methods, Results) and are often too long to index effectively, but the option is there if you need it.
Re: Database isolation - Great timing on this concern! I just released v1.1.0 which addresses exactly this. ZotSeek now uses a separate SQLite database file (zotseek.sqlite) that attaches to Zotero's connection using the ATTACH DATABASE pattern (same approach Better BibTeX uses). This means:
- Zotero's main database stays completely untouched
- Clean uninstall - the plugin automatically removes its database file when uninstalled
- Complete data isolation
The database path is now shown in the settings panel so you can verify it's separate.
@damnation:
Thanks! Glad the documentation is helpful.
On books. Thanks for the explanation. In our case, books make up a large proportion of the literature. They can be systematic monographs with chapters covering all aspects of a biological genus (morphology, population, human use, and so on) or more loosely structured works. Having ways to meaningfully integrate them would be very useful. If relevant, many universities provide access to high-performance computing if needed for an initial analysis.
On real-world use. Over time, research across projects, teaching across multiple subjects, preparing various papers and presentations, while collaborating with novices or people using different in-house practices, all result in many groups, duplications, and folders of documents not yet added to Zotero. In principle, this is the context where an automated search and sense-making system could be most useful... For your consideration.
I also commend the transparent documentation!
Otherwise: very cool project!
Correct - ZotSeek searches only your local Zotero library. It doesn't connect to external databases or query papers you don't already have.
Thanks for pointing out that this should be clearer! I'll update the README to state this explicitly upfront. The "100% Local" framing emphasizes privacy but doesn't make the scope obvious enough.
The focus is on helping you rediscover and make connections within your existing research, rather than external discovery. Think of it as a smarter way to navigate what you've already collected.
Thanks for the detailed context about your use case!
Re: Books - You're right that books present unique challenges. The current system can handle them if you enable that setting, but there are some practical limitations:
1. Length - A 300-page monograph might generate 100+ chunks at ~3 seconds each, making indexing time prohibitive
2. Structure variation - The section detection (which powers the "Source" column in results) assumes paper structure.
3. Books with chapter-based organization won't benefit from this Context window - Even with 8K tokens, most books need to be split heavily, which can fragment semantic meaning
That said, for shorter edited volumes or structured monographs, it could work reasonably well.
An idea for the future: Your use case has me thinking about making the chunk splitting configurable. Currently, the plugin looks for standard paper sections (Introduction, Methods, Results, etc.), but allowing users to define custom section patterns could make it work much better for books or discipline-specific formats. This could be a preferences setting where you define your own categories (e.g., "Morphology", "Population", "Human Use" for biological monographs). Would that help with your workflow?
Regarding HPC resources - I appreciate the offer! However, the current architecture is deliberately designed to run 100% locally on your machine (nothing leaves your computer), which means the AI model processes chunks sequentially rather than using external compute. This ensures complete privacy but also means that even with access to university computing clusters, indexing would still be sequential and take the same amount of time per chunk. The tradeoff is privacy and offline capability vs. speed.
Would love to hear how it performs if you try it with your book collection!
Re: Real-world messiness - This is actually where semantic search excels! The AI doesn't care about clean folder structures or perfect metadata. As long as papers are indexed, you can find them by meaning regardless of how chaotically they're organized. Duplicates might show up as highly similar papers, which could even help you identify them.
The main limitation right now is that papers need to be in Zotero to be indexed. "Documents not yet added to Zotero" won't be searchable until imported. Auto-indexing on import is planned to make this smoother.
On books and processing time. Say, we have some 20k articles x 3sec ≈ 16 hours? Say, 2k of books with 10 chapters each ≈ also 16 hours? Will subsequent updates only index new additions?
Not sure I understand the implications of the customisation you suggest, but anything manual here will be too time-consuming to attempt.
Still not clear, does the search work across Zotero groups?
Basically, if any chunk of content is not included in the search, the utility of searching is in question as you will have to search to ensure nothing is missing with other means anyway.
On HPC. The user could create a powerful virtual machine, index there, and then use the index on a laptop, etc., if it can be synced, or so I imagined.
Good luck with all this; it is a complex challenge.
The keyword search portion appears buggy, it's not finding results that the Zotero's stand-alone keyword search finds.
Search history would be super helpful.
On processing time and incremental updates:
Your math is correct - 20k articles at ~3 sec each would be roughly 16-17 hours of initial indexing. The good news: subsequent updates are incremental. "Update Library Index" only processes items that haven't been indexed yet or have been modified since last indexing. So after that initial one-time investment, adding 50 new papers would only take ~2.5 minutes.
On customization:
To clarify - I was thinking of a one-time preference setting where you define custom section patterns once (e.g., "Morphology chapters should be labeled 'Morphology'"), not manual configuration per document. The plugin would then automatically apply those patterns during indexing. But you're right that even that setup might not be worth it if the core challenge is indexing time.
On Zotero groups:
Good question - I need to verify this. The plugin works on collections, so in theory group libraries should work, but I haven't explicitly tested cross-group search. Let me confirm and get back to you.
On completeness:
This is the fundamental tension you've identified. If your workflow requires searching everything including lengthy monographs, and the indexing time makes that impractical, then ZotSeek might not be the right fit - or at least not as a complete replacement for other search methods. It's better suited for literature that fits the abstract/paper model, or as a complement to traditional search rather than a replacement.
On HPC/VM syncing:
Clever idea! The embeddings are stored in a single SQLite file (zotseek.sqlite) in your Zotero data directory. In theory, you could index on a powerful VM and then copy that database file to your laptop. However, this hasn't been tested and there could be path/compatibility issues. If you're willing to experiment, I'd be very interested to hear if it works!
Bottom line: For a library heavily weighted toward long-form books rather than papers, the current architecture might not be ideal. The plugin excels at paper-focused libraries where most items have abstracts. I appreciate you thinking through these edge cases - it helps clarify where the tool fits (and doesn't fit) in different research workflows.
Thanks for the kind words and the detailed feedback — both points are really helpful!
On the keyword search bug:
I'd love to dig into this. Could you share:
ZotSeek's keyword component queries Zotero's QuickSearch API under the hood, so in theory it should return the same results — but there might be edge cases I haven't caught. Specific examples would help me reproduce and fix it.
On chunk-level search with previews:
Great feature idea, and you've hit on something that's partially there but could go further.
What ZotSeek already does: In Full Document mode, papers are chunked by section (Abstract, Methods, Results, etc.), and the search uses MaxSim — meaning if any chunk matches well, the paper ranks highly. The "Source" column in results shows which section matched (e.g., "Methods" vs "Results").
What's missing (and what you're asking for):
The architecture supports this — chunk embeddings are already stored separately — but I'd need to store chunk text alongside embeddings and add preview UI. Definitely on the radar.
Quick question: For your use case with long documents, are you mostly working with monographs/books, or lengthy reports and review papers? That would help me prioritize how granular the chunking needs to be.
On search history: Also a good idea — I'll add it to the list. Would you want just recent queries, or more of a "saved searches" feature?
Thanks again for the thoughtful suggestions!
As an example I'll use this document: https://www.dni.gov/files/ODNI/documents/assessments/GlobalTrends_2040.pdf which I ingested into the index (full content) and tested by searching for the word "flood".
For reference, Zotero's plain search identifies this document as a match, and its PDF reader finds six variations of this word (x1 flood, 4x floods, 1x flooding).
I searched for the word "floods" using Zotero's Keyword search and it finds nothing. Oddly, the semantic and hybrid searches yields many hits from my library by there's no scrollbar to scroll down to view the results beyond the first page - perhaps this specific document is lower down in the list but I cannot tell, even with both monitors and the results stretched down across both. The "sort by column" feature also doesn't appear functional yet.
Does this help?
To your other question about document size: Currently I'm working on a research paper sourced by journal articles typically 10-40 pages long.
As a common *nix user (but currently on Windows here admittedly) I'm used to Bash terminal style history so that's what I had in mind but not tied to it.
Once I'm done with school (in a month or two) I hope I can help out on your GitHub!
On that note: would giving the users more control of 1) the chunk size and the 2) headings you search for via your long regex (I've done similar work in the past - my regexes were nearly identical!) provide more flexibility towards the recommendations offered above?
New: Result Granularity Toggle
In Full Document mode, you can now switch between two views:
- By Section (default) — 1 result per paper, showing the best matching section
- By Location — All matching paragraphs with exact page & paragraph numbers
In By Location mode, clicking a result opens the PDF directly to the matching page — so you can finally answer "where in this 300-page paper did I read about autonomous AI agents?"Also new:
- References filtering — Bibliography sections are now automatically excluded from indexing, keeping results focused on actual content
- Privacy documentation — Added a dedicated section confirming everything runs 100% locally
Upgrade notes:
- Download
- If using Full Document mode, go to Settings → ZotSeek and increase "Max Chunks per Paper" (recommended: 50-100 for longer documents)
- Rebuild your index to capture the new location metadata and apply references filtering
Would love to hear if this helps with your workflowzotseek-1.2.0.xpifrom the GitHub releases page