[New Plugin] ZotSeek: AI-Powered Semantic Search for Zotero 8

introfini · December 26, 2025

Hi everyone,

I've developed ZotSeek, a plugin that adds AI-powered semantic search to Zotero 8.

WHAT IT DOES

• Find Similar Documents - Right-click any paper → discover semantically related papers
• Natural Language Search - Search with queries like "machine learning in healthcare"
• Hybrid Search - Combines AI embeddings with Zotero's keyword search
• Section-Aware Results - See which section matched (Abstract, Methods, Results)
• 100% Local - Uses bundled AI model, no data leaves your machine
• Fast - Searches complete in ~70ms

LINKS

GitHub: https://github.com/introfini/ZotSeek
Download: https://github.com/introfini/ZotSeek/releases/latest

REQUIREMENTS

Zotero 8.0 or later (including beta)

TECHNICAL DETAILS

• Uses nomic-embed-text-v1.5 model (768 dimensions, 8K context)
• Transformers.js running in a ChromeWorker for non-blocking inference
• Embeddings stored in SQLite alongside Zotero's database
• ~131MB download due to bundled AI model (ensures offline functionality)

Feedback and suggestions welcome!

sdflewrit783 · December 27, 2025

Great idea. Some thoughts without trying as am not running v8. Plus: see point 2 below.

1. Books (special issues, edited books) are important; it would be good to have a way to include them.
2. Putting things into the main database seems risky. The potential loss of work that took many people many years to compile is significant, individual development is more precarious over time, and the core Zotero team will say, remove plugins as a first recommendation to most problems. Therefore, I would prefer this functionality to work in its own bounded space so that it guarantees not to interfere with the core reliability.

damnation · December 27, 2025

Fantastic!
I appreciate the elaborate documentation!

introfini · December 27, 2025

@sdflewrit783:

Thanks for the thoughtful feedback!

Re: Books - There's actually a setting in the preferences (Settings → ZotSeek → "Exclude books from indexing and search") that you can uncheck to include books. They're excluded by default because they lack the typical paper structure (Abstract, Methods, Results) and are often too long to index effectively, but the option is there if you need it.

Re: Database isolation - Great timing on this concern! I just released v1.1.0 which addresses exactly this. ZotSeek now uses a separate SQLite database file (zotseek.sqlite) that attaches to Zotero's connection using the ATTACH DATABASE pattern (same approach Better BibTeX uses). This means:
- Zotero's main database stays completely untouched
- Clean uninstall - the plugin automatically removes its database file when uninstalled
- Complete data isolation

The database path is now shown in the settings panel so you can verify it's separate.

@damnation:

Thanks! Glad the documentation is helpful.

sdflewrit783 · December 28, 2025

@introfini Cheers, sounds good, good timing indeed. :)

On books. Thanks for the explanation. In our case, books make up a large proportion of the literature. They can be systematic monographs with chapters covering all aspects of a biological genus (morphology, population, human use, and so on) or more loosely structured works. Having ways to meaningfully integrate them would be very useful. If relevant, many universities provide access to high-performance computing if needed for an initial analysis.

On real-world use. Over time, research across projects, teaching across multiple subjects, preparing various papers and presentations, while collaborating with novices or people using different in-house practices, all result in many groups, duplications, and folders of documents not yet added to Zotero. In principle, this is the context where an automated search and sense-making system could be most useful... For your consideration.

I also commend the transparent documentation!

aborel · December 28, 2025

The plugin only retrieves documents from one's own library, is that right? If yes, perhaps this could be written more explicitly.

Otherwise: very cool project!

introfini · December 28, 2025

@aborel

Correct - ZotSeek searches only your local Zotero library. It doesn't connect to external databases or query papers you don't already have.
Thanks for pointing out that this should be clearer! I'll update the README to state this explicitly upfront. The "100% Local" framing emphasizes privacy but doesn't make the scope obvious enough.
The focus is on helping you rediscover and make connections within your existing research, rather than external discovery. Think of it as a smarter way to navigate what you've already collected.

introfini · December 28, 2025

@sdflewrit783

Thanks for the detailed context about your use case!

Re: Books - You're right that books present unique challenges. The current system can handle them if you enable that setting, but there are some practical limitations:

1. Length - A 300-page monograph might generate 100+ chunks at ~3 seconds each, making indexing time prohibitive
2. Structure variation - The section detection (which powers the "Source" column in results) assumes paper structure.
3. Books with chapter-based organization won't benefit from this Context window - Even with 8K tokens, most books need to be split heavily, which can fragment semantic meaning

That said, for shorter edited volumes or structured monographs, it could work reasonably well.

An idea for the future: Your use case has me thinking about making the chunk splitting configurable. Currently, the plugin looks for standard paper sections (Introduction, Methods, Results, etc.), but allowing users to define custom section patterns could make it work much better for books or discipline-specific formats. This could be a preferences setting where you define your own categories (e.g., "Morphology", "Population", "Human Use" for biological monographs). Would that help with your workflow?

Regarding HPC resources - I appreciate the offer! However, the current architecture is deliberately designed to run 100% locally on your machine (nothing leaves your computer), which means the AI model processes chunks sequentially rather than using external compute. This ensures complete privacy but also means that even with access to university computing clusters, indexing would still be sequential and take the same amount of time per chunk. The tradeoff is privacy and offline capability vs. speed.

Would love to hear how it performs if you try it with your book collection!

Re: Real-world messiness - This is actually where semantic search excels! The AI doesn't care about clean folder structures or perfect metadata. As long as papers are indexed, you can find them by meaning regardless of how chaotically they're organized. Duplicates might show up as highly similar papers, which could even help you identify them.
The main limitation right now is that papers need to be in Zotero to be indexed. "Documents not yet added to Zotero" won't be searchable until imported. Auto-indexing on import is planned to make this smoother.

sdflewrit783 · December 29, 2025

@introfini

On books and processing time. Say, we have some 20k articles x 3sec ≈ 16 hours? Say, 2k of books with 10 chapters each ≈ also 16 hours? Will subsequent updates only index new additions?

Not sure I understand the implications of the customisation you suggest, but anything manual here will be too time-consuming to attempt.

Still not clear, does the search work across Zotero groups?

Basically, if any chunk of content is not included in the search, the utility of searching is in question as you will have to search to ensure nothing is missing with other means anyway.

On HPC. The user could create a powerful virtual machine, index there, and then use the index on a laptop, etc., if it can be synced, or so I imagined.

Good luck with all this; it is a complex challenge.

jameshatfield · December 29, 2025

This is a great idea!

The keyword search portion appears buggy, it's not finding results that the Zotero's stand-alone keyword search finds.

Search history would be super helpful.

jameshatfield · December 29, 2025

I think this is a great tool with even more room to develop. It looks for similar papers - would it permit search for semantically similar *chunks* within the papers? Sometimes it's not enough to know that the overall paper contains some similarity somewhere, but paragraph or similar level granularity is needed, especially if a preview of the matching chunk could be shown without opening the document (something like https://docs.langchain.com/oss/python/integrations/splitters might be helpful.) This would be helpful for those "I think I read something about 'autonomous AI agents' in this 300 page paper, but where?

introfini · December 30, 2025

@sdflewrit783:

On processing time and incremental updates:
Your math is correct - 20k articles at ~3 sec each would be roughly 16-17 hours of initial indexing. The good news: subsequent updates are incremental. "Update Library Index" only processes items that haven't been indexed yet or have been modified since last indexing. So after that initial one-time investment, adding 50 new papers would only take ~2.5 minutes.

On customization:
To clarify - I was thinking of a one-time preference setting where you define custom section patterns once (e.g., "Morphology chapters should be labeled 'Morphology'"), not manual configuration per document. The plugin would then automatically apply those patterns during indexing. But you're right that even that setup might not be worth it if the core challenge is indexing time.

On Zotero groups:
Good question - I need to verify this. The plugin works on collections, so in theory group libraries should work, but I haven't explicitly tested cross-group search. Let me confirm and get back to you.

On completeness:
This is the fundamental tension you've identified. If your workflow requires searching everything including lengthy monographs, and the indexing time makes that impractical, then ZotSeek might not be the right fit - or at least not as a complete replacement for other search methods. It's better suited for literature that fits the abstract/paper model, or as a complement to traditional search rather than a replacement.

On HPC/VM syncing:
Clever idea! The embeddings are stored in a single SQLite file (zotseek.sqlite) in your Zotero data directory. In theory, you could index on a powerful VM and then copy that database file to your laptop. However, this hasn't been tested and there could be path/compatibility issues. If you're willing to experiment, I'd be very interested to hear if it works!

Bottom line: For a library heavily weighted toward long-form books rather than papers, the current architecture might not be ideal. The plugin excels at paper-focused libraries where most items have abstracts. I appreciate you thinking through these edge cases - it helps clarify where the tool fits (and doesn't fit) in different research workflows.

introfini · December 30, 2025

@jameshatfield:

Thanks for the kind words and the detailed feedback — both points are really helpful!

On the keyword search bug:

I'd love to dig into this. Could you share:

An example query where Zotero's search found results but ZotSeek's hybrid mode missed them?

Which search mode were you using? (The dropdown in the search dialog — Hybrid, Semantic Only, or Keyword Only)

Are the missing results papers that have been indexed? (You can check the indexed count in Settings → ZotSeek)

ZotSeek's keyword component queries Zotero's QuickSearch API under the hood, so in theory it should return the same results — but there might be edge cases I haven't caught. Specific examples would help me reproduce and fix it.

On chunk-level search with previews:

Great feature idea, and you've hit on something that's partially there but could go further.

What ZotSeek already does: In Full Document mode, papers are chunked by section (Abstract, Methods, Results, etc.), and the search uses MaxSim — meaning if any chunk matches well, the paper ranks highly. The "Source" column in results shows which section matched (e.g., "Methods" vs "Results").

What's missing (and what you're asking for):

Chunk text preview — showing the actual matching paragraph without opening the PDF

Finer granularity — section-level chunks can still be large; paragraph-level would help for your "where in this 300-page paper" use case

Jump to location — clicking through to the exact spot in the PDF

The architecture supports this — chunk embeddings are already stored separately — but I'd need to store chunk text alongside embeddings and add preview UI. Definitely on the radar.

Quick question: For your use case with long documents, are you mostly working with monographs/books, or lengthy reports and review papers? That would help me prioritize how granular the chunking needs to be.

On search history: Also a good idea — I'll add it to the list. Would you want just recent queries, or more of a "saved searches" feature?

Thanks again for the thoughtful suggestions!

jameshatfield · January 1, 2026

Thanks for your thoughtful response, apologies for the delay getting back to you!

As an example I'll use this document: https://www.dni.gov/files/ODNI/documents/assessments/GlobalTrends_2040.pdf which I ingested into the index (full content) and tested by searching for the word "flood".

For reference, Zotero's plain search identifies this document as a match, and its PDF reader finds six variations of this word (x1 flood, 4x floods, 1x flooding).

I searched for the word "floods" using Zotero's Keyword search and it finds nothing. Oddly, the semantic and hybrid searches yields many hits from my library by there's no scrollbar to scroll down to view the results beyond the first page - perhaps this specific document is lower down in the list but I cannot tell, even with both monitors and the results stretched down across both. The "sort by column" feature also doesn't appear functional yet.

Does this help?

To your other question about document size: Currently I'm working on a research paper sourced by journal articles typically 10-40 pages long.

As a common *nix user (but currently on Windows here admittedly) I'm used to Bash terminal style history so that's what I had in mind but not tied to it.

Once I'm done with school (in a month or two) I hope I can help out on your GitHub!

On that note: would giving the users more control of 1) the chunk size and the 2) headings you search for via your long regex (I've done similar work in the past - my regexes were nearly identical!) provide more flexibility towards the recommendations offered above?

introfini · January 5, 2026

Hey @jameshatfield — quick update: I just released v1.2.0 which addresses your chunk-level search request!

New: Result Granularity Toggle

In Full Document mode, you can now switch between two views:

By Section (default) — 1 result per paper, showing the best matching section

By Location — All matching paragraphs with exact page & paragraph numbers

In By Location mode, clicking a result opens the PDF directly to the matching page — so you can finally answer "where in this 300-page paper did I read about autonomous AI agents?"
Also new:

References filtering — Bibliography sections are now automatically excluded from indexing, keeping results focused on actual content

Privacy documentation — Added a dedicated section confirming everything runs 100% locally

Upgrade notes:

Download zotseek-1.2.0.xpi from the GitHub releases page

If using Full Document mode, go to Settings → ZotSeek and increase "Max Chunks per Paper" (recommended: 50-100 for longer documents)

Rebuild your index to capture the new location metadata and apply references filtering

Would love to hear if this helps with your workflow

EwoutH · January 21, 2026

This is really cool. Maybe the most useful Zotero plugin of the year. Thanks a lot for developing and maintaining this!

kdb_research · January 29, 2026

Thanks for this. Very interesting. Does the index include only Items of Type Journal Article, or is it picking up other types with attached pdfs? And does that mean that the search results are limited to Articles? This would be a virtue -- good not to pick up web pages or newspapers, in my view -- but it might be the sort of thing you need to specify in the search options.

avophile · January 30, 2026

This is the approach I have been taking, with the hopes of semantic search results being enhanced through a community of federated users sharing query response evaluations (with only anonymised data shared, of course):

On HPC/VM syncing:
In theory, you could index on a powerful VM and then copy that database file to your laptop. However, this hasn't been tested and there could be path/compatibility issues

ezellohar · February 2, 2026

Thank you, awesome job, I was hoping on something 100% local!

I'd like to ask: could it be possible, in the future, to add a function to summarize a document or gather similiarities/differences in a group of papers? I'm asking because it could be really helpful and from my totally unware point of view, it seems that the data gathering process should be similar for both already included functionalities and this new one, but I could be mistaken of course, I'm clueless on the topic.

Again, thank you very much for your gift