Zotero and RAG?
Has anybody looked into ways of using Zotero libraries in 'retrieval-augmented generation' pipelines? Think of it as running a chatbot on your own library/ies, WITH an ability to reference your Zotero items ... I quite like this setup, which I've used on texts I downloaded in a folder but I'm not sure how to hook this up to the sqlite db. So if anybody has any ideas/experience - I'd love to hear them!
Longing for a RAG setup to find references fast.
I'm now looking into a setup that exports a selected group library to a json(l) file with the key fields I need (including the cached .ft and the path to the pdf) and then a) uses the .ft text in the RAG; AND b) (maybe) feeds the pdfs through GROBID for 'cleaner' text (it does a pretty good job in stripping out just the text - as well as all sorts of other elements). The nice thing is that this will then allow us to
My main stumbling block right now is that I am not sure where Zotero stores the groupID. I looked for a particular itemID (which I got from the 'key' column in the 'items' table) in all tables, but I can't find the zg there. I do see things like libraryID but I'm not sure that's the right field. And when I look for the zg of that item, I also can't find it. The point is of course to do a query on that groupID so at to work with JUST those (I have a quite gigantic Zotero collection).
So I'd appreciate it if somebody could confirm that that libraryID is indeed the group identifier. And then I'll take it from there.
UPD - I wrote some code to show me the items from that libraryID, and it is indeed the right one. It is '200', which I thought was an overly suspicious 'round' number, but I guess it IS the right number! :)
Update - I've got the jsonl-export bit ready. My first real Github coding foray - so be kind! Now trying to fix that GROBID/TEI-bit...
Nobody?
Here is my current understanding (but please correct me if I'm wrong!) on how keys, library IDs, and collection IDs relate within Zotero's sqlite database structure, and how they correlate to the physical storage paths of items and their associated files.
What I THINK I was able to figure out
key
): Unique identifier for each item within Zotero. I THOUGHT this key was the one used to name folders within the Zotero storage directory, but that appears NOT to be the caselibraryID
): Identifies a specific library within Zotero. A single Zotero account can have multiple libraries (e.g., personal library, group libraries).collectionID
): Identifies collections within a library. Collections are a way to organize items and can be nested.What I STILL haven't been able to figure out
I fully realize this is an uncommon request, but I would still be very grateful of somebody could provide some more clarity on this...
Thanks!
Thanks Sebastian! That's useful info. So I'll try to focus on the itemAttachments table and get the path from there (see here for the schema Zotero ERD)...
Fingers crossed!Still no cigar :( I tried looking for the path of the item with this code
And that gives me this. These 'path'-names are also the file names I see when I click on that item in Zotero itself (see this screenshot). So all of that makes sense. But the actual file is in a storage folder with a different folder name than the key of that item...
A reverse search on my entire Zotero sqlite db for a known folder name in my Zotero
storage folder from the collection I'm working with, tells me that the 'items' table has a 'key' column with that very content. I'm now trying to find a way to write code that can use this to add the .zotero-ft-cache to a 'full_text' column in my df_zotero... So something like: look in my db for all itemIDs in this collectionID, then use that to find the right 'key', and then find that sub-folder in my Zotero storage folder and add the text to my df... Hope springs eternal!
Again just updating [on the 'off' chance that somebody might be interested in exporting items from Zotero collections to get them into a pandas dataframe that can then be used in one of those new (and quite amazing!) LLM/RAG setups - and again: I'm 100% convinced that there are FAR smarter ways to do this that would hook up the RAG directly to the sqlite db, but that would take me a LOT more time to figure out. Si I'm trying to stick to something I SHOULD be able to do. But if somebody were able to do this 'properly' - I'd be thrilled!!!]
So since I never managed to find the right 'key' for the attachments folder from the sqlite db, I decided to use the 'path' column that my previous ipynb already added to the 'df_zotero' dataframe I exported (in flattened form) from my sqlite db. This code iterates through my Zotero storage folder in search of folder names that correspond to the (stripped) 'path' column contain pdf-files with the file names of the pdf-files. It then takes those folder names, and then first adds them as a 'folder_name' column to my df_zotero, and then also secondly adds the content of the '.zotero-ft-cache' file to a 'zotero-ft-cache' column in a new 'df_zotero_with_cache' dataframe.
Since I now also have a reliable 'folder_name' defined for each item in my collection, I should be able to feed the full pdfs to a GROBID server (the docker image works great!) to (hopefully) get a) a 'cleaner' full text than what Zotero generates automatically in the .zotero-ft-cache file; b) get things parsed out like references, bibliographies, etc. (which can then be used in bibliometric analyses etc.). And once I get the 'clean(er?)' text, I can then translate all non-English dissertations into English with Facebook's mBart-large-50-many-to-many-mmt into a new column. And THAT's what I(/we) can then build a RAG on.
So I'm not quite there yet, but I AM making progress! But I'm still hoping for some more proficient 'reinforcements'!!!
plaintext
Copy code
RangeIndex: 1241 entries, 0 to 1240
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 itemID 1241 non-null int64
1 author_last_name 1241 non-null object
2 author_first_name 1241 non-null object
3 abstractNote 1239 non-null object
4 accessDate 1241 non-null object
5 date 1241 non-null object
6 extra 1241 non-null object
7 language 1241 non-null object
8 libraryCatalog 1241 non-null object
9 numPages 1241 non-null int64
10 place 1241 non-null object
11 rights 1241 non-null object
12 shortTitle 889 non-null object
13 thesisType 1241 non-null object
14 title 1241 non-null object
15 university 1241 non-null object
16 url 1241 non-null object
17 path 1187 non-null object
18 degreeType 1241 non-null object
19 thesisTypeFull 1241 non-null object
20 key 1241 non-null object
21 folder_name 1187 non-null object
22 pdf_name 1187 non-null object
23 zotero-ft-cache 1187 non-null object
dtypes: int64(2), object(22)
I then made a second notebook that first of all feeds the pdfs from that collection from their storage path through GROBID to 'clean them up'. GROBID is an open source free machine learning library from AI2 (the guys and gals behind Semantic Scholar) for extracting, parsing, and restructuring raw documents such as academic publications into structured and semantically rich XML/TEI encoded documents. The library does a(n IMO) reasonable job with Header Extraction (extracting metadata from the headers of academic papers, like titles, authors, abstracts, keywords, etc.); Reference Parsing (parsing out and structuring the bibliographic references); but especially also Full-text Structuring (GROBID not only works on headers and references but can also structure the full text of articles, breaking it down into sections like introduction, methodology, results, etc. ) The results of all of this (in my Jupyter Notebook) is a 'results' dictionary containing all of that information - each row essentially being an xml file. Since the XML/TEI schema is VERY complex, this is what threw me for a loop until I found s2orc's very nice 'tei2json' script. The second part of this notebook therefore uses that script to create a simple dataframe that looks like this:
plaintext
Copy code
RangeIndex: 1106 entries, 0 to 1105
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 paper_id 1106 non-null object
1 pdf_hash 1106 non-null object
2 metadata 1106 non-null object
3 abstract 1106 non-null object
4 body_text 1106 non-null object
5 back_matter 1106 non-null object
6 bib_entries 1106 non-null object
7 ref_entries 1106 non-null object
8 raw_abstract_text 1106 non-null object
9 raw_body_text 1106 non-null object
dtypes: object(10)
This gives us a nice 'raw_body_text' column to work with. So the rest of that notebook then uses fasttext to identify the language used in the 'raw_body_text' column and store than in a 'lang' column; which we then use to have Facebook's 'facebook/mbart-large-50-many-to-many-mmt' language model to translate all of the dissertations from their original language to English in a new 'raw_body_text_en' that looks like this:
plaintext
Copy code
Language Truncated Original Body Text \
0 en Energy has recently become a very important it...
1 es En los últimos tres años, se ha observado una ...
2 fr Je remercie tout d'abord mes directeurs de rec...
3 tr Muammer Kaddafi'nin elde ettiği petrol gelirle...
4 ru 1 Зaпaдный бepeг pеки Иоpдaн -peгион нa Ближнe...
Truncated Translated Body Text
0 Energy has recently become a very important it...
1 In the last three years, there has been a tran...
2 I thank my research directors Michel Gagnon an...
3 Let's review Libya and what it does for the pe...
4 One of the main thrusts of this project is to ...
I then moved on to spaCy's 'en_core_web_lg' language model in order to sentencize all dissertations, creating a new dataframe with the index number of the actual dissertation from the original df, and then a column with the first sentence of the first diss, with the second sentence being in the second row, etc.
I then generated the embeddings of these sentences with 'BAAI/bge-large-en-v1.5' and started to get all of this into a weaviate vector database. But I then figured we'd miss too much of the context and we'd be better off with paragraphs. So I played around with paragraphs as well (splitting the texts on '\n's). That gave me the following stats:
Average number of paragraphs per dissertation: 444.1907775768535
Minimum number of paragraphs in a dissertation: 0
Maximum number of paragraphs in a dissertation: 2333
Dissertations with fewer than 50 paragraphs: 42
So that's not half bad.
But then I started reading up on semantic chunking (like here) and these overlapping (and semantically enhanced) chunking methods, which sound ideal to me. So next time I find some time that's where I plan to take this.
As always - I put my notebooks on my github. And so if somebody with more experience on RAGs is willing to pick it up from there (and/or to clean up my code) - please be my guest!
I encountered the same issue you found with storage keys and found a fix. What confused me what that the itemID's referred to in the collectionItems table don't identify the correct storage location key - it is the parentItemID in the itemAttachments table which creates the relationship. In this case, the following SQL statement pulls a list of item titles in a collection (ID of 215) and with its associated storage location (key):
SELECT
c.collectionID,
i.itemID as 'i.itemID',
i.key,
idv.value as 'title'
FROM collections AS c
JOIN collectionItems as ci ON c.collectioniD=ci.collectionID
JOIN itemAttachments as ia ON ia.parentItemID=ci.itemID
JOIN items as i ON i.itemID=ia.itemID
JOIN itemData as id ON id.itemID=i.itemID
JOIN itemDataValues as idv ON idv.valueID=id.valueID
JOIN fieldsCombined as fc ON id.fieldID=fc.fieldID
WHERE c.collectionID = 215
AND fc.fieldName = 'title'
It sounds amazing...
And here: https://github.com/whitead/paper-qa
Also, I'm looking at abstracting out the Zotero document loading + metadata part and adding it as a standalone module to LlamaHub for integration into different RAG pipelines.
I'd be interested in having testers and people who can write up issues. But the loader will use primarily the Zotero API to make a local db for the purposes of getting the files and adding metadata so it can run on a server/cloud.
Experience with python is a plus.
If this works well, eventually we could do a Typescript version so it could be used as a Zotero add-on.
I will add another message there and here when we get a repository setup and there is something to test.