Zotero and RAG?

edited December 28, 2023

Has anybody looked into ways of using Zotero libraries in 'retrieval-augmented generation' pipelines? Think of it as running a chatbot on your own library/ies, WITH an ability to reference your Zotero items ... I quite like this setup, which I've used on texts I downloaded in a folder but  I'm not sure how to hook this up to the sqlite db. So if anybody has any ideas/experience - I'd love to hear them! 

  • I have worked with SQL-bots, GPT4 and Langchain. Super easy to get started. I do not know how the Zotero backend works, why would you need SQL?

    Longing for a RAG setup to find references fast.
  • SQLite -- that's how Zotero's database is stored, and it's the fastest way to get local read-only access to it.
  • edited January 12, 2024
    "why would you need SQL?" - to get the metadata, which are stored in a sqlite db.

    I'm now looking into a setup that exports a selected group library to a json(l) file with the key fields I need (including the cached .ft and the path to the pdf) and then a) uses the .ft text in the RAG; AND b) (maybe) feeds the pdfs through GROBID for 'cleaner' text (it does a pretty good job in stripping out just the text - as well as all sorts of other elements). The nice thing is that this will then allow us to

    My main stumbling block right now is that I am not sure where Zotero stores the groupID. I looked for a particular itemID (which I got from the 'key' column in the 'items' table) in all tables, but I can't find the zg there. I do see things like libraryID but I'm not sure that's the right field. And when I look for the zg of that item, I also can't find it. The point is of course to do a query on that groupID so at to work with JUST those (I have a quite gigantic Zotero collection).

    So I'd appreciate it if somebody could confirm that that libraryID is indeed the group identifier. And then I'll take it from there.

    UPD - I wrote some code to show me the items from that libraryID, and it is indeed the right one. It is '200', which I thought was an overly suspicious 'round' number, but I guess it IS the right number! :)
  • Update - I've got the jsonl-export bit ready. My first real Github coding foray - so be kind! Now trying to fix that GROBID/TEI-bit...

  • Nobody?

    Here is my current understanding (but please correct me if I'm wrong!) on how keys, library IDs, and collection IDs relate within Zotero's sqlite database structure, and how they correlate to the physical storage paths of items and their associated files.

    What I THINK I was able to figure out


    • Item Key (key): Unique identifier for each item within Zotero. I THOUGHT this key was the one used to name folders within the Zotero storage directory, but that appears NOT to be the case

    • Library ID (libraryID): Identifies a specific library within Zotero. A single Zotero account can have multiple libraries (e.g., personal library, group libraries).

    • Collection ID (collectionID): Identifies collections within a library. Collections are a way to organize items and can be nested.

    What I STILL haven't been able to figure out


    • in which table in the db I can find the key that

      • corresponds to the folder name in my Zotero storage

      • will allow me to add the automatically extracted content of the '.zotero-ft-cache' file in that folder to the corresponding row in the dataframe that I WAS able to extract from the sqlite db


    • how (once I can 'hook up' the rows in my df to the correct Zotero storage folder) I can then 'feed'  all the pdfs iknto gribid and get (hopefully) 'cleaner' text - again in another column in my df..

    • how I might be able to bypass pandas dataframes altogether and to just 'hook up' a RAG the 'right way' to the sqlite dd

    I fully realize this is an uncommon request, but I would still be very grateful of somebody could provide some more clarity on this...

    Thanks!

     

  • I haven't looked at the sqlite but in terms of the data structure, the item has attachments and their item key is the folder name (since an item can have any number of attachments, each with their own folder, it couldn't be the top level items key.
  • Thanks Sebastian! That's useful info. So I'll try to focus on the itemAttachments table and get the path from there (see here for the schema Zotero ERD)... 

    Fingers crossed!
  • edited January 24, 2024

    Still no cigar :( I tried looking for the path of the item with this code

     

     


    import sqlite3

    import pandas as pd



    # Path to Zotero SQLite database in WSL format

    zotero_db_path = '/mnt/e/Zotero/zotero.sqlite'



    # Assuming collection_id has been defined previously in the code

    # collection_id = <your_collection_id_here>



    print("Connecting to the Zotero SQLite database...")

    conn = sqlite3.connect(zotero_db_path)



    # Fetching itemIDs in the specified collectionID

    item_ids_query = """

    SELECT itemID

    FROM collectionItems

    WHERE collectionID = ?

    """

    item_ids = pd.read_sql_query(item_ids_query, conn, params=(collection_id,))



    # Check if any items were found

    if not item_ids.empty:

        print(f"Found {len(item_ids)} items in collectionID {collection_id}. Fetching attachment paths...")



        # Converting itemIDs to a format suitable for SQL query

        item_ids_str = ', '.join(item_ids['itemID'].astype(str).tolist())



        # Fetching 'path' values from itemAttachments table

        paths_query = f"SELECT parentItemID, path FROM itemAttachments WHERE parentItemID IN ({item_ids_str})"

        paths = pd.read_sql_query(paths_query, conn)

        print("Paths of Attachments:")

        print(paths)

    else:

        print(f"No items found in collectionID {collection_id}.")



    # Close the connection

    conn.close()

    And that gives me this. These 'path'-names are also the file names I see when I click on that item in Zotero itself (see this screenshot).  So all of that makes sense. But the actual file is in a storage folder with a different folder name than the key of that item...

  • Just an update -
    A reverse search on my entire Zotero sqlite db for a known folder name in my Zotero
    storage folder from the collection I'm working with, tells me that the 'items' table has a 'key' column with that very content. I'm now trying to find a way to write code that can use this to add the .zotero-ft-cache to a 'full_text' column in my df_zotero... So something like: look in my db for all itemIDs in this collectionID, then use that to find the right 'key', and then find that sub-folder in my Zotero storage folder and add the text to my df... Hope springs eternal!
  • edited January 26, 2024

    Again just updating [on the 'off' chance that somebody might be interested in exporting items from Zotero collections to get them into a pandas dataframe that can then be used in one of those new (and quite amazing!) LLM/RAG setups - and again: I'm 100% convinced that there are FAR smarter ways to do this that would hook up the RAG directly to the sqlite db, but that would take me a LOT more time to figure out. Si I'm trying to stick to something I SHOULD be able to do. But if somebody were able to do this 'properly' - I'd be thrilled!!!]

    So since I never managed to find the right 'key' for the attachments folder from the sqlite db, I decided to use the 'path' column that my previous ipynb already added to the 'df_zotero' dataframe I exported (in flattened form) from my sqlite db. This code iterates through my Zotero storage folder in search of folder names that correspond to the (stripped) 'path'  column contain pdf-files with the file names of the pdf-files. It then takes those folder names, and then first adds them as a 'folder_name' column to my df_zotero, and then also secondly adds the content of the '.zotero-ft-cache' file to a 'zotero-ft-cache' column in a new 'df_zotero_with_cache' dataframe.

    Since I now also have a reliable 'folder_name' defined for each item in my collection, I should be able to feed the full pdfs to a GROBID server (the docker image works great!) to (hopefully) get a) a 'cleaner' full text than what Zotero generates automatically in the .zotero-ft-cache file; b) get things parsed out like references, bibliographies, etc. (which can then be used in bibliometric analyses etc.). And once I get the 'clean(er?)' text, I can then translate all non-English dissertations into English with Facebook's mBart-large-50-many-to-many-mmt into a new column. And THAT's what I(/we) can then build a RAG on.   

    So I'm not quite there yet, but I AM making progress! But I'm still hoping for some more proficient 'reinforcements'!!!

     

  • Haven't had much time this weekend, but as I said last time, I had already managed to get working code to get all items of the selected Zotero collection into a dataframe that looks like this:

    plaintext
    Copy code
    RangeIndex: 1241 entries, 0 to 1240
    Data columns (total 24 columns):
    # Column Non-Null Count Dtype
    --- ------ -------------- -----
    0 itemID 1241 non-null int64
    1 author_last_name 1241 non-null object
    2 author_first_name 1241 non-null object
    3 abstractNote 1239 non-null object
    4 accessDate 1241 non-null object
    5 date 1241 non-null object
    6 extra 1241 non-null object
    7 language 1241 non-null object
    8 libraryCatalog 1241 non-null object
    9 numPages 1241 non-null int64
    10 place 1241 non-null object
    11 rights 1241 non-null object
    12 shortTitle 889 non-null object
    13 thesisType 1241 non-null object
    14 title 1241 non-null object
    15 university 1241 non-null object
    16 url 1241 non-null object
    17 path 1187 non-null object
    18 degreeType 1241 non-null object
    19 thesisTypeFull 1241 non-null object
    20 key 1241 non-null object
    21 folder_name 1187 non-null object
    22 pdf_name 1187 non-null object
    23 zotero-ft-cache 1187 non-null object
    dtypes: int64(2), object(22)

    I then made a second notebook that first of all feeds the pdfs from that collection from their storage path through GROBID to 'clean them up'. GROBID is an open source free machine learning library from AI2 (the guys and gals behind Semantic Scholar) for extracting, parsing, and restructuring raw documents such as academic publications into structured and semantically rich XML/TEI encoded documents. The library does a(n IMO) reasonable job with Header Extraction (extracting metadata from the headers of academic papers, like titles, authors, abstracts, keywords, etc.); Reference Parsing (parsing out and structuring the bibliographic references); but especially also Full-text Structuring (GROBID not only works on headers and references but can also structure the full text of articles, breaking it down into sections like introduction, methodology, results, etc. ) The results of all of this (in my Jupyter Notebook) is a 'results' dictionary containing all of that information - each row essentially being an xml file. Since the XML/TEI schema is VERY complex, this is what threw me for a loop until I found s2orc's very nice 'tei2json' script. The second part of this notebook therefore uses that script to create a simple dataframe that looks like this:

    plaintext
    Copy code
    RangeIndex: 1106 entries, 0 to 1105
    Data columns (total 10 columns):
    # Column Non-Null Count Dtype
    --- ------ -------------- -----
    0 paper_id 1106 non-null object
    1 pdf_hash 1106 non-null object
    2 metadata 1106 non-null object
    3 abstract 1106 non-null object
    4 body_text 1106 non-null object
    5 back_matter 1106 non-null object
    6 bib_entries 1106 non-null object
    7 ref_entries 1106 non-null object
    8 raw_abstract_text 1106 non-null object
    9 raw_body_text 1106 non-null object
    dtypes: object(10)

    This gives us a nice 'raw_body_text' column to work with. So the rest of that notebook then uses fasttext to identify the language used in the 'raw_body_text' column and store than in a 'lang' column; which we then use to have Facebook's 'facebook/mbart-large-50-many-to-many-mmt' language model to translate all of the dissertations from their original language to English in a new 'raw_body_text_en' that looks like this:

    plaintext
    Copy code
    Language Truncated Original Body Text \
    0 en Energy has recently become a very important it...
    1 es En los últimos tres años, se ha observado una ...
    2 fr Je remercie tout d'abord mes directeurs de rec...
    3 tr Muammer Kaddafi'nin elde ettiği petrol gelirle...
    4 ru 1 Зaпaдный бepeг pеки Иоpдaн -peгион нa Ближнe...

    Truncated Translated Body Text
    0 Energy has recently become a very important it...
    1 In the last three years, there has been a tran...
    2 I thank my research directors Michel Gagnon an...
    3 Let's review Libya and what it does for the pe...
    4 One of the main thrusts of this project is to ...

    I then moved on to spaCy's 'en_core_web_lg' language model in order to sentencize all dissertations, creating a new dataframe with the index number of the actual dissertation from the original df, and then a column with the first sentence of the first diss, with the second sentence being in the second row, etc.

    I then generated the embeddings of these sentences with 'BAAI/bge-large-en-v1.5' and started to get all of this into a weaviate vector database. But I then figured we'd miss too much of the context and we'd be better off with paragraphs. So I played around with paragraphs as well (splitting the texts on '\n's). That gave me the following stats:

    Average number of paragraphs per dissertation: 444.1907775768535
    Minimum number of paragraphs in a dissertation: 0
    Maximum number of paragraphs in a dissertation: 2333
    Dissertations with fewer than 50 paragraphs: 42
    So that's not half bad.

    But then I started reading up on semantic chunking (like here) and these overlapping (and semantically enhanced) chunking methods, which sound ideal to me. So next time I find some time that's where I plan to take this.

    As always - I put my notebooks on my github. And so if somebody with more experience on RAGs is willing to pick it up from there (and/or to clean up my code) - please be my guest!
  • Hi sdspieg,
    I encountered the same issue you found with storage keys and found a fix. What confused me what that the itemID's referred to in the collectionItems table don't identify the correct storage location key - it is the parentItemID in the itemAttachments table which creates the relationship. In this case, the following SQL statement pulls a list of item titles in a collection (ID of 215) and with its associated storage location (key):


    SELECT
    c.collectionID,
    i.itemID as 'i.itemID',
    i.key,
    idv.value as 'title'
    FROM collections AS c
    JOIN collectionItems as ci ON c.collectioniD=ci.collectionID
    JOIN itemAttachments as ia ON ia.parentItemID=ci.itemID
    JOIN items as i ON i.itemID=ia.itemID
    JOIN itemData as id ON id.itemID=i.itemID
    JOIN itemDataValues as idv ON idv.valueID=id.valueID
    JOIN fieldsCombined as fc ON id.fieldID=fc.fieldID
    WHERE c.collectionID = 215
    AND fc.fieldName = 'title'
  • Some as Reor (https://github.com/reorproject/reor) does?
    It sounds amazing...
  • Have you looked here: https://forums.zotero.org/discussion/comment/429863
    And here: https://github.com/whitead/paper-qa

    Also, I'm looking at abstracting out the Zotero document loading + metadata part and adding it as a standalone module to LlamaHub for integration into different RAG pipelines.

    I'd be interested in having testers and people who can write up issues. But the loader will use primarily the Zotero API to make a local db for the purposes of getting the files and adding metadata so it can run on a server/cloud.

    Experience with python is a plus.

    If this works well, eventually we could do a Typescript version so it could be used as a Zotero add-on.
  • Great @sdspieg. See the issue here: https://github.com/langchain-ai/langchain/discussions/20288

    I will add another message there and here when we get a repository setup and there is something to test.
Sign In or Register to comment.