This is the second in a two-part sequence on utilizing SQLite for Machine Learning. In my last article, I dove into how SQLite is quickly turning into a production-ready database for internet functions. On this article, I’ll talk about methods to carry out retrieval-augmented-generation utilizing SQLite.
For those who’d like a customized internet utility with generative AI integration, go to losangelesaiapps.com
The code referenced on this article might be discovered here.
Once I first discovered methods to carry out retrieval-augmented-generation (RAG) as a budding knowledge scientist, I adopted the conventional path. This often seems one thing like:
- Google retrieval-augmented-generation and search for tutorials
- Discover the preferred framework, often LangChain or LlamaIndex
- Discover the preferred cloud vector database, often Pinecone or Weaviate
- Learn a bunch of docs, put all of the items collectively, and success!
In actual fact I truly wrote an article about my expertise constructing a RAG system in LangChain with Pinecone.
There’s nothing terribly flawed with utilizing a RAG framework with a cloud vector database. Nonetheless, I’d argue that for first time learners it overcomplicates the scenario. Do we actually want a whole framework to learn to do RAG? Is it essential to carry out API calls to cloud vector databases? These databases act as black packing containers, which is rarely good for learners (or frankly for anybody).
On this article, I’ll stroll you thru methods to carry out RAG on the best stack potential. In actual fact, this ‘stack’ is simply Sqlite with the sqlite-vec extension and the OpenAI API to be used of their embedding and chat fashions. I like to recommend you read part 1 of this sequence to get a deep dive on SQLite and the way it’s quickly turning into manufacturing prepared for internet functions. For our functions right here, it is sufficient to perceive that SQLite is the best type of database potential: a single file in your repository.
So ditch your cloud vector databases and your bloated frameworks, and let’s do some RAG.
SQLite-Vec
One of many powers of the SQLite database is using extensions. For these of us accustomed to Python, extensions are quite a bit like libraries. They’re modular items of code written in C to increase the performance of SQLite, making issues that have been as soon as unattainable potential. One in style instance of a SQLite extension is the Full-Text Search (FTS) extension. This extension permits SQLite to carry out environment friendly searches throughout giant volumes of textual knowledge in SQLite. As a result of the extension is written purely in C, we will run it wherever a SQLite database might be run, together with Raspberry Pis and browsers.
On this article I will likely be going over the extension often called sqlite-vec. This provides SQLite the ability of performing vector search. Vector search is just like full-text search in that it permits for environment friendly search throughout textual knowledge. Nonetheless, reasonably than seek for an actual phrase or phrase within the textual content, vector search has a semantic understanding. In different phrases, looking for “horses” will discover matches of “equestrian”, “pony”, “Clydesdale”, and so on. Full-text search is incapable of this.
sqlite-vec makes use of digital tables, as do most extensions in SQLite. A digital desk is just like an everyday desk, however with further powers:
- Customized Information Sources: The information for the standard desk in SQLite is housed in a single db file. For a digital desk, the information might be housed in exterior sources, for instance a CSV file or an API name.
- Versatile Performance: Digital tables can add specialised indexing or querying capabilities and assist advanced knowledge sorts like JSON or XML.
- Integration with SQLite Question Engine: Digital tables combine seamlessly with SQLite’s commonplace question syntax e.g.
SELECT
,INSERT
,UPDATE
, andDELETE
choices. Finally it’s as much as the writers of the extensions to assist these operations. - Use of Modules: The backend logic for a way the digital desk will work is carried out by a module (written in C or one other language).
The everyday syntax for making a digital desk seems like the next:
CREATE VIRTUAL TABLE my_table USING my_extension_module();
The essential a part of this assertion is my_extension_module()
. This specifies the module that will likely be powering the backend of the my_table
digital desk. In sqlite-vec we are going to use the vec0
module.
Code Walkthrough
The code for this text might be discovered here. It’s a easy listing with nearly all of information being .txt information that we’ll be utilizing as our dummy knowledge. As a result of I’m a physics nerd, nearly all of the information pertain to physics, with just some information referring to different random fields. I cannot current the complete code on this walkthrough, however as a substitute will spotlight the essential items. Clone my repo and mess around with it to analyze the complete code. Under is a tree view of the repo. Word that my_docs.db
is the single-file database utilized by SQLite to handle all of our knowledge.
.
├── knowledge
│ ├── cooking.txt
│ ├── gardening.txt
│ ├── general_relativity.txt
│ ├── newton.txt
│ ├── personal_finance.txt
│ ├── quantum.txt
│ ├── thermodynamics.txt
│ └── journey.txt
├── my_docs.db
├── necessities.txt
└── sqlite_rag_tutorial.py
Step 1 is to put in the mandatory libraries. Under is our necessities.txt
file. As you may see it has solely three libraries. I like to recommend making a digital atmosphere with the newest Python model (3.13.1 was used for this text) after which working pip set up -r necessities.txt
to put in the libraries.
# necessities.txt
sqlite-vec==0.1.6
openai==1.63.0
python-dotenv==1.0.1
Step 2 is to create an OpenAI API key if you happen to don’t have already got one. We will likely be utilizing OpenAI to generate embeddings for the textual content information in order that we will carry out our vector search.
# sqlite_rag_tutorial.py
import sqlite3
from sqlite_vec import serialize_float32
import sqlite_vec
import os
from openai import OpenAI
from dotenv import load_dotenv
# Arrange OpenAI consumer
consumer = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
Step 3 is to load the sqlite-vec extension into SQLite. We will likely be utilizing Python and SQL for our examples on this article. Disabling the power to load extensions instantly after loading your extension is an efficient safety follow.
# Path to the database file
db_path="my_docs.db"
# Delete the database file if it exists
db = sqlite3.join(db_path)
db.enable_load_extension(True)
sqlite_vec.load(db)
db.enable_load_extension(False)
Subsequent we are going to go forward and create our digital desk:
db.execute('''
CREATE VIRTUAL TABLE paperwork USING vec0(
embedding float[1536],
+file_name TEXT,
+content material TEXT
)
''')
paperwork
is a digital desk with three columns:
sample_embedding
: 1536-dimension float that can retailer the embeddings of our pattern paperwork.file_name
: Textual content that can home the identify of every file we retailer within the database. Word that this column and the next have a + image in entrance of them. This means that they’re auxiliary fields. Beforehand in sqlite-vec solely embedding knowledge could possibly be saved within the digital desk. Nonetheless, just lately an update was pushed that enables us so as to add fields to our desk that we don’t actually need embedded. On this case we’re including the content material and identify of the file in the identical desk as our embeddings. It will enable us to simply see what embeddings correspond to what content material simply whereas sparing us the necessity for further tables and JOIN statements.content material
: Textual content that can retailer the content material of every file.
Now that we now have our digital desk arrange in our SQLite database, we will start changing our textual content information into embeddings and storing them in our desk:
# Operate to get embeddings utilizing the OpenAI API
def get_openai_embedding(textual content):
response = consumer.embeddings.create(
mannequin="text-embedding-3-small",
enter=textual content
)
return response.knowledge[0].embedding
# Iterate over .txt information within the /knowledge listing
for file_name in os.listdir("knowledge"):
file_path = os.path.be part of("knowledge", file_name)
with open(file_path, 'r', encoding='utf-8') as file:
content material = file.learn()
# Generate embedding for the content material
embedding = get_openai_embedding(content material)
if embedding:
# Insert file content material and embedding into the vec0 desk
db.execute(
'INSERT INTO paperwork (embedding, file_name, content material) VALUES (?, ?, ?)',
(serialize_float32(embedding), file_name, content material)
# Commit adjustments
db.commit()
We primarily loop by way of every of our .txt information, embedding the content material from every file, after which utilizing an INSERT INTO
assertion to insert the embedding
, file_name
, and content material
into paperwork
digital desk. A commit assertion on the finish ensures the adjustments are endured. Word that we’re utilizing serialize_float32
right here from the sqlite-vec library. SQLite itself doesn’t have a built-in vector kind, so it shops vectors as binary giant objects (BLOBs) to avoid wasting house and permit quick operations. Internally, it makes use of Python’s struct.pack()
operate, which converts Python knowledge into C-style binary representations.
Lastly, to carry out RAG, you then use the next code to do a Ok-Nearest-Neighbors (KNN-style) operation. That is the guts of vector search.
# Carry out a pattern KNN question
query_text = "What's common relativity?"
query_embedding = get_openai_embedding(query_text)
if query_embedding:
rows = db.execute(
"""
SELECT
file_name,
content material,
distance
FROM paperwork
WHERE embedding MATCH ?
ORDER BY distance
LIMIT 3
""",
[serialize_float32(query_embedding)]
).fetchall()
print("Prime 3 most related paperwork:")
top_contexts = []
for row in rows:
print(row)
top_contexts.append(row[1]) # Append the 'content material' column
We start by taking in a question from the consumer, on this case “What’s common relativity?” and embedding that question utilizing the identical embedding mannequin as earlier than. We then carry out a SQL operation. Let’s break this down:
- The
SELECT
assertion means the retrieved knowledge may have three columns:file_name
,content material
, anddistance
. The primary two we now have already talked about.Distance
will likely be calculated through the SQL operation, extra on this in a second. - The
FROM
assertion ensures you might be pulling knowledge from thepaperwork
desk. - The
WHERE embedding MATCH ?
assertion performs a similarity search between the entire vectors in your database and the question vector. The returned knowledge will embody adistance
column. This distance is only a floating level quantity measuring the similarity between the question and database vectors. The upper the quantity, the nearer the vectors are. sqlite-vec supplies a number of choices for methods to calculate this similarity. - The
ORDER BY distance
makes positive to order the retrieved vectors in descending order of similarity (excessive -> low). LIMIT 3
ensures we solely get the highest three paperwork which are nearest to our question embedding vector. You’ll be able to tweak this quantity to see how retrieving kind of vectors impacts your outcomes.
Given our question of “What’s common relativity?”, the following paperwork have been pulled. It did a fairly good job!
Prime 3 most related paperwork:
(‘general_relativity.txt’, ‘Einstein’s concept of common relativity redefined our understanding of gravity. As an alternative of viewing gravity as a drive appearing at a distance, it interprets it because the curvature of spacetime round huge objects. Mild passing close to an enormous star bends barely, galaxies deflect beams touring thousands and thousands of light-years, and clocks tick at completely different charges relying on their gravitational potential. This groundbreaking concept led to predictions like gravitational lensing and black holes, phenomena later confirmed by observational proof, and it continues to information our understanding of the cosmos.’, 0.8316285610198975)
(‘newton.txt’, ‘In classical mechanics, Newton’s legal guidelines of movement type the muse of how we perceive the motion of objects. Newton’s first legislation, typically known as the legislation of inertia, states that an object at relaxation stays at relaxation and an object in movement continues in movement except acted upon by an exterior drive. This idea extends into extra advanced physics issues, the place analyzing internet forces on objects permits us to foretell their future trajectories and behaviors. Over time, making use of Newton’s legal guidelines has enabled engineers and scientists to design safer automobiles, extra environment friendly machines, and even information spacecraft by way of intricate gravitational fields.’, 1.2036118507385254)
(‘quantum.txt’, ‘Quantum mechanics revolutionized our understanding of the microscopic world. Not like classical particles, quantum entities similar to electrons can exhibit each wave-like and particle-like behaviors. Phenomena like quantum superposition recommend that particles can exist in a number of states directly, and the act of measurement typically “collapses” these states into one noticed end result. This unusual and counterintuitive concept underpins fashionable applied sciences like semiconductors and lasers, and it supplies a conceptual framework for rising fields like quantum computing and cryptography.’, 1.251380205154419)
We are able to then stuff the context of the mannequin with these three paperwork and have it try to reply our query.
# Put together the context for the question
context = "nn".be part of(top_contexts)
system_message = "You're a useful assistant. Use the next context to reply the question."
# Ship question and context to OpenAI
strive:
completion = consumer.chat.completions.create(
mannequin="gpt-4o-mini",
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": f"Context: {context}nnQuery: {query_text}"}
]
)
print("Response:")
print(completion.decisions[0].message.content material)
besides Exception as e:
print(f"Error producing response: {e}")
Here’s what the mannequin mentioned:
Normal relativity is a concept developed by Albert Einstein that redefines our understanding of gravity. As an alternative of viewing gravity as a drive appearing at a distance, common relativity interprets it because the curvature of spacetime attributable to the presence of mass. In line with this concept, huge objects like stars and planets create a distortion within the cloth of spacetime, and this curvature impacts the movement of different objects, together with gentle, which might bend when it passes close to an enormous physique.
That is faithfully sticking to the paperwork we gave the mannequin. Nice job 4o-mini
!
Conclusion
sqlite-vec
is a mission sponsored by the Mozilla Builders Accelerator program, so it has some important backing behind it. Have to offer an enormous because of Alex Garcia, the creator of sqlite-vec
, for serving to to push the SQLite ecosystem and making ML potential with this straightforward database. This can be a properly maintained library, with updates coming down the pipeline frequently. As of November twentieth, they even added filtering by metadata! Maybe I ought to re-do my aforementioned RAG article utilizing SQLite 🤔.
The extension additionally gives bindings for a number of in style programming languages, together with Ruby, Go, Rust, and extra.
The truth that we’re capable of radically simplify our RAG pipeline to the naked necessities is exceptional. To recap, there isn’t any want for a database service to be spun up and spun down, like Postgres, MySQL, and so on. There is no such thing as a want for API calls to cloud distributors. For those who deploy to a server instantly by way of Digital Ocean or Hetzner, you may even keep away from costly and unnecessary complexity related to managed cloud providers like AWS, Azure, or Vercel.
I imagine this straightforward structure can work for a wide range of functions. It’s cheaper to make use of, simpler to take care of, and sooner to iterate on. When you attain a sure scale it would possible make sense emigrate to a extra strong database similar to Postgres with the pgvector extension for RAG capabilities. For extra superior capabilities similar to chunking and doc cleansing, a framework would be the proper alternative. However for startups and smaller gamers, it’s SQLite to the moon.
Have enjoyable making an attempt out sqlite-vec for your self!
Source link