Home / Blog / Generative AI / What is Chroma DB: A Step-By-Step Guide

What is Chroma DB: A Step-By-Step Guide

February 19, 2024
69

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Ready to dive in? Let's get started

What is an chroma vector databse?

Chroma isn't just a database, it's a translator for the AI world. Imagine complex data like text, images, and audio – Chroma converts it into numerical patterns called "embeddings" that computers can understand. These embeddings are like secret maps, revealing hidden connections and meaning within the data.

Learn the core concepts of Data Science Course video on YouTube:

Enter the chroma vector database

Vector Database: A specialized database designed to store and manage numerical vectors, also known as embeddings. These vectors capture the essence of information, like text, images, and audio, in a machine-readable format.

Embedding: A mathematical representation that maps complex data into a dense vector of numbers, preserving relationships and semantic meaning. Think of it as a unique fingerprint for each piece of information.

Chroma's Inner Workings

Data Transformation: Chroma seamlessly converts various data types into their corresponding embeddings, acting as a universal translator for the AI world.
Efficient Storage and Indexing: It stores these embeddings in an optimized manner, enabling lightning-fast retrieval and comparison.
Semantic Search: Chroma unlocks the ability to search for information based on meaning and context, not just keywords. This means you can ask natural language questions and receive relevant results that truly align with your intent.
AI Application Enabler: Chroma serves as a foundation for building intelligent applications that understand and process information in a human-like manner.


Learning Objectives

Generating embeddings with ChromaDB and Embedding Models
Creating collections within the Chroma Vector Store
Storing documents, images, and embeddings within the collections
Performing Collection Operations like deleting and updating data, renaming of Collections
Finally, querying the collections to extract relevant information

Short Introduction to Embeddings

Embeddings or Vector Embeddings is a way of representing data (be it text, images, audio, videos, etc) in the numerical format, to be precise it’s a way of representing data in the form of numbers in an n-dimensional space(a numerical vector). This way, embeddings allow us to cluster similar data together. There are models, that take these inputs and convert them into vectors. One such example is the Word2Vec, which is a popular embedding model developed by Google, that converts words to vectors(vectors are points having n-dimensions). All the Large Language Models have their respective embedding models, which create embeddings for their LLM.

What are Embeddings Used for?

The good thing about converting words to vectors is we can compare them. A computer cannot compare two words as they are, but if we give them in the form of numerical inputs, i.e. vector embeddings it can compare them. We can create a cluster of words having similar embeddings. The words King, Queen, Prince, and Princess will appear in a cluster because they are related to other.

This way embeddings allow us to get find words similar to a given word. We can incorporate this into sentences, where we input a sentence and obtain the related sentences from the provided data. This is the base for Semantic Search, Sentence Similarity, Anomaly Detection, chatbot, and many more use cases. The Chatbots we build to perform Question Answering from a given PDF, Doc, leverage this very concept of embeddings. All the Generative Large Language Models use this approach to get similarly related content to the queries provided to them.

Vector Store and the Need for Them

As discussed, embeddings are representations of any kind of data usually, the unstructured ones in the numerical format in an n-dimensional space. Now where do we store them? Traditional RDMS (Relational Database Management Systems) cannot be used to store these vector embeddings. This is where the Vector Store / Vector Dabases come into play.. There are many Vector Stores out there, which differ by the embedding models they support and the kind of search algorithm they use to get similar vectors.

Why do we need them? We need them because they provide fast access to the data we need. Let’s consider a Chatbot based on a PDF. Now when a user enters a query, the first thing will be to fetch related content from PDF to that query and feed this information to the Chatbot. So that the Chatbot can take this information related to the query and proved the relevant answer to the User. Now how do we get the relevant content from PDF related to the User query? The answer is a simple similarity search

When data is represented in vector embeddings, we can find similarities between different parts of the data and extract the data similar to a particular embedding. The query is first converted to embeddings by an embedding model and then the Vector Store takes this vector embedding and then performs a similarity search (through search algorithms) between other embeddings that it has stored in its database and fetches all the relevant data. These relevant vector embeddings are then passed to the Large Language Model which is the chatbot that uses this information to generate a final answer to the User.

What is Chroma DB?

Chroma is a Vector Store / Vector DB by the company Chroma. Chroma DB like many other Vector Stores out there, is for storing and retrieving vector embeddings. The good part is that Chroma is a Free and Open Source project. This gives other skilled developers out there in the world the to give suggestions and make tremendous improvements to the Database and even one can expect a quick reply to an issue when dealing with Open Source software, as the whole Open Source community is out there to see and resolve that issue.

At present Chroma does not provide any hosting services. Store the data locally in the local file system when creating applications around Chroma. Though Chroma is planning to build a hosting service in the near future. Chroma DB offers different ways to store vector embeddings. You can store them In-memory, you can save and load them In-memory, you can just run Chroma a client to talk to the backend server. Overall Chroma DB has only 4 functions in the API, thus making it short, simple, and easy to get started with.

Let’s Start with Chroma DB

In this section, we will install Chroma and see all the functionalities it provides. Firstly, we will install the library through the pip command

Chroma Vector Store API

This will download the Chroma Vector Store API for Python. With this package, we can perform all tasks like storing the vector embeddings, retrieving them, and performing a semantic search for a given vector embedding.

Memory Database

We will start off with creating a persistent in-memory database. The above code will create one for us. To create a client we take the Client() object from the Chroma DB. Now to create an in-memory database, we configure our client with the following parameters

chroma_ db_ impl = “duckdb +parquet”
persist_ directory = “/content/”

This will create an in-memory Duck DB database with the parquet file format. And we provide the directory for where this data is to be stored. Here we are saving the database in the /content/ folder. So whenever we connect to a Chroma DB client with this configuration, the Chroma DB will look for an existing database in the directory provided and will load it. If it is not present then it will create it. And when we close the connection, the data will be saved to this directory.

Now, we will create a collection. Collection in Vector Store is where we save the set of vector embeddings, documents, and any metadata if present. Collection in a vector database can be thought of as a Table in Relational Database.

Looking forward to becoming a Data Scientist? Check out the Professional Course of Data Science Course in Bangalore and get certified today

Create Collection and Add Documents

We will now create a collection and add documents to it.

Here we start by creating a collection first. Here we name the collection “my_information”.

To this collection, we will be adding documents. Here we are adding 3 documents, in our case, we are just adding three sentences as three documents. The first document is about cars, the second one is about dogs and the final one is about four-wheelers.

We are even adding the metadata. Metadata for all three documents is provided.

Every document needs to have a unique ID to it, hence we are giving id1, id2, and id3 to them

All these are like the variables to the add() function from the collection

After running the code, add these documents to our collection “my_information”

Vector Databases

We learned that the information stored in Vector Databases is in the form of Vector Embeddings. But here, we provided text/text files i.e. documents. So how does it store them? Chroma DB by default, uses an all-MiniLM-L6-v2 vector embedding model to create the embeddings for us. This model will take our documents and convert them into vector embeddings. If we want to work with a specific embedding function like other sentence-transformer models from HuggingFace or OpenAI embedding model, we can specify it under the embeddings_function=embedding_function_name variable name in the create_collection() method.

We can also provide embeddings directly to the Vector Store, instead of passing the documents to it. Just like the document parameter in create_collection, we have an embedding parameter, to which we pass on the embeddings that we want to store in the Vector Database.

So now the model has successfully stored our three documents in the form of vector embeddings in the vector store. Now, we will look at retrieving relevant documents from them. We will pass a query and will fetch the documents that are relevant to it.

Query a Vector Store

To query a vector store, we have a query() function provided by the collections which lets us query the vector database for relevant documents. In this function, we provide two parameters
query_texts – To this parameter, we give a list of queries for which we need to extract the relevant documents.
n_results – This parameter specifies how many top results should the database return. In our case we want our collection to return 2 top most relevant documents related to the query
When we run and print the results, we get the following output
We see that the vector store returns two documents associated with id1 and id3. The id1 is the document about cars and the id3 is the document amount four wheelers, which is related to a car again. So when we gave a query, the Chrom DB converts the query into a vector embedding with the embedding model we provided at the start. Then this vector embedding performs a semantic search(similar nearest neighbors) on all the available documents. The query here “car” is most relevant to the id1 and id3 documents, hence we get the following result for the query.
This is very helpful when we are trying to build a chat application that includes multiple documents. Through a vector store, we can fetch the relevant documents to the provided query by performing a semantic search and feeding only these documents to the final Generative AI model, which will then take these relevant documents and generate a response to the provided query.

Updating and Deleting Data

Not always do we add all the information at once to the Vector Store. In most cases, we have only limited data/documents at the start, which we add as is to the Vector Store. Later in point of time, when we get more data, it becomes necessary to update the existing data/vector embeddings present in the Vector Store. To update data in Chroma DB, we do the following

Previously, the information in the document associated with id2 was about Dogs. Now we are changing it to Cats. For this information to be updated within the Vector Store, we pass the id of the document, the updated document, and the updated metadata of the document to the update() function of the collections. This will now update the id2 to Cats which was previously about Dogs.

Query in Data base

pass in Felines as the query to the Vector Store. Cats belong to the family of mammals called Felines. So the collection must return the Cat document as the relevant document to us. In the output, we get to see exactly the same. The vector store was able to perform a semantic search between the query and the contents of the documents and was able to return the perfect document to the query provided.

Upset Function

There is a similar function to the update function called the upsert() function. The only difference between both the update() and upsert() function is, if the document ID specified in the update() function does not exist, the update() function will raise an error. But in the case of the upsert() function, if the document ID doesn’t exist in the collection, then it will be added to the collection similar to the add() function.

Sometimes, to reduce the space or remove unnecessary/ unwanted information, we might want to delete some documents from the collection in the Vector Store.

Delete Function

To delete an item from a collection, we have the delete() function. In the above, we are deleting the first document associated with id1 which was about cars. Now to check, we query the collection with the “car” as the query and then see the results. We see that only 2 documents id2 and id3 appear, where the id2 is the document about four wheelers which are closest to cars and id3 is the document about cats which is the least closest to cars, but as we specified n_results = 2 we get the id3 as well. If we do not specify any variables to the delete() function, then all the items will be deleted from that collection

Collection Functions

We have seen how to create a new collection and then add documents, and embeddings to it. We have even seen how to extract relevant information to a query from the collection i.e. from the documents stored in the Vector Store. The collections object from Chroma DB is also associated with many other useful functions.

Let us look at some other functionalities provided by Chroma DB.

Count Function

The count() function from the collections returns the number of items present in the collection. In our case, we have 3 documents stored in our collection, hence the output will be 3. Coming to the get() function, it will return all the items that are present in our collection along with the metadata, ids, and embeddings if any. In the output, we see that all the items that we have to our collection have to get through the get() command. Let’s now look at modifying the collection name

Modify Function

Use the modify() function from collections to change the name of the collection that was given at the start of collection creation. When run, change the collection name from the old name that was defined at the start to the new name provided in the modify() function under the name variable. Now suppose, we have multiple collections in our Vector Store. How to work on a specific collection, that is how to get a specific collection from the Vector Store and how to delete a specific collection? Let’s see this

Get Collection Function

The get collection() function will fetch an existing collection provided the name, from the Vector Store. If the provided collection does not exist, then the function will raise an error for the same. Here the get_collection() will try to get the my_information_2 collection and assign it to the variable my_collection. To delete an existing collection, we have the delete_collection() function, which takes the collection name as the parameter (my_information in this case) and then deletes it, if it exists.

Conclusion

In this guide, we have seen how to get started with Chroma, one of the Open Source Vector Databases. We initially started with learning what are vector embeddings, why they are necessary for the Generative AI models, and how Vector Stores help these Generative Large Language Models. Then we deep-dived into Chroma, and we have seen how to create collections in Chroma. Then we looked into how to add data like documents to Chroma and how the Chroma DB creates vector embeddings out of them. Finally, we have seen how to retrieve relevant information related to the given query from a particular collection present in the Vector Store.

Are you looking to become a Data Scientist? Go through 360DigiTMG's Data Science Course in Chennai

Some of the key takeaways from this guide include:

Vector Embeddings are numerical representations (numerical vectors) of non-numerical data like text, images, audio, etc
Vector Stores are the databases that are used to store the vector embeddings in the form of collections
They provide efficient storage and retrieval of information from the embeddings data
Chroma DB can work as both an in-memory database and as a backend
Chroma DB has the functionality to store the data upon quitting and load the data to memory upon initiating a connection, thus persisting the data
With Vector Stores, extracting information from documents, generating recommendations, and building chatbot applications will become much simpler

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad

Data Analyst Courses in Other Locations

ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu, Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka

Navigate to Address

360DigiTMG - Data Analytics, Data Science Course Training in Chennai

1st Floor, Santi Ram Centre, Tirumurthy Nagar, Opposite to Indian Oil Bhavan, Nungambakkam, Chennai - 600006

1800-212-654-321

Get Direction: Data Science Course

Next Blog

Certification Program in Data Science

AI & Deep Learning Course Training in USA

Foundation Program in Data Science

Data Science using Python and R Programming

Exclusive Python & R Program For Beginners

Data Science for Managers

Practical Data Scientist Online Program

Business Analytics in USA

Data Visualization Using Tableau in USA

Professional Course in Data Analytics

MLOps Course with Training & Placement in USA

HR Analytics Course Training USA

Life Sciences and HealthCare Analytics Course in USA

Data Science for Internal Auditors

Certificate course on Data Science

Certificate course on Data Analytics

Certificate course on MLOps

Certificate course on Data Engineering