Qdrant DB
Qdrant is an open-source vector database designed to handle large volumes of high-dimensional vector data with precision and speed. Its primary function is to store, search, and manage embeddings, which are essential for various machine learning applications such as image and text retrieval, recommendation systems, and similarity searches. By indexing vector data efficiently, Qdrant optimizes both storage and retrieval operations, enabling rapid querying and robust handling of complex datasets.
Qdrant’s architecture is specifically tailored to facilitate fast and scalable vector operations. It employs advanced indexing techniques like HNSW (Hierarchical Navigable Small World graphs), which significantly reduce the search time for nearest neighbors in high-dimensional spaces. This makes Qdrant an excellent choice for applications requiring real-time vector similarity searching. Read more about Qdrant DB here.
Prerequisites for Integrating Qdrant
System Requirements
Provision 2 Virtual Machines (VMs) on Ori Global Cloud by following the steps here.
Operating System Specifications:
-
Recommended operating systems are Ubuntu 20.04 or later. These systems provide the best support for the necessary tools and libraries.
-
For server - a single CPU-only VM featuring next-generation Intel processors with a minimum of 8 cores (up to 64 cores) to ensure optimal performance.
-
The other VM could be a NVIDIA GPU with CUDA Compute Capability to leverage GPU acceleration. Ensure that NVIDIA drivers are up-to-date, and CUDA Toolkit 11.8 or newer is installed to support all necessary GPU operations.
Prior Installations:
-
On one of the VMs, Docker must be installed to facilitate the installation of Qdrant. Refer here to install Docker on your provisioned VM.
-
Python3.10 or newer should be installed, works well with the required libraries and tools.
Ubuntu22.04 or higher comes with Python3.10 version, you may use the following command to check the available Python version.
python3
Output
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
If you need to upgrade Python, following the link here.
Networking specifications
Each Qdrant instance needs the following ports to be open:
- Port 6333: This is the default port, reserved for the HTTP API. This port is used for accessing monitoring, health, and metrics endpoints, which are essential for managing and observing the operational status of Qdrant.
- Port 6334: Dedicated to the gRPC API, enabling robust, high-performance communication between clients and the Qdrant server. This API is crucial for handling vector database operations with efficiency.
- Port 6335: Utilized for distributed deployments. This port is critical for inter-node communication within a Qdrant cluster, ensuring data consistency and availability across different instances.
Installing Qdrant
Pull Qdrant's open source docker image to proceed with installation
docker pull qdrant/qdrant
Start running Qdrant docker container, using it's default port 6333
docker run -p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
It should produce the following output
qdrant/qdrant
_ _
__ _ __| |_ __ __ _ _ __ | |_
/ _` |/ _` | '__/ _` | '_ \| __|
| (_| | (_| | | | (_| | | | | |_
\__, |\__,_|_| \__,_|_| |_|\__|
|_|
Version: 1.9.4, build: 671cf97b
Access web UI at http://localhost:6333/dashboard
2024-05-30T10:44:26.814877Z INFO storage::content_manager::consensus::persistent: Initializing new raft state at ./storage/raft_state.json
2024-05-30T10:44:26.827607Z INFO qdrant: Distributed mode disabled
2024-05-30T10:44:26.827654Z INFO qdrant: Telemetry reporting enabled, id: 89959819-671f-4340-9322-563200352350
2024-05-30T10:44:26.831880Z INFO qdrant::actix: TLS disabled for REST API
2024-05-30T10:44:26.832123Z INFO qdrant::actix: Qdrant HTTP listening on 6333
2024-05-30T10:44:26.832248Z INFO actix_server::builder: Starting 7 workers
2024-05-30T10:44:26.832349Z INFO actix_server::server: Actix runtime found; starting in Actix runtime
2024-05-30T10:44:26.840279Z INFO qdrant::tonic: Qdrant gRPC listening on 6334
2024-05-30T10:44:26.840312Z INFO qdrant::tonic: TLS disabled for gRPC API
When setting up Qdrant as described above, a local directory named qdrant_storage
was created. This directory serves as the storage location for all your collections and their associated metadata.
You may also run Qdrant using Kubernetes, Docker Compose, or build from source. Refer to Qdrant's installation guide.
Connecting with Qdrant Client
After setting up your second VM, we'll connect the Qdrant server with it's client.
1. Setup a Virtual Enviroment
Once Qdrant is up and running, let's set up a virtual environment on another VM with some packages.
# Install Python virtual environment
sudo apt install python3-venv
#give your virtual env a name
python3 -m venv <virtual-env-name>
#activate your virtual env
source <v-env-name>/bin/activate
2. Install qdrant-client
We'll now install Qdrant's client library by running the following command.
pip install qdrant-client pandas numpy faker
3. Instantiate the client
We'll now instantiate the client by using the QdrantClient
module.
from qdrant_client import QdrantClient
client = QdrantClient(url="http://<your-qdrant-server-VM-IP>:6333/dashboard")
4. Start creating Collections
Let's begin by establishing a collection within the database. A collection is essentially a structured set of data, where each entry is referred to as a document. For our purposes, we will define the dimensionality of the vectors contained in this collection; here, each vector comprises 2048 dimensions.
from qdrant_client.http import models
from qdrant_client.models import CollectionStatus, Distance, VectorParams
collection_name = "new_collection"
qdrant_client.recreate_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(size=2048, distance=models.Distance.COSINE)
This should return True
value.
Below is the screenshot showing the new collection
added to the Qdrant Dashboard:
The Distance
attribute determines the method employed to calculate the distance between vectors. The COSINE
distance evaluates the cosine of the angle between two vectors, which is particularly useful for identifying or sorting similar vectors within a group.
To establish a unique, non-replicable collection, utilize the client.create_collection()
method.
We can gather information about collection by retrieving it through our client. This data can be invaluable for testing purposes, especially during the development phase, as it helps ensure the collection is functioning as expected.
collection_info = client.get_collection(collection_name=collection_name)
list(collection_info)
assert collection_info.status == CollectionStatus.GREEN
assert collection_info.vectors_count == 0
We can see the collection status on the Dashboard UI
You can now start adding some data and vectors to the collection.
To learn more and try some examples, refer to Qdrant’s Github repo.