How to use Vector Databases to represent complex data with vectors How they can help you harness the power of Large Language Models (LLMs) How to quickly apply AI to your proprietary data while protecting ownership of data and privacy

Behind every technological breakthrough of the past few decades, you’ll find a database that was created to meet the needs of applications that traditional products could no longer serve.

The introduction of ChatGPT brought generative AI to the public’s attention, and numerous AI-driven applications have followed. For large organizations, the use of AI has become critical to positioning themselves for the future.

ChatGPT and similar solutions were developed using Large Language Models (LLMs). The popularity of LLMs has increased recently, triggering a parallel increase in interest in vector databases. These databases offer a tantalizing prospect: organizations can tailor the behavior of LLMs to their proprietary data without requiring extensive tuning through model retraining.

With the right approach, you can leverage your own data in LLMs while maintaining control over flexibility, ownership, data security, and quality. Vector databases are ideal solutions for generative AI applications because they allow organizations to leverage pre-trained LLMs and extend them to their business use cases with proprietary data.

In this article, we will first provide an overview of LLMs and how they can be used in your business, before looking at vector databases and their usefulness for LLMs.

Large Language Models: A Model to Generate Human-like Text

LLMs are advanced language models that are trained using deep learning techniques on large amounts of text data. These models can generate human-like text and perform various natural language tasks. They are trained on massive amounts of data to learn patterns and how entities relate to each other. Through this training process, the models can generate coherent and contextually relevant responses to queries by analyzing the statistical relationships between words, phrases, and sentences. OpenAI’s GPT-4 is probably the most well-known and popular LLM currently available.

LLMs are typically built using a transformer architecture. Transformers are a type of neural network well suited to the task of natural language processing. They can learn long-range relationships between words, which is essential for understanding the subtleties of human language. They are typically trained using a cluster of computers. Depending on the size of the dataset and which model is used, this process can take weeks or even months. In the training phase, they learn general language patterns, word relationships, and additional foundational knowledge.

Leveraging existing LLMs and providing them with local context is more effective than attempting to develop one in-house.

Preventing AI Hallucinations with Non-parametric Models

Parametric memory is a feature of pre-trained language models. It refers to the ability of the model to store knowledge in the weights of its neural network.

LLMs with parametric memory are like colossal knowledge repositories. They perfectly encapsulate a world of information in their neural network structures. These stand-alone models have no external memory. A drawback of these types of models is that when you move outside of general knowledge into domain-specific areas where the relevant facts may not have been part of the training data, they tend to hallucinate (a phenomenon where LLMs generate factually incorrect answers). Updating a model to current events and facts requires full re-training, which is expensive and can take from a few days to a few months. This helps to reduce the risk of the model hallucinating and provides more reliable results.

Non-parametric LLMs can use external memory resources, which greatly expands their capabilities by freeing them from the constraints of their internal memory. This innovative approach allows these models to access domain-specific and up-to-date information without the need to retrain continually. The downside of this approach is that it makes the architecture more complex since it requires retrieval from an external source of knowledge, such as a vector database.

Vectors: Capturing the essence of data

The core of vector databases is representing data as numeric vectors. For example, an image or text may be represented as a vector such as: [0.50, 0.32, 0.76, …, 0.87, 0.12, 0.84]

Vector databases are designed to allow data to be managed and stored in a high-dimensional space. This is especially useful for applications like machine learning, where data such as images or text are represented as vectors. Vectors serve as numerical representations that capture the essence of data. They can capture the semantic meaning of the original data object.

An example of vectors in a 3-dimensional space, where the distance between them is a measure of their similarity:

Vectors used in AI have a large number of dimensions that cannot be represented visually in a way that is easily understood by humans:

What are Vector Embeddings?

Vector embeddings capture semantic information by representing text as dense vectors in a multidimensional mathematical space. These vectors are designed to encode the contextual and semantic relationships between text elements, allowing for more nuanced understanding and analysis. Understanding the semantic subtleties of natural language is critical to the success of LLM. By using vector embeddings, LLMs can leverage the rich semantic information embedded in textual data to generate more sophisticated and context-aware responses.

Embeddings are created by mapping each element of the input data to a dense vector in a high-dimensional space. They can be used to find similar elements or to understand the context or intent of the data because dense vectors are inherently similar and, therefore, close together in vector space.

How are Vector Embeddings Created?

Vector embeddings are created through a machine-learning process that trains a model to convert any piece of data into numerical vectors. Most embedding models rely on transformers. The central idea behind transformers is “attention”, which weighs the relevance of different contextual inputs and allows the model to focus on the more important parts of a text when predicting the output.

Overview of the process:

Collect a large dataset that is representative of the type of data for which you want to create embeddings, such as text or images.
Preprocess the data to remove noise, normalize text, etc., depending on the type of data you are working with.
Feed the pre-processed data into a neural network model.
By adjusting its internal parameters during training, the model learns patterns and relationships within the data.
The model will produce numeric vector embeddings that represent the meaning of each data point.

The diagram below illustrates this process:

Data Vectorized With Separate Embeddings

In a similar manner, a single multi-modal embedding model may also be used to generate embeddings which support cross-modal search:

Data Vectorized with Multimodal Embeddings

Retrieval-Augmented Generation: Augmenting LLM Knowledge with Your Data

Retrieval Augmented Generation (RAG) combines the capabilities of LLMs and non-parametric retrieval mechanisms into a hybrid architectural approach, and is essential for leveraging LLMs for business-specific use cases. RAG has become increasingly popular in recent years, using large neural networks together with external knowledge sources.

These pre-trained models have access to explicit nonparametric memory and can overcome the aforementioned hallucination problem when querying domain-specific or current information not in the training set. They combine that with parametric knowledge from their training set, which is typically general purpose and non-domain specific.

RAG takes an input and retrieves a set of relevant supporting documents, thus combining an information retrieval component with a text generation model. The documents are contextualized with the original query and fed to the text generator, which produces the final output.

Example Applications of Retrieval-Augmented Generation:

Chatbots: RAG can improve chatbot answers, pulling relevant answers from a knowledge base before generating an answer.
Question Answering: RAG can improve question-answering systems by combining retrieval with a generative model to generate accurate and informative responses.
Customer Support: RAG can enhance customer support systems by retrieving relevant information from product manuals to generate accurate and helpful responses to customer inquiries.

How are users’ queries processed?

Create and store vector embeddings of domain-specific or up-to-date data:

Generate vector embeddings from an existing set of data (that you would like to use to add additional context to the LLMs responses).
Store the generated embeddings in a vector database.

When a user submits a query, rather than passing it to the LLM, append additional context to the query:

The user query is passed to the model and a vector embedding representation of the query is returned.
This embedding is used to query the vector database, which returns similar vectors and related content, comparing the vectors to identify relevant documents. This provides additional context to the LLM, allowing it to provide a much more precise response related to domain-specific knowledge on which it was not trained.

Implementing RAG with Function Calls

Function calls enable AI models to interact with external APIs and systems, expanding their capabilities. When the model requires a function’s result to respond, it returns the function name and parameters. The agent handling the call then executes the function.

For instance, a function can perform a vector search, the results of which are passed along in a subsequent call to the model along with the original prompt. This is a common approach for implementing RAG using function calls. This paradigm also enables the orchestration of complex chains of observation, reasoning, and action, but that is beyond this blog post’s scope.

Improving LLMs with Vector Databases

Vector databases provide access to data, like images or text, alongside their vector embeddings. Vector databases store and manage data in high-dimensional space. This is especially useful for applications such as machine learning, where data points such as images or text can be represented as vectors in multidimensional space. The similarity of data objects can be calculated based on the distance between the vector embeddings of the data points because similar data points are close together in vector space.

Vector databases allow for rapid similarity searches amongst vectors, enabling quick retrieval of the most similar items from a vast dataset. To enable this, vector databases use vector indexing to pre-calculate the distances to enable faster retrieval at query time. Many traditional databases also support storing vector embeddings to enable vector search but are not as performant for this use. Vector searches can be performed on or across any modality, including images, video, audio, or combinations thereof.

Flexible and More Accurate Results

Vector databases existed long before LLMs. They have become an integral part of Generative AI technology since they can address key limitations like hallucinations and lack of state. Storing additional data as vector representations makes LLMs practical because it would be prohibitively expensive and time consuming to continuously retrain the LLM on new information.

Traditional relational databases are designed primarily for tabular data with fixed columns, and are not well suited to handle vector data efficiently. Vector databases, on the other hand, are designed to support high-dimensional and variable-length vectors, allowing for flexible data storage and retrieval.

LLMs are stateless and can’t retain the output produced by a new query or learn new information on which they have not been trained. Vector databases act as external memory, allowing new information to be stored as vectors over time and made available to AI models for more accurate results. Vector databases are an effective way to provide state to AI models because of the ease with which you can add to and update the information in a vector database.

Searching Data with Distance Metrics

A distance metric determines how the vector search engine will evaluate the similarity between vectors by measuring the distance between them in vector space. Points closer together in vector space represent concepts that are more similar in meaning. For example, “cat” is more similar to “animal” and less similar to “plant”.

Generative AI models often require flexibility when measuring the similarity of data points. Vector databases address this need by allowing for a variety of distance metrics. Common distance metrics include Euclidean Distance, Dot Product Similarity, and Cosine Similarity, each of which accommodates different types of vector distributions.

The general principle for the selection of the similarity metric is to use the same metric that was used in the training of your embedding model.

01 Euclidean Distance

Euclidean distance is the standard notion of distance in a 3-dimensional space.

Euclidean distance is useful when embeddings contain information related to counting or measuring things.

02 Dot Product Similarity

The dot product similarity is the sum of the products of the corresponding components of two vectors. It measures the extent to which two vectors are aligned in the same direction.

A larger dot product indicates that the vectors are pointing in the same direction, while a smaller dot product indicates opposite directions when two vectors have the same length but different directions. The dot product is the recommended similarity metric for training in many LLMs.

03 Cosine Similarity

Cosine similarity is measured by the cosine of the angle between two vectors. It is calculated by taking the dot product of the vectors and dividing it by the product of the magnitudes of the vectors.

Cosine similarity is not affected by vector size, only the angle between them. It is often used in semantic search and document classification tasks.

Indexing as a key feature

As we know, in traditional databases, similar to vector databases, the index will improve the search data from the table. Vector databases provide “flat indexes”, which directly represent vector embedding.

One of the key features of vector databases is advanced indexing strategies. They use methods such as Approximate Nearest Neighbor (ANN) to speed up searching in high-dimensional spaces. This is particularly useful in generative AI models, which often need to retrieve similar data points to generate a response.

Uses of Vector Databases in Generative AI:

Vector databases play a crucial role in generative AI by enabling advanced techniques such as:

Few-Shot Learning: Few-shot learning is the problem of making predictions from limited samples. When a model is exposed to only a handful of vectors, it can quickly infer the broader concept by identifying similarities and relationships with other vectors. Vector databases improve on this approach by maintaining a broad index of vectors.
Contextual Search Engines: Traditional searching relies on matching exact keywords. With vector databases, however, systems can understand and retrieve content based on semantic similarity. As a result, the search becomes more intuitive by focusing on the underlying meaning of the query rather than just on word matches.
Multimodal Search: A new technique that integrates data from multiple sources, including text, pictures, audio, and video. Vector databases are the backbone of this approach, allowing combined searching of vectors from different modalities. This enables holistic searches that pull information from multiple sources in one query, providing deeper insights and more comprehensive results.
Recommender Systems: Recommender systems use vector databases to suggest content that is closely related to a user’s preferences. The customer’s interests are represented as vectors. The recommendation system searches a vector database to find vectors representing content that matches a customer’s interests to ensure accurate recommendations.

Generative AI in the Lakehouse

Technology leaders are quickly discovering that businesses are keen to leverage AI and are racing to deploy the most comprehensive generative AI capabilities on their customers’ data sets.

Our partners Databricks and Microsoft have newly launched solutions in public preview that address this new market:

Azure AI Studio provides everything you need to build an AI application in one place. You can load data from different sources, prepare it for training, train a model, and use the model in production. It includes a variety of pretrained models and tools.
Azure AI Search provides indexing and querying capabilities with Azure Cloud infrastructure and security.
Databricks Vector Search allows you to provide a source delta table containing data in text form. Embeddings are then generated using a model of your choice and are indexed in sync with the delta table.
Databricks Curated models are optimized for high performance, and are designed to be easily integrated into the Lakehouse environment for a variety of applications, from text analysis and generation to image processing.
Databricks Model Serving provides a single interface for provisioning, managing, and querying AI models. Each model served is available as a REST API for integration into a web or client application.
Databricks AI functions are built-in SQL functions which allow users to:
- Use Databricks Foundation models
- Access external models such as GPT-4
- Retrieve models hosted by Model Serving

Conclusion

The world of AI is changing rapidly. It’s touching many industries, bringing with it both new capabilities and challenges. The rapid progress in the field of generative AI underscores the vital role of vector databases in the management and analysis of multi-dimensional data.

These specialized storage systems are an integral part of the effective functioning of modern AI applications, especially in the area of similarity search.

MobiLab is uniquely positioned to help you apply AI to your proprietary data and to ensure that your AI initiatives continue to evolve in line with technological advancements and industry shifts. We have deep expertise in Data Integration, Data Platforms and Cloud native application development, coupled with our knowledge of AI.

Oleg Malomed

Oleg is a Solution Architect at MobiLab, focusing on the overall technical vision for any specific business solutions.

Leveraging Generative AI for Proprietary Data – A Use Case for Vector Databases

In this article