Search

Full Text Search

Full-text search refers to matching some or all of a text query with documents stored in a database. Compared to traditional database queries, full-text search provides results even in case of partial matches. It allows building more flexible search interfaces for users, thus enabling them to find accurate results more quickly.
Prefix and infix searching: This allows you to search for parts of words, like finding "apple" by searching "app" or finding "highlight" by searching "light."
Morphology processing: This includes stemming and lemmatization. Stemming finds different forms of a word, like "running" "and ran," all stemming from "run." Lemmatization finds the base form of a word, so "running" becomes "run."
Fuzzy searching: This helps find results even when the query contains typos.
Exact result count: Full-text search provides the total number of documents that match the search criteria.

Vector Search

Unlike traditional keyword-based search, vector search retrieves results by analyzing the similarity between vectors
Vectorization: Machine learning (ML) models, such as sentence transformers or OpenAI embeddings, convert the search query text and the documents into numerical representations. These representations are called vectors or embeddings.
Embedding space: These vectors are plotted in a multi-dimensional space, where the distance between vectors reflects the semantic similarity between the original pieces of text. Documents with similar meanings have vectors that are closer together in this space.
Nearest neighbors: The search engine uses algorithms like k-nearest neighbors (KNN) to find the vectors in the embedding space that are closest to the query vector. These closest vectors represent the documents that are most semantically similar to the search query.

Comparison

Feature

Full-Text Search

Vector Search

Data Type

Structured or semi-structured text

Unstructured or high-dimensional data

Query Type

Keyword or phrase matching

Similarity matching

Primary Use Case

Exact matches, metadata filtering

Semantic understanding, recommendations

Technology Examples

PostgreSQL full-text search, Elasticsearch

pgvectorscale, FAISS

Full text search cannot understand the relationship and semantic
Vector sarch cannot identify the exact keyword precisely , some of the precise meaning of text may be missed

Hybrid Search

Hybrid search combines the strengths of full-text search and vector search. It builds upon the accessible, search-as-you-type experience of full-text search and integrates the enhanced discovery capabilities that AI search enables.

from langchain.vectorstores import LanceDB
import lancedb
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader

# Initialize embeddings
embedding = OpenAIEmbeddings()

# load single pdf

loader = PyPDFLoader("/content/Food_and_Nutrition.pdf")
pages = loader.load_and_split()

# Initialize the BM25 retriever
bm25_retriever = BM25Retriever.from_documents(pages)
bm25_retriever.k =  2  # Retrieve top 2 results

db = lancedb.connect('/tmp/lancedb')
table = db.create_table("pandas_docs", data=[
    {"vector": embedding.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")


# Initialize LanceDB retriever
docsearch = LanceDB.from_documents(pages, embedding, connection=table)
retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

# Initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, retriever_lancedb],
                                       weights=[0.4, 0.6])

# Example customer query
query = "which food needed for building strong bones and teeth ?
 which Vitamin & minerals importat for this?"


# Retrieve relevant documents/products
docs = ensemble_retriever.get_relevant_documents(query)

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(openai_api_key="sk-yourapikey")

#if you want to use opensource models such as lama,mistral check this 
# https://github.com/lancedb/vectordb-recipes/blob/main/tutorials/chatbot_using_Llama2_&_lanceDB

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=ensemble_retriever)

query = "what nutrition needed for pregnant women  "
qa.run(query)

Maximal Marginal Relevance (MMR)

Let’s say your final keyPhrases are ranked like Good Product, Great Product, Nice Product, Excellent Product, Easy Install, Nice UI, Light weight etc. But there is an issue with this approach, all the phrases like good product, nice product, excellent product are similar and define the same property of the product and are ranked higher. Suppose we have a space to show just 5 keyPhrases, in that case, we don't want to show all these similar phrases.
For the traditional semantic search, the highest similarity, the higheest ranking, which may cause the similar result
The idea behind using MMR is that it tries to reduce redundancy and increase diversity in the result and is used in text summarization. MMR selects the phrase in the final keyphrases list according to a combined criterion of query relevance and novelty of information.

PreviousChunking NextSide Products

Last updated 7 months ago

Was this helpful?