🖍️
Developer Note
  • Welcome
  • Git
    • Eslint & Prettier & Stylelint & Husky
  • Programming Language
    • JavaScript
      • Script Async vs Defer
      • Module
      • Const VS Let VS Var
      • Promise
      • Event Loop
      • Execution Context
      • Hoisting
      • Closure
      • Event Buddling and Capturing
      • Garbage Collection
      • This
      • Routing
      • Debounce and Throttle
      • Web Component
      • Iterator
      • Syntax
      • String
      • Array
      • Object
      • Proxy & Reflect
      • ProtoType
      • Class
      • Immutability
      • Typeof & Instanceof
      • Npm (Node package manager)
    • TypeScript
      • Utility Type
      • Type vs Interface
      • Any vs Unknown vs Never
      • Void and undefined
      • Strict Mode
      • Namespace
      • Enum
      • Module
      • Generic
    • Python
      • Local Development
      • Uv
      • Asyncio & Event loop
      • Context Manager
      • Iterator & Generator
      • Fast API
      • Pydantic & Data Class
    • Java
      • Compilation and Execution
      • Data Type
      • Enumeration
      • Data Structure
      • Try Catch
      • InputStream and OutputStream
      • Concurrent
      • Unicode Block
      • Build Tools
      • Servlet
      • Java 8
  • Coding Pattern
    • MVC vs MVVM
    • OOP vs Functional
    • Error Handling
    • MVC vs Flux
    • Imperative vs Declarative
    • Design Pattern
  • Web Communication
    • REST API
      • Web Hook
      • CORS issue
    • HTTPS
    • GraphQL
      • REST API vs GraphQL
      • Implementation (NodeJS + React)
    • Server-Sent Event
    • Web Socket
    • IP
    • Domain Name System (DNS)
  • Frontend
    • Progressive Web App (PWA)
    • Single Page & Multiple Page Application
    • Search Engine Optimiaztion (SEO)
    • Web bundling & Micro-frontend
      • Webpack
        • Using Webpack to build React Application
        • Using Webpack to build react library
      • Vite
      • Using rollup to build react library
      • Implementing micro frontend
    • Web Security
      • CSRF & Nonce
      • XSS
      • Click hijacking
    • Cypress
    • CSS
      • Core
        • Box Model
        • Inline vs Block
        • Flexbox & Grid
        • Pseudo Class
        • Position
      • Tailwind CSS
        • Shadcn
      • CSS In JS
        • Material UI
    • React
      • Core
        • Component Pattern
        • React Lazy & Suspense
        • React Portal
        • Error Boundary
        • Rendering Methods
        • Environment Variable
        • Conditional CSS
        • Memo
        • Forward Reference
        • High Order Component (HOC) & Custom Hook
        • TypeScript
      • State Management
        • Redux
        • Recoil
        • Zustand
      • Routing
        • React Router Dom
      • Data Fetching
        • Axios & Hook
        • React Query
        • Orval
      • Table
        • React Table
      • Form & Validation
        • React Hook Form
        • Zod
      • NextJS
        • Page Router
        • App Router
      • React Native
    • Angular
    • Svelte
      • Svelte Kit
  • Backend
    • Cache
      • Browser Cache
      • Web Browser Storage
      • Proxy
      • Redis
    • Rate limit
    • Monitoring
      • Logging
      • Distributed Tracing
    • Load Test
    • Encryption
    • Authentication
      • Password Protection
      • Cookie & Session
      • JSON Web Token
      • SSO
        • OAuth 2.0
        • OpenID Connect (OIDC)
        • SAML
    • Payment
      • Pre-built
      • Custom
    • File Handling
      • Upload & Download (Front-end)
      • Stream & Buffer
    • Microservice
      • API Gateway
      • Service Discovery
      • Load Balancer
      • Circuit Breaker
      • Message Broker
      • BulkHead & Zipkin
    • Elastic Search
    • Database
      • SQL
        • Group By vs Distinct
        • Index
        • N + 1 problem
        • Normalization
        • Foreign Key
        • Relationship
        • Union & Join
        • User Defined Type
      • NOSQL (MongoDB)
      • Transaction
      • Sharding
      • Lock (Concurrency Control)
    • NodeJS
      • NodeJS vs Java Spring
      • ExpressJS
      • NestJS
        • Swagger
        • Class Validator & Validation Pipe
        • Passport (Authentication)
      • Path Module
      • Database Connection
        • Integrating with MYSQL
        • Sequalize
        • Integrating with MongoDB
        • Prisma
        • MikroORM
        • Mongoose
      • Streaming
      • Worker Thread
      • Passport JS
      • JSON Web Token
      • Socket IO
      • Bull MQ
      • Pino (Logging)
      • Yeoman
    • Spring
      • Spring MVC
      • Spring REST
      • Spring Actuator
      • Aspect Oriented Programming (AOP)
      • Controller Advice
      • Filter
      • Interceptor
      • Concurrent
      • Spring Security
      • Spring Boot
      • Spring Cloud
        • Resilience 4j
      • Quartz vs Spring Batch
      • JPA and Hibernate
      • HATEOS
      • Swagger
      • Unit Test (Java Spring)
      • Unit Test (Spring boot)
  • DevOp
    • Docker
    • Kubernetes
      • Helm
    • Nginx
    • File System
    • Cloud
      • AWS
        • EC2 (Virtual Machine)
        • Network
        • IAM
          • Role-Service Binding
        • Database
        • Route 53
        • S3
        • Message Queue
        • Application Service
        • Serverless Framework
        • Data Analysis
        • Machine Learning
        • Monitoring
        • Security
      • Azure
        • Identity
        • Compute Resource
        • Networking
        • Storage
        • Monitoring
      • Google Cloud
        • IAM
          • Workload Identity Federation
        • Compute Engine
        • VPC Network
        • Storage
        • Kubernetes Engine
        • App Engine
        • Cloud function
        • Cloud Run
        • Infra as Code
        • Pub/Sub
    • Deployment Strategy
    • Jenkins
    • Examples
      • Deploy NextJS on GCP
      • Deploy Spring on Azure
      • Deploy React on Azure
  • Domain Knowledge
    • Web 3
      • Blockchain
      • Cryptocurrency
    • AI
      • Prompt
      • Chain & Agent
      • LangChain
      • Chunking
      • Search
      • Side Products
Powered by GitBook
On this page
  • Full Text Search
  • Vector Search
  • Comparison
  • Hybrid Search
  • Maximal Marginal Relevance (MMR)

Was this helpful?

  1. Domain Knowledge
  2. AI

Search

PreviousChunkingNextSide Products

Last updated 3 months ago

Was this helpful?

Full Text Search

  • refers to matching some or all of a text query with documents stored in a database. Compared to traditional database queries, full-text search provides results even in case of partial matches. It allows building more flexible search interfaces for users, thus enabling them to find accurate results more quickly.

  • Prefix and infix searching: This allows you to search for parts of words, like finding "apple" by searching "app" or finding "highlight" by searching "light."

  • Morphology processing: This includes stemming and lemmatization. Stemming finds different forms of a word, like "running" "and ran," all stemming from "run." Lemmatization finds the base form of a word, so "running" becomes "run."

  • Fuzzy searching: This helps find results even when the query contains typos.

  • Exact result count: Full-text search provides the total number of documents that match the search criteria.

Vector Search

  • Vectorization: Machine learning (ML) models, such as sentence transformers or OpenAI embeddings, convert the search query text and the documents into numerical representations. These representations are called vectors or embeddings.

  • Embedding space: These vectors are plotted in a multi-dimensional space, where the distance between vectors reflects the semantic similarity between the original pieces of text. Documents with similar meanings have vectors that are closer together in this space.

  • Nearest neighbors: The search engine uses algorithms like k-nearest neighbors (KNN) to find the vectors in the embedding space that are closest to the query vector. These closest vectors represent the documents that are most semantically similar to the search query.

Comparison

Feature

Full-Text Search

Vector Search

Data Type

Structured or semi-structured text

Unstructured or high-dimensional data

Query Type

Keyword or phrase matching

Similarity matching

Primary Use Case

Exact matches, metadata filtering

Semantic understanding, recommendations

Technology Examples

PostgreSQL full-text search, Elasticsearch

pgvectorscale, FAISS

  • Full text search cannot understand the relationship and semantic

  • Vector sarch cannot identify the exact keyword precisely , some of the precise meaning of text may be missed

Hybrid Search

  • Hybrid search combines the strengths of full-text search and vector search. It builds upon the accessible, search-as-you-type experience of full-text search and integrates the enhanced discovery capabilities that AI search enables.

from langchain.vectorstores import LanceDB
import lancedb
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader

# Initialize embeddings
embedding = OpenAIEmbeddings()

# load single pdf

loader = PyPDFLoader("/content/Food_and_Nutrition.pdf")
pages = loader.load_and_split()

# Initialize the BM25 retriever
bm25_retriever = BM25Retriever.from_documents(pages)
bm25_retriever.k =  2  # Retrieve top 2 results

db = lancedb.connect('/tmp/lancedb')
table = db.create_table("pandas_docs", data=[
    {"vector": embedding.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")


# Initialize LanceDB retriever
docsearch = LanceDB.from_documents(pages, embedding, connection=table)
retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

# Initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, retriever_lancedb],
                                       weights=[0.4, 0.6])

# Example customer query
query = "which food needed for building strong bones and teeth ?
 which Vitamin & minerals importat for this?"


# Retrieve relevant documents/products
docs = ensemble_retriever.get_relevant_documents(query)

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(openai_api_key="sk-yourapikey")

#if you want to use opensource models such as lama,mistral check this 
# https://github.com/lancedb/vectordb-recipes/blob/main/tutorials/chatbot_using_Llama2_&_lanceDB

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=ensemble_retriever)

query = "what nutrition needed for pregnant women  "
qa.run(query)

Maximal Marginal Relevance (MMR)

  • Let’s say your final keyPhrases are ranked like Good Product, Great Product, Nice Product, Excellent Product, Easy Install, Nice UI, Light weight etc. But there is an issue with this approach, all the phrases like good product, nice product, excellent product are similar and define the same property of the product and are ranked higher. Suppose we have a space to show just 5 keyPhrases, in that case, we don't want to show all these similar phrases.

  • For the traditional semantic search, the highest similarity, the higheest ranking, which may cause the similar result

  • The idea behind using MMR is that it tries to reduce redundancy and increase diversity in the result and is used in text summarization. MMR selects the phrase in the final keyphrases list according to a combined criterion of query relevance and novelty of information.

Unlike traditional keyword-based search, vector search retrieves results by analyzing the similarity between

vectors
Full-text search