Natural Language Processing (NLP)

Introduction

A branch of artificial intelligence (AI) that enables computers to understand, interpret, and generate human language.
Uses techniques from computational linguistics, machine learning, and deep learning to process and analyze text and speech, allowing machines to perform tasks like understanding the meaning of text, translating languages, and generating human-like conversations

To filter out non essential data , such as adjective , a , an , the, ...
Common Practice:
Lowercase , Removing stop words, Regular expression, Lemmatization( ate -> eat, eats -> eat), N-gram (to get the group of word based on N number)

Text vectorization is the broad process of converting words, sentences, or entire documents into numbers that machine learning models can work with. It’s like creating a translation dictionary between human language and computer language.
One of the simplest ways to vectorize text is the Bag-of-Words (BoW) model. The idea is to use a vector to represent the frequency or presence of each word in a document. Imagine taking all the unique words in your dataset
TF-IDF (Term Frequency-Inverse Document Frequency) improves on this by weighting words based on how important they are in a document compared to a collection of documents.

An unsupervised machine learning technique used in natural language processing (NLP) to discover abstract "topics" or semantic themes within a large collection of documents, such as articles or social media posts
Actually using the model to understand the nature of word, e.g: Latent Semantic Analysis (LSA) or Latent Dirichlet allocation (LDA)

More likely act as a specialist, focusing on handle the specific task
Developer is needed to write a script to handle the logic of NLP module output to return back customized response , so called hard-coded rules.
Failed to answer the question that out-of-scope

More likely act as a generalist, can answer different kinds of question
Token cost is needed and needed more time for computing it
Hard to be "explainable" , as we are using 3rd party tool, it will be a black box if answering incorrectly

Last updated 1 month ago