Natural Language Processing (NLP)

Introduction

  • A branch of artificial intelligence (AI) that enables computers to understand, interpret, and generate human language.

  • Uses techniques from computational linguistics, machine learning, and deep learning to process and analyze text and speech, allowing machines to perform tasks like understanding the meaning of text, translating languages, and generating human-like conversations

Text Preprocessing

  • To filter out non essential data , such as adjective , a , an , the, ...

  • Common Practice:

  • Lowercase , Removing stop words, Regular expression, Lemmatization( ate -> eat, eats -> eat), N-gram (to get the group of word based on N number)

Vectorizing Text

  • Text vectorization is the broad process of converting words, sentences, or entire documents into numbers that machine learning models can work with. It’s like creating a translation dictionary between human language and computer language.

  • One of the simplest ways to vectorize text is the Bag-of-Words (BoW) model. The idea is to use a vector to represent the frequency or presence of each word in a document. Imagine taking all the unique words in your dataset

  • TF-IDF (Term Frequency-Inverse Document Frequency) improves on this by weighting words based on how important they are in a document compared to a collection of documents.

Topic Modelling

  • An unsupervised machine learning technique used in natural language processing (NLP) to discover abstract "topics" or semantic themes within a large collection of documents, such as articles or social media posts

  • Actually using the model to understand the nature of word, e.g: Latent Semantic Analysis (LSA) or Latent Dirichlet allocation (LDA)

Comparison (vs LLM)

NLP

  • More likely act as a specialist, focusing on handle the specific task

  • Developer is needed to write a script to handle the logic of NLP module output to return back customized response , so called hard-coded rules.

  • Failed to answer the question that out-of-scope

LLM

  • More likely act as a generalist, can answer different kinds of question

  • Token cost is needed and needed more time for computing it

  • Hard to be "explainable" , as we are using 3rd party tool, it will be a black box if answering incorrectly

Last updated