Index

Introduction

If there are lots of row on database table, it is time-wasting to get the value from the row, and scan the value of row one by one
Index is set on primary key by default
When user needs to search the result frequency by entering some parameters or fields, we need to create the index that we will searching on

Primary(Clustered) Index

Introduction

A clustered index is collocated with the data in the same table space or same disk file. It will be created when primary key is created on the column
Data blocks are all the time moved around here & there by the OS whenever it’s necessary. A database system does not have any absolute control over how physical data space is managed, but inside a data block, records can be stored or managed in the logical order of the index key.
Don’t need to care about actually organizing the physical record in a certain order, rather a small index section is maintained in that order & fetching or maintaining records becomes very easy.

Structure

An index is usually maintained as a B+ Tree on disk & in-memory, and any index is stored in blocks on disk.
The leaf index block of the index contains a row locator (A pointer to the block). For the primary index, the row locator refers to virtual address of the corresponding physical location of the data blocks on disk where rows reside being sorted as per the index key.

Secondary Index

Introduction

Any index other than a clustered index is called a secondary index. Secondary indices does not impact physical storage locations unlike primary indices.
Secondary indices are needed on these columns if the frequency of such queries is very high.

Structure

Secondary index is also maintained in the B+ Tree and it’s sorted as per the key on which the index was created. The leaf nodes contain a copy of the key of the corresponding data in the primary index.
Retrieving data through the secondary index means you have to traverse two B+ trees — one is the secondary index B+ tree itself, and the other is the primary index B+ tree. The secondary index points to primary index
if you delete a primary index, all secondary indices have to be updated to contain a copy of the new primary index key

Unique Key Index

CREATE UNIQUE INDEX unique_idx_1 ON index_demo (pan_no);

Like primary keys, unique keys can also identify records uniquely with one difference — the unique key column can contain null values.

Composite Index

CREATE INDEX composite_index_1 ON index_demo (phone_no, name, age);

CREATE INDEX composite_index_2 ON index_demo (pan_no, name, age);

Let’s say we have an index defined on 4 columns — col1, col2, col3, col4. With a composite index, we have search capability on col1, (col1, col2) , (col1, col2, col3) , (col1, col2, col3, col4)
we can’t omit a column from the middle & use that like — (col1, col3) or (col1, col2, col4) or col3 or col4 etc
While deciding the columns for a composite index, you can analyze different use cases of your system & try to come up with the order of columns that will benefit most of your use cases.

Partial Index

CREATE INDEX secondary_index_1 ON index_demo (name(4));

The column name can contain large values of any length. Also in the index, the row locators’ or row pointers’ metadata have their own size.
it’s possible to create an index on the first few bytes of data as well. Example: the following command creates an index on the first 4 bytes of name. Though this method reduces memory overhead by a certain amount

Cardinality

It refers that the uniqueness of the result get by searching by the index. Fewer results get by searching the index and the bigger difference between each result means higher cardinality.
Setting low cardinality index, such as gender will decrease the performance of the query, as from the concept of b-tree geometry, index containing numbers of value will damage the performance, and the time complexity will be near to O(n)
At the opposite side, setting high cardinality index, the time complexity will be near to O(1), as the index is nearly unique.
The query optimizer will decide whether using index or not on searching based on cardinality
Only one index per table per query will be used by optimizer based on cardinality

Advantages

Help us to increase the speed of searching data

With index Scan

Index Scan:
Index Block → Specific Data Block → Row
[Only reads relevant blocks]

Without index scan

Sequential Scan:
Block 1 → Block 2 → Block 3 → Block 4 → Block 5
[Check every row in each block]

Drawbacks

With DML operations, indices are updated, so write operations are quite costly with indexes. The more indices you have, the greater the cost. Indexes are used to make read operations faster.
indices consume extra memory

Index algo

B-tree

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email TEXT
);

CREATE INDEX idx_email ON users(email);

INSERT INTO users (email) VALUES 
    ('alice@mail.com'),   -- id=1
    ('bob@mail.com'),     -- id=2
    ('charlie@mail.com'), -- id=3
    ('david@mail.com');   -- id=4

Select * from "users" where "email" = 'bob@mail.com';

Initial structure (after first two inserts):
        'bob@mail.com'
        /
'alice@mail.com'

After third insert:
        'bob@mail.com'
        /            \
'alice@mail.com'  'charlie@mail.com'

After fourth insert (rebalancing):
        'charlie@mail.com'    -- Becomes root for better balance
        /                \
'bob@mail.com'     'david@mail.com'
    /
'alice@mail.com'

The B-tree traversal:
Start at root ('charlie@mail.com')
'bob' < 'charlie' → go left
Find 'bob@mail.com' → get row data

B-tree is better for finding exact values or ranges (The equality)

Gin

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    tags TEXT[]
);

CREATE INDEX idx_tags ON products USING GIN (tags);

INSERT INTO products (tags) VALUES 
    (ARRAY['electronics', 'laptop']),
    (ARRAY['electronics', 'phone']),
    (ARRAY['clothing', 'shirt']);

-- Find products with 'electronics' tag
SELECT * FROM products WHERE tags @> ARRAY['electronics']::TEXT[];

Entry Tree:
'clothing'   → [Row3]
'electronics'→ [Row1, Row2]
'laptop'     → [Row1]
'phone'      → [Row2]
'shirt'      → [Row3]

Each entry points to multiple rows:
'electronics' points to:
  - Row(id=1, tags=['electronics', 'laptop'])
  - Row(id=2, tags=['electronics', 'phone'])

The GIN traversal:
Look up 'electronics' in entry tree
Get list of matching rows [Row1, Row2]
Return those rows

Good for array column and find for containment

References

An in-depth look at Database IndexingfreeCodeCamp.org

PreviousGroup By vs Distinct NextN + 1 problem

Last updated 8 months ago

hashtagIntroduction

hashtagPrimary(Clustered) Index

hashtagIntroduction

hashtagStructure

hashtagSecondary Index

hashtagIntroduction

hashtagStructure

hashtagUnique Key Index

hashtagComposite Index

hashtagPartial Index

hashtagCardinality

hashtagAdvantages

hashtagDrawbacks

hashtagIndex algo

hashtagB-tree

hashtagGin

hashtagReferences

Introduction

Primary(Clustered) Index

Introduction

Structure

Secondary Index

Introduction

Structure

Unique Key Index

Composite Index

Partial Index

Cardinality

Advantages

Drawbacks

Index algo

B-tree

Gin

References