Group By vs Distinct
Distinct
To eliminate duplicate rows from a result set. It scans the entire result set and removes any duplicates, returning only unique rows.
It deduplicates by identifying exact duplicate rows, filtering out redundant results. often faster for simple uniqueness filtering, especially on smaller datasets or queries with fewer columns.
When our goal is to eliminate duplicates and return unique rows without any aggregations. Especially useful for simple queries where the result set needs to be small and straightforward.
Scan the Table or Index: The database engine first scans the table or index based on the query. It collects all the rows with the specified columns.
Sort the Rows:To efficiently identify duplicates, the database typically sorts the result set based on the selected columns. Sorting organizes the data such that identical rows are adjacent to each other.
Remove Duplicate Rows: After sorting, the database engine compares adjacent rows. If two or more consecutive rows are identical, all but one are removed from the result set.
Return Unique Rows: The remaining rows, which are now unique, are returned as the final result.
Group By
Used for aggregating data by one or more columns. It groups rows that have the same values in specified columns.
GROUP BY groups rows based on identical column values but doesn't technically eliminate duplicates the same way. Instead, it collapses rows into groups and returns one row per group.
Scan the Table or Index: The database engine scans the table or index to collect all rows with the columns specified in the GROUP BY clause.
Sort or Hash the Rows (Depends on Database): The database either sorts the rows by the GROUP BY columns or uses a hashing mechanism. Sorting is the more common approach, but some databases may choose hashing for better performance, especially with large datasets. Sorting (or hashing) organizes the rows into groups based on the values of the GROUP BY columns.
Group Rows: After sorting (or hashing), the engine groups rows that have the same values in the specified columns. Even without an aggregation function, GROUP BY will collapse rows into unique sets based on the grouping columns.
Return One Row Per Group: The database engine returns one representative row for each unique group. If there are no aggregation functions, it simply returns the unique combination of the grouping columns.
Last updated
Was this helpful?