Avoid Grouping (DISTINCT / GROUP BY)

Previous Next

Avoid grouping (DISTINCT / GROUP BY).

Background

If you do not need group information such as SUM, then avoid grouping.

In a way comparable to GROUP BY... HAVING, the DISTINCT keyword generally causes large sets of records to be cached for subsequent sorting. Avoid DISTINCT where possible. In the example, the RDBMS can move to the next set of records of the schedtour table as soon as the subquery has been satisfied once.

 

Example

Rewrite an SQL statement like the following:

SELECT    DISTINCT st.schedtour_id
FROM      schedtour st
,         reservation r
WHERE     s.schedtour_id = r.schedtour_id

to:

SELECT    st.schedtour_id
FROM      schedtour st
WHERE EXISTS
(
          SELECT    ''
          FROM      reservation r
          WHERE     s.schedtour_id = r.schedtour_id
)

 

The first statement with DISTINCT retrieves all IDs of scheduled tours for which at least one corresponding reservation exists. It does this by first performing a join, and then grouping by the primary key value.

Instead of the DISTINCT keyword, the same query could be formulated as:

GROUP BY st.schedtour_id

Both the DISTINCT and the GROUP BY version are slow, because grouping must be executed before the result can be delivered.

With the EXISTS construct, the RDBMS only needs to scan the SCHEDTOUR table once. For each scheduled tour record, it can decide to add it to the result list as soon as a corresponding child record is found, and then move to the next scheduled tour record. Scheduled tour records do not need to be buffered, since no grouping needs to take place.