DISTINCT (SQL)
Synopsis
SELECT DISTINCT BY (item {,item2}) select-item {,select-item2}
SELECT DISTINCT [ALL] select-item {,select-item2}
Arguments
Argument | Description |
---|---|
DISTINCT | Optional — Returns rows for which the combined select-item value(s) are unique. |
DISTINCT BY (item {,item2}) | Optional — Returns select-item values for rows for which the BY (item) value(s) are unique. |
ALL | Optional — Returns all rows in the result set. The default. |
Description
The optional DISTINCT clause appears after the SELECT keyword and before the optional TOP clause and the first select-item.
The DISTINCT clause is applied to the result set of the SELECT statement. It limits the rows returned to one arbitrary row for each distinct (unique) value. If no DISTINCT clause is specified, the default is to display all the rows that fulfill the SELECT criteria. The ALL clause is the same as specifying no DEFAULT clause; if you specify ALL, SELECT returns all the rows in the table that fulfill the SELECT criteria.
The DISTINCT clause has two forms:
-
SELECT DISTINCT: Returns one row for each unique combination of select-item values. You can specify one or more than one select-items. For example, the following query returns a row with Home_State and Age values for each unique combination of Home_State and Age values:
SELECT DISTINCT Home_State,Age FROM Sample.Person
-
SELECT DISTINCT BY (item): Returns one row for each unique combination of item values. You can specify a single item or a comma-separated list of items. The specified item or item list must be enclosed in parentheses. Spaces may be specified or omitted between the BY keyword and the parentheses. The select-item list may, but does not have to, include the specified item(s). For example, the following query returns a row with Name and Age values for each unique combination of Home_State and Age values:
SELECT DISTINCT BY (Home_State,Age) Name,Age FROM Sample.Person
The item field(s) must be specified by column name. Valid values include the following: a column name (DISTINCT BY (City)); an %ID (which returns all rows); a scalar function specifying a column name (DISTINCT BY (ROUND(Age,-1))); a collation function specifying a column name (DISTINCT BY (%EXACT(City))). You cannot specify a field by column alias; attempting to do so generates an SQLCODE -29 error. You cannot specify a field by column number; this is interpreted as a literal and returns one row. Specifying a literal as the item value in a DISTINCT clause returns 1 row; which row is returned is indeterminate. Thus, specifying 7, 'Chicago', '', 0, or NULL all return 1 row. However, if you specify a literal as an item value in a comma-separated list, the literal is ignored and DISTINCT selects one arbitrary row for each unique combination of the specified field names.
The DISTINCT clause is applied before the TOP clause. If both are specified, the SELECT returns only rows with unique values, the number of unique value rows specified in the TOP clause.
If the column specified in the DISTINCT clause has rows that are NULL (contain no value), DISTINCT returns one row with NULL as a distinct (unique) value, as shown in the following examples:
SELECT DISTINCT FavoriteColors FROM Sample.Person
SELECT DISTINCT BY (FavoriteColors) Name,FavoriteColors FROM Sample.Person
ORDER BY FavoriteColors
A DISTINCT clause is not meaningful in an Embedded SQL simple query, because in this type of Embedded SQL a SELECT always returns only one row of data. However, an Embedded SQL cursor–based query can return multiple rows of data; in a cursor-based query, a DISTINCT clause returns only unique value rows.
DISTINCT and ORDER BY
The DISTINCT clause is applied before the ORDER BY clause. Therefore, the combination of DISTINCT and ORDER BY will first select an arbitrary row that satisfies the DISTINCT clause, then order those rows based on the ORDER BY clause.
DISTINCT and GROUP BY
DISTINCT and GROUP BY both group records by a specified field (or fields) and return one record for each unique value of that field. One significant difference between them is that DISTINCT calculates aggregate functions before grouping. GROUP BY calculates aggregate functions after grouping. This difference is shown in the following examples:
SELECT DISTINCT BY (ROUND(Age,-1)) Age,AVG(Age) AS AvgAge FROM Sample.Person
/* AVG(Age) returns average of all ages in table */
SELECT Age,AVG(Age) AS AvgAge FROM Sample.Person GROUP BY ROUND(Age,-1)
/* AVG(Age) returns an average age for each age group */
A DISTINCT clause can be specified with one or more aggregate function fields, though this is rarely meaningful because an aggregate function returns a single value. Thus the following example returns a single row:
SELECT DISTINCT BY (AVG(Age)) Name,Age,AVG(Age) FROM Sample.Person
If a DISTINCT clause contains aggregate functions as the only item or select-item is used with a GROUP BY clause, the DISTINCT clause is ignored. The intended combination of DISTINCT, aggregate function, and GROUP BY can be achieved using a subquery. For further details and program examples, refer to the GROUP BY clause reference page.
Letter Case and DISTINCT Optimization
DISTINCT groups together string values based on the collation type defined for the field. By default, string data type fields are defined with SQLUPPER collation, which is not case-sensitive. You can define the string collation default for the current namespace and specify a non-default field collation type when defining a field/property.
If the field/property collation type is SQLUPPER, grouped field values are returned in all uppercase letters. To group values by original letter case, or to display the returned values for a grouped field in their original letter case, use the %EXACT collation function. This is shown in the following examples, which assume that the Home_City field is defined with collation type SQLUPPER and contains the values ‘New York’ and ‘new york’:
SELECT DISTINCT BY (Home_City) Name,Home_City FROM Sample.Person
/* groups together Home_City values by their uppercase letter values
returns the name of each grouped city in uppercase letters.
Thus, 'NEW YORK' is returned. */
SELECT DISTINCT BY (Home_City) Name,%EXACT(Home_City) FROM Sample.Person
/* groups together Home_City values by their uppercase letter values
returns the name of each grouped city in original letter case.
Thus, 'New York' or 'new york' may be returned, but not both. */
SELECT DISTINCT BY (%EXACT(Home_City)) Name,Home_City FROM Sample.Person
/* groups together Home_City values by their original letter case
returns the name of each grouped city in original letter case.
Thus, both 'New York' and 'new york' are returned.
Optimization is not used. */
You can optimize query performance for queries that contain a DISTINCT clause by using the Management Portal. Select System Administration, Configuration, SQL and Object Settings, SQL. View and edit the GROUP BY and DISTINCT queries must produce original values option. (This optimization also works for the GROUP BY clause.) The default is “No”.
This default groups alphabetic values by their uppercase letter collation. This optimization takes advantage of indexes for the selected field(s). It is therefore only meaningful if an index exists for one or more of the selected fields. It collates field values as they are stored in the index; alphabetic strings are returned in all uppercase letters. You can set this system-wide option, then override it for specific queries by using the %EXACT collation function to preserve letter case.
For further details, refer to the SQL and Object Settings Pages listed in System Administration Guide.
You can also set this option system-wide using the $SYSTEM.SQL.Util.SetOption()Opens in a new tab method FastDistinct option. To determine the current setting, call $SYSTEM.SQL.CurrentSettings()Opens in a new tab, which displays the DISTINCT optimization turned on setting; the default is 1.
Other Uses of DISTINCT
-
Stream Field: DISTINCT operates on the OID of a stream field, not its actual data. Because all stream field OIDs are unique values, DISTINCT has no effect on actual stream field duplicate data values. DISTINCT BY (StreamField) reduces the number records where the stream field is NULL to one NULL record. For further details, see Storing and Using Stream Data (BLOBs and CLOBs).
-
Asterisk Syntax: The syntax DISTINCT * is legal, but not meaningful, because all rows, by definition, contain some distinct unique identifier. The syntax DISTINCT BY (*) is not legal.
-
Subquery: The use of a DISTINCT clause in a subquery is legal, but not meaningful, because a subquery returns a single value.
-
No Row Data Selected: The DISTINCT clause can be used with a SELECT that does not access any table data. If the SELECT contains a FROM clause, specifying DISTINCT results in one row contain these non-table values; if you do not specify DISTINCT (or TOP) the SELECT results in as many rows with identical values as the number of rows in the FROM clause table. If the SELECT does not contain a FROM clause, DISTINCT is legal but not meaningful. See FROM clause for more details.
-
Aggregate Function: A DISTINCT clause can be used within an aggregate function to select only distinct (unique) field values for inclusion in the aggregate. Unlike the SELECT DISTINCT clause, DISTINCT within an aggregate function does not include NULL as a distinct (unique) value. Note that the MAX and MIN aggregate functions parse DISTINCT clause syntax without error, but this syntax performs no operation.
DISTINCT and %ROWID
Specifying the DISTINCT keyword causes a cursor-based Embedded SQL query to not set the %ROWID variable. %ROWID is not set even when DISTINCT does not limit the rows returned. This is shown in the following example:
SET %ROWID=999
&sql(DECLARE EmpCursor CURSOR FOR
SELECT DISTINCT Name, Home_State
INTO :name,:state FROM Sample.Person
WHERE Home_State %STARTSWITH 'M')
&sql(OPEN EmpCursor)
QUIT:(SQLCODE'=0)
FOR { &sql(FETCH EmpCursor)
QUIT:SQLCODE
WRITE !,"RowID: ",%ROWID," row count: ",%ROWCOUNT
WRITE " Name=",name," State=",state
}
&sql(CLOSE EmpCursor)
This change of query behavior only applies to cursor-based Embedded SQL SELECT queries. Dynamic SQL SELECT queries and non-cursor Embedded SQL SELECT queries never set %ROWID.
DISTINCT and Transaction Processing
Specifying the DISTINCT keyword causes a query to retrieve all current data, including data that has not yet been committed by the current transaction. The transaction’s READ COMMITTED isolation mode parameter (if set) is ignored; all data is retrieved in READ UNCOMMITTED mode. For further details, refer to Transaction Processing.
Examples
The following query returns one row for each distinct Home_State value:
SELECT DISTINCT Home_State FROM Sample.Person
ORDER BY Home_State
The following query returns one row for each distinct Home_State value, but returns additional fields for that row. The row that is retrieved is not predictable:
SELECT DISTINCT BY (Home_State) %ID,Name,Home_State,Office_State FROM Sample.Person
ORDER BY Home_State
The following query returns one row for each distinct combination of Home_State and Office_State values. Depending on the data, it will either return more rows or the same number of rows as the previous example:
SELECT DISTINCT BY (Home_State,Office_State) %ID,Name,Home_State,Office_State FROM Sample.Person
ORDER BY Home_State,Office_State
The following query uses DISTINCT BY to return one row for each distinct Name length:
SELECT DISTINCT BY ($LENGTH(Name)) Name,$LENGTH(Name) AS lname
FROM Sample.Person
ORDER BY lname
The following query uses DISTINCT BY to return one row for each distinct first element of FavoriteColors %List values. It lists one distinct row with FavoriteColors NULL:
SELECT DISTINCT BY ($LIST(FavoriteColors,1)) Name,FavoriteColors,$LIST(FavoriteColors,1) AS FirstColor
FROM Sample.Person
The following query returns the first 20 distinct Home_State values retrieved from Sample.Person in ascending collation sequence order. The “top” rows reflect the ORDER BY clause sequencing of all of the rows in Sample.Person.
SELECT DISTINCT TOP 20 Home_State FROM Sample.Person ORDER BY Home_State
The following query uses DISTINCT in both the main query and in a WHERE clause subquery. It returns the first 20 distinct Home_State values in Sample.Person that are also in Sample.Employee. If the subquery DISTINCT was not provided, it would retrieve the distinct Home_State values in Sample.Person that match a random selection of Home_State values in Sample.Employee:
SELECT DISTINCT TOP 20 Home_State FROM Sample.Person
WHERE Home_State IN(SELECT DISTINCT TOP 20 Home_State FROM Sample.Employee)
ORDER BY Home_State
The following query returns the first 20 distinct FavoriteColor values. This reflects the ORDER BY clause sequencing of all of the rows in Sample.Person. The FavoriteColors field is known to have NULLs, so one distinct row with FavoriteColors NULL appears at the top of the collation sequence.
SELECT DISTINCT BY (FavoriteColors) TOP 20 FavoriteColors,Name FROM Sample.Person
ORDER BY FavoriteColors
Also note in the preceding example that because FavoriteColors is a list field, the collation sequence includes the element length byte. Thus distinct list values beginning with a three-letter element (RED) are listed before list values beginning with a four-letter element (BLUE).