Main Page Content
Top Ten Sql
What is the "top ten" problem? Using SQL, how do you select only those rows that have one of the top ten values for a certain column?
Suppose we have a SAMPLETABLE
of people and scores.
The "top ten" problem here is how to select the people who have the top ten scores.
Here's an example:
PERSON | SCORE |
---|---|
Tom | 2 |
Dick | 15 |
Harry | 4 |
Winken | 3 |
Blinken | 11 |
Nod | 7 |
Dopey | 7 |
Sneezy | 21 |
Bashful | 17 |
Doc | 2 |
Dasher | 7 |
Dancer | 9 |
Comet | 7 |
Cupid | 11 |
Donner | 11 |
Blitzen | 13 |
Can't I just sort the data?
One way to find the top ten values in a column would be to select all rows and bring them back sorted in descending order by that column:
SELECT PERSON , SCORE FROM SAMPLETABLE ORDER BY SCORE DESC
PERSON | SCORE |
---|---|
Sneezy | 21 |
Bashful | 17 |
Dick | 15 |
Blitzen | 13 |
Blinken | 11 |
Donner | 11 |
Cupid | 11 |
Dancer | 9 |
Nod | 7 |
Dopey | 7 |
Dasher | 7 |
Comet | 7 |
Harry | 4 |
Winken | 3 |
Tom | 2 |
Doc | 2 |
Since the result set comes back sorted, all you need to do is use the top ten records.
Some front-end programs, like Sybase's Infomaker and Microsoft's Access, allow you to specify a parameter to limit the number of rows in the result set.
In Infomaker it's called maximum number of rows retrieved, but for this discussion, we'll use the Access term, TopValues.
Uh oh, what about ties?
Take another look at the sorted scores, this time with the results numbered like they do in races.
PERSON | SCORE | PLACE |
---|---|---|
Sneezy | 21 | 1st |
Bashful | 17 | 2nd |
Dick | 15 | 3rd |
Blitzen | 13 | 4th |
Blinken | 11 | 5th |
Donner | 11 | 5th |
Cupid | 11 | 5th |
Dancer | 9 | 8th |
Nod | 7 | 9th |
Dopey | 7 | 9th |
Dasher | 7 | 9th |
Comet | 7 | 9th |
Harry | 4 | 13th |
Winken | 3 | 14th |
Tom | 2 | 15th |
Doc | 2 | 16th |
Did you notice that there are actually several people tied for 9th place?
How would you interpret the "top ten" problem in light of the possibility of ties across 10th place?
Most people would agree that it makes sense to include ties.
Note: No, the SAMPLETABLE
does not contain a PLACE
column (if it did, you could just select all rows where PLACE < 10
, and then where would this tutorial be?). The place values are for illustrative purposes only, and no, it is not easy to generate PLACE
values into a table -- that's another SQL tutorial for another day.
Okay, I understand the problem -- what's the solution?
If you don't have a TopValues feature.....
If your front end program doesn't have a TopValues feature, you can just select the entire table with an ORDER BY
, then inspect the results (watching for ties), and discard all records past the place you're looking for -- or past the last of the ties.
This is a manual procedure but it's simple.
Once you have determined what the cutoff point is, you would then re-run the query with an absolute value like this:
SELECT PERSON , SCORE FROM SAMPLETABLE WHERE SCORE >= 7 ORDER BY SCORE DESC
Why TopValues doesn't work well
If your front end program does have a TopValues feature, go ahead and use it -- but be sure to set it high enough to get at least one record beyond any ties for the last place you're looking for.
This, too, is a manual procedure, and here's the problem -- you may have to repeat it a few times, or turn off the TopValues feature at least once.
Why is this?
In the last example above, you would have to set TopValues to 13 in order to get past the ties for 9th place. The problem here, of course, is that you don't know in advance how high to go, whether 13 is high enough to snag all the ties across 10th place.
Imagine the case where every person has the same score, i.e. everybody is tied for first place. You would have to set TopValues to a number that is higher than the number of records in the table. With luck, you might happen to notice that the first 10 people had the same score, and realize that something was up.
In the worst case scenario, though, the top few scores are all different, but there's a tie that starts at 10th place and goes on from there. You probably wouldn't realize that you needed to increase TopValues. Therefore it's always a good idea to set it substantially higher than you need to.
This, in turn, means that even when you do use the TopValues feature, you will still have to inspect the results -- you have to make sure you got past any ties, and discard any records past the place you're looking for or past the last of the ties.
But with or without the TopValues feature, sorting the table is simple and efficient -- although it does require manual intervention.
Isn't there an automated way?
Yes, with the "top ten" SQL --
SELECT PERSON , SCORE FROM SAMPLETABLE XXX WHERE 10 > ( SELECT COUNT(*) FROM SAMPLETABLE WHERE SCORE > XXX.SCORE )
Let's try to understand this in stages.
First, the query uses a correlated subquery. Basically, a correlated subquery is an inner query that must be evaluated for each row of the outer query. The XXX
identifies the table in the outer query, so that in the subquery, the XXX.SCORE
identifies that column as belonging to the outer query's table.
So for each row of the outer query, the subquery counts the number of rows with a larger score than that of the outer row under consideration.
If there are fewer than 10 rows with a larger score, then the outer row satisfies the outer WHERE
clause, i.e. it's in the top ten.
Are ties across 10th place included by the top ten SQL?
Yes.
PERSON | SCORE |
---|---|
Sneezy | 21 |
Bashful | 17 |
Dick | 15 |
Blitzen | 13 |
Blinken | 11 |
Donner | 11 |
Cupid | 11 |
Dancer | 9 |
Nod | 7 |
Dopey | 7 |
Dasher | 7 |
Comet | 7 |
To see why this is so, let's take Dasher as a test case.
Count the number of rows that have a score greater than 7, which is Dasher's score.
The answer is eight rows, so Dasher must be in the top ten.
The top ten SQL automatically includes all ties across 10th place.
What about efficiency?
Clearly, a correlated subquery could cause a performance problem -- the database has to look at all rows of a table for every row.
On the face of it, this looks like it will have "n-squared" efficiency, which means processing requirements go up by the square of the number of rows.
Thankfully, the difference is not too noticeable for tables with a modest number of rows. If your table is large and the query seems to perform poorly, try declaring a descending index on the SCORE
column.
For very large tables, you should be on an industrial-strength database server (all of which have a built-in optimizer for situations like this).
For further information...
Please feel free to send your SQL questions to
Rudy Limeback (r937@interlog.com)