Top Ten Sql

Posted on 03 Jun 1999

in Backend

by rudy limeback (r937)

Rated 4.22 (Ratings: 11)

Want more?

More articles in Backend

rudy limeback

Member info

User since: 14 Dec 1998

Articles written: 12

What is the "top ten" problem? Using SQL, how do you select only those rows that have one of the top ten values for a certain column?

Suppose we have a SAMPLETABLE of people and scores.

The "top ten" problem here is how to select the people who have the top ten scores.

Here's an example:

PERSON	SCORE
Tom	2
Dick	15
Harry	4
Winken	3
Blinken	11
Nod	7
Dopey	7
Sneezy	21
Bashful	17
Doc	2
Dasher	7
Dancer	9
Comet	7
Cupid	11
Donner	11
Blitzen	13

Can't I just sort the data?

One way to find the top ten values in a column would be to select all rows and bring them back sorted in descending order by that column:


SELECT PERSON
     , SCORE
  FROM SAMPLETABLE
 ORDER BY SCORE DESC

PERSON	SCORE
Sneezy	21
Bashful	17
Dick	15
Blitzen	13
Blinken	11
Donner	11
Cupid	11
Dancer	9
Nod	7
Dopey	7
Dasher	7
Comet	7
Harry	4
Winken	3
Tom	2
Doc	2

Since the result set comes back sorted, all you need to do is use the top ten records.

Some front-end programs, like Sybase's Infomaker and Microsoft's Access, allow you to specify a parameter to limit the number of rows in the result set.

In Infomaker it's called maximum number of rows retrieved, but for this discussion, we'll use the Access term, TopValues.

Uh oh, what about ties?

Take another look at the sorted scores, this time with the results numbered like they do in races.

PERSON	SCORE	PLACE
Sneezy	21	1st
Bashful	17	2nd
Dick	15	3rd
Blitzen	13	4th
Blinken	11	5th
Donner	11	5th
Cupid	11	5th
Dancer	9	8th
Nod	7	9th
Dopey	7	9th
Dasher	7	9th
Comet	7	9th
Harry	4	13th
Winken	3	14th
Tom	2	15th
Doc	2	16th

Did you notice that there are actually several people tied for 9th place?

How would you interpret the "top ten" problem in light of the possibility of ties across 10th place?

Most people would agree that it makes sense to include ties.

Note: No, the SAMPLETABLE does not contain a PLACE column (if it did, you could just select all rows where PLACE < 10, and then where would this tutorial be?). The place values are for illustrative purposes only, and no, it is not easy to generate PLACE values into a table -- that's another SQL tutorial for another day.

Okay, I understand the problem -- what's the solution?

If you don't have a TopValues feature.....

If your front end program doesn't have a TopValues feature, you can just select the entire table with an ORDER BY, then inspect the results (watching for ties), and discard all records past the place you're looking for -- or past the last of the ties.

This is a manual procedure but it's simple.

Once you have determined what the cutoff point is, you would then re-run the query with an absolute value like this:


SELECT PERSON
     , SCORE
  FROM SAMPLETABLE
 WHERE SCORE >= 7
 ORDER BY SCORE DESC

Why TopValues doesn't work well

If your front end program does have a TopValues feature, go ahead and use it -- but be sure to set it high enough to get at least one record beyond any ties for the last place you're looking for.

This, too, is a manual procedure, and here's the problem -- you may have to repeat it a few times, or turn off the TopValues feature at least once.

Why is this?

In the last example above, you would have to set TopValues to 13 in order to get past the ties for 9th place. The problem here, of course, is that you don't know in advance how high to go, whether 13 is high enough to snag all the ties across 10th place.

Imagine the case where every person has the same score, i.e. everybody is tied for first place. You would have to set TopValues to a number that is higher than the number of records in the table. With luck, you might happen to notice that the first 10 people had the same score, and realize that something was up.

In the worst case scenario, though, the top few scores are all different, but there's a tie that starts at 10th place and goes on from there. You probably wouldn't realize that you needed to increase TopValues. Therefore it's always a good idea to set it substantially higher than you need to.

This, in turn, means that even when you do use the TopValues feature, you will still have to inspect the results -- you have to make sure you got past any ties, and discard any records past the place you're looking for or past the last of the ties.

But with or without the TopValues feature, sorting the table is simple and efficient -- although it does require manual intervention.

Isn't there an automated way?

Yes, with the "top ten" SQL --


SELECT PERSON
     , SCORE
  FROM SAMPLETABLE XXX
 WHERE 10 > (
   SELECT COUNT(*)
     FROM SAMPLETABLE
     WHERE SCORE > XXX.SCORE )

Let's try to understand this in stages.

First, the query uses a correlated subquery. Basically, a correlated subquery is an inner query that must be evaluated for each row of the outer query. The XXX identifies the table in the outer query, so that in the subquery, the XXX.SCORE identifies that column as belonging to the outer query's table.

So for each row of the outer query, the subquery counts the number of rows with a larger score than that of the outer row under consideration.

If there are fewer than 10 rows with a larger score, then the outer row satisfies the outer WHERE clause, i.e. it's in the top ten.

Are ties across 10th place included by the top ten SQL?

Yes.

PERSON	SCORE
Sneezy	21
Bashful	17
Dick	15
Blitzen	13
Blinken	11
Donner	11
Cupid	11
Dancer	9
Nod	7
Dopey	7
Dasher	7
Comet	7

To see why this is so, let's take Dasher as a test case.

Count the number of rows that have a score greater than 7, which is Dasher's score.

The answer is eight rows, so Dasher must be in the top ten.

The top ten SQL automatically includes all ties across 10th place.

What about efficiency?

Clearly, a correlated subquery could cause a performance problem -- the database has to look at all rows of a table for every row.

On the face of it, this looks like it will have "n-squared" efficiency, which means processing requirements go up by the square of the number of rows.

Thankfully, the difference is not too noticeable for tables with a modest number of rows. If your table is large and the query seems to perform poorly, try declaring a descending index on the SCORE column.

For very large tables, you should be on an industrial-strength database server (all of which have a built-in optimizer for situations like this).

For further information...

Please feel free to send your SQL questions to

Rudy Limeback (r937@interlog.com)

My involvement with evolt.org goes back to 1998, as one of the original founders. I'm an SQL consultant who dabbled in web development for several years, in the "golden age" between HTML 2 and XHTML 1.1. My web sites are r937.com and rudy.ca. I live in Toronto, Canada, and you are all cordially invited to come and visit and play a round of frisbee golf with me.

Start of page header

Other Fine Evolt.org Sites

Navigation Starts

Submit

Article Categories

Highest rated articles

Help Support evolt.org

Main Page Content