GEOG 868
Spatial Database Management

Subqueries

PrintPrint

Subqueries

There comes a time when every SQL developer has a problem that is too difficult to solve using only the methods we've discussed so far. In our baseball stats database, difficulty arises when you consider the fact that players who switch teams mid-season have a separate row for each team in the STATS table. For example, Mark McGwire started the 1997 season with OAK, before being traded to and spending the rest of that season with STL. The STATS table contains two McGwire-1997 rows; one for his stats with OAK and one for his stats with STL. 

If you wanted to identify each player's best season (in terms of batting average, home runs or runs batted in), you wouldn't be able to do a straight GROUP BY on the player's name or ID because that would not account for the mid-season team switchers. What's needed in this situation is a multistep approach that first computes each player's yearly aggregated stats, then identifies the maximum value in that result set.

A novice's approach to this problem would be to output the results from the first query to a new table, then build a second query on top of the table created by the first. The trouble with this approach is that it requires re-running the first query each time the STATS table changes.

A more ideal solution to the problem can be found through the use of a subquery. Let's have a look at how to build a subquery in Access.

A. Building a query on top of another saved query

  1. Create a new query and add the STATS table to the design canvas.
  2. Click on the Totals button to add the Total row to the design grid.
  3. Bring the following fields down into the design grid: PLAYER_ID, YEAR, AB, HITS, HR, RBI, and TEAM.
  4. Set the value in the Total cell to Group By for the PLAYER_ID and YEAR fields.
  5. Set the value in the Total cell to Sum for the AB, HITS, HR, and RBI fields.
  6. Set the value in the Total cell to Count for the TEAM field. These settings specify that we want to group by the player ID and year, sum the player's stats across each player/year grouping, and include a count of the number of teams the player played for each year.
  7. Run the query and confirm that it produces 227 records. (Compare that against the number of records in the STATS table. Do you understand why there is a difference?)
  8. Save the query as STATS_AGGR.
  9. Now, create a new query. In the Show Table dialog, double-click on PLAYERS to add it to the design canvas.
  10. Next, in the Show Table dialog, switch from the Tables tab to the Queries tab and double-click on STATS_AGGR to add it to the design canvas. Let's identify each player's best season in terms of home runs.
  11. Again, click on the Totals button.
  12. Bring down to the design grid the FIRST_NAME and LAST_NAME fields from the PLAYERS table and SumOfHR from the STATS_AGGR query.
  13. Confirm that the value in the Total cell is set to Group By for the FIRST_NAME and LAST_NAME fields.
  14. Set the value in the Total cell for the SumOfHR field to Max.
  15. Save this query as QueryBasedOnAQuery.

B. Building a query within another query.

  1. Switch to the SQL View of the QueryBasedOnAQuery query (sorry about that). Note that an inner join is conducted between PLAYERS and STATS_AGGR. When Access executes this query, it needs to first evaluate the SQL stored in the STATS_AGGR query before it can join that result set to PLAYERS. In your mind, you can imagine replacing the STATS_AGGR part of the PLAYERS INNER JOIN STATS_AGGR expression with the SQL stored in STATS_AGGR. In fact, our next step will be to literally replace STATS_AGGR with its SQL code.
  2. Re-open the STATS_AGGR query and switch to SQL View.
  3. Highlight and copy the SQL (leave behind the trailing semi-colon).
  4. Return to the SQL View of the QueryBasedOnAQuery query.
  5. Remove the STATS_AGGR from the PLAYERS INNER JOIN STATS_AGGR and paste in the code on your clipboard.
  6. Insert parentheses around the code you just pasted and add AS STATS_AGG after the closing parenthesis. Note that we're giving the subquery an alias of STATS_AGG to avoid confusion with the saved query with the name STATS_AGGR.
  7. Replace the query's references to STATS_AGGR with STATS_AGG, (there is one reference to STATS_AGGR in the SELECT clause and one in the ON clause). Your query should look like this:
    SELECT PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME, Max(STATS_AGG.SumOfHR) AS MaxOfSumOfHR
        FROM PLAYERS INNER JOIN (SELECT STATS.PLAYER_ID, STATS.YEAR, Sum(STATS.AB) AS SumOfAB, Sum(STATS.HITS) AS SumOfHITS, Sum(STATS.HR) AS SumOfHR, Sum(STATS.RBI) AS SumOfRBI, Count(STATS.TEAM) AS CountOfTEAM
        FROM STATS
        GROUP BY STATS.PLAYER_ID, STATS.YEAR) AS STATS_AGG
        ON PLAYERS.PLAYER_ID = STATS_AGG.PLAYER_ID
        GROUP BY PLAYERS.FIRST_NAME, PLAYERS.LAST_NAME;
  8. Test the query. It should report the highest single-season home run total for each player.

This approach is a bit more complicated than the first, since it cannot be carried out with the graphical Query Builder. However, the advantage is that all of the code involved can be found in one place, and you don't have a second intermediate query cluttering the database's object list.

This completes the material on retrieving data from an RDBMS using SELECT queries. In the next section, you'll be posed a number of questions that will test your ability to write SELECT queries on your own.