I consider myself fairly competent in understanding and manipulating C-ish languages; it's not a problem for me to come up with an algorithm and implement it in any C-ish language.
I have tremendous difficulty writing SQL (in my specific case, MySQL) queries. For very simple queries, it isn't a problem, but for complex queries, I become frustrated not knowing where to start. Reading the MySQL documentation is difficult, mainly because the syntax description and explanation isn't organized very well.
For example, the SELECT
documentation is all over the map: it starts out with what looks like psuedo-BNF, but then (since the text for aggregate descriptions aren't clickable... like select_expr
) it quickly devolves into this frustrating exercise of trying to piece the syntax together yourself by having a number of browser windows open.
Enough whining.
I'd like to know how people, step by step, begin constructing a complex MySQL query. Here is a specific example. I have three tables below. I want to SELECT
a set of rows with the following characteristics:
From the userInfo
and userProgram
tables, I want to select the userName
, isApproved
, and modifiedTimestamp
fields and UNION
them into one set. From this set I want to ORDER
by modifiedTimestamp
taking the MAX(modifiedTimestamp)
for every user (i.e. there should be only one row with a unique userName
and the timestamp associated with that username should be as high as possible).
From the user
table, I want to match the firstName
and lastName
that is associated with the userName
so that it looks something like this:
+-----------+----------+----------+-------------------+
| firstName | lastName | userName | modifiedTimestamp |
+-----------+----------+----------+-------------------+
| JJ | Prof | jjprofUs | 1289914725 |
| User | 2 | user2 | 1289914722 |
| User | 1 | user1 | 1289914716 |
| User | 3 | user3 | 1289914713 |
| User | 4 | user4 | 1289914712 |
| User | 5 | user5 | 1289914711 |
+-----------+----------+----------+-------------------+
The closest I've got is a query that looks like this:
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userInfo
WHERE user.userName=userInfo.userName)
UNION
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userProgram
WHERE user.userName=userProgram.userName)
ORDER BY modifiedTimestamp DESC;
I feel like I'm pretty close but I don't know where to go from here or even if I'm thinking about this in the right way.
> user
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| userName | char(8) | NO | PRI | NULL | |
| firstName | varchar(255) | NO | | NULL | |
| lastName | varchar(255) | NO | | NULL | |
| email | varchar(255) | NO | UNI | NULL | |
| avatar | varchar(255) | YES | | '' | |
| password | varchar(255) | NO | | NULL | |
| passwordHint | text | YES | | NULL | |
| access | int(11) | NO | | 1 | |
| lastLoginTimestamp | int(11) | NO | | -1 | |
| isActive | tinyint(4) | NO | | 1 | |
+--------------------+--------------+------+-----+---------+-------+
> userInfo
+-------------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+------------+------+-----+---------+-------+
| userName | char(8) | NO | MUL | NULL | |
| isApproved | tinyint(4) | NO | | 0 | |
| modifiedTimestamp | int(11) | NO | | NULL | |
| field | char(255) | YES | | NULL | |
| value | text | YES | | NULL | |
+-------------------+------------+------+-----+---------+-------+
> userProgram
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| userName | char(8) | NO | PRI | NULL | |
| isApproved | tinyint(4) | NO | PRI | 0 | |
| modifiedTimestamp | int(11) | NO | | NULL | |
| name | varchar(255) | YES | | NULL | |
| address1 | varchar(255) | YES | | NULL | |
| address2 | varchar(255) | YES | | NULL | |
| city | varchar(50) | YES | | NULL | |
| state | char(2) | YES | MUL | NULL | |
| zip | char(10) | YES | | NULL | |
| phone | varchar(25) | YES | | NULL | |
| fax | varchar(25) | YES | | NULL | |
| ehsChildren | int(11) | YES | | NULL | |
| hsChildren | int(11) | YES 开发者_如何学编程| | NULL | |
| siteCount | int(11) | YES | | NULL | |
| staffCount | int(11) | YES | | NULL | |
| grantee | varchar(255) | YES | | NULL | |
| programType | varchar(255) | YES | | NULL | |
| additional | text | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
For what I understand from your question, you seem to need a correlated query, which would look like this:
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userInfo ui1
WHERE user.userName=userInfo.userName
AND modifiedtimestamp=(select max(modifiedtimestamp) from userInfo ui2 where ui1.userName=ui2.userName))
UNION
(SELECT firstName, lastName, user.userName, modifiedTimestamp
FROM user, userProgram up1
WHERE user.userName=userProgram.userName
AND modifiedtimestamp=(select max(modifiedtimestamp) from userProgram up2 where up1.userName=up2.userName))
ORDER BY modifiedTimestamp DESC;
So, do I proceed to get to this result? Key is: express clearly the information you want to retrieve, without taking mental shortcuts.
Step 1: Choose the fields I need in the different tables of my database. That's what is between SELECT and FROM. Seems obvious, but it becomes less obvious when it comes to aggregation function like sums or counts. In that case, you have to say, for example "I need the count of lines in userInfo for each firstName". See below in GROUP BY.
Step 2: Knowing the field you need, write the joins between the different corresponding tables. That's an easy one...
Step 3: Express your conditions. It can be easy, like if you want data from user for userName="RZEZDFGBH", or more complicated, like in your case: the way to formulate it so you can get the thing done, if you want only the most recent modifiedtimestamp, is "so that the modifiedtimestamp is equal to the most recent modifiedtimestamp" (that's where you can easily take a mental shortcut and miss the point)
Step 4: If you have aggregates, it's time to set the GROUP BY statement. For example, if you count all line in userInfo for each firstName, you would write "GROUP BY firstName":
SELECT firstName,count(*) FROM userInfo GROUP BY firstName
This gives you the number of entries in the table for each different firstName.
Step 5: HAVING conditions. These are conditions on the aggregates. In the previous example, if you wanted only the data for the firstName having more than 5 lines in the table, you could write SELECT firstName,count(*) FROM userInfo GROUP BY firstName HAVING count(*)>5
Step 6: Sort with ORDER BY. Pretty easy...
That's only a short summary. There is much, much more to discover, but it would be too long to write an entire SQL course here... Hope it helps, though!
As f00 says, it's simple(r) if you think of the data in terms of sets.
One of the issues with the question as it stands is that the expected output doesn't match the stated requirements - the description mentions the isApproved column, but this doesn't appear anywhere in either the query or the expected output.
What this illustrates is that the first step in writing a query is to have a clear idea of what you want to achieve. The bigger issue with the question as it stands is that this is not clearly described - instead, it moves from a sample table of expected output (which would be more helpful if we had corresponding samples of expected input data) straight into a description of how you intend to achieve it.
As I understand it, what you want to see is a list of users (by username, with their associated first and last names), together with the last time any associated record was modified on either the userInfo or userProgram tables.
(It isn't clear whether you want to see users who have no associated activity on either of these other tables - your supplied query implies not, otherwise the joins would be outer joins.)
So, you want a list of users (by username, with their associated first and last names):
SELECT firstName, lastName, userName
FROM user
together with a list of times that records were last modified:
SELECT userName, MAX(modifiedTimestamp)
...
on either the userInfo or userProgram tables:
...
FROM
(SELECT userName, modifiedTimestamp FROM userInfo
UNION ALL
SELECT userName, modifiedTimestamp FROM userProgram
) subquery -- <- this is an alias
...
by userName:
...
group by userName
These two sets of data need to be linked by their userName - so the final query becomes:
SELECT user.firstName, user.lastName, user.userName,
MAX(subquery.modifiedTimestamp) last_modifiedTimestamp
FROM user
JOIN
(SELECT userName, modifiedTimestamp FROM userInfo
UNION ALL
SELECT userName, modifiedTimestamp FROM userProgram
) subquery
ON user.userName = subquery.userName
GROUP BY user.userName
In most versions of SQL, this query would return an error as user.firstName
and user.lastName
are not included in the GROUP BY
clause, nor are they summarised.
MySQL allows this syntax - in other SQLs, since those fields are functionally dependant on userName, adding a MAX in front of each field or adding them to the grouping would achieve the same result.
A couple of additional points:
- UNION and UNION ALL are not identical - the former removes duplicates while the latter does not; this makes the former more processor-intensive. Since duplicates will be removed by the grouping, it is better to use UNION ALL.
- Many people will write this query as user joined to userInfo UNIONed ALL with user joined to userProgram - this is because many SQL engines can optimise this type of query more effectively. At this point, this represents premature optimisation.
There's a lot of good stuff here. Thanks to everyone who contributed. This is a quick summary of the things I found helpful as well as some additional thoughts in connecting building functions to building queries. I wish I could give everyone SO merit badges/points but I think that there can only be one (answer) so I'm picking Traroth based upon point total and personal helpfulness.
A function can be understood as three parts: input, process, output. A query can be understood similarly. Most queries look something like this:
SELECT stuff FROM data WHERE data is like something
The
SELECT
portion is the output. There are some capabilities for formatting the output here (i.e. usingAS
)The
FROM
portion is the input. The input should be seen as a pool of data; you will want to make this as specific as possible, using a variety of joins and subqueries that are appropriate.The
WHERE
portion is like the process, but there's a lot of overlap with the FROM portion. Both theFROM
andWHERE
portions can reduce the pool of data appropriately using a variety of conditions to filter out unwanted data (or to only included desired data). TheWHERE
portion can also help format the output.
Here's how I broke down the steps:
Start with thinking about what your output looks like. This stuff goes into the
SELECT
portion.Next, you want to define the set of data that you wish to work on. Traroth notes: "Knowing the field you need, write the joins between the different corresponding tables. That's an easy one..." It depends on what you mean by 'easy'. If you are new to writing queries, you will probably just default to writing inner joins (like I did). This is not always the best way to go. http://en.wikipedia.org/wiki/Join_(SQL) is a great resource to understanding the different kinds of joins possible.
As a part of the previous step think about smaller parts of that data set and build up to the complete data set you are interested in. In writing a function, you can write subfunctions to help express your process in a clearer manner. Similar to that, you can write subqueries. A huge tip from Mark Bannister in creating a subquery AND USING AN ALIAS. You will have to reconfigure your output to use this alias, but this is pretty key.
Last, you can use various methods to pare down your data set, removing data you're not interested in
One way to think about the data you are operating on is a giant 2-D matrix: JOIN
s make larger the horizontal aspect, UNION
s make larger the vertical aspect. All the other filters are designed to make this matrix smaller to be appropriate for your output. I don't know if there is a "functional" analogy to JOIN
, but UNION
is just adding the output of two functions together.
I realize, though, there are lots of ways that building query IS NOT like writing a function. For example, you can build and pare down your data set in both the FROM
and WHERE
areas. What was key for me was understanding joins and finding out how to create subqueries using aliases.
just learn to think in terms of sets - then it's simple :P
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
You can't construct sql without understanding the data in the tables and the logical result required. There's no background given for what data the tables might look like and mean and the description of the results you're trying to gather doesn't make sense to me so I'm not going to venture a guess.
On the latter point... it's rare that you'd want a union of timestamp values multiple sources. Generally speaking when results like that are gathered it's generally for some sort of auditing/tracing. However, when you're discarding all information about the source of the timestamp and just computing a maximum you have... well what exactly?
Anyways, one or more examples of data and desired output and maybe something about the application and the whys is a must to make yourself clear.
To the extent I'll make any prediction about the shape of your eventual statement, (assuming your task will still be to get a single maximum timestamp per user) it's that it will look something like this:
select u.firstname, u.lastname, user_max_time.userName, user_max_time.max_time
from users u,
( select (sometable).userName, max((sometable).(timestamp column))
from (data of interest)
group by (sometable).userName) user_max_time
where u.userName = user_max_time.userName
order by max_time desc;
Your task here would then be to replace the ()s inside the the user_max_time subselect with something that makes sense and maps to your requirements. In terms of a general approach to complex sql, the major suggestion is to build the query from the innermost subselects back out (testing along the way to make sure performance is ok and you don't need intermediate tables).
Anyways, if you're having trouble, and can come back with examples, would be happy to help.
Cheers, Ben
精彩评论