I'm trying to accomplish a query that requires a calculated column using a subquery that passes the date reference via a variable. I'm not sure if I'm not "doing it right" but essentially the query never finishes and spins for minutes on end. This is my query:
select @groupdate:=date_format(order_date,'%Y-%m'), count(distinct customer_email) as num_cust,
(
select count(distinct cev.customer_email) as num_prev
from _pj_cust_email_view cev
inner join _pj_cust_email_view as prev_purch on (prev_purch.order_date < @groupdate) and (cev.customer_email=prev_purch.customer_email)
where cev.order_date > @groupdate
) as prev_cust_count
from _pj_cust_email_view
group by @groupdate;
Subquery has an inner join
accomplishes the self-join that only gives me the count of people that have purchased prior to the date in @groupdate
. The EXPLAIN
is below:
+----+----------------------+---------------------+------+---------------+-----------+---------+---------------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-开发者_开发知识库---------------------+---------------------+------+---------------+-----------+---------+---------------------------+--------+---------------------------------+
| 1 | PRIMARY | _pj_cust_email_view | ALL | NULL | NULL | NULL | NULL | 140147 | Using temporary; Using filesort |
| 2 | UNCACHEABLE SUBQUERY | cev | ALL | IDX_EMAIL | NULL | NULL | NULL | 140147 | Using where |
| 2 | UNCACHEABLE SUBQUERY | prev_purch | ref | IDX_EMAIL | IDX_EMAIL | 768 | cart_A.cev.customer_email | 1 | Using where |
+----+----------------------+---------------------+------+---------------+-----------+---------+---------------------------+--------+---------------------------------+
And the structure of the table _pj_cust_email_view
is as such:
'_pj_cust_email_view', 'CREATE TABLE `_pj_cust_email_view` (
`order_date` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
`customer_email` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
KEY `IDX_EMAIL` (`customer_email`),
KEY `IDX_ORDERDATE` (`order_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1'
Again, as I said earlier, I'm not really sure that this is the best way to accomplish this. Any criticism, direction is appreciated!
Update
I've made a little progress, and I'm now doing the above procedurally by iterating through all known months instead of months in the database and setting the vars ahead of time. I don't like this still. This is what I've got now:
Sets the user defined vars
set @startdate:='2010-08', @enddate:='2010-09';
Gets total distinct emails in the given range
select count(distinct customer_email) as num_cust
from _pj_cust_email_view
where order_date between @startdate and @enddate;
Gets the total count of customers who had purchased prior to the given range
select count(distinct cev.customer_email) as num_prev
from _pj_cust_email_view cev
inner join _pj_cust_email_view as prev_purch on (prev_purch.order_date < @startdate) and (cev.customer_email=prev_purch.customer_email)
where cev.order_date between @startdate and @enddate;
Where @startdate
is set to the start of the month and @enddate
signifies the end of that month's range.
I really feel like this still can be done in one full query.
I don't think you need to use subqueries at all, nor do you need to iterate over months.
Instead, I recommend you create a table to store all months. Even if you prepopulate it with 100 years of months, it would only have 1200 rows in it, which is trivial.
CREATE TABLE Months (
start_date DATE,
end_date DATE,
PRIMARY KEY (start_date, end_date)
);
INSERT INTO Months (start_date, end_date)
VALUES ('2011-03-01', '2011-03-31');
Store the actual start and end dates, so you can use the DATE data type and index the two columns properly.
edit: I think I understand your requirement a bit better, and I've cleaned up this answer. The following query may be right for you:
SELECT DATE_FORMAT(m.start_date, '%Y-%m') AS month,
COUNT(DISTINCT cev.customer_email) AS current,
GROUP_CONCAT(DISTINCT cev.customer_email) AS current_email,
COUNT(DISTINCT prev.customer_email) AS earlier,
GROUP_CONCAT(DISTINCT prev.customer_email) AS earlier_email
FROM Months AS m
LEFT OUTER JOIN _pj_cust_email_view AS cev
ON cev.order_date BETWEEN m.start_date AND m.end_date
INNER JOIN Months AS mprev
ON mprev.start_date <= m.start_date
LEFT OUTER JOIN _pj_cust_email_view AS prev
ON prev.order_date BETWEEN mprev.start_date AND mprev.end_date
GROUP BY month;
If you create the following compound index in your table:
CREATE INDEX order_email on _pj_cust_email_view (order_date, customer_email);
Then the query has the best chance of being an index-only query, and will run a lot faster.
Below is the EXPLAIN optimization report from this query. Note type: index
for each table.
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: m
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 6
ref: NULL
rows: 4
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: mprev
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 6
ref: NULL
rows: 4
Extra: Using where; Using index; Using join buffer
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: cev
type: index
possible_keys: order_email
key: order_email
key_len: 17
ref: NULL
rows: 10
Extra: Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: prev
type: index
possible_keys: order_email
key: order_email
key_len: 17
ref: NULL
rows: 10
Extra: Using index
Here's some test data:
INSERT INTO Months (start_date, end_date) VALUES
('2011-03-01', '2011-03-31'),
('2011-02-01', '2011-02-28'),
('2011-01-01', '2011-01-31'),
('2010-12-01', '2010-12-31');
INSERT INTO _pj_cust_email_view VALUES
('ron', '2011-03-10'),
('hermione', '2011-03-15'),
('hermione', '2011-02-15'),
('hermione', '2011-01-15'),
('hermione', '2010-12-15'),
('neville', '2011-01-10'),
('harry', '2011-03-19'),
('harry', '2011-02-10'),
('molly', '2011-03-25'),
('molly', '2011-01-10');
Here's the result given that data, including the concatenated list of emails to make it easier to see.
+---------+---------+--------------------------+---------+----------------------------------+
| month | current | current_email | earlier | earlier_email |
+---------+---------+--------------------------+---------+----------------------------------+
| 2010-12 | 1 | hermione | 1 | hermione |
| 2011-01 | 3 | neville,hermione,molly | 3 | hermione,molly,neville |
| 2011-02 | 2 | hermione,harry | 4 | harry,hermione,molly,neville |
| 2011-03 | 4 | molly,ron,harry,hermione | 5 | molly,ron,hermione,neville,harry |
+---------+---------+--------------------------+---------+----------------------------------+
Although Bill has a nice query using multiple tables, this one does it with the SQL variables too, so no extra table. The inner query joins to your _pj_cust_email_view table and does a limit 10 to signify only going back 10 months from the current month. So, no hard-coding of dates, it is computed on the fly... if you want more or less months, just change the LIMIT clause.
By setting the @dt := as the LAST field in the internal query, only THEN does the date basis get assigned for the next record cycle to create your qualifying dates...
select justDates.FirstOfMonth,
count( distinct EMCurr.customer_Email ) UniqThisMonth,
count( distinct EMLast.customer_Email ) RepeatCustomers
from
( SELECT
@dt FirstOfMonth,
last_day( @dt ) EndOfMonth,
@dt:= date_sub(@dt, interval 1 month) nextCycle
FROM
(select @dt := date_sub( current_date(), interval dayofmonth( current_date())-1 day )) vars,
_pj_cust_email_view limit 10
) JustDates
join _pj_cust_email_view EMCurr
on EMCurr.order_Date between JustDates.FirstOfMonth and JustDates.EndOfMonth
left join _pj_cust_email_view EMLast
on EMLast.order_Date < JustDates.FirstOfMonth
and EMCurr.customer_Email = EMLast.customer_Email
group by
1
精彩评论