Database:
Transacation# Items List
T1 butter
T1 jam
T2 butter
T3 bread
T3 ice cream
T4 butter
T4 jam
In the above table, Is it possible to run apriori association rule in mysql statement?
For example, the support of buys(T开发者_JAVA百科, butter) --> buys(T, jam) = 50%
because there are 4 transactions and T1, T4 satisfy "support" rule.
Can i just use a sql statement to find out such result?
Yes, you can use SQL to find the support of a single item. But if you want to find itemsets containing more than one item, it would be difficult.
For example, if you had transactions containing several items and you want to find the support of "jam" with "milk" and "bread" appearing together, then it is better to use an algorithm like Apriori, or a faster algorithm like FPGrowth.
I arrive at 66% for the sample data you gave? There are 3 transactions for "butter", and out of those only 2 included "jam".
I used the following test table.
create table transactions(
trans_no varchar(5) not null
,item varchar(20) not null
,primary key(trans_no, item)
);
insert into transactions(trans_no, item)
values ('T1', 'butter')
,('T1', 'jam')
,('T2', 'butter')
,('T3', 'bread')
,('T3', 'ice cream')
,('T4', 'butter')
,('T4', 'jam');
The following is my attempt at an answer. The inner select find all transactions that included "butter". For each such transaction, it also sets a flag (bought_jam) saying whether that transaction also included "jam". (The having clause excludes transactions that includes "jam" but not "butter").
In the outer select, I basically count all rows (the count corresponds to the number of transactions including butter), and sums the jam flag, which corresponds to the number of transactions including both butter and jam.
select sum(bought_jam) as jams_bought
,count(*) as num_trans
,100 * sum(bought_jam) / count(*) as correlation_pct
from (select trans_no
,max(case when item = 'jam' then 1 else 0 end) as bought_jam
from transactions
where item in('butter', 'jam')
group
by trans_no
having min(case when item = 'butter' then item end) = 'butter'
) butter_trans;
The query above gives the following result:
+-------------+-----------+-----------------+
| jams_bought | num_trans | correlation_pct |
+-------------+-----------+-----------------+
| 2 | 3 | 66.6667 |
+-------------+-----------+-----------------+
1 row in set (0.00 sec)
Let me know how this works out for you.
Edit:
The following query would give the same results, but is much easier to read. However, if the transactions table is very large, and the item = x
isn't very selective (returns lots of rows), this query would almost certainly be slower.
select count(t2.trans_no) as jams_bought
,count(*) as num_trans
,count(t2.trans_no) / count(*) as correlation_pct
from transactions t1
left join transactions t2 on(t2.trans_no = t1.trans_no and t2.item = 'jam')
where t1.item = 'butter';
精彩评论