开发者

Data Mining Operation using SQL Query (Fuzzy Apriori Algorithm) - Coding it using SQL

开发者 https://www.devze.com 2023-01-28 01:54 出处:网络
So I have this Table: Trans_IDNameFuzzy_ValueTotal_Item 100I10.333333333 100I20.333333333 100I50.333333333

So I have this Table:

Trans_ID    Name    Fuzzy_Value    Total_Item  
100          I1  0.33333333        3  
100          I2  0.33333333        3  
100          I5  0.33333333        3  
200          I2  0.5               2  
200          I5  0.5               2  
300          I2  0.5               2  
300          I3  0.5               2  
400          I1  0.33333333        3  
400          I2  0.33333333        3  
400          I4  0.33333333        3  
500          I1  0.5               2  
500          I3  0.5               2  
600          I2  0.5               2  
600          I3  0.5               2  
700          I1  0.5               2  
700          I3  0.5               2  
800          I1  0.25              4  
800          I2  0.25              4  
800          I3  0.25              4  
800          I5  0.25              4  
900          I1  0.33333333        3  
900          I2  0.33333333        3  
900          I3  0.33333333        3  
1000         I1  0.2               5  
1000         I2  0.2               5  
1000         I4  0.2               5  
1000         I6  0.2               5  
1000         I8  0.2               5  

And two blank Tables:

Table  ITEMSET
  
"ITEM_SET" "Support" 



Table Confidence
  
"ANTECEDENT" "CONSEQUENT" 

I need to find FUZZY value for each item that occurs in each transaction:

I1 = Sum of (Fuzzy_Value from item I1 in trans 100 until 1000 which is trans: 100,400,500,700,800,900,1000)/Total Trans  
-> (.33333333+0.33333333+0.5+0.5+0.25+0.33333333+0.2)/10 = 0.244999999

  
I2 = Sum of (Fuzzy_Value from item I2 in trans 100 - 1000 which is trans:100,200,300,400,600,800,900,1000)/Total Trans  
-> (0.33333333+0.5+0.5+0.33333333+0.5+0.25+0.33333333)/10 = 0.274999999


I3 -> 0.258333333  
I4 -> 0.103333333  
I5 -> 0.058333333    
I6 -> 0.02    
I8 -> 0.02    

E.g., I use minimum Support 10% -> 0.1

I need to remove I5,I6,I8 since it's value < 0.1 => prune step

then store:

I1=0.244999999, I2=0.274999999, I3=0.258333333,I4=0.103333333  on new table 'ITEMSET' 

2 COMBINATIONS

NOTE: This is the basic 1st step after this most likely need to use repeat or recursive, since the process will keep going on until no other 开发者_开发知识库item combination is possible

then from what's left I need to find K+1 itemset (which is 2 combination itemset) => join step

{I1,I2} =Sum of (Fuzzy_Value from item I1 + I2 in trans 100 - 1000 which is trans:100,400,800,900,1000)/Total Trans 
->(0.666666667+0.666666667+0.5+0.666666667+0.4)/9 = 0.29

*do the same for the rest*
{I1,I3} =(1+1+0.5+0.666666667)/9 = 0.316666667
{I1,I4} =(0.666666667+0.4)/9 = 0.106666667
{I2,I3} =(1+1+0.5+0.666666667)/9 = 0.316666667
{I2,I4} =(1+0.666666667+0.4)/9 =0.206666667
{I3,I4} =0  

Then Do another Prune Step removing less than 0.1 value which is {I3,I4}

Store {I1,I2} = 0.29, {I1,I3} = 0.316666667, {I1,I4} =0.106666667, {I2,I3} = 0.316666667, {I2,I4} = 0.206666667  AT "ITEMSET" TABLE 

3 COMBINATION

After that Do another JOIN STEP combining itemset that pass pruning

{I1,I2,I3} = Sum of (Fuzzy_Value from item I1 + I2 +I3 in trans 100 - 1000 which is trans:800,900)/Total Trans  
-> 0.75+1 = 0.175  
**Same for the rest**  
{I1,I2,I4} = 1+0.6 = 0.16  
{I2,I3,I4} = 0  

Do another Prune Step removing less than 0.1 value which is {I1,I3,I4}

Store {I1,I2,I3} = 0.176 AND {I1,I2,I4} = 0,16 AT "ITEMSET" TABLE  

4 COMBINATION

Combine itemset that pass pruning K+4 (4 combination)

{I1,I2,I3,I4} = 0

**since no transaction containing this item

after process stop since there's no possible combination left


At this point, ITEMSET database have:

ITEM_SET           Support  
{I1}               0.244999999
{I2}               0.274999999     
{I3}               0.258333333    
{I4}               0.103333333  
{I1,I2}            0.29    
{I1,I3}            0.316666667  
{I1,I4}            0.106666667  
{I2,I3}            0.316666667  
{I2,I4}            0.206666667  
{I1,I2,I3}         0.176  
{I1,I2,I4}         0,16  

How do I code that in sql? Thank you very much!

Note: You can add another table as needed.


Step 1:

CREATE TABLE ITEMSET
SELECT Name, SUM(Fuzzy_Value)/COUNT(*) Fuzzy_Value
FROM trans
GROUP BY ID
HAVING ROUND(SUM(Fuzzy_Value), 1) >= 0.1

Note the ROUND() function - it's important, because you have values like .33333 that don't sum in a happy way.

Step 2:

ALTER TABLE ITEMSET ADD INDEX (Name)

SELECT a.Name Name1, b.Name Name2, SUM(Fuzzy_Value)/COUNT(*) Fuzzy_Value
FROM ITEMSET a JOIN ITEMSET b ON (a.Name != b.Name)
GROUP BY a.Name, b.Name
HAVING ROUND(SUM(Fuzzy_Value), 1) >= 0.1

Opps: I just noticed that you asked this half a year ago, so I guess there is no point in continuing. If you still need this answer leave a comment.


Try the following link use roll-up combinations oracle doc


Here is a bit different approach to this question. It differs because there is no writing to tables but giving final result that then can be written according to conditions into one or more tables. The first cte named "tbl" is just used for sample data creation. The second one, "name_ids" is later used to generate different combinations of 2, 3 or more T_NAMEs (innermost query of step2, 3, 4) The third one, "id_names" serves as a filtering dataset for different combinations that are still active and that have at least one common T_ID All of the cte-s data are shown in the code along with the final result at the end.
Tested with Oracle 11g

WITH
    tbl AS
        (
            Select  100 "T_ID", 'I1' "T_NAME",      0.33333333 "FUZZY_VALUE",       3 "TOTAL_ITEM"      From DUAL  UNION ALL
            Select  100,        'I2',               0.33333333,                     3                   From Dual  UNION ALL
            Select  100,        'I5',               0.33333333,                     3                   From Dual  UNION ALL
            Select  200,        'I2',               0.5,                            2                   From Dual  UNION ALL
            Select  200,        'I5',               0.5,                            2                   From Dual  UNION ALL
            Select  300,        'I2',               0.5,                            2                   From Dual  UNION ALL
            Select  300,        'I3',               0.5,                            2                   From Dual  UNION ALL
            Select  400,        'I1',               0.33333333,                     3                   From Dual  UNION ALL
            Select  400,        'I2',               0.33333333,                     3                   From Dual  UNION ALL
            Select  400,        'I4',               0.33333333,                     3                   From Dual  UNION ALL
            Select  500,        'I1',               0.5,                            2                   From Dual  UNION ALL
            Select  500,        'I3',               0.5,                            2                   From Dual  UNION ALL
            Select  600,        'I2',               0.5,                            2                   From Dual  UNION ALL
            Select  600,        'I3',               0.5,                            2                   From Dual  UNION ALL
            Select  700,        'I1',               0.5,                            2                   From Dual  UNION ALL
            Select  700,        'I3',               0.5,                            2                   From Dual  UNION ALL
            Select  800,        'I1',               0.25,                           4                   From Dual  UNION ALL
            Select  800,        'I2',               0.25,                           4                   From Dual  UNION ALL
            Select  800,        'I3',               0.25,                           4                   From Dual  UNION ALL
            Select  800,        'I5',               0.25,                           4                   From Dual  UNION ALL
            Select  900,        'I1',               0.33333333,                     3                   From Dual  UNION ALL
            Select  900,        'I2',               0.33333333,                     3                   From Dual  UNION ALL
            Select  900,        'I3',               0.33333333,                     3                   From Dual  UNION ALL
            Select  1000,       'I1',               0.2,                            5                   From Dual  UNION ALL
            Select  1000,       'I2',               0.2,                            5                   From Dual  UNION ALL
            Select  1000,       'I4',               0.2,                            5                   From Dual  UNION ALL
            Select  1000,       'I6',               0.2,                            5                   From Dual  UNION ALL
            Select  1000,       'I8',               0.2,                            5                   From Dual 
        ),
--  **************************************************************************************************************************
    name_ids AS
        (   Select T_NAME, Count(T_NAME) OVER(Order By T_NAME ROWS BETWEEN UNBOUNDED PRECEDING And CURRENT ROW) "T_NAME_ID", LISTAGG(T_ID, ', ') WITHIN GROUP (Order By T_ID) "NAME_IDS"
            From   tbl
            Group By T_NAME ),
--  -----------------------------------------------------------
--  R e s u l t   f o r   name_ids
--
--  T_NAME  T_NAME_ID NAME_IDS                                   
--  ------ ---------- -----------------------------------------
--  I1              1 100, 400, 500, 700, 800, 900, 1000       
--  I2              2 100, 200, 300, 400, 600, 800, 900, 1000  
--  I3              3 300, 500, 600, 700, 800, 900             
--  I4              4 400, 1000                                
--  I5              5 100, 200, 800                           
--  I6              6 1000                                   
--  I8              7 1000                                      
--
--  -----------------------------------------------------------
    id_names AS
        (   Select T_ID, LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By T_NAME) "ID_NAMES", COUNT(DISTINCT T_ID) OVER() "TOT_NUM_TRANS"
            From     tbl
            Group By T_ID   ),
--  -----------------------------------------------------------
--  R e s u l t   f o r   id_names
--
--        T_ID ID_NAMES                           TOT_NUM_TRANS
--  ---------- --------------------------------   ------------
--         100 I1, I2, I5                                   10
--         200 I2, I5                                       10     
--         300 I2, I3                                       10
--         400 I1, I2, I4                                   10
--         500 I1, I3                                       10
--         600 I2, I3                                       10
--         700 I1, I3                                       10
--         800 I1, I2, I3, I5                               10
--         900 I1, I2, I3                                   10
--        1000 I1, I2, I4, I6, I8                           10
--  ---------------------------------------------------------------------------------------------
    step1 AS
        (   Select 
                1 "STEP", T_NAME "T_NAME", i.TOT_NUM_TRANS "TOT_NUM_TRANS", 
                Sum(FUZZY_VALUE) "FUZZ_SUM", Round(Sum(FUZZY_VALUE / i.TOT_NUM_TRANS), 8) "SUPPORT", 
                CASE WHEN Sum(FUZZY_VALUE / i.TOT_NUM_TRANS) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
            From    tbl
            Inner  Join (Select Max(TOT_NUM_TRANS) "TOT_NUM_TRANS" From id_names) i ON(1=1)
            Group By T_NAME, i.TOT_NUM_TRANS
        ),
--  -----------------------------------------------------------
--  R e s u l t   f o r   step1
--
--        STEP T_NAME TOT_NUM_TRANS   FUZZ_SUM    SUPPORT STATUS
--  ---------- ------ ------------- ---------- ---------- ------
--           1 I1                10 2.44999999       .245 IN     
--           1 I2                10 2.94999999       .295 IN     
--           1 I3                10 2.58333333  .25833333 IN     
--           1 I4                10  .53333333  .05333333 OUT    
--           1 I5                10 1.08333333  .10833333 IN     
--           1 I6                10         .2        .02 OUT    
--           1 I8                10         .2        .02 OUT 
--  -------------------------------------------------------------------------------------------------------------------------
    step2 AS
        (   Select STEP, T_NAME, TOT_NUM_TRANS, FUZZ_SUM, SUPPORT, STATUS
            From
                (
                    Select  STEP, S_NAME "T_NAME", TOT_NUM_TRANS, Sum(FUZZY_VALUE) "FUZZ_SUM", Sum(SUPPORT) "SUPPORT", CASE WHEN Sum(SUPPORT) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
                    From    (   Select  RWN, STEP, T_NAME "S_NAME", FUZZY_VALUE, TOT_NUM_TRANS, SUPPORT   
                                From    (   Select  ROW_NUMBER() OVER (PARTITION BY s.T_ID, t.T_NAME Order By s.T_ID, t.T_NAME) "RWN", s.STEP, s.T_NAME, i.TOT_NUM_TRANS "TOT_NUM_TRANS",   s.T_ID, t.FUZZY_VALUE,  Round(t.FUZZY_VALUE / i.TOT_NUM_TRANS, 8) "SUPPORT"
                                                From(
                                                        Select
                                                            n.STEP, n.T_NAME, i.T_ID
                                                        From
                                                            id_names i
                                                        Inner Join
                                                            (
                                                                Select 2 "STEP", n1.T_NAME || ', ' || n2.T_NAME "T_NAME" 
                                                                From name_ids n1
                                                                Inner Join
                                                                    name_ids n2 ON(n1.T_NAME <>  n2.T_NAME  And n1.T_NAME_ID < n2.T_NAME_ID)
                                                            ) n ON(1=1)
                                                    ) s
                                            Inner Join 
                                                tbl t ON(t.T_ID = s. T_ID And InStr(s.T_NAME, t.T_NAME) > 0)
                                            Inner Join
                                                id_names i ON(i.T_ID = t.T_ID And InStr(i.ID_NAMES, t.T_NAME) > 0 And 
                                                                InStr(i.ID_NAMES, SubStr(s.T_NAME, 1, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 5, 2)) > 0)
                                            Inner Join 
                                                step1 s1 ON(INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step1 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, 1, 2)) > 0 And 
                                                            INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step1 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 1) + 2, 2)) > 0)
                                        )
                                Where RWN = 1
                            )
                    Group By STEP, S_NAME, TOT_NUM_TRANS
                )
        ),
--  -----------------------------------------------------------
--  R e s u l t   f o r   step2
--
--        STEP T_NAME TOT_NUM_TRANS   FUZZ_SUM    SUPPORT STATUS
--  ---------- ------ ------------- ---------- ---------- ------
--           2 I1, I2            10 2.89999998  .28999998 IN     
--           2 I1, I3            10 2.58333333  .25833333 IN     
--           2 I1, I5            10  .58333333  .05833333 OUT    
--           2 I2, I3            10          2         .2 IN     
--           2 I2, I5            10          1         .1 IN
--  -------------------------------------------------------------------------------------------------------------------------
    step3 AS
        (   Select STEP, T_NAME, TOT_NUM_TRANS, FUZZ_SUM, SUPPORT, STATUS
            From
                (
                    Select  STEP, S_NAME "T_NAME", TOT_NUM_TRANS, Sum(FUZZY_VALUE) "FUZZ_SUM", Sum(SUPPORT) "SUPPORT", CASE WHEN Sum(SUPPORT) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
                    From    
                        (   Select  RWN, STEP, T_NAME, T_NAME "S_NAME", FUZZY_VALUE, TOT_NUM_TRANS, SUPPORT   
                            From    
                                (   
                                    Select  
                                        ROW_NUMBER() OVER (PARTITION BY s.T_ID, t.T_NAME Order By s.T_ID, t.T_NAME) "RWN", s.STEP, s.T_NAME, i.TOT_NUM_TRANS "TOT_NUM_TRANS",   s.T_ID, t.FUZZY_VALUE,  Round(t.FUZZY_VALUE / i.TOT_NUM_TRANS, 8) "SUPPORT"
                                    From
                                        (
                                            Select
                                                n.STEP, n.T_NAME, i.T_ID
                                            From
                                                id_names i
                                            Inner Join
                                                (
                                                    Select 3 "STEP", n1.T_NAME || ', ' || n2.T_NAME || ', ' || n3.T_NAME "T_NAME" 
                                                    From name_ids n1
                                                    Inner Join
                                                        name_ids n2 ON(n1.T_NAME <>  n2.T_NAME  And n1.T_NAME_ID < n2.T_NAME_ID)
                                                    Inner Join
                                                        name_ids n3 ON(n2.T_NAME <>  n3.T_NAME  And n2.T_NAME_ID < n3.T_NAME_ID)
                                                ) n ON(1=1)
                                        ) s
                                    Inner Join 
                                        tbl t ON(t.T_ID = s. T_ID And InStr(s.T_NAME, t.T_NAME) > 0)
                                    Inner Join
                                        id_names i ON(i.T_ID = t.T_ID And InStr(i.ID_NAMES, t.T_NAME) > 0 And 
                                                        InStr(i.ID_NAMES, SubStr(s.T_NAME, 1, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 5, 2)) > 0)
                                    Inner Join
                                        step2 s2 ON(INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step2 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, 1, 6)) > 0 And 
                                                    INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step2 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 1) + 2, 6)) > 0 And
                                                    INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step2 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 2) + 2, 6)) > 0)
                                )
                            Where RWN = 1
                        )
                    Group By STEP, S_NAME, TOT_NUM_TRANS
                )
        ),
--  -----------------------------------------------------------
--  R e s u l t   f o r   step3
--
--        STEP T_NAME     TOT_NUM_TRANS   FUZZ_SUM    SUPPORT STATUS
--  ---------- ---------- ------------- ---------- ---------- ------
--           3 I1, I2, I3            10 3.48333331  .34833331 IN     
--           3 I1, I2, I5            10  .58333333  .05833333 OUT
--  -------------------------------------------------------------------------------------------------------------------------
    step4 AS
        (Select STEP, T_NAME, TOT_NUM_TRANS, FUZZ_SUM, SUPPORT, STATUS
            From
                (   Select  STEP, S_NAME "T_NAME", TOT_NUM_TRANS, Sum(FUZZY_VALUE) "FUZZ_SUM", Sum(SUPPORT) "SUPPORT", CASE WHEN Sum(SUPPORT) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
                    From    
                        (   Select  RWN, STEP, T_NAME, T_NAME "S_NAME", FUZZY_VALUE, TOT_NUM_TRANS, SUPPORT   
                                From    
                            (   
                                Select  
                                    ROW_NUMBER() OVER (PARTITION BY s.T_ID, t.T_NAME Order By s.T_ID, t.T_NAME) "RWN", s.STEP, s.T_NAME, i.TOT_NUM_TRANS "TOT_NUM_TRANS",   s.T_ID, t.FUZZY_VALUE,  Round(t.FUZZY_VALUE / i.TOT_NUM_TRANS, 8) "SUPPORT"
                                From
                                    (
                                        Select
                                            n.STEP, n.T_NAME, i.T_ID
                                        From
                                            id_names i
                                        Inner Join
                                            (
                                                Select 4 "STEP", n1.T_NAME || ', ' || n2.T_NAME || ', ' || n3.T_NAME || ', ' || n4.T_NAME "T_NAME" 
                                                From name_ids n1
                                                Inner Join
                                                    name_ids n2 ON(n1.T_NAME <>  n2.T_NAME  And n1.T_NAME_ID < n2.T_NAME_ID)
                                                Inner Join
                                                    name_ids n3 ON(n2.T_NAME <>  n3.T_NAME  And n2.T_NAME_ID < n3.T_NAME_ID)
                                                Inner Join
                                                    name_ids n4 ON(n3.T_NAME <>  n4.T_NAME  And n3.T_NAME_ID < n4.T_NAME_ID)
                                            ) n ON(1=1) 
                                    ) s
                                Inner Join 
                                    tbl t ON(t.T_ID = s. T_ID And InStr(s.T_NAME, t.T_NAME) > 0)
                                Inner Join
                                    id_names i ON(i.T_ID = t.T_ID And InStr(i.ID_NAMES, t.T_NAME) > 0 And 
                                                    InStr(i.ID_NAMES, SubStr(s.T_NAME, 1, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 5, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 9, 2)) > 0)
                                Inner Join
                                    step3 s3 ON(INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, 1, 10)) > 0 And 
                                                INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 1) + 2, 10)) > 0 And
                                                INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 2) + 2, 10)) > 0 And
                                                INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 3) + 2, 10)) > 0)
                            )
                            Where RWN = 1
                        )
                    Group By STEP, S_NAME, TOT_NUM_TRANS
                )
        )
--  -----------------------------------------------------------
--  R e s u l t   f o r   step4
--
--  no rows selected
--  -----------------------------------------------------------
--  *******************************************************
Select * From
    (
        Select * From step1 UNION ALL
        Select * From step2 UNION ALL
        Select * From step3 UNION ALL
        Select * From step4
    )
Order By 
    STEP, T_NAME
--  --------------------------------------------------------------------
--  R e s u l t   all together
--
--        STEP T_NAME           TOT_NUM_TRANS   FUZZ_SUM    SUPPORT STATUS
--  ---------- -------------- ------------- ---------- ---------- ------
--           1 I1                        10 2.44999999 .244999999 IN     
--           1 I2                        10 2.94999999 .294999999 IN     
--           1 I3                        10 2.58333333 .258333333 IN     
--           1 I4                        10  .53333333 .053333333 OUT    
--           1 I5                        10 1.08333333 .108333333 IN     
--           1 I6                        10         .2        .02 OUT    
--           1 I8                        10         .2        .02 OUT    
--           2 I1, I2                    10 2.89999998  .28999998 IN     
--           2 I1, I3                    10 2.58333333  .25833333 IN     
--           2 I1, I5                    10  .58333333  .05833333 OUT    
--           2 I2, I3                    10          2         .2 IN     
--           2 I2, I5                    10          1         .1 IN     
--           3 I1, I2, I3                10 3.48333331  .34833331 IN     
--           3 I1, I2, I5                10  .58333333  .05833333 OUT
--  -------------------------------------------------------------------

Limiting factor for this answer is that the code works ok with T_NAME column that has length of 2 characters as in sample data. For different lengths there should be some changes in ON() clauses of joins with step2,3,... No time to deal with it now but that also can be solved in a way that it doesn't matter what the length of the column is. Regards...

0

精彩评论

暂无评论...
验证码 换一张
取 消