selecting top 3 values (based on their count) from each category_问答_开发者

selecting top 3 values (based on their count) from each category

开发者 https://www.devze.com 2022-12-07 22:12 出处：网络

I\'m trying开发者_StackOverflow中文版 to get the top 3 rows from col2 for each letter in col1 based on the count of col2

I'm trying开发者_StackOverflow中文版 to get the top 3 rows from col2 for each letter in col1 based on the count of col2

My table looks like this:

col1	col2
A	1
B	2
A	2
B	2
B	1
B	1
B	1
A	3
A	2
B	4
A	2
B	2
A	3
A	4

So for example, here I have that A=1 one time, A=2 three times and A=3 two times (similar for B).

I would like my output to look like this:

col1	col2
A	2
A	3
A	1
B	1
B	2
B	4

Where it shows me the top 3 values (col2) (based on the count) of A for col1. Same for B.

I tried doing:

SELECT col1, col2
FROM data
WHERE col2 = (SELECT COUNT(col2)
                    FROM data
                    WHERE col1 = data.col1
                    ORDER BY COUNT(col2) desc
                    LIMIT(3)
                    )
GROUP BY col1, col2

but I get an empty table

I'm also trying:

SELECT col1, col2, rn
FROM (SELECT col1, col2,
      ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY COUNT(col2) DESC) AS rn
    FROM data) tmp 
WHERE rn <= 3
ORDER BY col1, rn

but it doesn't work (I think the problem is when I tell it to count).

Any ideas? I'm pretty new with SQL and can't figure it out. I could also do it using dataframes (pyspark) but I'm not sure what the equivalent for OVER PARTITION would be in API dataframe..

UPDATE: I figured it out. This is the correct query:

SELECT col1, col2, x FROM (SELECT col1, col2, count(col2) AS x, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY count(col2) DESC) AS rn FROM data GROUP BY col1, col2) tmp WHERE rn <= 3 ORDER BY col1

try this

SELECT col1,col2
FROM (
    SELECT col1,col2,
           row_number() over(PARTITION BY col1 ORDER BY cnt DESC) as rn
    FROM (
        SELECT col1,col2,COUNT(1) as cnt 
        FROM `table`
        group by col1,col2
    ) t0
) t1
WHERE rn <= 3

in PySpark:

df = spark.createDataFrame(
    [
     ('A','1')
    ,('B','2')
    ,('A','2')
    ,('B','2')
    ,('B','1')
    ,('B','1')
    ,('B','1')
    ,('A','3')
    ,('A','2')
    ,('B','4')
    ,('A','2')
    ,('B','2')
    ,('A','3')
    ,('A','4')
    ],
    ['col1','col2']
)

from pyspark.sql import Window
import pyspark.sql.functions as F

w = Window.partitionBy("col1").orderBy('col2')

df\
    .withColumn('rank', F.dense_rank().over(w))\
    .filter(F.col('rank')<=3)\
    .groupby('col1','col2')\
    .agg(F.first('col2').alias('dummy'))\
    .drop('rank','dummy')\
    .show()

# +----+----+
# |col1|col2|
# +----+----+
# |   A|   1|
# |   A|   2|
# |   A|   3|
# |   B|   1|
# |   B|   2|
# |   B|   4|
# +----+----+