python csv: getting subset_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-08 18:53 出处：网络

here is a snapshot of my csv: alex123f1 harryfwef2 alexsef 3 alexgsdf4 alexwf356 harrysdfsdf3 i would like to ge开发者_JAVA百科t the subset of this data where the occurrence of anything in the firs

相关专题：csv python

here is a snapshot of my csv:

alex    123f    1
harry   fwef    2
alex    sef 3
alex    gsdf    4
alex    wf35    6
harry   sdfsdf  3

i would like to ge开发者_JAVA百科t the subset of this data where the occurrence of anything in the first column (harry, alex) is at least 4. so i want the resulting data set to be:

alex    123f    1
alex    sef 3
alex    gsdf    4
alex    wf35    6

Clearly, you cannot decide which rows are interesting until you've seen all rows (since the very last row might be the one turning some count from three to four and thereby making some previously seen rows interesting, for example;-). So, unless your CSV file is horribly huge, suck it all into memory, first, as a list...:

import csv

with open('thefile.csv', 'rb') as f:
  data = list(csv.reader(f))

then, do the counting -- Python 2.7 has a better way, but assuming you're still on 2.6 like most of us...:

import collections
counter = collections.defaultdict(int)
for row in data:
    counter[row[0]] += 1

and finally do the selection loop...:

for row in data:
    if counter[row[0]] >= 4:
        print row

Of course, this prints each interesting row as a roughly-hewed list (with square brackets and quotes around the items), but it will be easy to format it in any way you might prefer.

if Python is not a must

$ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file

And yes, 70MB file is fine.

python csv: getting subset

精彩评论

关注公众号

热门标签

图文推荐

python csv: getting subset

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：