I have been working on the problems presented in Python Challenge. One of the problems asks to sift through a mess of characters and pick out the rarest character/s.
My methodology was to read the characters from a text file, store the characters/occurrence as a key/value pair in a dictionary. Sort the dictionary by value and invert the dictionary where the occurrence is the key and the string of characters is the value. Assuming that the rarest character occurs only once, I return the values where the key of this inverted dictionary equals one.
The input(funkymess.txt) is like this:
%%$@$^_#)^)&!_+]!*@&^}@@%%+$&[(_@%+%$*^@$^!+]!&#)*}{}}!}]$[%}@[{@#_^{*......
The code is as follows:
from operator import itemgetter
characterDict = dict()
#put the characters in a dictionary
def putEncounteredCharactersInDictionary(lineStr):
for character in lineStr:
if character in characterDict:
characterDict[character] = characterDict[character]+1
else:
characterDict[character] = 1
#Sort the character dictionary
def sortCharacterDictionary(characterDict):
sortCharDict = dict()
sortsortedDictionaryItems = sorted(characterDict.iteritems(),key = itemgetter(1))
for key, value in sortsortedDictionaryItems:
sortCharDict[key] = value
return sortCharDict
#invert the sorted character dictionary
def inverseSortedCharacterDictionary(sortedCharDict):
inv_map = dict()
开发者_运维知识库 for k, v in sortedCharDict.iteritems():
inv_map[v] = inv_map.get(v, [])
inv_map[v].append(k)
return inv_map
f = open('/Users/Developer/funkymess.txt','r')
for line in f:
#print line
processline = line.rstrip('\n')
putEncounteredCharactersInDictionary(processline)
f.close()
sortedCharachterDictionary = sortCharacterDictionary(characterDict)
#print sortedCharachterDictionary
inversedSortedCharacterDictionary = inverseSortedCharacterDictionary(sortedCharachterDictionary)
print inversedSortedCharacterDictionary[1]r
Can somebody take a look and provide me with some pointers on whether I am on the right track here and if possible provide some feedback on possible optimizations/best-practices and potential refactorings both from the language as well as from an algorithmic standpoint.
Thanks
Refactoring: A Walkthrough
I want to walk you through the process of refactoring. Learning to program is not just about knowing the end result, which is what you usually get when you ask a question on Stack Overflow. It's about how to get to that answer yourself. When people post short, dense answers to a question like this it's not always obvious how they arrived at their solutions.
So let's do some refactoring and see what we can do to simplify your code. We'll rewrite, delete, rename, and rearrange code until no more improvements can be made.
Simplify your algorithms
Python need not be so verbose. It is usually a code smell when you have explicit loops operating over lists and dicts in Python, rather than using list comprehensions and functions that operate on containers as a whole.
Use defaultdict to store character counts
A defaultdict(int)
will generate entries when they are accessed if they do not exist. This let's us eliminate the if/else branch when counting characters.
from collections import defaultdict
characterDict = defaultdict(int)
def putEncounteredCharactersInDictionary(lineStr):
for character in lineStr:
characterDict[character] += 1
Sorting dicts
Dictionaries don't guarantee any ordering on their keys. You cannot assume that the items are stored in the same order that you insert them. So sorting the dict entries and then putting them right back into another dict just scrambles them right back up.
This means that your function is basically a no-op. After you sort the items you will need to keep them as a list of tuples to retain their sorting order. Removing that code we can then reduce this method down to a single line.
def sortCharacterDictionary(characterDict):
return sorted(characterDict.iteritems(), key=itemgetter(1))
Inverting dicts
Given the previous comment you won't actually have a dict any more after sorting. But assuming you did, this function is one of those cases where explicit looping is discouraged. In Python, always be thinking how you can operate over collections all at once rather than one item at a time.
def inverseSortedCharacterDictionary(sortedCharDict):
return dict((v, k) for k, v in sortedCharDict.iteritems())
All in one line we (1) iterate over the key/value pairs in the dict; (2) switch them and create inverted value/key tuples; (3) create a dict out of these inverted tuples.
Comment and name wisely
Your method names are long and descriptive. There's no need to repeat the same information in comments. Use comments only when your code isn't self-descriptive, such as when you have a complex algorithm or an unusual construct that isn't immediately obvious.
On the naming front, your names are unnecessarily long. I would stick with far less descriptive names, and also make them more generic. Instead of inverseSortedCharacterDictionary
, try just invertedDict
. That's all that method does, it inverts a dict. It doesn't actually matter if it's passed a sorted character dict or any other type of dict.
As a rule of thumb, try to use the most generic names possible so that your methods and variables can be as generic as possible. More generic means more reusable.
characters = defaultdict(int)
def countCharacters(string):
for ch in string:
characters[ch] += 1
def sortedCharacters(characters):
return sorted(characters.iteritems(), key=itemgetter(1))
def invertedDict(d):
return dict((v, k) for k, v in d.iteritems())
Reduce volume
Using temporary variables and helper methods is a good programming practice, and I applaud you for doing so in your program. However, now that we have them simple enough that each one is only one or two lines we probably don't even need them any more.
Here's your program body after changing the functions as above:
f = open('funkymess.txt', 'r')
for line in f:
countCharacters(line.rstrip('\n'))
f.close()
print sortedCharacters(characters)[0]
And then let's just go ahead and inline those helper methods since they're so simple. Here's the final program after all the refactoring:
Final program
#!/usr/bin/env python
from operator import itemgetter
from collections import defaultdict
characters = defaultdict(int)
f = open('funkymess.txt','r')
for line in f:
for ch in line.rstrip('\n'):
characters[ch] += 1
f.close()
print sorted(characters.iteritems(), key=itemgetter(1))[0]
You don't even need as much code as that, because Python already has a class that counts elements in an iterable for you! The following does all of what you asked for.
from collections import Counter
counter = Counter(open(<...>).read())
print min(counter, key=counter.get)
Explanation:
collections
is a standard module in Python containing some commonly-used data structures. In particular, it contains Counter
, which is a subclass of dict
designed to count the frequency of stuff. It takes an iterable and counts all the characters in it.
Now as you may know, in Python strings are iterables and their elements are the single characters. So we can open
the file, read
all its contents at once, and feed that large string into a Counter
. This makes a dict-like object which maps characters to their frequencies.
Finally, we want to find the least frequent charater, given this dictionary of their frequencies. In other words, we want the minimum element of counter
, sorted by its value in the dictionary. Python has a built-in function for taking the minimum of things, naturally called min
. If you want to sort the data by something, you can pass it an optional key argument and it will sort the list by key
of that list. In this case, we ask min
to find the minimum element as sorted by counter.get
; in other words, we sort by its frequency!
That's way too much code.
[k for k, v in characterdict.iteritems()
if v = min(characterdict.items(), key=operator.itemgetter(1))[0]]
Optimize as desired (e.g. store the minimum in another variable first).
Here's the code that I used to solve this puzzle:
comment = open('comment.txt').read()
for c in sorted(set(comment)):
print ' %-3s %6d' % (repr(c)[1:-1], comment.count(c))
It sorts characters alphabetically rather than by frequency, but the rarest characters are very easy to pick up from the output.
If I wanted frequency sorting, I'd use collections.Counter like katrielalex suggested (if I remembered about its existence), or
from collections import defaultdict
comment = open('comment.txt').read()
counts = defaultdict(int)
for c in comment:
counts[c] += 1
for c in sorted(counts, key=counts.get):
print ' %-3s %6d' % (repr(c)[1:-1], counts[c])
Another way (not very compact) to accomplish your task:
text = """%$@$^_#)^)&!_+]!*@&^}@@%%+$&[(_@%+%$*^@$^!+]!&#)*}{}}!}"""
chars = set(text)
L = [[c, text.count(c)] for c in chars]
L.sort(key=lambda sublist: sublist[1])
>>> L
[('(', 1),
('[', 1),
('{', 1),
('#', 2),
(']', 2),
(')', 3),
('*', 3),
('_', 3),
('&', 4),
('+', 4),
('!', 5),
('%', 5),
('$', 5),
('}', 5),
('^', 5),
('@', 6)]
>>>
精彩评论