开发者

Python - to check if a char is in dictionary and if not to deal with it

开发者 https://www.devze.com 2022-12-19 21:58 出处:网络
I am going about transliteration from one source language(input file) to a target language(target file) so I am checking for equivalent mappings in a dictionary in my source code, certain characters i

I am going about transliteration from one source language(input file) to a target language(target file) so I am checking for equivalent mappings in a dictionary in my source code, certain characters in the source code don't have an equivalent mapping like comma(,) and all other such special symbols. How do I check if the character belongs to the dictionary for whic开发者_运维问答h I have an equivalent mapping and to even take care of those special symbols to be printed in the target file which don't have an equivalent mapping in the dictionary. Thank you:).


My recommendation, given that rules is a mapping of the characters to their transliterated equivalents:

results = []
for char in source_text:
    results.append(rules.get(char, char))
return ''.join(results)    # turns the list back into a string

A dict's get method will return either the value for a key or a default value if the key does not exist - normally the default value is None, but in this case, we gave the same character as the default value (the second argument) so that if the key is not found it will just return itself.

A more compact way to write this using generator expressions would be:

''.join((rules.get(char, char) for char in source_text))


If you use the translate method of Unicode objects, as I recommended in answer to another question of yours, everything's done automatically for you exactly as you desire: each Unicode character c whose codepoints (ord(c)) is not in the transliteration dictionary is simply passed unchanged from input to output, just as you want. Why reinvent the wheel?


I think you want something like this:

tokenMapping = {"&&" : "and"}

for token in source file: # <-- pseudocode
    translatedToken = tokenMapping[token] if token in tokenMapping else "transliteration unknown"

If there's a translation in the dictionary (e.g. "&&" -> "and"), it will use that. Else it will translate to "transliteration unknown".

Hope that helped.

EDIT: As LeafStorm suggested, a dictionary's get function can be used to simplify the above code. The code line in the loop would become

    translatedToken = tokenMapping.get(token, "transliteration unknown")


dictx = {}
for itm in my_source :
    dictx[itm] = dictx.get(itm, 0) + 1

I didn't completely understand the details of your question, but here's the simplest example i could think of that illustrates the pattern i think you are after.

The 'get' method i believe is what you want. It allows you to retrieve a key from a dictionary, but if the key is not there, you can set a default value--i.e., "i want dictx[itm] (the value assigned to the key 'itm') but if 'itm' is not in dictionary then create it and value of .'

This snippet will loop through your source document ('my_source') and count the frequency of the various items in it, adding those counts as values to the keys already in your dictionary, but when it reaches an item for which no key exists, no exception is thrown, a key is added and a value of '0' assigned.


This seems pretty straightforward. If your dictionary is char to char, then you would do something like

outstr = ''
for ch in instr:
    if ch in mydict:
        outstr += mydict[ch]
    else:
        outstr += ch

Here, instr is your input string and mydict contains your mapping of chars to chars.

If you want to check parts of words, I would recommend using two dictionaries: one that contains the characters that are contained in any word, and one that contains the words. You could use it like this:

outstr = ''
word = ''
for ch in instr:
    if ch in chardict:
        word += ch
    else:
        if len(word):
            if word in worddict:
                outstr += worddict[word]
            else:
                outstr += word
            word = ''
        outstr += ch
if len(word):
    outstr += worddict[word]
else:
    outstr += word

chardict might contain all of the alphabet for instance. Of course, you might want to do some parts a little bit differently (like use something other than chardict to check if a char is to be considered part of a valid word - perhaps something with a binary search), but hopefully you get the idea.

0

精彩评论

暂无评论...
验证码 换一张
取 消