How do you represent data records containing non-numerical features as vectors (mathematical, NOT c++ vector)?_问答_开发者

Many data mining algorithms/strategies use vector representation of data records in order to simulate a spatial representation of the data (like support vector machines).

My trouble comes from how to represent non-numerical features within the dataset. My first thought was to 'alias' each possible value for a feature with a number from 1 to n (where n is the number of features).

While doing some research I came across a suggestion that when dealing with features that have a small number of possible values that you should use a bit string of length n where each bit represents a different value and only the one bit corresponding to the value being stored is flipped. I can see how you could theoretically save memory using this method with features that have less possible values than the number of bits used to store an integer value on your target system but the data set I'm working with has many different values for various features so I don't think that solution will help me at all.

What are some of the accepted methods of representing these values in vectors and when is each 开发者_如何学运维strategy the best choice?

So there's a convention to do this. It's much easier to show by example than to explain.

Suppose you have have collected from your web analytics app, four sets of metrics describing each visitor to a web site:

sex/gender
acquisition channel
forum participation level
account type

Each of these is a categorical variable (aka factor) rather than a continuous variable (e.g., total session time, or account age).

# column headers of raw data--all fields are categorical ('factors')
col_headers = ['sex', 'acquisition_channel', 'forum_participation_level', 'account_type']

# a single data row represents one user
row1 = ['M', 'organic_search', 'moderator', 'premium_subscriber']

# expand data matrix width-wise by adding new fields (columns) for each factor level:
input_fields = [ 'male', 'female', 'new', 'trusted', 'active_participant', 'moderator', 
                 'direct_typein', 'organic_search', 'affiliate', 'premium_subscriber',   
                 'regular_subscriber',  'unregistered_user' ]

# now, original 'row1' above, becomes (for input to ML algorithm, etc.)
warehoused_row1 = [1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0]

This transformation technique seems more sensible to me than keeping each variable as a single column. For instance, if you do the latter, then you have to reconcile the three types of acquisition channels with their numerical representation--i.e., if organic search is a "1" should affiliate be a 2 and direct_typein a 3, or vice versa?

Another significant advantage of this representation is that it is, despite the width expansion, a compact representation of the data. (In instances where the column expansion is substantial--i.e., one field is user state, which might mean 1 column becomes 50, a sparse matrix representation is obviously a good idea.)

for this type of work i use the numerical computation libraries NumPy and SciPy.

from the Python interactive prompt:

>>> # create two data rows, representing two unique visitors to a Web site:

>>> row1 = NP.array([0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0])

>>> row2 = NP.array([1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0])

>>> row1.dtype
  dtype('int64')
>>> row1.itemsize
  8

>>> # these two data arrays can be converted from int/float to boolean, substantially 
>>> # reducing their size w/ concomitant performance improvement
>>> row1 = NP.array(row1, dtype=bool)
>>> row2 = NP.array(row2, dtype=bool)

>>> row1.dtype
  dtype('bool')
>>> row1.itemsize    # compare with row1.itemsize = 8, above
  1

>>> # element-wise comparison of two data vectors (two users) is straightforward:
>>> row1 == row2  # element-wise comparison
  array([False, False, False, False,  True,  True, False,  True,  True, False], dtype=bool)
>>> NP.sum(row1==row2)
  5

For similarity-based computation (e.g. k-Nearest Neighbors), there is a particular metric used for expanded data vectors comprised of categorical variables called the Tanimoto Coefficient. For the particular representation i have used here, the function would look like this:

def tanimoto_bool(A, B) :
    AuB = NP.sum(A==B)
    numer = AuB
    denom = len(A) + len(B) - AuB
    return numer/float(denom)

>>> tanimoto_bool(row1, row2)
  0.25

There are no "widely accepted answer" that I know of, it entirely depends on what you want.

The main idea behind your post is that the trivial memory representation of a state may be too memory intensive. For example, to store a value that can have at most four states, you will use an int (32 bits) but you could manage with only 2 bits, so 16 times less.

However, the cleverer your representation of a vector (ie : compact), the longer it will take you to code/decode it from/to the trivial representation.

I did a project where I represented the state of a Connect-4 board with 2 doubles (64 bits), where each double coded the discs owned by each player. It was a vast improvement over storing the state as 42 integers! I could explore much farther by having a smaller memory footprint. This is typically what you want.

It is possible through clever understanding of the Connect-4 to code it with only one double! I tried it, and the program became sooo long that I reverted to using 2 doubles instead of one. The program spent most of its time in the code/decode functions. This is typically what you do not want.

Now, because you want an answer, there are some guidelines :

If you can store booleans with one byte only, then keep them as booleans (language/compiler dependant).
Concatenate all small features (between 3 and 256 possible values) in primitive types like int, double, long double or whatever your language uses. Then write functions to code/decode, using bitshift operators for speed if possible.
Keep features that can have "lots of" (more than 256) possible values as-is.

Of course, these are not absolutes. If you have a feature that can take exactly 2^15 values and another 2^17 values, then concatenate them in a primitive type that has a 32bits size.

TL;DR : There's a trade off between memory consumption and speed. You need to adjust according to your problem