开发者

Conditional Data Manipulation in Mathematica

开发者 https://www.devze.com 2023-03-08 04:55 出处:网络
I am trying to prepare the best tools for efficient Data Analysis in Mathematica. I have a approximately 300 Columns & 100 000 Rows.

I am trying to prepare the best tools for efficient Data Analysis in Mathematica. I have a approximately 300 Columns & 100 000 Rows.

What would be the best tricks to :

"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.

One of the trickiest examples I could think of is :

Given a data structure,

Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the val开发者_运维知识库ue in column 8 is different than y

I also welcome any general advice on data manipulation.


For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:

Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];

ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules =  Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
    indexRules  =  Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
     With[{selF = Apply[Function, Hold[condition] /. colRules]},
       Select[{rows}, selF @@ # &][[All, cnames /. indexRules]]]];

What happens here is that the function used in Select gets generated automatically from your specifications. For example (using @Yoda's example):

rows = Array[#1 #2 &, {5, 15}];

We need to define the column names (must be strings or symbols without values):

In[425]:= 
colnames = "c" <> ToString[#] & /@ Range[15]

Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", 
"c13", "c14", "c15"}

(in practice, usually names are more descriptive, of course). Here is the table then:

table = Prepend[rows, colnames];

Here is the select statement you need (I picked x = 4 and y=2):

select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
    where["c2" == 4 && "c8" != 2]]

{{2, 4, 6, 12, 14, 16, 18, 30}}

Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as

In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]

Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}

and similar.

Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).

EDIT

On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.


data = RandomInteger[{1, 20}, {40, 20}]

x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]

==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}

Some useful commands to get pieces of matrices and list are Span (;;), Drop, Take, Select, Cases and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,

Part ([[...]]) in combination with ;; can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).

Select can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother is Cases which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using the Part ([[..]]) function.


To pull out columns (or rows) you can do it by part indexing

data = Array[#1 #2 &, {5, 15}];
data[[All, Flatten@{Range@3, Range @@ {6, 9}, -1}]]

MatrixForm@%

The last line is just to view it pretty.

As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the Span (;;) command. If you are joining multiple disjoint ranges, using Flatten to combine the separate ranges created with Range is easier than entering them by hand.


I read:

Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y

as meaning that we want:

  • elements 1-3 and 6-9 from each row

AND

  • the last element from rows wherein [[2]] == x && [[8]] != y.

This is what I hacked together:

a = RandomInteger[5, {20, 10}];          (*define the array*)
x = 4; y = 0;                            (*define the test values*)

Join @@ Range @@@ {1 ;; 3, 6 ;; 9};      (*define the column ranges*)

#2 == x && #8 != y & @@@ a;              (*test the rows*)

Append[%%, #] & /@ % /. {True -> -1, False :> Sequence[]};  (*complete the ranges according to the test*)

MapThread[Part, {a, %}] // TableForm     (*extract and display*)
0

精彩评论

暂无评论...
验证码 换一张
取 消