I am trying to prepare the best tools for efficient Data Analysis in Mathematica. I have a approximately 300 Columns & 100 000 Rows.
What would be the best tricks to :
"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.
One of the trickiest examples I could think of is :
Given a data structure,
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the val开发者_运维知识库ue in column 8 is different than y
I also welcome any general advice on data manipulation.
For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select
manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:
Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];
ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules = Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
indexRules = Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
With[{selF = Apply[Function, Hold[condition] /. colRules]},
Select[{rows}, selF @@ # &][[All, cnames /. indexRules]]]];
What happens here is that the function used in Select
gets generated automatically from your specifications. For example (using @Yoda's example):
rows = Array[#1 #2 &, {5, 15}];
We need to define the column names (must be strings or symbols without values):
In[425]:=
colnames = "c" <> ToString[#] & /@ Range[15]
Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12",
"c13", "c14", "c15"}
(in practice, usually names are more descriptive, of course). Here is the table then:
table = Prepend[rows, colnames];
Here is the select statement you need (I picked x = 4
and y=2
):
select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
where["c2" == 4 && "c8" != 2]]
{{2, 4, 6, 12, 14, 16, 18, 30}}
Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as
In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]
Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}
and similar.
Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).
EDIT
On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.
data = RandomInteger[{1, 20}, {40, 20}]
x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]
==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}
Some useful commands to get pieces of matrices and list are Span
(;;), Drop
, Take
, Select
, Cases
and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,
Part
([[...]]) in combination with ;;
can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).
Select
can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother is Cases
which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using the Part
([[..]]) function.
To pull out columns (or rows) you can do it by part indexing
data = Array[#1 #2 &, {5, 15}];
data[[All, Flatten@{Range@3, Range @@ {6, 9}, -1}]]
MatrixForm@%
The last line is just to view it pretty.
As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the Span
(;;
) command. If you are joining multiple disjoint ranges, using Flatten
to combine the separate ranges created with Range
is easier than entering them by hand.
I read:
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
as meaning that we want:
- elements 1-3 and 6-9 from each row
AND
- the last element from rows wherein
[[2]] == x && [[8]] != y
.
This is what I hacked together:
a = RandomInteger[5, {20, 10}]; (*define the array*)
x = 4; y = 0; (*define the test values*)
Join @@ Range @@@ {1 ;; 3, 6 ;; 9}; (*define the column ranges*)
#2 == x && #8 != y & @@@ a; (*test the rows*)
Append[%%, #] & /@ % /. {True -> -1, False :> Sequence[]}; (*complete the ranges according to the test*)
MapThread[Part, {a, %}] // TableForm (*extract and display*)
精彩评论