I am producing flat lists with 10^6开发者_如何学Python to 10^7 Real numbers, and some of them are repeating.
I need to delete the repeating instances, keeping the first occurrence only, and without modifying the list order.
The key here is efficiency, as I have a lot of lists to process.
Example (fake):
Input:
{.8, .3 , .8, .5, .3, .6}
Desired Output
{.8, .3, .5, .6}
Aside note
Deleting repeating elements with Union (without preserving order) gives in my poor man's laptop:
DiscretePlot[a = RandomReal[10, i]; First@Timing@Union@a, {i, 10^6 Range@10}]
You want DeleteDuplicates
, which preserves list order:
In[13]:= DeleteDuplicates[{.8, .3, .8, .5, .3, .6}]
Out[13]= {0.8, 0.3, 0.5, 0.6}
It was added in Mathematica 7.0.
Not to compete with other answers, but I just could not help sharing a Compile
- based solution. The solution is based on building a binary search tree, and then checking for every number in the list, whether its index in the list is the one used in building the b-tree. If yes, it is the original number, if no - it is a duplicate. What makes this solution interesting for me is that it shows a way to emulate "pass-by-reference" with Compile
. The point is that, if we inline compiled functions into other Compiled functions (and that can be achieved with an "InlineCompiledFunctions" option), we can refer in inner functions to the variables defined in outer function scope (because of the way inlining works). This is not a true pass-by-reference, but it still allows to combine functions from smaller blocks, without efficiency penalty (this is more in the spirit of macro-expnsion). I don't think this is documented, and have no idea whether this will stay in future versions. Anyways, here is the code:
(* A function to build a binary tree *)
Block[{leftchildren , rightchildren},
makeBSearchTree =
Compile[{{lst, _Real, 1}},
Module[{len = Length[lst], ctr = 1, currentRoot = 1},
leftchildren = rightchildren = Table[0, {Length[lst]}];
For[ctr = 1, ctr <= len, ctr++,
For[currentRoot = 1, lst[[ctr]] != lst[[currentRoot]],(*
nothing *),
If[lst[[ctr]] < lst[[currentRoot]],
If[leftchildren[[currentRoot]] == 0,
leftchildren[[currentRoot]] = ctr;
Break[],
(* else *)
currentRoot = leftchildren[[currentRoot]] ],
(* else *)
If[rightchildren[[currentRoot]] == 0,
rightchildren[[currentRoot]] = ctr;
Break[],
(* else *)
currentRoot = rightchildren[[currentRoot]]]]]];
], {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}]];
(* A function to query the binary tree and check for a duplicate *)
Block[{leftchildren , rightchildren, lst},
isDuplicate =
Compile[{{index, _Integer}},
Module[{currentRoot = 1, result = True},
While[True,
Which[
lst[[index]] == lst[[currentRoot]],
result = index != currentRoot;
Break[],
lst[[index]] < lst[[currentRoot]],
currentRoot = leftchildren[[currentRoot]],
True,
currentRoot = rightchildren[[currentRoot]]
]];
result
],
{{leftchildren, _Integer, 1}, {rightchildren, _Integer,
1}, {lst, _Real, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}
]];
(* The main function *)
Clear[deldup];
deldup =
Compile[{{lst, _Real, 1}},
Module[{len = Length[lst], leftchildren , rightchildren ,
nodup = Table[0., {Length[lst]}], ndctr = 0, ctr = 1},
makeBSearchTree[lst];
For[ctr = 1, ctr <= len, ctr++,
If[! isDuplicate [ctr],
++ndctr;
nodup[[ndctr]] = lst[[ctr]]
]];
Take[nodup, ndctr]], CompilationTarget -> "C",
"RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True,
"InlineCompiledFunctions" -> True,
"InlineExternalDefinitions" -> True}];
Here are some tests:
In[61]:= intTst = N@RandomInteger[{0,500000},1000000];
In[62]:= (res1 = deldup[intTst ])//Short//Timing
Out[62]= {1.141,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}
In[63]:= (res2 = Tally[intTst,Equal][[All,1]])//Short//Timing
Out[63]= {0.64,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}
In[64]:= res1==res2
Out[64]= True
Not as fast as the Tally
version, but also Equal
- based, and as I said, my point was to illustrate an interesting (IMO) technique.
For versions of Mathematica before 7, and for general interest, here are several ways of implementing the UnsortedUnion (i.e. DeleteDuplicates) function. These are collected from the help docs and MathGroup. They have been adjusted to accept multiple lists which are then joined, in analogy to Union.
For Mathematica 4 or earlier
UnsortedUnion = Module[{f}, f[y_] := (f[y] = Sequence[]; y); f /@ Join@##] &
For Mathematica 5
UnsortedUnion[x__List] := Reap[Sow[1, Join@x], _, # &][[2]]
For Mathematica 6
UnsortedUnion[x__List] := Tally[Join@x][[All, 1]]
From Leonid Shifrin for Mathematica 3+ (?)
unsortedUnion[x_List] := Extract[x, Sort[Union[x] /. Dispatch[MapIndexed[Rule, x]]]]
精彩评论