This is a generic question that applies to (probably) any high-level programming language. Here is the situation:
Suppose I have an array of strings. Say, I managed to put 500 000 strings from a short story into an array (just suppose you don't have an option for input format). Consequently, there will most likely be an arbitrary number of duplicated items.
I want to take this array of strings and create another array that contains a unique subset(?) of that array (ie: no duplicates). In this scenario, both the input and 开发者_开发问答output must be arrays, so that may restrict you from various options.
Performance-wise, what's the fastest way to accomplish this? I'm currently using a linear search to check whether a word exists already, but as it is a linear search I feel that there might be faster ways especially if I have unreasonable amounts of strings to work with. Like a bigger novel!
Using a hashset might be the most sensible thing to do - complexity should be O(N).
Note: most high-level programming languages contain an implementation of a function that removes duplicates from an array, e.g. PHP.
If you are going to be putting gazillions of words into it, a directed acyclic word graph is the most efficient data structure I know of.
And yet it is conceptually a very simple data structure.
精彩评论