Unique word count_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-27 05:23 出处：网络

This is a generic question that applies to (probably) any high-level programming language. Here is the situation:

Suppose I have an array of strings. Say, I managed to put 500 000 strings from a short story into an array (just suppose you don't have an option for input format). Consequently, there will most likely be an arbitrary number of duplicated items.

I want to take this array of strings and create another array that contains a unique subset(?) of that array (ie: no duplicates). In this scenario, both the input and 开发者_开发问答output must be arrays, so that may restrict you from various options.

Performance-wise, what's the fastest way to accomplish this? I'm currently using a linear search to check whether a word exists already, but as it is a linear search I feel that there might be faster ways especially if I have unreasonable amounts of strings to work with. Like a bigger novel!

Using a hashset might be the most sensible thing to do - complexity should be O(N).

Note: most high-level programming languages contain an implementation of a function that removes duplicates from an array, e.g. PHP.

If you are going to be putting gazillions of words into it, a directed acyclic word graph is the most efficient data structure I know of.

And yet it is conceptually a very simple data structure.