Where would a suffix array be preferable to a suffix tree?_问答_开发者

Where would a suffix array be preferable to a suffix tree?

开发者 https://www.devze.com 2023-03-29 08:35 出处：网络

Two closely-related data structures are the suffix tree and suffix array.From what I\'ve read, the suffix tree is faster, more powerful, more flexible, and more memory-efficient than a suffix array.Ho

Two closely-related data structures are the suffix tree and suffix array. From what I've read, the suffix tree is faster, more powerful, more flexible, and more memory-efficient than a suffix array. However, in this earlier question, one of the top answers mentioned that suffix arrays are more widely used in practice. I don't have any experience using either of these structures, but right now it seems like I would always prefer a suffix tree over a suffix array for problems that needed the functionality they provide (fast substring checking, for example).

In what circumstances would a suffix array be preferable开发者_StackOverflow中文版 to a suffix tree?

(By the way, while this question is related to the one I've linked, I don't think it's an exact duplicate as I'm interested solely in a comparison of suffix arrays and suffix trees, leaving tries completely out of the picture. If you disagree, though, I would understand if this question were to be closed.)

Citing from http://www.youtube.com/watch?v=1DGZxd-PP7U

Suffix Arrays and Suffix Trees used to be different. But nowadays Suffix Arrays are just a way of implementing a Suffix Tree (or vice versa). See: Kim, Kim, and Park. Linearized suffix tree: an efficient index data structure with the capabilities of suffix trees and suffix arrays. Algorithmica, 2007.

The Kim et al paper is well written, accessible and has references to other important papers, such as the one by Abouelhoda et al.

A suffix array is nearly always preferable, except:

If you are going to index little ammounts of data.
If you are doing research on protein matching or dna mutations and have access to extremely expensive computers.
If you must at all cost use the error search with wildcards.

A suffix array can be used to implement the suffix tree. Meaning a suffix tree can be a suffix array and a few additional data structures to simulate the suffix tree functionality.

Therefore:

Suffix arrays use less space (a lot less)
Suffix trees are slower to build
Suffix trees are faster doing pattern matching operations
Suffix trees can do more operations, the best is error pattern matching with wildcards (suffix array also does pattern matching but not with wildcards)

If you want to index a lot of data, like more than 50 megabytes. The suffix tree uses so much space that your computer does not have enough ram to keep it in central memory. Therefore it starts to use secondary memory and you will see a huge degradation in speed. (for example the human dna uses 700 megabytes, a suffix tree of that data "can" use 40 gigabytes -> * "can" depending on the implementation * )

Because of this the suffix tree is nearly never used in practice. In practice the suffix array is used and small additional data structures give it some extra functionality (never a complete suffix tree).

However they are different. There are many cases where a pure suffix array is preferable for pattern matching due to efficient speed, fast construction speed and low space use.