How to go about indexing 300,000 text files for search?_问答_开发者

How to go about indexing 300,000 text files for search?

开发者 https://www.devze.com 2023-03-10 02:00 出处：网络

I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.

I think searching for words a开发者_高级运维nd phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?

I don't mind using existing software if such exists.

Consider Lucene http://lucene.apache.org/java/docs/index.html

There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:

Opensource:

elasticsearch - based on lucene

constellio - based on lucene

Sphinx - based on C++

Solr - built on top of lucene

You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx

http://blog.webdistortion.com/2011/05/29/open-source-search-engines/