开发者

How to search keyword in 100 billions of posts?

开发者 https://www.devze.com 2023-01-14 16:13 出处:网络
This is a college project: I have a database ( mysql or postgresql doesn\'t matter ) with 100 billion of posts and I 开发者_StackOverflow中文版need to search ( as fast as possible ) a generic keyword

This is a college project:

I have a database ( mysql or postgresql doesn't matter ) with 100 billion of posts and I 开发者_StackOverflow中文版need to search ( as fast as possible ) a generic keyword.

Every post is 500-1000 keywords.

This isn't only a database issue but also a software ( for indexing or other ) issue.

How can I do that ?

I could use some advanced search engine technology but I don't know which.


Have you considered using Apache Lucene ?

That does not directly work directly with your SQL database, you'll have to write some code to feed the documents into it in order to build and index which you then can query.

I don't know however how much additional space that would require and how much time it will take.


Sell the "100 billion" post data to Google. :)

They will index it for you free and you will make money.


Well, there are about 6.8 billion people on the planet, who can read about 1 post per minute (on average).

If everyone contributes, 100 billion divided by 6.8 billion is 14.7 minutes to have all posts read.

So:

1) Conquer Earth.
2) Make everyone your slave.
3) Have posts read.
4) ???
5) Profit!


You might want to check out Sphinx. It's a full text search engine that handles distributed indexes. You can have parts of the data spread over many computers. And querying a single server can send the query to the other servers, and gather the results from each. It has pretty good speed, but you probably can't do 100 billion posts on a single machine.

You probably won't be able to do something like this in MySQL or Postgresql. While you could store all the data, MySQL and Postgres lack the full text indexing and search speed that a real full text index will bring you.

MySQL has a way to compile in support for the Sphinx storage engine, which, although the data will still be stored in Sphinx separate from MySQL, you can still query the Sphinx search engine using anything that connects to MySQL, and also perform joins to other tables that are in your MySQL database. However, if you just want to perform simple searches on the documents, and don't require joining to other data, you can just use the native PHP interface.


Try Sphinx http://sphinxsearch.com/


Have you tried the built-in full-text indexing functions of your database? You should try it and prove it doesn't work before you decide it's not suitable and look for something else.


Use Google Custom Search. Plus you will earn a little bit and save a lot of hosting resources.


First of all, are we talking keywords in separate fields or in post contents?

If separate fields, it is sort of OK. Just build a table with keyword-post relationships and do simple search on SELECT post_id 7 ... WHERE keyword = 'X'.

If we are talking about full-text indexing, you'd be best off using some custom indexing software, such as suggested in some other posts.

0

精彩评论

暂无评论...
验证码 换一张
取 消