Writing a search engine_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-04 11:28 出处：网络

The title might be a bit misleading, but I couldn\'t figure out a better title. I\'m writing a simple search engine which will search on several sites for the specific domain. To be concrete: I\'m wri

The title might be a bit misleading, but I couldn't figure out a better title. I'm writing a simple search engine which will search on several sites for the specific domain. To be concrete: I'm writing a search engine for hardstyle livesets/aftermovies/tracks. To do I will search on the sites who provide livesets, tracks, and such. The problem here is speed, I need to pass the search query to 5-7 sites, get the results and then use my own algorithm to display the results in a sorted order. I could just "multithread" it, but it's easie开发者_开发问答r said then done so I have a few questions.

What would be the best solution to this problem? Should I just multithread/process this application, so I'm going to get a bit of speed-up?
Are there any other solutions or I am doing something really wrong?

Thanks,

William van Doorn

Unless you're trying to learn multithreading, avoid writing the infrastructure for this yourself. Synchronizing lots of tasks that could take different times, handling failures, etc., it's a mess.

For largely parallelizable tasks (such as querying multiple sites, combining results, etc.), you may want to look at existing infrastructures.

Map/reduce frameworks (such as Hadoop for Java) can handle some of this for you, letting you focus on the logic of your application.

In the specific case of a search engine I recommend you check out Solr or Lucene. For 5-7 sites Hadoop will probably be overkill. Incremental indexing is possible and also adding specific metadata to each of the searchable things.

I can imagine these sites publish a lot of their content also in RSS feeds which you can use to keep your indexes up to date faster than you would by continuously crawling them.

The search engine itself allows all kind of interesting ways to get lindingly fast to your results for postprocessing or immediately displaying to your users.

For parallelization there is excellent support in the JSR-166y packages (java.util.concurrent) which allow parallelization without headaches if you stick to one of the patterns proposed. They work really well.

Just some thoughts.

Use Google? ;)

The bottleneck will be downloading the information multithreading will help.

Otherwise download only the html.

I'll try to use some pseudo code here:

// main thread

barrier = Barrier(numberOfQueries) // initialize the barrier 
                                   // with number of working threads

for (i = 0; i < numberOfQueries; i++) {
  workers(i) = Worker(i, barrier) // create a worker passing the barrier
  workers(i).start() // start a worker
}

barrier.await() // wait until the barrier resets to ZERO

for (i = 0; i < numberOfQueries; i++) {
  results(i) = workers(i).result // collect the results
}

display(results) // display the results


// worker thread

function start() {
  doTheJob() // do the long job of querying a site
  this.barrier.decrement // once the job is finished decrement the barrier
}

You can use Map/Reduce for this kind of task. Hadoop is a implementation in java