Assistance with building an inverted-index_问答_开发者

Assistance with building an inverted-index

开发者 https://www.devze.com 2022-12-25 06:43 出处：网络

It\'s part of an information retrieval thing I\'m doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters sav

It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters save开发者_Go百科d as a string value. So,

hashmap["ba"] = "bad barley base"

Once I'm done tokenizing a line I take that hashmap, serialize it, and append it to the text file named after the key.

The idea is that if I take my data and spread it over hundreds of files I'll lessen the time it takes to fulfill a search by lessening the density of each file. The problem I am running into is when I'm making 100+ files in each run it happens to choke on creating a few files for whatever reason and so those entries are empty. Is there any way to make this more efficient? Is it worth continuing this, or should I abandon it?

I'd like to mention I'm using PHP. The two languages I know relatively intimately are PHP and Java. I chose PHP because the front end will be very simple to do and I will be able to add features like autocompletion/suggested search without a problem. I also see no benefit in using Java. Any help is appreciated, thanks.

I would use a single file to get and put the serialized string. I would also use json as the serialization.

Put the data

$string = "bad barley base";
$data = explode(" ",$string);
$hashmap["ba"] = $data;

$jsonContent = json_encode($hashmap);
file_put_contents("a-z.txt",$jsonContent);

Get the data

$jsonContent = file_get_contents("a-z.txt");
$hashmap = json_decode($jsonContent);

foreach($hashmap as $firstTwoCharacters => $value) {
    if ($firstTwoCharacters == 'ba') {
        $wordCount = count($value);
    }
}

You didn't explain the problem you are trying to solve. I'm guessing you are trying to make a full text search engine, but you don't have document ids in your hashmap so I'm not sure how you are using the hashmap to find matching documents.

Assuming you want a full text search engine, I would look into using a trie for the data structure. You should be able to fit everything in it without it growing too large. Nodes that match a word you want to index would contain the ids of the documents containing that word.