adapting text search for graph/molecule comparison algorithms_问答_开发者

I'm looking for a text search engine for a non-traditional sort of text search and I want advice on which tool (Lucene, Sphinx, Xapian, or something else) is most appropriate for me, plus pointers on where to get started.

I have molecules represented as graphs (atoms and bond). I have a way to enumerate all subgraphs of up to size k. Being technical, the inputs are SMILES and the output is canonical SMARTS and the number of times each subgraph/SMARTS occurs.

For example, if the input molecule is "CCO" then the canonical results are {"C": 2, "O": 1, "CC": 1, "OC": 1, "CCO": 1} and if the molecule is "SCO" then the canonical results are {"C": 1, "S": 1, "O": 1, "CS": 1, "OC": 1, "SCO": 1}. These are tiny examples. For real molecule I got around 500 "words", which look like "CC(C)O", "CCCOCC", "cn" and "cccc(c)O".

Looking at molecules as a collections of characteristic strings plus counts means I should be able to use a text search tool to do comparisons at the text level, with the hopes that they are meaningful at the chemistry level.

For examples, I can use cosine similarity perhaps with tf-idf weight and find similar molecules by looking for similar subpatterns. With the "CCO" and "SCO" examples above, the cosine similarity is (2*1+1*1+1*1)/sqrt(2*2+1*1+1*1+1*1+1*1)/sqrt(6*(1*1)) = 4/sqrt(8*6) = 0.58.

For another example, if I want to find molecules which contain a "CCS" substructure then I can do a fast inverted index search based on th开发者_StackOverflow中文版e counts (the molecules must have at least 2 "C"s, at least 1 "CS", and so on) before tackling the NP subgraph isomorphism problem. That is, text-based methods can act as a filter to reject obvious mismatches.

I'm trying to figure out the text solutions which exist but it's a bit daunting. I don't need stop words, I don't need stemming, I don't care about word order; I don't need quite a number of the features which exist. I do need the ability to keep word vectors, since it's important to know if "C" appears 2 time or 3.

Which text search engine is most appropriate for me? It looks like Lucene, especially with the work in Mahout. Can you recommend which parts of the documentation to look at or relevant tutorials? The ones I've found are meant for full-text searches, with stemming and the other features I don't need.

EDIT: I may have understood this better now. You want to compare graphs, represented as strings. The strings have "words" which may repeat. You may use Lucene, in which case I second the suggestion to use Solr. Basically, each Solr document will consist of a single field; The field will contain the string, which I suggest you unroll: write C C instead of C:2. If you use a space to separate the words, you can use a WhiteSpaceAnalyzer. If you use another separator, you may need to write a custom analyzer, which is not so hard to do.

Is this a good idea? I am not sure. Here's why:

Lucene (and Solr) do not use cosine similarity as such, but rather Lucene Similarity, which mixes cosine, TF/IDF and boolean scoring, with some specific modifications. This works well for most textual use-cases, but may be different than what you need.
Do you need to compare hits from different searches? If you do, it is hard to do using Solr, as it normalized every search to a maximal value of 1.

I suggest you do try Solr for a small sample of your database. If Solr works for you, fine. If not, shingling and min-hashes are probably the way to go. Mining of Massive Datasets by Rajaraman and Ullman is a recent free book about these subjects. I suggest you read it. It covers search for similar strings in mountains of data. I guess the differentiator is: Do you need a relatively large intersection? If so, use shingling and min-hashes. If not, maybe Solr is enough.

Hmm... don't really know what are SMARTS, or how chemical similarity actually work. If you want to use lucene, first consider using solr. Since your data is in graphs, you can take a look at neo4j with the solr component. Also, would this problem be more closely related to document near duplicates? For helping with that there are a number of algorithms LSH, Spotsigs, shingling, and simhash. Wish I could be of more help.

Don't use lucene. Or Solr. The internal models are antiquated and cobbled together; although they do a good job. Find an engine with the minimal criteria (if you want to map inside a text engine) BM25F fully supported. If I were after it and I wanted scalability and performance and low cost support community, frankly I'd go with SQL Server and cubes.Licensing with SQL Server could be a complete blocker. Good luck.