So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic decomposition of buried dead organisms. The age of the organisms and their resulting fossil fuels is typically millions of years, and sometimes exceeds 650 million years. The fossil fuels, which contain high percentages of carbon, include coal, petroleum, and natural gas. Fossil fuels range from volatile materials with low carbon:hydrogen ratios like methane, to liquid petroleum to nonvolatile materials composed of almost pure carbon, like anthracite coal. Methane can be found in hydrocarbon fields, alone, associated with oil, or in the form of methane clathrates. It is generally accepted that they formed from the fossilized remains of dead plants by exposure to heat and pressure in the Earth's crust over millions of years. This biogenic theory was first introduced by Georg Agricola in 1556 and later by Mikhail Lomonosov in the 18th century.
Second:
Fossil fuel reforming is a method of producing hydrogen or other useful products from fossil fuels such as natural gas. This is achieved in a processing device called a reformer which reacts steam at high temperature with the fossil fuel. The steam methane reformer is widely used in industry to make hydrogen. There is also interest in the development of much smaller units based on 开发者_Go百科similar technology to produce hydrogen as a feedstock for fuel cells. Small-scale steam reforming units to supply fuel cells are currently the subject of research and development, typically involving the reforming of methanol or natural gas but other fuels are also being considered such as propane, gasoline, autogas, diesel fuel, and ethanol.
That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.
In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!
You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).
精彩评论