开发者

Discovering synonyms from set of documents using LSA transform in Ruby

开发者 https://www.devze.com 2023-02-20 05:07 出处:网络
After applying the LSA transform to a document array, how can this be used to generate synonyms? For instance, I have the following sample documents:

After applying the LSA transform to a document array, how can this be used to generate synonyms? For instance, I have the following sample documents:

D1 = Mobilization

D2 = Reflective Pavement

D3 = Maintenance of Traffic

D4 = Special Detour

D5 = Commercial Materials for Driveway

            D1    D2    D3    D4    D5    
commerci[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
 special[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
 mainten[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 reflect[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
  mobil [ +1.00开发者_StackOverflow +0.00 +0.00 +0.00 +0.00 ]  

Applying TFIDF transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  

Applying LSA transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  


Firstly, this example won't work. The principle behind it is that the more frequently words occur in similar contexts, the more related they are in meaning. Therefore there needs to be some overlap between the input documents. Paragraph length documents are ideal (since they have a reasonable number of words and there tends to be a single topic per paragraph).

To understand how LSA is useful for synonym recognition, you need to first understand how a vector space representation (the first matrix you've got there) of words occurrences is useful for synonym recognition in the first place. This is because you can calculate the distance between two items in this high dimensionality vector space as a measure of their similarity (given that it is a measure of how often they occur together). The Magic of LSA is that it reshuffles the dimensions of the vector space, so that items that don't occur together but occur in similar contexts are brought together by a collapsing of similar dimensions into each other.

The idea of the TFIDF weighting function is to highlight the differences between documents, by giving higher weightings to words that appear more in a smaller subset of of the corpus, and lower weightings to words that are used everywhere. A more thorough explanation.

The "LSA" transformation is actually a singular-value decomposition (SVD) – conventionally Latent Semantic Analysis or Latent Semantic Indexing refers to the combination TFIDF with SVD – and it serves to reduce the dimensionally of the vector space, or in other words, it reduces the number of columns into a smaller, more concise description (as described above).

So to get the the nub of your question: you can tell how similar to words are by applying a distance function to the two corresponding vectors (rows). There are several distance functions to choose from by the most commonly used is the cosine distance (which measures the angle between the two vectors).

Hope this makes things clearer.

0

精彩评论

暂无评论...
验证码 换一张
取 消