开发者

Calculate TF-IDF using Sql

开发者 https://www.devze.com 2023-01-10 04:09 出处:网络
I have a table in my DB containning a free text field column. I would like to know the frequency each word appears over all the rows, or maybe even calc a TF-IDF for all words, where my documents are

I have a table in my DB containning a free text field column.

I would like to know the frequency each word appears over all the rows, or maybe even calc a TF-IDF for all words, where my documents are that field's values per row.

Is it possible to calculate this using an Sql Query? if not or there's a simpler way could you please direct m开发者_如何学Goe to it?

Many Thanks,

Jon


In SQL Server 2008 depending on your needs you could apply full text indexing to the column then query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the occurrence count.

Edit: Actually even without creating a persistent full text index you can still leverage the parser

WITH testTable AS
(
SELECT 1 AS Id, N'how now brown cow' AS txt UNION ALL
SELECT 2, N'she sells sea shells upon the sea shore' UNION ALL
SELECT 3, N'red lorry yellow lorry' UNION ALL
SELECT 4, N'the quick brown fox jumped over the lazy dog'
)

SELECT display_term, COUNT(*) As Cnt
FROM testTable
CROSS APPLY sys.dm_fts_parser('"' + REPLACE(txt,'"','""') + '"', 1033, 0,0)
WHERE TXT IS NOT NULL
GROUP BY display_term
HAVING COUNT(*) > 1
ORDER BY Cnt DESC

Returns

display_term                   Cnt
------------------------------ -----------
the                            3
brown                          2
lorry                          2
sea                            2


Solution for SQL Server 2008:

here is the table:

CREATE TABLE MyTable (id INT, txt VARCHAR(MAX));

here is SQL query:

SELECT sum(case when TSplitted.txt_word = 'searched' then 1 else 0 end) as cnt_searched
     , count(*) as cnt_all
FROM MyTable MYT 
INNER JOIN Fn_Split(MYT.id,' ',MYT.txt) TSplitted on MYT.id=TSplitted.id

here is table valued function Fn_Split(@id int, @separator VARCHAR(32), @string VARCHAR(MAX)) (taken from here):

CREATE FUNCTION Fn_Split (@id int, @separator VARCHAR(32), @string VARCHAR(MAX))

RETURNS @t TABLE
    (
        ret_id INT
       ,txt_word VARCHAR(MAX)
    )   
AS
    BEGIN
        DECLARE @xml XML
        SET @XML = N'<root><r>' + REPLACE(@s, @separator, '</r><r>') + '</r></root>'

        INSERT INTO @t(ret_id, val)
        SELECT @id, r.value('.','VARCHAR(5)') as Item
        FROM @xml.nodes('//root/r') AS RECORDS(r)

        RETURN
    END
0

精彩评论

暂无评论...
验证码 换一张
取 消