开发者

Suggestion for a tag cloud algorithm

开发者 https://www.devze.com 2023-01-11 14:26 出处:网络
I have a MSSQL 2005 table: [Companies]( [CompanyID] [int] IDENTITY(1,1) NOT NULL, [Title] [nvarchar](128),

I have a MSSQL 2005 table:

[Companies](
    [CompanyID] [int] IDENTITY(1,1) NOT NULL,
    [Title] [nvarchar](128),
    [开发者_高级运维Description] [nvarchar](256),
    [Keywords] [nvarchar](256)
)

I want to generate a tag cloud for this companies. But I've saved all keywords in one column separated by commas. Any suggestions for how to generate tag cloud by most used keywords. There could be millions of companies approx ten keywords per company.

Thank you.


Step 1: separate the keywords into a proper relation (table).

CREATE TABLE Keywords (KeywordID int IDENTITY(1,1) NOT NULL
  , Keyword NVARCHAR(256)
  , constraint KeywordsPK primary key (KeywordID)
  , constraint KeywordsUnique unique (Keyword));

Step 2: Map the many-to-many relation between companies and tags into a separate table, like all many-to-many relations:

CREATE TABLE CompanyKeywords (
   CompanyID int not null
   , KeywordID int not null
   , constraint CompanyKeywords primary key (KeywordID, CompanyID)
   , constraint CompanyKeyword_FK_Companies
      foreign key (CompanyID)
      references Companies(CompanyID)
   , constraint CompanyKeyword_FK_Keywords
      foreign key (KeywordID)
      references Keywords (KeywordID));

Step 3: Use a simple GROUP BY query to generate the 'cloud' (by example taking the 'cloud' to mean the most common 100 tags):

with cte as (
SELECT TOP 100 KeywordID, count(*) as Count
FROM CompanyKeywords
group by KeywordID
order by count(*) desc)
select k.Keyword, c.Count
from cte c
join Keyword k on c.KeywordID = k.KeywordID;

Step 4: cache the result as it changes seldom and it computes expensively.


I'd much rather see your design normalized as suggested by Remus, but if you're at a point where you can't change your design...

You can use a parsing function (the example I'll use is taken from here), to parse your keywords and count them.

CREATE FUNCTION [dbo].[fnParseStringTSQL] (@string NVARCHAR(MAX),@separator NCHAR(1))
RETURNS @parsedString TABLE (string NVARCHAR(MAX))
AS 
BEGIN
   DECLARE @position int
   SET @position = 1
   SET @string = @string + @separator
   WHILE charindex(@separator,@string,@position) <> 0
      BEGIN
         INSERT into @parsedString
         SELECT substring(@string, @position, charindex(@separator,@string,@position) - @position)
         SET @position = charindex(@separator,@string,@position) + 1
      END
     RETURN
END
go

create table MyTest (
    id int identity,
    keywords nvarchar(256)
)

insert into MyTest
    (keywords)
    select 'sql server,oracle,db2'
    union
    select 'sql server,oracle'
    union
    select 'sql server'

select k.string, COUNT(*) as count
    from MyTest mt
        cross apply dbo.fnParseStringTSQL(mt.keywords,',') k
    group by k.string
    order by count desc

drop function dbo.fnParseStringTSQL
drop table MyTest


Both Remus and Joe are correct but yes as what Joe said if you dont have a choice then you have to live with it. I think I can offer you an easy solution by using an XML Data Type. You can already easily view the parsed column by doing this query

WITH myCommonTblExp AS (
    SELECT CompanyID,
    CAST('<I>' + REPLACE(Keywords, ',', '</I><I>') + '</I>' AS XML) AS Keywords
    FROM Companies
)
SELECT CompanyID, RTRIM(LTRIM(ExtractedCompanyCode.X.value('.', 'VARCHAR(256)'))) AS Keywords
FROM myCommonTblExp
CROSS APPLY Keywords.nodes('//I') ExtractedCompanyCode(X)

now knowing that you can do that, all you have to do is to group them and count, but you cannot group XML methods so my suggestion is create a view of the query above

CREATE VIEW [dbo].[DissectedKeywords]
AS
WITH myCommonTblExp AS (
    SELECT 
    CAST('<I>' + REPLACE(Keywords, ',', '</I><I>') + '</I>' AS XML) AS Keywords
    FROM Companies
)
SELECT RTRIM(LTRIM(ExtractedCompanyCode.X.value('.', 'VARCHAR(256)'))) AS Keywords
FROM myCommonTblExp
CROSS APPLY Keywords.nodes('//I') ExtractedCompanyCode(X)
GO

and perform your count on that view

SELECT Keywords, COUNT(*) AS KeyWordCount FROM DissectedKeywords
GROUP BY Keywords
ORDER BY Keywords

Anyways here is the full article -->http://anyrest.wordpress.com/2010/08/13/converting-parsing-delimited-string-column-in-sql-to-rows/

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号