开发者

Sql Server 2008 - Line Break and fulltext queries

开发者 https://www.devze.com 2023-03-28 09:13 出处:网络
I have a database and a fulltext indexed table. Lets call this table test. This table has one field called testfield. Now, lets insert only one record as follow:

I have a database and a fulltext indexed table. Lets call this table test. This table has one field called testfield. Now, lets insert only one record as follow:

insert into test values ('word' + Char(13) + 'test')

T开发者_JS百科his query inserts a word with a LINE_BREAK.

Now, lets query this table using fulltext:

select * from test where contains(testfield, '"word test"')

In that case, this query returns nothing at all.

Also:

select * from test where contains(testfield, '"wordtest"')

Returns nothing (that was expected do be this way)

Now, lets query the table again modifying the search word:

select * from test where contains(testfield, '"word' + Char(13) + 'test"')

In that case the query returns the correct line.

The question is: Why does this happen? I mean, SQL should ignore line breaks when searching for a word and it does not. I do not believe that this is the default behavior of a fulltext engine. At least it is not acceptable. What happens if my users insert lines with line breaks on my table (and this is usual, since they can write anything they want because the field is an memo field)?

Is there any way of correcting this?

EDIT

It only happens when I choose brazilian as the language for FT. If I choose english, none of the problems I mentioned happens.

EDIT

On Sql Server Denali CTP3 neither english nor portuguese works. Maybe this is bug in the english steamer.

EDIT

Screenshot for usage of fts_parser


Try:

select * from test where contains(testfield, 'word NEAR test')

Looking at the CONTAINS documentation, it seems that you'd need the additional proximity term to find the match past the Char(13).

Edit: If using a proximity term isn't enough, try adding the character to the "noise words" list so that it gets ignored in indexing. You can find the noise word files in $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. The English file is named noiseENG.txt. You can find more about this detail in Robert Sheldon's article about full-text indexing.

Edit: I am also doing some research on the difference between CHAR(13) and NCHAR(13) as there might be a unicode vs. non-unicode problem you're seeing in your original testing.

Edit: Additionally, in 2008 Portugese got new word breakers. Word breakers are language specific and likely why you may see this difference. It also matters what language was chosen for the indexed column.

Edit: To check how words are being broken, try running this (if you have access to master):

declare @english nvarchar(20), @portugese nvarchar(20)
set @english = N'"hello' + NCHAR(13) + N'world"'
set @portugese = N'"Olá' + NCHAR(13) + N'mundo"'

select * from sys.dm_fts_parser (@english, 1033, 0, 0)
select * from sys.dm_fts_parser (@portugese, 2070, 0, 0)

Edit: This breaks the phrase "hello[13]world" in English and in Portugese. Below is the screenshot of the results, which are as I would think they would be. The CHAR(13) is being treated slightly differently.

Sql Server 2008 - Line Break and fulltext queries


I have tried to reproduce your so called "bug" on my SQL Server 2008 (please see code below). As a result it works exactly as you expected. I am surprised you did not provide with the complete code in the first place.

UPDATE: Changed language of full index to Portuguese and received the same result. It shows "bug-like behaviour". UPDATE:

Why does this happen?

MS introduced new word breakers in SQL Server 2008 and Portuguese is one of them.

I mean, SQL should ignore line breaks when searching for a word and it does not. I do not believe that this is the default behavior of a fulltext engine. At least it is not acceptable.

UPDATE: It looks like this behaviour can be seen in all new breakers.

 -- Create auxiliary table to test languages
IF OBJECTPROPERTY(object_id('test_languages'), 'IsUserTable') IS NOT NULL
DROP TABLE test_languages;
GO

CREATE TABLE test_languages
(
componenttype varchar(30),
componentname int,
clsid uniqueidentifier,
fullpath varchar(2000),
version_no varchar(50),
manufacturer varchar(50)
);

-- Populate Auxiliary table 
INSERT INTO test_languages
EXEC ('exec sp_help_fulltext_system_components ''wordbreaker''');


-- Create Cursor and check how languages work with sys.dm_fts_parser 
  DECLARE MY_CURSOR CURSOR FOR
  select 
  componentname
  from test_languages 
  INNER JOIN 
  sys.fulltext_languages
  on sys.fulltext_languages.lcid=test_languages.componentname
  ORDER BY name

  DECLARE @RESULT varchar(max)
  DECLARE @test_var NVARCHAR(20)
  SET @test_var='"word' + CHAR(13) + 'test"'   
  DECLARE @componentname int

  OPEN MY_CURSOR

  FETCH NEXT FROM MY_CURSOR
    INTO @componentname

WHILE (@@FETCH_STATUS = 0)
 BEGIN

    SELECT name+ ' - '+ 
    CASE test_languages.version_no 
  WHEN '6.0.6001.18000' 
  THEN 'Updated Language 6.0.6001.18000'
  WHEN '12.0.9735.0'
  THEN 'Old Language 12.0.9735.0'
  WHEN '12.0.6828.0'
  THEN 'Old Language 12.0.6828.0'
  END
  AS [Language_Type]

 FROM test_languages 
  INNER JOIN 
  sys.fulltext_languages
  ON sys.fulltext_languages.lcid=test_languages.componentname
  WHERE lcid=@componentname
    SELECT display_term
    FROM sys.dm_fts_parser (@test_var, @componentname, 0, 0)



    FETCH NEXT FROM MY_CURSOR
    INTO @componentname

 END;

-- clean up the mess 
CLOSE MY_CURSOR
DEALLOCATE MY_CURSOR; 
DROP TABLE test_languages;

It is up to MS to decide if this is acceptable. You can submit a ticket with support services though.

What happens if my users insert lines with line breaks on my table (and this is usual, since they can write anything they want because the field is an memo field)?

WYSIWYG

Is there any way of correcting this?

Maybe you will need to upload your own word breaker or downgrade to SQL Server 2005. Goodluck anyway!

        USE
        master
        GO

        CREATE
        DATABASE Test_Brazil -- Portuguese (Brazil)
        COLLATE
        Latin1_General_100_CI_AI

        USE
        Test_Brazil
        GO

        CREATE
        TABLE [dbo].[test](
        [test] [varchar] (100) NOT NULL
        )
        ON [PRIMARY]

        CREATE UNIQUE NONCLUSTERED INDEX [test] ON [dbo].[test] 
        (
        [test]
        ASC
        )
        WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
        GO

        CREATE
        FULLTEXT CATALOG ft AS DEFAULT;

        CREATE
        FULLTEXT INDEX ON test(test)
        KEY INDEX test
        WITH STOPLIST = SYSTEM;

        sp_fulltext_database 'enable'


        insert into test values ('word' + Char(9) + 'test') --Tab    
        insert into test values ('word' + Char(10) + 'test') -- Line feed     
        insert into test values ('word' + Char(13) + 'test') -- Carriage return  
        insert into test values ('word test')
        -- wait 3 sec

        select test from test where contains(test, '"word test"')

        -- 4 rows returned

        DROP FULLTEXT INDEX ON dbo.test
        CREATE
        FULLTEXT INDEX ON dbo.test
        (
         test Language 2070 -- Portuguese
        )
        KEY INDEX test;


        -- 2 rows returned
        select * from test where contains(test, '"word test"')
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号