I have a database and a fulltext indexed table. Lets call this table test
. This table has one field called testfield
. Now, lets insert only one record as follow:
insert into test values ('word' + Char(13) + 'test')
T开发者_JS百科his query inserts a word with a LINE_BREAK
.
Now, lets query this table using fulltext:
select * from test where contains(testfield, '"word test"')
In that case, this query returns nothing at all.
Also:
select * from test where contains(testfield, '"wordtest"')
Returns nothing (that was expected do be this way)
Now, lets query the table again modifying the search word:
select * from test where contains(testfield, '"word' + Char(13) + 'test"')
In that case the query returns the correct line.
The question is: Why does this happen? I mean, SQL should ignore line breaks when searching for a word and it does not. I do not believe that this is the default behavior of a fulltext engine. At least it is not acceptable. What happens if my users insert lines with line breaks on my table (and this is usual, since they can write anything they want because the field is an memo field)?
Is there any way of correcting this?
EDIT
It only happens when I choose brazilian as the language for FT. If I choose english, none of the problems I mentioned happens.
EDIT
On Sql Server Denali CTP3 neither english nor portuguese works. Maybe this is bug in the english steamer.
EDIT
Screenshot for usage of fts_parser
Try:
select * from test where contains(testfield, 'word NEAR test')
Looking at the CONTAINS
documentation, it seems that you'd need the additional proximity term to find the match past the Char(13)
.
Edit: If using a proximity term isn't enough, try adding the character to the "noise words" list so that it gets ignored in indexing. You can find the noise word files in $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
. The English file is named noiseENG.txt
. You can find more about this detail in Robert Sheldon's article about full-text indexing.
Edit: I am also doing some research on the difference between CHAR(13) and NCHAR(13) as there might be a unicode vs. non-unicode problem you're seeing in your original testing.
Edit: Additionally, in 2008 Portugese got new word breakers. Word breakers are language specific and likely why you may see this difference. It also matters what language was chosen for the indexed column.
Edit: To check how words are being broken, try running this (if you have access to master):
declare @english nvarchar(20), @portugese nvarchar(20)
set @english = N'"hello' + NCHAR(13) + N'world"'
set @portugese = N'"Olá' + NCHAR(13) + N'mundo"'
select * from sys.dm_fts_parser (@english, 1033, 0, 0)
select * from sys.dm_fts_parser (@portugese, 2070, 0, 0)
Edit: This breaks the phrase "hello[13]world" in English and in Portugese. Below is the screenshot of the results, which are as I would think they would be. The CHAR(13)
is being treated slightly differently.
I have tried to reproduce your so called "bug" on my SQL Server 2008 (please see code below). As a result it works exactly as you expected. I am surprised you did not provide with the complete code in the first place.
UPDATE: Changed language of full index to Portuguese and received the same result. It shows "bug-like behaviour". UPDATE:
Why does this happen?
MS introduced new word breakers in SQL Server 2008 and Portuguese is one of them.
I mean, SQL should ignore line breaks when searching for a word and it does not. I do not believe that this is the default behavior of a fulltext engine. At least it is not acceptable.
UPDATE: It looks like this behaviour can be seen in all new breakers.
-- Create auxiliary table to test languages
IF OBJECTPROPERTY(object_id('test_languages'), 'IsUserTable') IS NOT NULL
DROP TABLE test_languages;
GO
CREATE TABLE test_languages
(
componenttype varchar(30),
componentname int,
clsid uniqueidentifier,
fullpath varchar(2000),
version_no varchar(50),
manufacturer varchar(50)
);
-- Populate Auxiliary table
INSERT INTO test_languages
EXEC ('exec sp_help_fulltext_system_components ''wordbreaker''');
-- Create Cursor and check how languages work with sys.dm_fts_parser
DECLARE MY_CURSOR CURSOR FOR
select
componentname
from test_languages
INNER JOIN
sys.fulltext_languages
on sys.fulltext_languages.lcid=test_languages.componentname
ORDER BY name
DECLARE @RESULT varchar(max)
DECLARE @test_var NVARCHAR(20)
SET @test_var='"word' + CHAR(13) + 'test"'
DECLARE @componentname int
OPEN MY_CURSOR
FETCH NEXT FROM MY_CURSOR
INTO @componentname
WHILE (@@FETCH_STATUS = 0)
BEGIN
SELECT name+ ' - '+
CASE test_languages.version_no
WHEN '6.0.6001.18000'
THEN 'Updated Language 6.0.6001.18000'
WHEN '12.0.9735.0'
THEN 'Old Language 12.0.9735.0'
WHEN '12.0.6828.0'
THEN 'Old Language 12.0.6828.0'
END
AS [Language_Type]
FROM test_languages
INNER JOIN
sys.fulltext_languages
ON sys.fulltext_languages.lcid=test_languages.componentname
WHERE lcid=@componentname
SELECT display_term
FROM sys.dm_fts_parser (@test_var, @componentname, 0, 0)
FETCH NEXT FROM MY_CURSOR
INTO @componentname
END;
-- clean up the mess
CLOSE MY_CURSOR
DEALLOCATE MY_CURSOR;
DROP TABLE test_languages;
It is up to MS to decide if this is acceptable. You can submit a ticket with support services though.
What happens if my users insert lines with line breaks on my table (and this is usual, since they can write anything they want because the field is an memo field)?
WYSIWYG
Is there any way of correcting this?
Maybe you will need to upload your own word breaker or downgrade to SQL Server 2005. Goodluck anyway!
USE
master
GO
CREATE
DATABASE Test_Brazil -- Portuguese (Brazil)
COLLATE
Latin1_General_100_CI_AI
USE
Test_Brazil
GO
CREATE
TABLE [dbo].[test](
[test] [varchar] (100) NOT NULL
)
ON [PRIMARY]
CREATE UNIQUE NONCLUSTERED INDEX [test] ON [dbo].[test]
(
[test]
ASC
)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
GO
CREATE
FULLTEXT CATALOG ft AS DEFAULT;
CREATE
FULLTEXT INDEX ON test(test)
KEY INDEX test
WITH STOPLIST = SYSTEM;
sp_fulltext_database 'enable'
insert into test values ('word' + Char(9) + 'test') --Tab
insert into test values ('word' + Char(10) + 'test') -- Line feed
insert into test values ('word' + Char(13) + 'test') -- Carriage return
insert into test values ('word test')
-- wait 3 sec
select test from test where contains(test, '"word test"')
-- 4 rows returned
DROP FULLTEXT INDEX ON dbo.test
CREATE
FULLTEXT INDEX ON dbo.test
(
test Language 2070 -- Portuguese
)
KEY INDEX test;
-- 2 rows returned
select * from test where contains(test, '"word test"')
精彩评论