T-SQL and wildcard string matching performance on large table_问答_开发者

I've got a bit of a problem with T-SQL, I'm basically in a position where I have to run a whole bunch of queries like:

SELECT TOP 30 * FROM [table]
    WHERE firstfield IS NOT NULL
    AND secondfield IS NOT NULL
    AND (firstfield LIKE '%substring%' OR secondfield LIKE '%substring%')

This is being run from a C#.NET program with the timeout on the SqlCommand set to 90 seconds and for most queries (that search for terms that are rare or non-existent) it hits the timeout.

The problem with this is of course that it's extremely slow but the data has to be fetched. Once fetched all the returned firstfields and secondfields are just concatenated together (this larger string is stored and analyzed by the program).

I'm currently running this on my work laptop as the production server that has this data slows down too much when running thousands of these queries. When running on the production server I get results back about 30% of the time, on my laptop (with only 4 GiB of RAM and a regular 2.5" spinning rust hard drive) it's closer to 10%.

So, is there any way for me to either speed this up or is my best bet to just up the timeout and leave it running for weeks until it is finished? I've considered just building a data fetcher program that just dumps the results into a separate table along the lines of:

CREATE TABLE(
    id INT NOT NULL PRIMARY KEY IDENTITY,
    keyword VARCHAR(开发者_运维百科255) NOT NULL,
    returneddata VARCHAR(MAX), -- Or possibly a TEXT field.
)

That way I could leave the fetcher program running and just have my main program query this table. But honestly, I'd rather just figure out a way of making the queries run faster (or another way of making the queries). I just really hope I'm being a dummy and that I've missed out on some obvious way to make this run faster...

How quick does it need to be? There's a couple of options: diddle around with indexes to improve disk i/o (which won't give you the best performance as normal indexes aren't optimised for sub-string searched)... or use full-text search (a bit more work to set-up, but will give you a huge performance boost as they are optimised for subn-string searches). If this is an operation that that's going to happen regularly, then you'll seriously want to use full-text search as anything else is just storing trouble up further down the line. If it's just for a one-off job, then you might get away with basic indexes (it depends what other constraints you've got), but that's by no means guranteed... however I'll cover both here just for completeness.

First option... have you got any indexes against these columns table? If you haven't got appropriate indexes then it will be a lot slower than it needs to be as it has to fetch the data straight out of the data page. If not, then there's two options: create a composite index on (firstfield, secondfield) then see if that does anything.

Alternatively create two indexes, one on (firstfield), the other on (secondfield) and split your query into two queries, one that queries firstfield only, the other that queries secondfield only.

However neither of these will be blindingly quick as normal indexes are not optimised for substring searches: any speed-up will simply be due to improved disk i/o as SQL Servfer won't need to read as much data from disk to find matching rows (it'll just read index pages rather than the full data page ... however it still needs to do the same amount of work once it's got the data off disk to then find the rows).

If you do need it to be lightning quick, then you're better off investigating Full Text Search, which will then allow you to use the CONTAINS keyword instead. This will allow you to do very quick sub-string searches.

You can consider enabling full text and then use freetext in your query instead of wildcard http://msdn.microsoft.com/en-us/library/ms176078.aspx

If you have a wildcard on both sides of the string (e.g. %x% not simply x%) then SQL Server cannot use any statistics to optimise the query.

Full text search is a likely solution, but I notice the LIKE does not run afoul of the documented limitations on creating indexed views. You could create a view and index it for fairly fast results.

I would strongly advise using the second table approach.

You can run the "fetcher" program as a background task/overnight etc.

Although you may be able to optimise the above query, it will never be as performant as the key/value approach you mention, providing of course you are able to index the 2nd table appropriately.

Use of a 2nd table, is very similar to denormalising which is a standard practice for improving responses to user requests or reporting, so what you suggest is not in any way a hack.

At the end of the day, your users will appreciate the faster response times.

Since both fields are varchar(800), you can do the following:

SELECT TOP 30 * FROM [table]
    WHERE (
        coalesce(firstfield, '') LIKE '%substring%' OR
        coalesce(secondfield, '') LIKE '%substring%')

This may not help much regarding performance, but I would try both and have a look at the execution plan of the queries. That should tell you what gets used in terms of indices and the like, I've always found that very helpful.