I have a database table with HTML sn开发者_如何学Goippets (not whole documents) in a column and I need to do some basic HTML validation of the contents. My initial need is to just be able to run a one time query+validation report, not anything more complicated than that.
I would suggest using Regex -
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
Example -
select dbo.RegexMatch( N'123-45-6789', N'^\d{3}-\d{2}-\d{4}$' )
Or stricly t-sql -
http://blogs.msdn.com/b/khen1234/archive/2005/05/11/416392.aspx
However, CLR User-Defined Functions are probably the way to go.
SQL Server does have some XML validation capabilities built in for a field of type XML. Given that HTML is a subset of XML you might be able to twist that functionality to make SQL Server do the work for you.
I read Jeff's post here http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html and realized that I need to use a real parser after all.
It looks like http://tidy.sourceforge.net/ will get me what I need, I'll just have to write an ugly script that goes row by row and shells out to Tidy.
精彩评论