So I've been given the task of going through our mailing list and "cleaning" the emails which are slightly incorrect (trailing symbols, 开发者_如何学运维random symbols accidentally in the middle of the email, syntactically invalid, etc.) and to delete the entries which are totally bogus. The problem is I'm finding some of the emails in the database have already been syntactically corrected but the original entries still exist. There are over 2000 emails which my query is returning as incorrect (it seems past cleanings encountered upwards of 10,000 bad emails!) and going through them one at a time is no good. Does anyone have any suggestions for how to go about this?
Here's the query I used to return the bad emails:
SELECT id,email
FROM table
WHERE NOT
(
CHARINDEX(' ',LTRIM(RTRIM([email]))) = 0
AND LEFT(LTRIM([email]),1) <> '@'
AND RIGHT(RTRIM([email]),1) <> '.'
AND CHARINDEX('.',[Email],CHARINDEX('@',[email])) - CHARINDEX('@',[email]) > 1
AND LEN(LTRIM(RTRIM([email]))) - LEN(REPLACE(LTRIM(RTRIM([email])),'@','')) = 1
AND CHARINDEX('.',REVERSE(LTRIM(RTRIM([email])))) >= 3
AND (CHARINDEX('.@',[email]) = 0 AND CHARINDEX('..',[email]) = 0)
)
or id in (select id from table where email like '%[+;(,!]%')
This question has already been asked (and somewhat answered) on Stack Overflow - T-SQL: checking for email format
I suggest using https://www.rfc-editor.org/rfc/rfc3696 to come up with a solution.
Also, don't assume the second level domain is always a max of four characters - In New Zealand, our longest second level domain is ".parliament.nz". I use a ".geek.nz" second level domain and are constantly having websites tell me it's an invalid email address.
Wikipedias article, http://en.wikipedia.org/wiki/Email_address#Valid_email_addresses is also informative - The following are all valid email addresses
- niceandsimple@example.com
- a.little.unusual@example.com
- much."more\ unusual"@example.com
- very.unusual."@".unusual.com@example.com
- very."(),:;<>[]".VERY."very\\ @"very".unusual@cool.example.com
精彩评论