I have a block list of names/words, has about 500,000+ entries. The use of the data is to prevent people from entering these words as their username or name. The table structure is simple: word_id, word, create_date
.
When the user clicks submit, I want the system to lookup whether the entered name is an exact match or a word%
match.
Is this the only way to implement a block or is there a better way? I don't l开发者_开发技巧ike the idea of doing lookups of this many rows on a submit as it slows down the submit process.
Consider a few points:
Keep your blacklist (business logic) checking in your application, and perform the comparison in your application. That's where it most belongs, and you'll likely have richer programming languages to implement that logic.
Load your half million records into your application, and store it in a cache of some kind. On each signup, perform your check against the cache. This will avoid hitting your table on each signup. It'll be all in-memory in your application, and will be much more performant.
Ensure
myEnteredUserName
doesn't have a blacklisted word at the beginning, end, and anywhere in between. Your question specifically had a begins-with check, but ensure that you don't miss out on123_BadWord999
.Caching bring its own set of new challenges; consider reloading from the database everyday n minutes, or at a certain time or event. This will allow new blacklisted words to be loaded, and old ones to be thrown out.
You can't do where 'loginName' = word%
. % can only be used in the literal string, not as part of the column data.
You would need to say where 'logi' = word or 'login' = word or ...
where you compare substrings of the login name with the bad words. You'll need to test each substring whose length is between the shortest and longest bad word, inclusive.
Make sure you have an index on the word
column of your table, and see what performance is like.
Other ways to do this would be:
- Use Lucene, it's good at quickly searching text, espacially if you just need to know whether or not your substring exists. Of course Lucene might not fit technically in your environment -- it's a Java library.
- Take a hash of each bad word, and record them in a bitset in memory -- this will be small and fast to look up, and you'll only need to go to the database to make sure that a positive isn't false.
精彩评论