Spam detection in (objective-) C_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-13 10:38 出处：网络

I\'m currently writing an iPhone application which gets some data from the user and uploads it to a server. The uploaded data will be displayed to other users of the same program (there\'s more to it

I'm currently writing an iPhone application which gets some data from the user and uploads it to a server. The uploaded data will be displayed to other users of the same program (there's more to it than that, but to keep the idea simple...). The data which is uploaded is basically just three strings: a name(max. 50 char.), a title(max. 50 char.) and some text(virtually unlimited char.). What I need is basically a function, service or algorithm which can detect how valid the data input is. It would have to be able to detect series of repetitive characters, certain 'illegal' words, abnormal whitespaces, etc. So my questions is; is there a C or Objective-C library (build-in or open source) for this sort of data validation, or else, how would I go about doing this kind of check?

Here are two examples of good and bad data:

GOOD:

Name: "John Aaron Smith"  
Title: "Why am I still here?"  
Text: "Can anybody please help me? I'm feeling lonely!"

BAD:

Name: "f**k you, kldsanfklds"   
Title: "Only $99. Buy Now. Only $99"  
Text: "ndsaklgnvds lakævndsaklæfhadsæhdsjka fhdskjafhdskj lafhsdkhf. €#&/ #&()(/&%& ># €%€#% €#& hidosæahviædshvidshfiodsa. adsifjDSILFJIDSH \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"

I know taking precautions for so many cases will be difficult, but this algorithm/library would just have to filter the worst spam. I will also be looking through the data before the final database submission, but of course the less spam, the easier I'll have it.

Yours, BEN.

EDIT: My most 'fluent' language is objective-C, but I'm also doing pretty well with C, and I have knowledge of PHP and JAVA. Libraries/examples in other languages might be difficult for me to understand, and 'translate' into a valid iPhone langua开发者_如何学JAVAge.

EDIT-EDIT: I'm not looking for something overly sophisticated. Just a simple way for me to do the rough cut.

This is a very difficult problem to solve. I would not attempt to create my own spam detection, I would use a solution which already exists and has a good reputation, such as SpamAssassin.

Have you seen Mollom? It has a bunch of developer libraries (php, ruby, perl, etc) that communicate with the Mollom servers to determine the spaminess of an entry. It wouldn't be hard to translate one of those to Objective-C.

I've made something similar to what you want but it's in php. All the text I deal with is entered with a captcha so what I'm blocking is useless comment spam similar to your bad example. Here's what I've got so far which has been blocking a good 80% of the junk. It may block some valid text from people with bad spelling habits but I prefer that over manually editing text.

check that the text is not empty and verify that it's not all spaces
Check the length, I use a minimum of 3 characters.
check for series of matching characters e.g. !!!!!! I use no more then 3.
check for words longer then 15 characters. e.g. lakævndsaklæfhadsæhdsjka
convert a copy of the text to lowercase and run through a dictionary of bad words

You could add to this by blocking text with suspicious characters e.g. %^[] additionally you could compile a list of characters that should never be used next to each other e.g. fd, gf, kp, yt, vnd At this point you need to automate by adding to the algorithm. This would mean that the algorithm needs to understand some grammar and the overall process will begin to multiply in intensity. Anything else is beyond my comprehension at this point.