开发者

Java Spam Filter

开发者 https://www.devze.com 2022-12-28 13:28 出处:网络
I\'m trying to create a spam filter in Java using t开发者_如何学JAVAhe Bayesian algorithm. I use a text file that contains email messages and split the tokens using regex, storing these values into a

I'm trying to create a spam filter in Java using t开发者_如何学JAVAhe Bayesian algorithm.

I use a text file that contains email messages and split the tokens using regex, storing these values into a hashmap.

My problem is, with regex, the email addresses are split so instead of: johnsmith@example.com

regex causes the token to be: john smith example

The same holds true for ip addresses, so for example, instead of: 192.55.34.322

regex splits the tokens to be: 192 55 34 322

So does anybody know of a way that I could read the email messages and store their contents as is?

AMENDMENT: I am using a regex that does not keep ip addresses or email addresses. It splits these up.

I was wondering if regex was not the way to go and if I could be suggested any alternatives for me to be able to filter emails to pick out characteristics I desire.


Find a way to separate the body of the email from the header information before you tokenize.

0

精彩评论

暂无评论...
验证码 换一张
取 消