I want to extract body part from the following chain mail using regex in php. The chain mail is saved in txt format.While extracting, the html tags if present in the body tag should be untouched.
$content = <<<HEREDOC
From: Matrimony <matrimony@mangalsutrabandhan.in>
Sent: Fri, 12 Aug 2011 16:17:40
To: "matrimony@mangalsutrabandhan.com" <matrimony@mangalsutrabandhan.in>
Subject: Re: bride search
From: brides <sales@mangalsutrabandhan.com>
Sent: Fri, 12 Aug 2011 15:49:52
To: "Matrimony " <matrimony@mangalsutrabandhan.in>
Cc: "groom" <brides@mangalsutrabandhan.com>
Subject: Re: bride search
PFA
Regds.,
sales
From: shaadi <kundaali@mangalsutrabandhan.in>
Sent: Tue, 22 Feb 2011 16:40:24
To: <vivaah@mangalsutrabandhan.com>, <bandhan@mangalsutrabandhan.com>
Cc: "'lagna '" <lagna@mangalsutrabandhan.in>, <movies@mangalsutrabandhan.in>, <manishv@mangalsutrabandhan.com>, "'beta data'" <channel@mangalsutrabandhan.com>, "'test S'" <city@mangalsutrabandhan.com>
Subject: Re:data transfer would be made live for 145 test
This is to inform you that we are going to test today.
Activity Timing: 9:00 PM onwards
Thanks and Regards,
free matrimony
shaadi Operations
P Please do not print this e-mail unless it is absolutely necessary
From: shaadi [nikaah:kundaali@mangalsutrabandhan.in]
Sent: 21 February 2011 23:09
To: vivaah@mangalsutrabandhan.com; bandhan@mangalsutrabandhan.com
Cc: 'lagna '; movies@mangalsutrabandhan.in; manishv@mangalsutrabandhan.com;
Subject: data transfer would be made live for 145 test
Hi,
gtsdhsdbh
anbdsmbsa
sda the data test .
Would request you to send in your feedback.
Thanks and Regards,
beta data
assa xyz
P Please do not print this e-mail unless it is absolutely necessary
HEREDOC;
O/p
Array
(
[0] => Array
(
[0] => Re: bride search
[1] => Re: bride search
PFA
Regds.,
sales
[2] => Re:data transfer would be made live for 145 test
This is to inform you that we are going to test today.
Activity Timing: 9:00 PM onwards
Thanks and Regards,
free matrimony
shaadi Operations
P Please do not print this e-mail unless it is absolutely necessary
)
开发者_运维知识库
[1] => Array
(
[0] => Re: bride search
[1] => Re: bride search
PFA
Regds.,
sales
[2] => Re:data transfer would be made live for 145 test
This is to inform you that we are going to test today.
Activity Timing: 9:00 PM onwards
Thanks and Regards,
free matrimony
shaadi Operations
P Please do not print this e-mail unless it is absolutely necessary
)
)
The regex what i used to get above o/p
preg_match_all('/(?<=Subject: )(.*?[\n][\s]*?)(?=From:)/is',$content,$rest);
but it does not gives last one as it does not have 'from' to get the middle data. Hope its clear. Please let me know if there are any other method too, for this.
preg_match_all('/(?m:^From:\x20(?<From>[^\n]*)\n^Sent:\x20(?<Sent>[^\n]*)\n^To:\x20(?<To>[^\n]*)\n(?:^Cc:\x20(?<Cc>[^\n]*)\n)?^Subject:\x20(?<Subject>[^\n]*)\n)(?<Body>.*?(?=(?:\nFrom:)|$))/s',$content,$matches);
echo "<pre>".print_r($matches,true);
Its providing the nearly the correct o/p.Should i provide the text file on http://www.mangalsutrabandhan.com
You're going to need some much smarter parsing to make sense of this - whatever produced this file is changing the structure of the emails:
Subject: Re: bride search
PFA
There should be at least one blank line between what appears to be part of an email header and it's body.
Then you've got the problem of top-posting (and you can't rely on the timestamps in the headers without knowing the timezones), incomplete headers and no quoting.
So even if you built a heuristic to parse this, there are too many scenarios that it will not cope with.
精彩评论