开发者

Extract all the email headers including body part from mail in php

开发者 https://www.devze.com 2023-03-28 16:08 出处:网络
I want to extract body part from the following chain mail using regex in php. The chain mail is saved in txt format.While extracting, the html tags if present in the body tag should be untouched.

I want to extract body part from the following chain mail using regex in php. The chain mail is saved in txt format.While extracting, the html tags if present in the body tag should be untouched.

 $content = <<<HEREDOC

    From: Matrimony <matrimony@mangalsutrabandhan.in>
    Sent: Fri, 12 Aug 2011 16:17:40
    To: "matrimony@mangalsutrabandhan.com" <matrimony@mangalsutrabandhan.in>
    Subject: Re: bride search


    From: brides <sales@mangalsutrabandhan.com>
    Sent: Fri, 12 Aug 2011 15:49:52
    To: "Matrimony " <matrimony@mangalsutrabandhan.in>
    Cc: "groom" <brides@mangalsutrabandhan.com>
    Subject: Re: bride search
    PFA

    Regds.,
    sales


    From: shaadi <kundaali@mangalsutrabandhan.in>
    Sent: Tue, 22 Feb 2011 16:40:24
    To: <vivaah@mangalsutrabandhan.com>, <bandhan@mangalsutrabandhan.com>
    Cc: "'lagna '" <lagna@mangalsutrabandhan.in>, <movies@mangalsutrabandhan.in>, <manishv@mangalsutrabandhan.com>, "'beta data'" <channel@mangalsutrabandhan.com>, "'test S'" <city@mangalsutrabandhan.com>
    Subject: Re:data transfer would be made live for 145 test

    This is to inform you that we are going to test today.



    Activity Timing: 9:00 PM onwards



    Thanks and Regards,

    free matrimony

    shaadi Operations


     P  Please do not print this e-mail unless it is absolutely necessary

    From: shaadi [nikaah:kundaali@mangalsutrabandhan.in]
    Sent: 21 February 2011 23:09
    To: vivaah@mangalsutrabandhan.com; bandhan@mangalsutrabandhan.com
    Cc: 'lagna '; movies@mangalsutrabandhan.in; manishv@mangalsutrabandhan.com; 
    Subject: data transfer would be made live for 145 test



    Hi,

    gtsdhsdbh
    anbdsmbsa
    sda the data test .

    Would request you to send in your feedback.



    Thanks and Regards,



    beta data

    assa xyz


     P  Please do not print this e-mail unless it is absolutely necessary



    HEREDOC;

O/p

Array
(
    [0] => Array
        (
            [0] => Re: bride search



            [1] => Re: bride search
PFA

Regds.,
sales



            [2] => Re:data transfer would be made live for 145 test

This is to inform you that we are going to test today.



Activity Timing: 9:00 PM onwards



Thanks and Regards,

free matrimony

shaadi Operations


 P  Please do not print this e-mail unless it is absolutely necessary


        )
开发者_运维知识库
    [1] => Array
        (
            [0] => Re: bride search



            [1] => Re: bride search
PFA

Regds.,
sales



            [2] => Re:data transfer would be made live for 145 test

This is to inform you that we are going to test today.



Activity Timing: 9:00 PM onwards



Thanks and Regards,

free matrimony

shaadi Operations


 P  Please do not print this e-mail unless it is absolutely necessary


        )

)

The regex what i used to get above o/p

preg_match_all('/(?<=Subject: )(.*?[\n][\s]*?)(?=From:)/is',$content,$rest);

but it does not gives last one as it does not have 'from' to get the middle data. Hope its clear. Please let me know if there are any other method too, for this.

preg_match_all('/(?m:^From:\x20(?<From>[^\n]*)\n^Sent:\x20(?<Sent>[^\n]*)\n^To:\x20(?<To>[^\n]*)\n(?:^Cc:\x20(?<Cc>[^\n]*)\n)?^Subject:\x20(?<Subject>[^\n]*)\n)(?<Body>.*?(?=(?:\nFrom:)|$))/s',$content,$matches);
echo "<pre>".print_r($matches,true);

Its providing the nearly the correct o/p.Should i provide the text file on http://www.mangalsutrabandhan.com


You're going to need some much smarter parsing to make sense of this - whatever produced this file is changing the structure of the emails:

Subject: Re: bride search
PFA

There should be at least one blank line between what appears to be part of an email header and it's body.

Then you've got the problem of top-posting (and you can't rely on the timestamps in the headers without knowing the timezones), incomplete headers and no quoting.

So even if you built a heuristic to parse this, there are too many scenarios that it will not cope with.

0

精彩评论

暂无评论...
验证码 换一张
取 消