从php中提取包含正文部分的所有邮件头文件

问题描述:

我想在php中使用正则表达式从下面的链式邮件中提取正文部分。 连锁邮件以txt格式保存。在提取时,如果在body标签中存在html标签,则应该保持不变。从php中提取包含正文部分的所有邮件头文件

$content = <<<HEREDOC 

    From: Matrimony <[email protected]> 
    Sent: Fri, 12 Aug 2011 16:17:40 
    To: "[email protected]" <[email protected]> 
    Subject: Re: bride search 


    From: brides <[email protected]> 
    Sent: Fri, 12 Aug 2011 15:49:52 
    To: "Matrimony " <[email protected]> 
    Cc: "groom" <[email protected]> 
    Subject: Re: bride search 
    PFA 

    Regds., 
    sales 


    From: shaadi <[email protected]> 
    Sent: Tue, 22 Feb 2011 16:40:24 
    To: <[email protected]>, <[email protected]> 
    Cc: "'lagna '" <[email protected]>, <[email protected]>, <[email protected]>, "'beta data'" <[email protected]>, "'test S'" <[email protected]> 
    Subject: Re:data transfer would be made live for 145 test 

    This is to inform you that we are going to test today. 



    Activity Timing: 9:00 PM onwards 



    Thanks and Regards, 

    free matrimony 

    shaadi Operations 


    P Please do not print this e-mail unless it is absolutely necessary 

    From: shaadi [nikaah:[email protected]] 
    Sent: 21 February 2011 23:09 
    To: [email protected]; [email protected] 
    Cc: 'lagna '; [email protected]; [email protected]; 
    Subject: data transfer would be made live for 145 test 



    Hi, 

    gtsdhsdbh 
    anbdsmbsa 
    sda the data test . 

    Would request you to send in your feedback. 



    Thanks and Regards, 



    beta data 

    assa xyz 


    P Please do not print this e-mail unless it is absolutely necessary 



    HEREDOC; 

O/P

Array 
(
    [0] => Array 
     (
      [0] => Re: bride search 



      [1] => Re: bride search 
PFA 

Regds., 
sales 



      [2] => Re:data transfer would be made live for 145 test 

This is to inform you that we are going to test today. 



Activity Timing: 9:00 PM onwards 



Thanks and Regards, 

free matrimony 

shaadi Operations 


P Please do not print this e-mail unless it is absolutely necessary 


     ) 

    [1] => Array 
     (
      [0] => Re: bride search 



      [1] => Re: bride search 
PFA 

Regds., 
sales 



      [2] => Re:data transfer would be made live for 145 test 

This is to inform you that we are going to test today. 



Activity Timing: 9:00 PM onwards 



Thanks and Regards, 

free matrimony 

shaadi Operations 


P Please do not print this e-mail unless it is absolutely necessary 


     ) 

) 

是我用来获取上述O/P的正则表达式

preg_match_all('/(?<=Subject:)(.*?[\n][\s]*?)(?=From:)/is',$content,$rest); 

,但它并没有给出最后的一个,因为它没有 '从'获取中间数据。 希望它清楚。 请让我知道是否也有其他方法,为此。

preg_match_all('/(?m:^From:\x20(?<From>[^\n]*)\n^Sent:\x20(?<Sent>[^\n]*)\n^To:\x20(?<To>[^\n]*)\n(?:^Cc:\x20(?<Cc>[^\n]*)\n)?^Subject:\x20(?<Subject>[^\n]*)\n)(?<Body>.*?(?=(?:\nFrom:)|$))/s',$content,$matches); 
echo "<pre>".print_r($matches,true); 

它提供了近正确的O/p.Should我提供http://www.mangalsutrabandhan.com

+0

我不知道如果正则表达式是最好的选择。你最好根据/从/主题数据的“簇”来分割文档。从那里,任何内容都应该被视为内容。 –

+0

你会编辑你的问题,以澄清所需的输出? – paulmelnikow

你会需要一些更聪明的解析,以使这个意义上的文本文件 - 无论生成此文件改变电子邮件的结构:

Subject: Re: bride search 
PFA 

应该有什么似乎是一个电子邮件标题的一部分,它的身体之间至少有一个空行。

然后你有问题top-posting(你不能依赖不知道时区的标题中的时间戳),不完整的标题和no quoting

所以,即使你建立了一个启发式来解析这个,它有太多的场景,它不会应付。