从php中提取包含正文部分的所有邮件头文件
问题描述:
我想在php中使用正则表达式从下面的链式邮件中提取正文部分。 连锁邮件以txt格式保存。在提取时,如果在body标签中存在html标签,则应该保持不变。从php中提取包含正文部分的所有邮件头文件
$content = <<<HEREDOC
From: Matrimony <[email protected]>
Sent: Fri, 12 Aug 2011 16:17:40
To: "[email protected]" <[email protected]>
Subject: Re: bride search
From: brides <[email protected]>
Sent: Fri, 12 Aug 2011 15:49:52
To: "Matrimony " <[email protected]>
Cc: "groom" <[email protected]>
Subject: Re: bride search
PFA
Regds.,
sales
From: shaadi <[email protected]>
Sent: Tue, 22 Feb 2011 16:40:24
To: <[email protected]>, <[email protected]>
Cc: "'lagna '" <[email protected]>, <[email protected]>, <[email protected]>, "'beta data'" <[email protected]>, "'test S'" <[email protected]>
Subject: Re:data transfer would be made live for 145 test
This is to inform you that we are going to test today.
Activity Timing: 9:00 PM onwards
Thanks and Regards,
free matrimony
shaadi Operations
P Please do not print this e-mail unless it is absolutely necessary
From: shaadi [nikaah:[email protected]]
Sent: 21 February 2011 23:09
To: [email protected]; [email protected]
Cc: 'lagna '; [email protected]; [email protected];
Subject: data transfer would be made live for 145 test
Hi,
gtsdhsdbh
anbdsmbsa
sda the data test .
Would request you to send in your feedback.
Thanks and Regards,
beta data
assa xyz
P Please do not print this e-mail unless it is absolutely necessary
HEREDOC;
O/P
Array
(
[0] => Array
(
[0] => Re: bride search
[1] => Re: bride search
PFA
Regds.,
sales
[2] => Re:data transfer would be made live for 145 test
This is to inform you that we are going to test today.
Activity Timing: 9:00 PM onwards
Thanks and Regards,
free matrimony
shaadi Operations
P Please do not print this e-mail unless it is absolutely necessary
)
[1] => Array
(
[0] => Re: bride search
[1] => Re: bride search
PFA
Regds.,
sales
[2] => Re:data transfer would be made live for 145 test
This is to inform you that we are going to test today.
Activity Timing: 9:00 PM onwards
Thanks and Regards,
free matrimony
shaadi Operations
P Please do not print this e-mail unless it is absolutely necessary
)
)
是我用来获取上述O/P的正则表达式
preg_match_all('/(?<=Subject:)(.*?[\n][\s]*?)(?=From:)/is',$content,$rest);
,但它并没有给出最后的一个,因为它没有 '从'获取中间数据。 希望它清楚。 请让我知道是否也有其他方法,为此。
preg_match_all('/(?m:^From:\x20(?<From>[^\n]*)\n^Sent:\x20(?<Sent>[^\n]*)\n^To:\x20(?<To>[^\n]*)\n(?:^Cc:\x20(?<Cc>[^\n]*)\n)?^Subject:\x20(?<Subject>[^\n]*)\n)(?<Body>.*?(?=(?:\nFrom:)|$))/s',$content,$matches);
echo "<pre>".print_r($matches,true);
它提供了近正确的O/p.Should我提供http://www.mangalsutrabandhan.com
答
你会需要一些更聪明的解析,以使这个意义上的文本文件 - 无论生成此文件改变电子邮件的结构:
Subject: Re: bride search
PFA
应该有什么似乎是一个电子邮件标题的一部分,它的身体之间至少有一个空行。
然后你有问题top-posting(你不能依赖不知道时区的标题中的时间戳),不完整的标题和no quoting。
所以,即使你建立了一个启发式来解析这个,它有太多的场景,它不会应付。
我不知道如果正则表达式是最好的选择。你最好根据/从/主题数据的“簇”来分割文档。从那里,任何内容都应该被视为内容。 –
你会编辑你的问题,以澄清所需的输出? – paulmelnikow