如何从Perl中的字符串中提取子字符串？

问题描述：

考虑以下字符串：如何从Perl中的字符串中提取子字符串？

1）方案编号：ABC-456-hu5t10（高优先级）*****

2）方案编号：FRT-78F-hj542w （平衡）

3）方案编号：23F-f974-nm54w（超级式运行）*****

等以上述格式 - 粗体部分是跨字符串变化。

==>想象一下，我有很多字符串的格式如上所示。 我想从上述每个字符串中挑选3个子字符串（如下面的BOLD所示）。

含有字母数字值（在例如高于它的“ABC-456-hu5t10”）包含单词（在例如高于它的“高优先级”）
含有第三子
第二子第一子串* （IF *存在于字符串的末尾ELSE离开它）

如何选择这些子3从上面所示的每个字符串？我知道它可以在Perl中使用正则表达式来完成......你能帮忙吗？

可以在括号中的字符串本身包含嵌套的括号？ – 2009-09-18 12:02:21

答

你可以做这样的事情：

my $data = <<END; 
1) Scheme ID: abc-456-hu5t10 (High priority) * 
2) Scheme ID: frt-78f-hj542w (Balanced) 
3) Scheme ID: 23f-f974-nm54w (super formula run) * 
END 

foreach (split(/\n/,$data)) { 
    $_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next; 
    my ($id,$word,$star) = ($1,$2,$3); 
    print "$id $word $star\n"; 
}

关键的是正则表达式：

Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?

如下打破了。

固定字符串 “计划ID：”：

Scheme ID:

后跟一个或多个字符A-Z，0-9或 - 。我们用括号来捕捉它为$ 1：

([a-z0-9-]+)

后跟一个或多个空格字符：

\s+

接着是左括号（我们逃），其次是任意数量的AREN字符不是右括号，然后是右括号（已转义）。我们使用转义括号来捕捉词为$ 2：

\(([^)]+)\)

其次是一些空格的也许*，捕捉为$ 3：

\s*(\*)?

答

(\S*)\s*\((.*?)\)\s*(\*?) 


(\S*) picks up anything which is NOT whitespace 
\s*  0 or more whitespace characters 
\(  a literal open parenthesis 
(.*?) anything, non-greedy so stops on first occurrence of... 
\)  a literal close parenthesis 
\s*  0 or more whitespace characters 
(\*?) 0 or 1 occurances of literal *

\（（[^）]）\）会比\（（。*？）\）好，因为它保证在第一个位置停止。非贪婪的量词可能导致严重的回溯，这会杀死性能。（不可否认，在这种情况下，不可否认，但在不需要时避免它们仍然是一个培养良好习惯的习惯。）否定的角色阶级也更清晰地表达了您的意图 - 您正在寻找“ ）字符“，而不是”任何字符的最小数字，然后是a“，这使得表达式成为一个整体匹配”。 – 2009-09-19 10:19:04

答

你可以使用正则表达式，如下列：

/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/

因此，例如：

$s = "abc-456-hu5t10 (High priority) *"; 
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/; 
print "$1\n$2\n$3\n";

打印

abc-456-hu5t10 
High priority 
*

答

很久没有的Perl

while(<STDIN>) { 
    next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/; 
    print "|$1|$2|$3|\n"; 
}

答

串1：

$input =~ /'^\S+'/; 
$s1 = $&;

字符串2：

$input =~ /\(.*\)/; 
$s2 = $&;

的琴弦3：

$input =~ /\*?$/; 
$s3 = $&;

答

好了，一个衬垫位置：

perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt

扩展为一个简单的脚本，以更好地解释事情：

#!/usr/bin/perl -ln    

#-w : warnings     
#-l : print newline after every print        
#-n : apply script body to stdin or files listed at commandline, dont print $_   

use strict; #always do this.  

my $regex = qr{ # precompile regex         
    Scheme\ ID:  # to match beginning of line.      
    \s+    # 1 or more whitespace        
    (.*?)   # Non greedy match of all characters up to   
    \s+    # 1 or more whitespace        
    \(    # parenthesis literal        
    (.*?)   # non-greedy match to the next      
    \)    # closing literal parenthesis      
    \s*    # 0 or more whitespace (trailing * is optional)  
    (\*)?   # 0 or 1 literal *s         
}x; #x switch allows whitespace in regex to allow documentation. 

#values trapped in $1 $2 $3, so do whatever you need to:    
#Perl lets you use any characters as delimiters, i like pipes because      
#they reduce the amount of escaping when using file paths   
m|$regex| && print "$1 : $2 : $3"; 

#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }

虽然如果它不是格式化的东西，我会实现一个主循环来处理文件并充实脚本的主体，而不是依赖命令行开关进行循环。

答

这只是需要一个小的变化，以我的last answer：

my ($guid, $scheme, $star) = $line =~ m{ 
    The [ ] Scheme [ ] GUID: [ ] 
    ([a-zA-Z0-9-]+)   #capture the guid 
    [ ] 
    \( (.+) \)    #capture the scheme 
    (?: 
     [ ] 
     ([*])    #capture the star 
    )?      #if it exists 
}x;

如何从Perl中的字符串中提取子字符串？

相关推荐