将FILE1值与FILE2范围进行比较并打印匹配
我对Perl很陌生,正在大学进行生物信息学项目。我已经FILE1包含位置的列表,格式为:将FILE1值与FILE2范围进行比较并打印匹配
99269
550
100
126477
1700
和file2的格式为:
517 1878 forward
700 2500 forward
2156 3289 forward
99000 100000 forward
22000 23000 backward
我想在FILE1每个位置比较每一个范围在FILE2值,和如果一个位置落入其中一个范围,那么我想打印位置,范围和方向。
所以我期望的输出将是:
99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward
目前,它会没有错误运行,但它不输出任何信息,所以我不确定我要去的地方错了!当我拆分最终的'if'规则时,它将运行,但只有在位置与范围完全相同的行上时才能工作。
我的代码如下:
#!/usr/bin/perl
use strict;
use warnings;
my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
open FILE1, "/Users/edwardtickle/Documents/CC22positions.txt"
or die "cannot open > CC22: $!";
open FILE2, "/Users/edwardtickle/Documents/CDSpositions.txt"
or die "cannot open > CDS: $!";
open(OUTPUTFILE, ">$outputfile") or die "Could not open output file: $! \n";
while (<FILE1>) {
if (/^(\d+)/) {
my $CC22 = $1;
while (<FILE2>) {
if (/^(\d+)\s+(\d+)\s+(\S+)/) {
my $CDS1 = $1;
my $CDS2 = $2;
my $CDS3 = $3;
if ($CC22 > $CDS1 && $CC22 < $CDS2) {
print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
}
}
}
}
}
close(FILE1);
close(FILE2);
我已经发布了same question on Perlmonks。
因为仅读取FILE2一旦它仅与FILE1
的第一线相比后续线与关闭的文件
藏匿相比从FILE1中的阵列的行,然后比较每个线在FILE2每个数组项,如下图所示
#!/usr/bin/perl
use strict;
use warnings;
my $outputfile = "out.txt";
open FILE1, "file1.txt"
or die "cannot open > CC22: $!";
open FILE2, "file2.txt"
or die "cannot open > CDS: $!";
open(OUTPUTFILE, ">$outputfile") or die "Could not open output file: $! \n";
my @file1list =();
while (<FILE1>) {
if (/^(\d+)/) {
push @file1list, $1;
}
}
while (<FILE2>) {
if (/^(\d+)\s+(\d+)\s+(\S+)/) {
my $CDS1 = $1;
my $CDS2 = $2;
my $CDS3 = $3;
for my $CC22 (@file1list) {
if ($CC22 > $CDS1 && $CC22 < $CDS2) {
print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
}
}
}
}
(也有与节目风格问题(如,但我忽略了这些大写字母变量),这是一个相当不错的计划对于一个初学者)
完美的作品,谢谢你的即时回复!我没有足够的声望投票你的答案,但我会回来,一旦我有足够的时间做。出于兴趣,是否建议不要在变量中使用大写字母以避免区分大小写错误? – 2014-10-17 11:37:54
快速浏览http://perldoc.perl.org/perlstyle.html所有大写变量名通常是perl自身使用的常量或内部变量 – Vorsprung 2014-10-17 13:01:41
我想我可以通过使用split而不是正则表达式来简化一些,但我认为我的代码实际上更长,更难以阅读!在任何情况下,请记住,分裂为这类问题的伟大工程:
# User config area
my $positions_file = 'input_positions.txt';
my $ranges_file = 'input_ranges.txt';
my $output_file = 'output_data.txt';
# Reading data
open my $positions_fh, "<", $positions_file;
open my $ranges_fh, "<", $ranges_file;
chomp(my @positions = <$positions_fh>);
# Store the range data in an array containing hash tables
my @range_data;
# to be used like $range_data[0] = {start => $start, end => $end, dir => $dir}
while (<$ranges_fh>) {
chomp;
my ($start, $end, $dir) = split; #splits $_ according to whitespace
push @range_data, { start => $start, end => $end, dir => $dir };
#print "start: $start, end: $end, direction: $dir\n";
} #/while
close $positions_fh;
close $ranges_fh;
# Data processing:
open my $output_fh, ">", $output_file;
#It feels like it should be more efficient to process one range at a time for all data points
foreach my $range (@range_data) { #start one range at a time
#each $range = $range_data[#] = { hash table }
foreach my $position (@positions) { #check all positions
if (($range->{start} <= $position) and ($position <= $range->{end})) {
my $output_string = "$position " . $range->{start} . " " . $range->{end} . " " . $range->{dir} . "\n";
print $output_fh $output_string;
} #/if
} #/foreach position
} #/foreach range
close $output_fh;
该代码可能会运行得更快,如果while循环,它的阅读范围内的数据时做数据处理。
你的错误是因为你在嵌入文件处理,所以你的内循环只能经过文件的内容一次,然后卡在eof
。
最简单的解决方案就是先将内部循环文件加载到内存中。
下面演示使用更多Modern Perl技术:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $cc22file = "/Users/edwardtickle/Documents/CC22positions.txt";
my $cdsfile = "/Users/edwardtickle/Documents/CDSpositions.txt";
my $outfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
my @ranges = do {
# open my $fh, '<', $cdsfile; # Using Fake Data instead below
open my $fh, '<', \ "517 1878 forward\n700 2500 forward\n2156 3289 forward\n99000 100000 forward\n22000 23000 backward\n";
map {[split]} <$fh>;
};
# open my $infh, '<', $cc22file; # Using Fake Data instead below
open my $infh, '<', \ "99269\n550\n100\n126477\n1700\n";
# open my $outfh, '>', $outfile; # Using STDOUT instead below
my $outfh = \*STDOUT;
CC22:
while (my $cc22 = <$infh>) {
chomp $cc22;
for my $cds (@ranges) {
if ($cc22 > $cds->[0] && $cc22 < $cds->[1]) {
print $outfh "$cc22 @$cds\n";
next CC22;
}
}
# warn "$cc22 No match found\n";
}
输出:
99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward
[在PerlMonks Crossposted](http://www.perlmonks.org/ ?NODE_ID = 1104164)。 – choroba 2014-10-17 10:22:02
1700适合两个范围('517 1878'和'700 2500'),但你只需要其中的一个。你选择那个标准是什么? – TLP 2014-10-17 11:28:00
这里的数据由数据组成,范围实际上是基因组的片段,所以如果它匹配两次就可以,只要它找到一个范围即可!谢谢你指出,虽然。 – 2014-10-17 11:58:37