如何在Perl中找到正则表达式匹配项的位置？

How can I find the location of a regex match in Perl?

我需要编写一个接收字符串和正则表达式的函数。我需要检查是否有匹配项，并返回匹配项的开始和结束位置。 (正则表达式已由qr//编译。)

该函数可能还会收到"全局"标志，然后我需要返回所有匹配项的(开始，结束)对。

我不能更改正则表达式，甚至不能在其周围添加()，因为用户可能会使用()和\\1。也许我可以使用(?:)。

示例：给定" ababab"和正则表达式qr/ab/，在全局情况下，我需要取回3对(开始，结束)。

内置变量@-和@+分别保存上一次成功匹配的开始和结束位置。 $-[0]和$+[0]对应于整个模式，而$-[N]和$+[N]对应于$N($1，$2等)子匹配。

忘记我以前的帖子，我有个更好的主意。

1
2
3
4
5
6
7
8
9
10
11
12
13

sub match_positions {
my ($regex, $string) = @_;
return if not $string =~ /$regex/;
return ($-[0], $+[0]);
}
sub match_all_positions {
my ($regex, $string) = @_;
my @ret;
while ($string =~ /$regex/g) {
push @ret, [ $-[0], $+[0] ];
}
return @ret
}

此技术不会以任何方式更改正则表达式。

编辑添加：引用perlvar上$ 1 .. $ 9。"这些变量都是只读的，并且动态地作用于当前的块。" 换句话说，如果要使用$ 1 .. $ 9，则不能使用子例程进行匹配。

pos函数为您提供比赛的位置。如果将正则表达式放在括号中，则可以使用length $1来获取长度(并以此结尾)。像这样

1
2
3
4
5
6
7
8
9
10
11
12
13

sub match_positions {
my ($regex, $string) = @_;
return if not $string =~ /($regex)/;
return (pos($string), pos($string) + length $1);
}
sub all_match_positions {
my ($regex, $string) = @_;
my @ret;
while ($string =~ /($regex)/g) {
push @ret, [pos($string), pos($string) + length $1];
}
return @ret
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

#!/usr/bin/perl

# search the postions for the CpGs in human genome

sub match_positions {
my ($regex, $string) = @_;
return if not $string =~ /($regex)/;
return (pos($string), pos($string) + length $1);
}
sub all_match_positions {
my ($regex, $string) = @_;
my @ret;
while ($string =~ /($regex)/g) {
push @ret, [(pos($string)-length $1),pos($string)-1];
}
return @ret
}

my $regex='CG';
my $string="ACGACGCGCGCG";
my $cgap=3;
my @pos=all_match_positions($regex,$string);

my @hgcg;

foreach my $pos(@pos){
push @hgcg,@$pos[1];
}

foreach my $i(0..($#hgcg-$cgap+1)){
my $len=$hgcg[$i+$cgap-1]-$hgcg[$i]+2;
print"$len\
";
}

如果您希望程序中所有RE的执行速度较慢，也可以使用不推荐使用的$`变量。从perlvar：

1
2
3
4
5
6

$‘ The string preceding whatever was matched by the last successful pattern match (not
counting any matches hidden within a BLOCK or eval enclosed by the current BLOCK).
(Mnemonic:"`" often precedes a quoted string.) This variable is read-only.

The use of this variable anywhere in a program imposes a considerable performance penalty
on all regular expression matches. See"BUGS".