有了qr,就可以将正则表达式细化成一小片一小片,然后组合起来。例如:
my $howells = qr/Thurston|Mrs/;
my $tagalongs = qr/Ginger|Mary Ann/;
my $passengers = qr/$howells|$tagalongs/;
my $crew = qr/Gilligan|Skipper/;
my $everyone = qr/$crew|$passengers/;
就像RFC 1738中对URL各个部分的解剖,如果转换成Perl正则,大概是这样的(了解即可):
# 可复用的基本符号类
my $alpha = qr/[a−z]/;
my $digit = qr/\d/;
my $alphadigit = qr/(?i:$alpha|$digit)/;
my $safe = qr/[\$_.+−]/;
my $extra = qr/[!*'\(\),]/;
my $national = qr/[{}|\\^~\[\]`]/;
my $reserved = qr|[;/?:@&=]|;
my $hex = qr/(?i:$digit|[A−F])/;
my $escape = qr/%$hex$hex/;
my $unreserved = qr/$alpha|$digit|$safe|$extra/;
my $uchar = qr/$unreserved|$escape/;
my $xchar = qr/$unreserved|$reserved|$escape/;
my $ucharplus = qr/(?:$uchar|[;?&=])*/;
my $digits = qr/(?:$digit){1,}/;
# 可复用的URL组成元素
my $hsegment = $ucharplus;
my $hpath = qr|$hsegment(?:/$hsegment)*|;
my $search = $ucharplus;
my $scheme = qr|(?i:https?://)|;
my $port = qr/$digits/;
my $password = $ucharplus;
my $user = $ucharplus;
my $toplevel = qr/$alpha|$alpha(?:$alphadigit|−)*$alphadigit/;
my $domainlabel = qr/$alphadigit|$alphadigit(?:$alphadigit|−)*$alphadigit/x;
my $hostname = qr/(?:$domainlabel\.)*$toplevel/;
my $hostnumber = qr/$digits\.$digits\.$digits\.$digits/;
my $host = qr/$hostname|$hostnumber/;
my $hostport = qr/$host(?::$port)?/;
my $login = qr/(?:$user(?::$password)\@)?/;
my $urlpath = qr/(?:(?:$xchar)*)/;
然后我们就可以用上面看上去无比复杂的正则表达式去匹配一个路径是否是合格的http url:
use v5.10.1;
my $httpurl = qr|$scheme$hostport(?:/$hpath(?:\?$search)?)?|;
while( <> ) {
say if /$httpurl/;
}
正则表达式模块
上面构建的正则太复杂了,很多常用的正则表达式别人已经造好了轮子,我们直接拿来用就行了。例如,Regexp::Common模块,提供了很多种已经构建好的正则表达式。
首先安装这个模块:
sudo cpan -i Regexp::Common
以下是CPAN上提供的Regexp::Common已造好的轮子,可参考:https://metacpan.org/release/Regexp-Common
Regexp::Common - Provide commonly requested regular expressions
Regexp::Common::CC - provide patterns for credit card numbers.
Regexp::Common::SEN - provide regexes for Social-Economical Numbers.
Regexp::Common::URI - provide patterns for URIs.
Regexp::Common::URI::RFC1035 - Definitions from RFC1035;
Regexp::Common::URI::RFC1738 - Definitions from RFC1738;
Regexp::Common::URI::RFC1808 - Definitions from RFC1808;
Regexp::Common::URI::RFC2384 - Definitions from RFC2384;
Regexp::Common::URI::RFC2396 - Definitions from RFC2396;
Regexp::Common::URI::RFC2806 - Definitions from RFC2806;
Regexp::Common::URI::fax - Returns a pattern for fax URIs.
Regexp::Common::URI::file - Returns a pattern for file URIs.
Regexp::Common::URI::ftp - Returns a pattern for FTP URIs.
Regexp::Common::URI::gopher - Returns a pattern for gopher URIs.
Regexp::Common::URI::http - Returns a pattern for HTTP URIs.
Regexp::Common::URI::news - Returns a pattern for file URIs.
Regexp::Common::URI::pop - Returns a pattern for POP URIs.
Regexp::Common::URI::prospero - Returns a pattern for prospero URIs.
Regexp::Common::URI::tel - Returns a pattern for telephone URIs.
Regexp::Common::URI::telnet - Returns a pattern for telnet URIs.
Regexp::Common::URI::tv - Returns a pattern for tv URIs.
Regexp::Common::URI::wais - Returns a pattern for WAIS URIs.
Regexp::Common::_support - Support functions for Regexp::Common.
Regexp::Common::balanced - provide regexes for strings with balanced parenthesized delimiters or arbitrary delimiters.
Regexp::Common::comment - provide regexes for comments.
Regexp::Common::delimited - provides a regex for delimited strings
Regexp::Common::lingua - provide regexes for language related stuff.
Regexp::Common::list - provide regexes for lists
Regexp::Common::net - provide regexes for IPv4, IPv6, and MAC addresses.
Regexp::Common::number - provide regexes for numbers
Regexp::Common::profanity - provide regexes for profanity
Regexp::Common::whitespace - provides a regex for leading or trailing whitescape
Regexp::Common::zip - provide regexes for postal codes.
这些正则表达式是通过hash进行嵌套的,hash的名称为%RE。例如模块Regexp::Common::URI::http,它提供的是HTTP URI的正则表达式,它嵌套了两层,第一层的key为URI,这个key对应的值是第二层hash,第二层hash的key为HTTP,于是可以通过$RE{URI}{HTTP}的方式获取这个正则。
例如,匹配一个http url是否合理:
use Regexp::Common qw(URI);
while( <> ) {
print if /$RE{URI}{HTTP}/;
}