正则表达式匹配可以更快更简单 (but is slow in Java, Perl, PHP, Python, Ruby, ...) (6)

可数次重复:很多正则表达式提供一种可数次重复的操作符{n},可以让字符串重复n次。{n,m}可以重复n到m次,{n,}可以重复n次及以上。递归回溯算法可以通过一个循环来实现这个操作符,而NFA或基于DFA的算法就需要显示地扩展表达式后,才能进行下一次处理,比如e{3}需要先扩展为eee,e{3,5}扩展为eeee?e?,e{3,}扩展为eee+。

提取子串:正则表达式可以用来划分字符串,比如使用([0-9]+-[0-9]+-[0-9]+) ([0-9]+:[0-9]+)来匹配日期时间的字符串(date time),很多正则表达式引擎可以提取匹配的子串,例如Perl:

if(/([0-9]+-[0-9]+-[0-9]+) ([0-9]+:[0-9]+)/){ print "date: $1, time: $2\n"; } 如何区分子串的边界这个问题,被大多数理论学家所忽视。区分子串边界是最适合使用递归回溯算法来解决的问题。然而,Thompson算法也能在不放弃性能的情况下解决问题。Unix regexp(3)库的第八个版本使用算法就是Thompson算法,然而知道它的人很少。

中间匹配(Unanchored matches):我们在文章中所谈的正则表达式都从最左边开始向右进行匹配,而我们有时需要在字符串中间进行匹配,这可以理解为是提取子串的特殊情况。比如我们要中间匹配e,那么就相当于匹配.*(e).*,即字符串中间只要出现e的样式,就算匹配成功。(hellovello可以匹配.*(love).*)

非贪婪匹配:传统的正则表达式匹配的都是最长字符串,采用的都是贪婪匹配的方式。Perl语言引入了新的操作符?? *? +?来进行最短字符串匹配。当用(.+?)(.+?)匹配字符串abcd时,第一个(.+?)仅匹配a,第二个匹配bcd(而传统的贪婪匹配策略第一个匹配abc,第二个匹配d)。由定义可以看出,非贪婪匹配并不影响整个字符串的匹配,而是影响匹配的过程。回溯算法可以先匹配短的,再匹配长的,来实现最短匹配。而使用Thompson算法也可以解决这个问题。

总结 正则表达式表达式本可以更简单,更快的完成匹配,而基于有限自动机的算法就能够胜任这项工作,这项技术已然存在几十年,然而即使到了现在,诸如Perl, PCRE, Python, Ruby, Java等编程语言仍然使用的是基于递归的回溯算法,这种算法尽管便于理解、容易编写,但是运行速度异常缓慢。除了backreferences这样的操作符,其他可以使用回溯算法解决的问题,都可以使用基于有限自动机的算法解决。 History and References

Michael Rabin and Dana Scott introduced non-deterministic finite automata and the concept of non-determinism in 1959 [7], showing that NFAs can be simulated by (potentially much larger) DFAs in which each DFA state corresponds to a set of NFA states. (They won the Turing Award in 1976 for the introduction of the concept of non-determinism in that paper.)

R. McNaughton and H. Yamada [4] and Ken Thompson [9] are commonly credited with giving the first constructions to convert regular expressions into NFAs, even though neither paper mentions the then-nascent concept of an NFA. McNaughton and Yamada's construction creates a DFA, and Thompson's construction creates IBM 7094 machine code, but reading between the lines one can see latent NFA constructions underlying both. Regular expression to NFA constructions differ only in how they encode the choices that the NFA must make. The approach used above, mimicking Thompson, encodes the choices with explicit choice nodes (the Split nodes above) and unlabeled arrows. An alternative approach, the one most commonly credited to McNaughton and Yamada, is to avoid unlabeled arrows, instead allowing NFA states to have multiple outgoing arrows with the same label. McIlroy [3] gives a particularly elegant implementation of this approach in Haskell.

Thompson's regular expression implementation was for his QED editor running on the CTSS [10] operating system on the IBM 7094. A copy of the editor can be found in archived CTSS sources [5]. L. Peter Deutsch and Butler Lampson [1] developed the first QED, but Thompson's reimplementation was the first to use regular expressions. Dennis Ritchie, author of yet another QED implementation, has documented the early history of the QED editor [8] (Thompson, Ritchie, and Lampson later won Turing awards for work unrelated to QED or finite automata.)

Thompson's paper marked the beginning of a long line of regular expression implementations. Thompson chose not to use his algorithm when implementing the text editor ed, which appeared in First Edition Unix (1971), or in its descendant grep, which first appeared in the Fourth Edition (1973). Instead, these venerable Unix tools used recursive backtracking! Backtracking was justifiable because the regular expression syntax was quite limited: it omitted grouping parentheses and the |, ?, and + operators. Al Aho's egrep, which first appeared in the Seventh Edition (1979), was the first Unix tool to provide the full regular expression syntax, using a precomputed DFA. By the Eighth Edition (1985), egrep computed the DFA on the fly, like the implementation given above.

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wsszjg.html