Submitted by: Jordan Sissel
This revision by: Jordan Sissel
Date: Sun Sep 16 18:54:20 -0400 2007
Regular expressions are very flexible, but perl went a step further allowing you to inject code into the execution of a regular expression. This is extremely powerful in that it lets you extend functionality of the regexp system without having to write your own.
In regex, the problem of doing ‘and’, submatches on groups, and general assertions is difficult. For example:
matching an a url with foo in it. It is easier to specify “match a url with foo in it” by crafting a regex that matches any url and specifiying that it also matches ‘foo’ than crafting a regex that matches any url and injecting the necessary regexes which will match foo anywhere inside. Sometimes, look-ahead/behind assertions can be used, but not always.
A way to specify “and” in regex is useful, but absent in all implementations other than perl. That is, =~ /(regex_matching_a_url)(?some_assertion_that_tests_$1)/
This change affects the regular expression syntax by adding a new pattern:
(?{ code })
where ‘code’ is a ruby code block (alternatively, simply a function to be called. ‘perldoc perlre’ will explain exactly what (?{ code }) does:
"(?{ code })"
This zero-width assertion evaluates any embedded Perl code. It
always succeeds, and its "code" is not interpolated.
I would like to modify the constraints here and add that ‘code’ can optionally fail if the return value from said code is false. This gives you great control over pattern matching.
No other changes to Ruby are necessary for this change other than adding this into the regular expression engine.
—-
An example of this might be an assertion that matches a number that is >10. Match a number: /[0-9]+/ Assert it is greater than 10: x > 10 And, what if your assertion, “> 10” comes from user input (a function argument, for example). Do you provide a translation function which turns “> 10” into a regex that matches both a number and a number that is > N, where N is the input, or would it be easier to match a number and call a function to verify the assertion and affect the output of the regexp.
s = "1 3 5 7 9 11"
I vote for the latter, since it’s more readable and simpler to write.
This change is only truely beneficial when implemented in the regex engine itself so you can take advantage of the natural backtracking that the regex execution does when it hits a failure. It makes for more readable and more powerful regular expressions when you need to do some more advanced matching.
I use this specific kind of advanced regular expressions in grok, a pattern matching tool I wrote in perl. It is extremely useful to have.
I have patches that mostly put this change into ruby1.8.6 with oniguruma. Here’s a sample ruby invocation of (?{ code })
def check_private_network(ip)
priv_re = /^(192\.168\.|10\.|172.16)/
ret = false
puts "Checking #{ip}"
result = (ip =~ priv_re)
if result
ret = true
end
return ret
end
ip_re = "((?:[0-9]{1,3}\.){3}(?:[0-9]{1,3}))"
fun_re = Regexp.new("(#{ip_re})(?{ check_private_network($g[0])})");
ips.each { |x|
y = fun_re.match(x)
if y
puts "#{y} is a private net"
end
}
Another example:
mystr = “1.2.3.4 192.168.0.1” ip = fun_re.match(mystr)
Sorry for beeing late in the discussion about this extension. We started ~1 year ago a discussion in the german Ruby-Forum about these kind of extensions. This is my personal opinion about this construction in Ruby. Unfortunately (due to person constraints) I cant't spent much effort in 2007 into prototype implementations or else. So here are my remarks: 1) A more general remark about this extension: Oniguruma is a product independent of Ruby which is used by Ruby. It has an independent release strategy and is used by other products too (see http://www.geocities.jp/kosako3/oniguruma/). This RCR has at least the same effects on Oniguruma as on Ruby, so should be coordinated with the development of Oniguruma to avoid unnecessary parallel developments. 2) If the implementation of this RCR is carried out in cooperation with the Oniguruma development, then it must be taken into account that Oniguruma is also used in other languages. This means, that Oniguruma itself must be able to parse the body of "(?{...})" construct for each language. A consequence is, that the code inside the construct must be syntactically as simple as possible, and should not contain features which are special for the progarmming paradigm (i.e. OO specific things in case of Ruby). Pure functions (Proc objects in case of Ruby) may be a good compromise (in case of Ruby). 3) I support the planned change (Perl's original construct's return code will not be used by the pattern matching engine), that the return code (success/failure expressed by true/false in case of Ruby) will be evaluated by Oniguruma. This allows simple expression of complex matching, which is related to several side effects. A simple example from our discussion one year ago is: def checkdiv(intstring, md) # 'md' is not used in this example if ((intstring.to_i)%2)==0 '>>' + intstring + '<<' else false end end if (md = 'we are 5, 3, or 2 persons'.match(/(?{checkdiv(\d+)})/)) puts "'we are 5, 3, or 2 persons' is successfull for '#{md[0]}'" end unless (md = 'we are 7, 5, or 3 persons'.match(/(?{checkdiv(\d+)})/)) puts "no successfull match for 'we are 7, 5, or 3 persons'" end !(Remark) There was a different syntax used at this time, but I hope it is understandable. More complex examples are when a match of a text will succeed only, if a second match on different data succeeds too - e.g. a participant of a trip will be accepted in the list only, if he is in the list of persons who paid it already too. One must be aware, that the use of this construct may consume a lot of time, but one may not use it, if this will be a problem. ---------- Due to the fact that I often use Ruby via "irb" to analyse textual data, this may be a very useful because it allows compact formulation. I will strongly support it, but I think the implementation must be done together with the Oniguruma development. Best regards, Wolfgang Nádasi-Donner alias WoNáDo
> I am not sure what we will gain by this. As of now we can execute code > per match already (block form of String#gsub and String#scan). For > example, the "URL with 'foo' issue" can be done cleanly in any of > these ways: > > # two passes > urls = text.scan(%r{http://\S+}).select {|m| /foo/ =~ m} > String#scan doesn't return MatchData objects (only captures or the match string), so we lose a great deal of context even on simple regexps: re = /(?<one>\w+) \d+ (?<two>\w+)/ str = "test 1234 test foo 456 bar" str.scan(re).each { |x| puts x.join(" ") } Output is: test test foo bar We lose the MatchData info, which would tell me what part of the string was matched. Am I making sense? Further testing reveals that $~ is set in String#scan, but only is set to the last match found, so we cannot rely on it for each match iteration. str.scan(re).each { |x| puts $~["one"] } output: foo foo I expected the output to be 'test' and 'foo', not 'foo' and 'foo'. I'm guessing that String#scan loops over the whole string for matches, so the last match ($~) would indeed be 'foo'. I need the MatchData for each iteration for this to be really feasible. If String#scan gave MatchData instances, we might be closer. Beyond that, we lose backtracking that is built into the regexp engine. There are probably cases where String#scan will behave differently than if we were causing backtracking by inducing failures at runtime (aka, named capture 'two' doesn't match 'bar', so we'll "fail" and backtrack). Such one case is this one: re = /(?<one>\w+) .*? (?<two>\w+)/ str = "test 1234 test foo 456 bar" str.scan(re).each { |x| puts x.join(" ") } Output is: test test foo bar The '.*?' makes me think the following are possible matches: test 1234 test (which might yield 'test test' from this code) test 1234 test foo (which might yield 'test foo' from this code) test 1234 test foo 456 bar (which might yield 'test bar' from this code) ... etc ... I would expect, if I wanted to match /\w+ .*? \w+/ where the 2nd \w+ also matched 'bar' that the resulting match would be the entire string, since 'test 1234 test foo 456 bar' matches both /w+ .*? \w+/ and /\w+ .* bar/. I might be causing confusion. My perl implementation of this behaves in the way I would expect: % echo "test 2345 foo 234 bar" | perl foo.pl test bar > Is there maybe a better example that demonstrates what can be done > with the new feature and cannot be done with current state of affairs? > Let me know if my above examples don't adequately explain why String#scan is not sufficient. > > My questions would be: > > - What do we gain? It's hard for me to put into words: My perl text analyzer tool (grok) gains much from this particular feature. I can do crazy things like say match an IP and also require that IP match another pattern; or match a number and also require that number meet an additional condition such as "num > 10" and only numbers > 10 will be matched. Additionally, we can use this to call functions or callbacks when portions of a regexp are reached (useful for debugging your regexes, etc). > - Does the feature influence matching speed of RX that do not make use of it? Nope, and hurray for that! Basically, your regex gets compiled into a series of opcodes when you do Regex.new(). If you don't use a given regex feature, then that feature isn't compiled into your executable opcode string and doesn't get used at match-time. My implementation adds some small code bits to Oniguruma and does not modify its fundamental design. > > - Also, I'd check the rationale why the Perl people choose to not let > the code influence matching. Maybe there is a good reason to do that > which applies to Ruby as well. In fact, perl lets you go beyond influencing success and failure. There are two 'execute this code in my regex' features in perl: (?{ code }) (??{ code }) In all cases, 'code' is excuted any time it is reached in the regexp. The first one simply executes the perl code and operates otherwise as a zero-width always-positive assertion. The second one, (??{ code }), uses the return value of 'code' as an additional regular expression, which gets injected, at runtime. Code, in all cases, is executed each time that particular portion of the regexp is visited. So eachtime (??{ code }) is executed, it has an additional opportunity to return a different regexp. Pretty cool feature. Looking at ruby's Regexp docs, it seems like /#{stuff}/ is evaluated each time a match is attempted, but I'm not sure that's the case: % ruby -e '"abc".scan(/.#{puts "hi"}/) { |x| puts x }' hi a b e 'hi' is only output once. I could be doing something wrong, though, or my interpretation of the ruby pickaxe is wrong. http://whytheluckystiff.net/ruby/pickaxe/html/language.html Under 'Substitutions'. > > Kind regards > > robert > Thanks for your feedback :) > > PS: Why is this RCR twice in the DB? > I'm bad at clicking. Accidentally submitted twice, and I wasn't able to find a 'delete' feature to remove the dupe.
I have patches which implement this feature. They are slightly buggy, and I only consider it a prototype, but it does work. The current implementation I have is about 120 lines of additions, so it's pretty simple. > This RCR has at least the same > effects on Oniguruma as on Ruby, so should be coordinated with the development > of Oniguruma to avoid unnecessary parallel developments. Agreed. A general solution would better serve everyone. > used in other languages. This means, that Oniguruma itself must be able to parse > the body of "(?{...})" construct for each language. A consequence is, that > the code inside the construct must be syntactically as simple as possible, and I'm on the fence about this one. I would prefer that oniguruma had this feature and had a way for a consumer to register a 'code parser' callback and a 'code executer' callback which oniguruma would use at compiletime and execution time, respectively. If no 'code parser' was registered, then any attempt to use (?{ code }) could simply throw a compilation error. > 3) I support the planned change (Perl's original construct's return code will > not be used by the pattern matching engine), that the return code > (success/failure expressed by true/false in case of Ruby) will be evaluated by > Oniguruma. > One must be aware, that the use of this construct may consume a lot of time, but > one may not use it, if this will be a problem. Probably, but the implementation itself should be no slower than if you were running the code outside of the regexp. This should be trivial to ensure. > Due to the fact that I often use Ruby via "irb" to analyse textual data, this > may be a very useful because it allows compact formulation. I will strongly > support it, but I think the implementation must be done together with the > Oniguruma development. This does indeed allow you to write very powerful code in a very concise way. I am happy to work with the oniguruma folks to make this generally applicable to any tool wanting to use the oniguruma library.
Hi, In message "Re: [RCR] Add perl-like regexp (?{ code }) to ruby." on Sun, 23 Sep 2007 05:07:31 -0400, psionic@csh.rit.edu writes: |I would prefer that oniguruma had this feature and had a way for a |consumer to register a 'code parser' callback and a 'code executer' |callback which oniguruma would use at compiletime and execution time, |respectively. If no 'code parser' was registered, then any attempt to |use (?{ code }) could simply throw a compilation error. Interesting idea, but since Oniguruma is not under our control, I hardly see it's going to happen. matz.
matz@ruby-lang.org schrieb: > Interesting idea, but since Oniguruma is not under our control, I > hardly see it's going to happen. Last year I sent some mails to K. Kosako about the Perl like extensions. I will ask him via mail to have a look at the RCR, and to tell us his opinion on it. I believe this is the easiest way to see, if there is any interest in it. Wolfgang Nádasi-Donner
Hi, At Sun, 23 Sep 2007 11:39:15 -0400, matz@ruby-lang.org wrote: > |I would prefer that oniguruma had this feature and had a way for a > |consumer to register a 'code parser' callback and a 'code executer' > |callback which oniguruma would use at compiletime and execution time, > |respectively. If no 'code parser' was registered, then any attempt to > |use (?{ code }) could simply throw a compilation error. > > Interesting idea, but since Oniguruma is not under our control, I > hardly see it's going to happen. I don't think a 'code parser' callback is necessary. For instance, the parser could replace the code blocks with each particular argument, i.e., index numbers, and the regexp engine would calls back a 'executer' with those.
Hi, In message "Re: [RCR] Add perl-like regexp (?{ code }) to ruby." on Sun, 23 Sep 2007 23:23:49 -0400, nobu@ruby-lang.org writes: |I don't think a 'code parser' callback is necessary. For |instance, the parser could replace the code blocks with each |particular argument, i.e., index numbers, and the regexp engine |would calls back a 'executer' with those. I'm afraid that your approach makes the engine less versatile, since we need to "preprocess" the regular expression as we did for string interpolation. It is relatively easy for us (we already have similar mechanism), but would be quite difficult for other languages, say PHP. All decision would be up to the maintainer anyway. matz.
On 9/23/07, nobu@ruby-lang.org <nobu@ruby-lang.org> wrote: > At Sun, 23 Sep 2007 11:39:15 -0400, > matz@ruby-lang.org wrote: > > |I would prefer that oniguruma had this feature and had a way for a > > |consumer to register a 'code parser' callback and a 'code executer' > > |callback which oniguruma would use at compiletime and execution time, > > |respectively. If no 'code parser' was registered, then any attempt to > > |use (?{ code }) could simply throw a compilation error. > > Interesting idea, but since Oniguruma is not under our control, I > > hardly see it's going to happen. > I don't think a 'code parser' callback is necessary. For > instance, the parser could replace the code blocks with each > particular argument, i.e., index numbers, and the regexp engine > would calls back a 'executer' with those. It needs to know something about supported languages syntax so that it doesn't close the code block too soon: %r{(?{ abc(def {|g| h(g)})})} -austin
Hi, At Mon, 24 Sep 2007 07:46:43 -0400, halostatue@gmail.com wrote: > > I don't think a 'code parser' callback is necessary. For > > instance, the parser could replace the code blocks with each > > particular argument, i.e., index numbers, and the regexp engine > > would calls back a 'executer' with those. > > It needs to know something about supported languages syntax so that it > doesn't close the code block too soon: > > %r{(?{ abc(def {|g| h(g)})})} I meant splitting it as %r{(?{#1})} and proc {abc(def {|g| h(g)})} by the parser. The proc could be supplied to Oniguruma directly or just index to another common callback argument.
2007/9/23, psionic@csh.rit.edu <psionic@csh.rit.edu>: > We lose the MatchData info, which would tell me what part of the string > was matched. Am I making sense? Yes. But you are not completely right - although the block does not receive a MatchData instance, you can nevertheless obtain the information: irb(main):005:0> s.scan(/\d+/) {|m| print $`, "<", m, ">", $', "\n"} foo <123> bar 456 baz foo 123 bar <456> baz => "foo 123 bar 456 baz" (see also further below) > Further testing reveals that $~ is set in String#scan, but only is set > to the last match found, so we cannot rely on it for each match > iteration. That's wrong: irb(main):008:0> s.scan(/\d+/) {|m| print $`, "<", $~, ">", $', "\n"} foo <123> bar 456 baz foo 123 bar <456> baz => "foo 123 bar 456 baz" irb(main):010:0> s.scan(/\d+/) {|m| print $`, "<", $~.class, ">", $', "\n"} foo <MatchData> bar 456 baz foo 123 bar <MatchData> baz => "foo 123 bar 456 baz" > If String#scan gave MatchData instances, we might be closer. You can access a MatchData instance via $~ (see above). Note also that $~ and the like are thread safe so there is no problem using these globals. You need only be careful when doing nested matches. > Beyond > that, we lose backtracking that is built into the regexp engine. There > are probably cases where String#scan will behave differently than if we > were causing backtracking by inducing failures at runtime (aka, named > capture 'two' doesn't match 'bar', so we'll "fail" and backtrack). This particular example can be better solved with negative lookahead or selection afterwards (see my first comment). > > Is there maybe a better example that demonstrates what can be done > > with the new feature and cannot be done with current state of affairs? > > Let me know if my above examples don't adequately explain why > String#scan is not sufficient. Frankly, I still feel that I haven't seen something that cannot be done with current state of the Ruby art (including 1.9 lookarounds). The fact that your tool has benefited from this feature does not mean that there are no other ways or even simpler ways to implement it. :-) > > My questions would be: > > > > - What do we gain? > > It's hard for me to put into words: My perl text analyzer tool (grok) > gains much from this particular feature. I can do crazy things like say > match an IP and also require that IP match another pattern; That can be done differently - even with 1.8.x (see my first comment). > or match a > number and also require that number meet an additional condition such as > "num > 10" and only numbers > 10 will be matched. Same here. irb(main):023:0> s.to_enum(:scan, /\d+/).each {|m| p m if m.to_i > 200} "456" => "foo 123 bar 456 baz" > Additionally, we can > use this to call functions or callbacks when portions of a regexp are > reached (useful for debugging your regexes, etc). That sounds interesting. Can you show a more concrete example? Still I am not sure whether better debugging alone is justification enough to introduce such a complex new feature - especially given all the dependencies and problems with using a third party lib (see other comments). > > - Does the feature influence matching speed of RX that do not make use of it? > > Nope, and hurray for that! Basically, your regex gets compiled into a > series of opcodes when you do Regex.new(). If you don't use a given > regex feature, then that feature isn't compiled into your executable > opcode string and doesn't get used at match-time. Good. > > - Also, I'd check the rationale why the Perl people choose to not let > > the code influence matching. Maybe there is a good reason to do that > > which applies to Ruby as well. > > In fact, perl lets you go beyond influencing success and failure. I thought you said that in Perl result of the expression would not influence matching. But I do see now this is irrelevant. Thanks for the update! > Pretty cool feature. Indeed. But I'd prefer usefulness, usability and efficiency over coolness. :-) > Looking at ruby's Regexp docs, it seems like /#{stuff}/ is evaluated each time a match is attempted, but I'm not sure that's the case: It is evaluated once per invocation. Since #scan is a method invocation it is evaluated only once in your example: > % ruby -e '"abc".scan(/.#{puts "hi"}/) { |x| puts x }' > hi > a > b > e > > 'hi' is only output once. I could be doing something wrong, though, or > my interpretation of the ruby pickaxe is wrong. The latter. > Thanks for your feedback :) You're welcome. Cheers robert
> Frankly, I still feel that I haven't seen something that cannot be > done with current state of the Ruby art (including 1.9 lookarounds). The actual look behind feature has the disadvantage, that the pattern must have a fixed length. I agree that everything is possible without any change - we have access to a complete MatchData-Object inside a scan block, and there are no problems to access the complete String object too - but it is much more compact and natural with callback inside the pattern (including access to the partial MatchData and use result as success/failure inside the RegEx engine). This may not be of any interest in larger programs which have to be maintained, but it is helpful for "one to some" liners and interactive text processing (may be using 'irb'), which are usually "use and throw away" programs. These kind of programs are very often needed. Ruby programs replace Perl programs more and more for short textual processing programs. Wolfgang Nádasi-Donner
Copyright © 2006, Ruby Power and Light, LLC
Comments
from Robert Klemme, Tue Sep 18 09:09:31 -0400 2007