Add perl-like regexp (?{ code }) to ruby. (#16)

Submitted by: Jordan Sissel

This revision by: Jordan Sissel

Date: Sun Sep 16 18:54:20 -0400 2007

View earlier revisions

ABSTRACT

Regular expressions are very flexible, but perl went a step further allowing you to inject code into the execution of a regular expression. This is extremely powerful in that it lets you extend functionality of the regexp system without having to write your own.

PROBLEM

In regex, the problem of doing ‘and’, submatches on groups, and general assertions is difficult. For example:

matching an a url with foo in it. It is easier to specify “match a url with foo in it” by crafting a regex that matches any url and specifiying that it also matches ‘foo’ than crafting a regex that matches any url and injecting the necessary regexes which will match foo anywhere inside. Sometimes, look-ahead/behind assertions can be used, but not always.

A way to specify “and” in regex is useful, but absent in all implementations other than perl. That is, =~ /(regex_matching_a_url)(?some_assertion_that_tests_$1)/

PROPOSAL

This change affects the regular expression syntax by adding a new pattern:

(?{ code })

where ‘code’ is a ruby code block (alternatively, simply a function to be called. ‘perldoc perlre’ will explain exactly what (?{ code }) does:

"(?{ code })" 
          This zero-width assertion evaluates any embedded Perl code.  It
          always succeeds, and its "code" is not interpolated.

I would like to modify the constraints here and add that ‘code’ can optionally fail if the return value from said code is false. This gives you great control over pattern matching.

No other changes to Ruby are necessary for this change other than adding this into the regular expression engine.

—-

An example of this might be an assertion that matches a number that is >10. Match a number: /[0-9]+/ Assert it is greater than 10: x > 10 And, what if your assertion, “> 10” comes from user input (a function argument, for example). Do you provide a translation function which turns “> 10” into a regex that matches both a number and a number that is > N, where N is the input, or would it be easier to match a number and call a function to verify the assertion and affect the output of the regexp.

s = "1 3 5 7 9 11"

I vote for the latter, since it’s more readable and simpler to write.

ANALYSIS

This change is only truely beneficial when implemented in the regex engine itself so you can take advantage of the natural backtracking that the regex execution does when it hits a failure. It makes for more readable and more powerful regular expressions when you need to do some more advanced matching.

I use this specific kind of advanced regular expressions in grok, a pattern matching tool I wrote in perl. It is extremely useful to have.

IMPLEMENTATION

I have patches that mostly put this change into ruby1.8.6 with oniguruma. Here’s a sample ruby invocation of (?{ code })

def check_private_network(ip)
  priv_re = /^(192\.168\.|10\.|172.16)/
  ret = false
  puts "Checking #{ip}" 
  result = (ip =~ priv_re)
  if result
    ret = true
  end
  return ret
end
ip_re = "((?:[0-9]{1,3}\.){3}(?:[0-9]{1,3}))" 
fun_re = Regexp.new("(#{ip_re})(?{ check_private_network($g[0])})");
  1. this set matches correctly, identifies i0 and i2 as private ips = [“192.168.0.1”, “1.2.3.4”, “10.8.3.44”, “73.55.244.2”]
ips.each { |x|
  y = fun_re.match(x)
  if y
    puts "#{y} is a private net" 
  end
}
  1. output should be:
  2. Checking 192.168.0.1
  3. 192.168.0.1 is a private net
  4. Checking 1.2.3.4
  5. Checking 10.8.3.44
  6. 10.8.3.44 is a private net
  7. Checking 73.55.244.2
Another example:
mystr = “1.2.3.4 192.168.0.1” ip = fun_re.match(mystr)
  1. ip should be ‘192.168.0.1’ because the assertion will fail to match
  2. 1.2.3.4 as a ‘private’ ip address causing the regexp engine to skip ip.

Comments

from Robert Klemme, Tue Sep 18 09:09:31 -0400 2007

I am not sure what we will gain by this. As of now we can execute code
per match already (block form of String#gsub and String#scan). For
example, the "URL with 'foo' issue" can be done cleanly in any of
these ways:

# two passes
urls = text.scan(%r{http://\S+}).select {|m| /foo/ =~ m}

# one pass
urls = text.to_enum(:scan, %r{http://\S+}).select {|m| /foo/ =~ m}

Is there maybe a better example that demonstrates what can be done
with the new feature and cannot be done with current state of affairs?


My questions would be:

- What do we gain?

- Does the feature influence matching speed of RX that do not make use of it?

- If yes, how big is the impact?

- Also, I'd check the rationale why the Perl people choose to not let
the code influence matching.  Maybe there is a good reason to do that
which applies to Ruby as well.

Kind regards

robert


PS: Why is this RCR twice in the DB?

from Wolfgang Nádasi-Donner, Sat Sep 22 08:42:41 -0400 2007

Sorry for beeing late in the discussion about this extension. We started ~1 year 
ago a discussion in the german Ruby-Forum about these kind of extensions.

This is my personal opinion about this construction in Ruby. Unfortunately (due 
to person constraints) I cant't spent much effort in 2007 into prototype 
implementations or else. So here are my remarks:

1) A more general remark about this extension: Oniguruma is a product 
independent of Ruby which is used by Ruby. It has an independent release 
strategy and is used by other products too (see 
http://www.geocities.jp/kosako3/oniguruma/). This RCR has at least the same 
effects on Oniguruma as on Ruby, so should be coordinated with the development 
of Oniguruma to avoid unnecessary parallel developments.

2) If the implementation of this RCR is carried out in cooperation with the 
Oniguruma development, then it must be taken into account that Oniguruma is also 
used in other languages. This means, that Oniguruma itself must be able to parse 
  the body of "(?{...})" construct for each language.  A consequence is, that 
the code inside the construct must be syntactically as simple as possible, and 
should not contain features which are special for the progarmming paradigm (i.e. 
OO specific things in case of Ruby). Pure functions (Proc objects in case of 
Ruby) may be a good compromise (in case of Ruby).

3) I support the planned change (Perl's original construct's return code will 
not be used by the pattern matching engine), that the return code 
(success/failure expressed by true/false in case of Ruby) will be evaluated by 
Oniguruma. This allows simple expression of complex matching, which is related 
to several side effects. A simple example from our discussion one year ago is:

def checkdiv(intstring, md)
   # 'md' is not used in this example
   if ((intstring.to_i)%2)==0
     '>>' + intstring + '<<'
   else
     false
   end
end
if (md = 'we are 5, 3, or 2 persons'.match(/(?{checkdiv(\d+)})/))
   puts "'we are 5, 3, or 2 persons' is successfull for '#{md[0]}'"
end
unless (md = 'we are 7, 5, or 3 persons'.match(/(?{checkdiv(\d+)})/))
   puts "no successfull match for 'we are 7, 5, or 3 persons'"
end

!(Remark) There was a different syntax used at this time, but I hope it is 
understandable.

More complex examples are when a match of a text will succeed only, if a second 
match on different data succeeds too - e.g. a participant of a trip will be 
accepted in the list only, if he is in the list of persons who paid it already too.

One must be aware, that the use of this construct may consume a lot of time, but 
one may not use it, if this will be a problem.

----------

Due to the fact that I often use Ruby via "irb" to analyse textual data, this 
may be a very useful because it allows compact formulation. I will strongly 
support it, but I think the implementation must be done together with the 
Oniguruma development.

Best regards, Wolfgang Nádasi-Donner alias WoNáDo

from Jordan Sissel, Sun Sep 23 04:36:36 -0400 2007


> I am not sure what we will gain by this. As of now we can execute code
> per match already (block form of String#gsub and String#scan). For
> example, the "URL with 'foo' issue" can be done cleanly in any of
> these ways:
> 
> # two passes
> urls = text.scan(%r{http://\S+}).select {|m| /foo/ =~ m}
> 

String#scan doesn't return MatchData objects (only captures or the
match string), so we lose a great deal of context even on simple
regexps:

  re = /(?<one>\w+) \d+ (?<two>\w+)/
  str = "test 1234 test foo 456 bar"
  str.scan(re).each { |x|
    puts x.join(" ")
  }

Output is:
  test test
  foo bar

We lose the MatchData info, which would tell me what part of the string
was matched. Am I making sense?

Further testing reveals that $~ is set in String#scan, but only is set
to the last match found, so we cannot rely on it for each match
iteration.

  str.scan(re).each { |x|
    puts $~["one"]
  } 

output:
  foo
  foo

I expected the output to be 'test' and 'foo', not 'foo' and 'foo'. I'm guessing
that String#scan loops over the whole string for matches, so the last match
($~) would indeed be 'foo'. I need the MatchData for each iteration for this to
be really feasible.

If String#scan gave MatchData instances, we might be closer. Beyond
that, we lose backtracking that is built into the regexp engine. There
are probably cases where String#scan will behave differently than if we
were causing backtracking by inducing failures at runtime (aka, named
capture 'two' doesn't match 'bar', so we'll "fail" and backtrack).

Such one case is this one:

  re = /(?<one>\w+) .*? (?<two>\w+)/
  str = "test 1234 test foo 456 bar"
  str.scan(re).each { |x|
    puts x.join(" ")
  }

Output is:
  test test
  foo bar

The '.*?' makes me think the following are possible matches:
  test 1234 test (which might yield 'test test' from this code)
  test 1234 test foo (which might yield 'test foo' from this code)
  test 1234 test foo 456 bar (which might yield 'test bar' from this code)
  ... etc ...

I would expect, if I wanted to match /\w+ .*? \w+/ where the 2nd \w+
also matched 'bar' that the resulting match would be the entire string,
since 'test 1234 test foo 456 bar' matches both /w+ .*? \w+/ and /\w+ .*
bar/. 

I might be causing confusion. My perl implementation of this behaves in
the way I would expect:

% echo "test 2345 foo 234  bar" | perl foo.pl
test bar

> Is there maybe a better example that demonstrates what can be done
> with the new feature and cannot be done with current state of affairs?
> 

Let me know if my above examples don't adequately explain why
String#scan is not sufficient.

> 
> My questions would be:
> 
> - What do we gain?

It's hard for me to put into words: My perl text analyzer tool (grok)
gains much from this particular feature. I can do crazy things like say
match an IP and also require that IP match another pattern; or match a
number and also require that number meet an additional condition such as
"num > 10" and only numbers > 10 will be matched. Additionally, we can
use this to call functions or callbacks when portions of a regexp are
reached (useful for debugging your regexes, etc).

> - Does the feature influence matching speed of RX that do not make use of it?

Nope, and hurray for that! Basically, your regex gets compiled into a
series of opcodes when you do Regex.new(). If you don't use a given
regex feature, then that feature isn't compiled into your executable
opcode string and doesn't get used at match-time.

My implementation adds some small code bits to Oniguruma and does not
modify its fundamental design.

> 
> - Also, I'd check the rationale why the Perl people choose to not let
> the code influence matching.  Maybe there is a good reason to do that
> which applies to Ruby as well.

In fact, perl lets you go beyond influencing success and failure. There
are two 'execute this code in my regex' features in perl:
  (?{ code })
  (??{ code })

In all cases, 'code' is excuted any time it is reached in the regexp.

The first one simply executes the perl code and operates otherwise as a
zero-width always-positive assertion. The second one, (??{ code }), uses
the return value of 'code' as an additional regular expression, which
gets injected, at runtime. Code, in all cases, is executed each time
that particular portion of the regexp is visited. So eachtime (??{ code
}) is executed, it has an additional opportunity to return a different
regexp.

Pretty cool feature.

Looking at ruby's Regexp docs, it seems like /#{stuff}/ is evaluated each time a match is attempted, but I'm not sure that's the case:

 % ruby -e '"abc".scan(/.#{puts "hi"}/) { |x| puts x }'
 hi
 a
 b
 e

'hi' is only output once. I could be doing something wrong, though, or
my interpretation of the ruby pickaxe is wrong.

http://whytheluckystiff.net/ruby/pickaxe/html/language.html
Under 'Substitutions'.

> 
> Kind regards
> 
> robert
> 

Thanks for your feedback :)

> 
> PS: Why is this RCR twice in the DB?
> 

I'm bad at clicking. Accidentally submitted twice, and I wasn't able to
find a 'delete' feature to remove the dupe.

from Jordan Sissel, Sun Sep 23 05:07:30 -0400 2007


I have patches which implement this feature. They are slightly buggy,
and I only consider it a prototype, but it does work. The current
implementation I have is about 120 lines of additions, so it's pretty
simple.

> This RCR has at least the same 
> effects on Oniguruma as on Ruby, so should be coordinated with the development 
> of Oniguruma to avoid unnecessary parallel developments.

Agreed. A general solution would better serve everyone.

> used in other languages. This means, that Oniguruma itself must be able to parse 
>   the body of "(?{...})" construct for each language.  A consequence is, that 
> the code inside the construct must be syntactically as simple as possible, and 

I'm on the fence about this one.

I would prefer that oniguruma had this feature and had a way for a
consumer to register a 'code parser' callback and a 'code executer'
callback which oniguruma would use at compiletime and execution time,
respectively. If no 'code parser' was registered, then any attempt to
use (?{ code }) could simply throw a compilation error.

> 3) I support the planned change (Perl's original construct's return code will 
> not be used by the pattern matching engine), that the return code 
> (success/failure expressed by true/false in case of Ruby) will be evaluated by 
> Oniguruma. 

> One must be aware, that the use of this construct may consume a lot of time, but 
> one may not use it, if this will be a problem.

Probably, but the implementation itself should be no slower than if you
were running the code outside of the regexp. This should be trivial to
ensure.

> Due to the fact that I often use Ruby via "irb" to analyse textual data, this 
> may be a very useful because it allows compact formulation. I will strongly 
> support it, but I think the implementation must be done together with the 
> Oniguruma development.

This does indeed allow you to write very powerful code in a very concise
way. I am happy to work with the oniguruma folks to make this generally
applicable to any tool wanting to use the oniguruma library.

from Yukihiro Matsumoto, Sun Sep 23 11:39:09 -0400 2007

Hi,

In message "Re: [RCR] Add perl-like regexp (?{ code }) to ruby."
    on Sun, 23 Sep 2007 05:07:31 -0400, psionic@csh.rit.edu writes:

|I would prefer that oniguruma had this feature and had a way for a
|consumer to register a 'code parser' callback and a 'code executer'
|callback which oniguruma would use at compiletime and execution time,
|respectively. If no 'code parser' was registered, then any attempt to
|use (?{ code }) could simply throw a compilation error.

Interesting idea, but since Oniguruma is not under our control, I
hardly see it's going to happen.

                                                        matz.

from Wolfgang Nádasi-Donner, Sun Sep 23 15:24:43 -0400 2007

matz@ruby-lang.org schrieb:
> Interesting idea, but since Oniguruma is not under our control, I
> hardly see it's going to happen.

Last year I sent some mails to K. Kosako about the Perl like extensions. I will 
ask him via mail to have a look at the RCR, and to tell us his opinion on it.

I believe this is the easiest way to see, if there is any interest in it.

Wolfgang Nádasi-Donner

from Nobuyoshi Nakada, Sun Sep 23 23:23:36 -0400 2007

Hi,

At Sun, 23 Sep 2007 11:39:15 -0400,
matz@ruby-lang.org wrote:
> |I would prefer that oniguruma had this feature and had a way for a
> |consumer to register a 'code parser' callback and a 'code executer'
> |callback which oniguruma would use at compiletime and execution time,
> |respectively. If no 'code parser' was registered, then any attempt to
> |use (?{ code }) could simply throw a compilation error.
> 
> Interesting idea, but since Oniguruma is not under our control, I
> hardly see it's going to happen.

I don't think a 'code parser' callback is necessary. For
instance, the parser could replace the code blocks with each
particular argument, i.e., index numbers, and the regexp engine
would calls back a 'executer' with those.


from Yukihiro Matsumoto, Mon Sep 24 02:56:06 -0400 2007

Hi,

In message "Re: [RCR] Add perl-like regexp (?{ code }) to ruby."
    on Sun, 23 Sep 2007 23:23:49 -0400, nobu@ruby-lang.org writes:

|I don't think a 'code parser' callback is necessary. For
|instance, the parser could replace the code blocks with each
|particular argument, i.e., index numbers, and the regexp engine
|would calls back a 'executer' with those.

I'm afraid that your approach makes the engine less versatile, since
we need to "preprocess" the regular expression as we did for string
interpolation.  It is relatively easy for us (we already have similar
mechanism), but would be quite difficult for other languages, say PHP.
All decision would be up to the maintainer anyway.

                                                        matz.

from Austin Ziegler, Mon Sep 24 07:46:33 -0400 2007

On 9/23/07, nobu@ruby-lang.org <nobu@ruby-lang.org> wrote:
> At Sun, 23 Sep 2007 11:39:15 -0400,
> matz@ruby-lang.org wrote:
> > |I would prefer that oniguruma had this feature and had a way for a
> > |consumer to register a 'code parser' callback and a 'code executer'
> > |callback which oniguruma would use at compiletime and execution time,
> > |respectively. If no 'code parser' was registered, then any attempt to
> > |use (?{ code }) could simply throw a compilation error.
> > Interesting idea, but since Oniguruma is not under our control, I
> > hardly see it's going to happen.
> I don't think a 'code parser' callback is necessary. For
> instance, the parser could replace the code blocks with each
> particular argument, i.e., index numbers, and the regexp engine
> would calls back a 'executer' with those.

It needs to know something about supported languages syntax so that it
doesn't close the code block too soon:

%r{(?{ abc(def {|g| h(g)})})}

-austin

from Nobuyoshi Nakada, Mon Sep 24 09:20:55 -0400 2007

Hi,

At Mon, 24 Sep 2007 07:46:43 -0400,
halostatue@gmail.com wrote:
> > I don't think a 'code parser' callback is necessary. For
> > instance, the parser could replace the code blocks with each
> > particular argument, i.e., index numbers, and the regexp engine
> > would calls back a 'executer' with those.
> 
> It needs to know something about supported languages syntax so that it
> doesn't close the code block too soon:
> 
> %r{(?{ abc(def {|g| h(g)})})}

I meant splitting it as
  %r{(?{#1})}
and
  proc {abc(def {|g| h(g)})}
by the parser.

The proc could be supplied to Oniguruma directly or just index
to another common callback argument.


from Robert Klemme, Mon Oct 01 08:14:06 -0400 2007

2007/9/23, psionic@csh.rit.edu <psionic@csh.rit.edu>:
> We lose the MatchData info, which would tell me what part of the string
> was matched. Am I making sense?

Yes. But you are not completely right - although the block does not
receive a MatchData instance, you can nevertheless obtain the
information:

irb(main):005:0> s.scan(/\d+/) {|m| print $`, "<", m, ">", $', "\n"}
foo <123> bar 456 baz
foo 123 bar <456> baz
=> "foo 123 bar 456 baz"

(see also further below)

> Further testing reveals that $~ is set in String#scan, but only is set
> to the last match found, so we cannot rely on it for each match
> iteration.

That's wrong:

irb(main):008:0> s.scan(/\d+/) {|m| print $`, "<", $~, ">", $', "\n"}
foo <123> bar 456 baz
foo 123 bar <456> baz
=> "foo 123 bar 456 baz"

irb(main):010:0> s.scan(/\d+/) {|m| print $`, "<", $~.class, ">", $', "\n"}
foo <MatchData> bar 456 baz
foo 123 bar <MatchData> baz
=> "foo 123 bar 456 baz"

> If String#scan gave MatchData instances, we might be closer.

You can access a MatchData instance via $~ (see above).  Note also
that $~ and the like are thread safe so there is no problem using
these globals.  You need only be careful when doing nested matches.

> Beyond
> that, we lose backtracking that is built into the regexp engine. There
> are probably cases where String#scan will behave differently than if we
> were causing backtracking by inducing failures at runtime (aka, named
> capture 'two' doesn't match 'bar', so we'll "fail" and backtrack).

This particular example can be better solved with negative lookahead
or selection afterwards (see my first comment).

> > Is there maybe a better example that demonstrates what can be done
> > with the new feature and cannot be done with current state of affairs?
>
> Let me know if my above examples don't adequately explain why
> String#scan is not sufficient.

Frankly, I still feel that I haven't seen something that cannot be
done with current state of the Ruby art (including 1.9 lookarounds).
The fact that your tool has benefited from this feature does not mean
that there are no other ways or even simpler ways to implement it. :-)

> > My questions would be:
> >
> > - What do we gain?
>
> It's hard for me to put into words: My perl text analyzer tool (grok)
> gains much from this particular feature. I can do crazy things like say
> match an IP and also require that IP match another pattern;

That can be done differently - even with 1.8.x (see my first comment).

> or match a
> number and also require that number meet an additional condition such as
> "num > 10" and only numbers > 10 will be matched.

Same here.

irb(main):023:0> s.to_enum(:scan, /\d+/).each {|m| p m if m.to_i > 200}
"456"
=> "foo 123 bar 456 baz"

> Additionally, we can
> use this to call functions or callbacks when portions of a regexp are
> reached (useful for debugging your regexes, etc).

That sounds interesting. Can you show a more concrete example?  Still
I am not sure whether better debugging alone is justification enough
to introduce such a complex new feature - especially given all the
dependencies and problems with using a third party lib (see other
comments).

> > - Does the feature influence matching speed of RX that do not make use of it?
>
> Nope, and hurray for that! Basically, your regex gets compiled into a
> series of opcodes when you do Regex.new(). If you don't use a given
> regex feature, then that feature isn't compiled into your executable
> opcode string and doesn't get used at match-time.

Good.

> > - Also, I'd check the rationale why the Perl people choose to not let
> > the code influence matching.  Maybe there is a good reason to do that
> > which applies to Ruby as well.
>
> In fact, perl lets you go beyond influencing success and failure.

I thought you said that in Perl result of the expression would not
influence matching. But I do see now this is irrelevant. Thanks for
the update!

> Pretty cool feature.

Indeed.  But I'd prefer usefulness, usability and efficiency over coolness. :-)

> Looking at ruby's Regexp docs, it seems like /#{stuff}/ is evaluated each time a match is attempted, but I'm not sure that's the case:

It is evaluated once per invocation. Since #scan is a method
invocation it is evaluated only once in your example:

>  % ruby -e '"abc".scan(/.#{puts "hi"}/) { |x| puts x }'
>  hi
>  a
>  b
>  e
>
> 'hi' is only output once. I could be doing something wrong, though, or
> my interpretation of the ruby pickaxe is wrong.

The latter.

> Thanks for your feedback :)

You're welcome.

Cheers

robert

from Wolfgang Nádasi-Donner, Mon Oct 01 10:58:35 -0400 2007

> Frankly, I still feel that I haven't seen something that cannot be
> done with current state of the Ruby art (including 1.9 lookarounds).
The actual look behind feature has the disadvantage, that the pattern must have 
a fixed length.

I agree that everything is possible without any change - we have access to a 
complete MatchData-Object inside a scan block, and there are no problems to 
access the complete String object too - but it is much more compact and natural 
with callback inside the pattern (including access to the partial MatchData and 
use result as success/failure inside the RegEx engine).

This may not be of any interest in larger programs which have to be maintained, 
but it is helpful for "one to some" liners and interactive text processing (may 
be using 'irb'), which are usually "use and throw away" programs. These kind of 
programs are very often needed. Ruby programs replace Perl programs more and 
more for short textual processing programs.

Wolfgang Nádasi-Donner


Return to top

Copyright © 2006, Ruby Power and Light, LLC