Submitted by austin (Sun Aug 22 13:08:14 UTC 2004)
String#scan
, #gsub
, and #sub
to provide the MatchData to attached code blocks.
String#scan
, #gsub
, and #sub
yield the string value of the matched regular expression to a provided block, which is of very limited value. Currently, we must rely upon either ugly numeric match variables ( $1
- $9
, etc.) or a class method ( Regexp.last_match
str = '<span id="1"> <span> ...</span> </span> ' re = /(<(\/?)span> )/i
str.scan(re)
# => [["<span> ", ""], ["</span> ", "/"], ["</span> ", "/"]]
matches = [] str.scan(re) do
matches << Regexp.last_match
end
matches.each do |match|
match.captures.each_with_index do |capture, ii| soff, eoff = match.offset(ii + 1) puts %Q("#{capture}" #{soff} .. #{eoff}) end
end
String#scan
, #sub
, and #gsub
yield MatchData objects instead of Strings. I think that this could be achieved while breaking the least amount of code by adding a #to_str implementation to MatchData.
--- re.c.old 2004-08-22 00:24:09 Eastern Daylight Time +++ re.c 2004-08-22 00:18:50 Eastern Daylight Time
@@ -2320,6 +2320,7 @@
rb_define_method(rb_cMatch, "pre_match", rb_reg_match_pre, 0); rb_define_method(rb_cMatch, "post_match", rb_reg_match_post, 0); rb_define_method(rb_cMatch, "to_s", match_to_s, 0);
+ rb_define_method(rb_cMatch, "to_str", match_to_s, 0);
rb_define_method(rb_cMatch, "inspect", rb_any_to_s, 0); /* in object.c */ rb_define_method(rb_cMatch, "string", match_string, 0); }
--- string.c.old 2004-08-22 00:24:10 Eastern Daylight Time +++ string.c 2004-08-22 00:20:35 Eastern Daylight Time
@@ -1928,7 +1928,7 @@
if (iter) { rb_match_busy(match);
- repl = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match))); + repl = rb_obj_as_string(rb_yield(0, match));
rb_backref_set(match); } else {
@@ -2043,7 +2043,7 @@
regs = RMATCH(match)-> regs; if (iter) { rb_match_busy(match);
- val = rb_obj_as_string(rb_yield(rb_reg_nth_match(0, match))); + val = rb_obj_as_string(rb_yield(match));
rb_backref_set(match); } else {
@@ -4164,15 +4164,7 @@
else { *start = END(0); }
- if (regs-> num_regs == 1) { - return rb_reg_nth_match(0, match); - } - result = rb_ary_new2(regs-> num_regs); - for (i=1; i < regs-> num_regs; i++) { - rb_ary_push(result, rb_reg_nth_match(i, match)); - } - - return result; + return match;
} return Qnil; }
I'm not 100% sure that this is right, and I haven't tested it. The equivalent Ruby code would be (note: this code appears to work, but it does cause problems with irb):
class MatchData
def to_str self.to_s end
end
class String
alias_method :old_scan, :scan alias_method :old_gsub!, :gsub! alias_method :old_sub!, :sub! def scan(pattern) if block_given? old_scan(pattern) { yield Regexp.last_match } else old_scan(pattern) end end def gsub(pattern, repl = nil, &block) s = self.dup s.gsub!(pattern, repl, &block) s end def gsub!(pattern, repl = nil) if block_given? and repl.nil? old_gsub!(pattern) { yield Regexp.last_match } elsif repl.nil? old_gsub!(pattern) else old_gsub!(pattern, repl) end end def sub(pattern, repl = nil, &block) s = self.dup s.sub!(pattern, repl, &block) s end def sub!(pattern, repl = nil) if block_given? and repl.nil? old_sub!(pattern) { yield Regexp.last_match } elsif repl.nil? old_sub!(pattern) else old_sub!(pattern, repl) end end
end
Comments | Current voting | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
RCRchive copyright © David Alan Black, 2003-2005.
Powered by .
Robert Klemme has suggested that instead of modifying String#scan, String#gsub, and String#sub, new methods Regexp#scan, Regexp#sub, and Regexp#gsub be created that yield MatchData objects. They would be called as re.scan(string) { |md| block }, etc. Obviously, I prefer the change noted above (there would be no way to have #sub! and #gsub! versions with this change), but I would not be opposed to this, either. -- Austin Ziegler
If accepted, this RCR would do three good things, IMHO... it would:
As for the Regexp#gsub idea, I'm not convinced that it makes sense to have these methods, which are basically identical in function, attached to two different objects. It would make about as much sense to me as offering String#join(array), with it being a near parallel to Array#join(string). Then there's the lack of possible bang methods, which I would miss. --Mark Hubbart
Third try: Nobu Nokada suggested in : "#to_str doesn't solve everything. MatchData#[] returns a matched portion for sub-patterns, whereas String#[] returns a byte at the position." I responded ():
Agreed. It also is 100% incompatible on #scan with groups in the regexp (e.g., "foobar".scan(/(..)(.)/) will yield [["fo", "o"], ["ba", "b"]]. This is the argument for Regexp#scan instead of modifying String#scan. However, this is something that I believe should be changed. An alternative is to yield both the normal values and the match -- but that itself will be incompatible with #scan and most current uses of #gsub and #sub that use the match value.
Yet another alternative is to add an optional parameter in all cases. String#gsub currently expects a regexp and a replace pattern OR a regexp and a block. #gsub could be modified such that when it gets a regexp, a "boolean", and a block, it yields something different. This could be, for example:
I would actually rather see the opposite form, if we do this:
This would encourage the use of the new form. By doing it this way, a transition period can be introduced for this (e.g., it in 1.8.3 it may warn that the current replace will be changed to yield a match_data instead of a string; in 1.9 it yields a match_data instead of a string).
I have *not* analysed code out there that uses #gsub/#scan/#sub, but I think that this is an ideal change.
In , Robert Klemme suggested that this may not be ideal:
Adding a flag to change method behavior is usually regarded bad OO practice. The usual solution is to have two different methods - one for each behavior. That increases modularity, simplifies the implementation and improves performance.
I'd rather have String#gsub_md, String#gsub_md! and String#scan_md than the flag although I have to admit that those method names are ugly.
I agree that the presence of the flag may not be ideal, but note that #gsub already uses such conditional work -- and not having the flag would actually cause unnecessary code duplication (per the C code above). Right now, it accepts #gsub(patt, repl) or #gsub(patt, repl) { repl-block }; this would extend the capability to include #gsub(patt, repl-type) { repl-block }, the repl-type determining what is yielded. Again: this is entirely about the yielded value. I believe that all forms are correct for how they work when they don't use a block.
The idea of having #gsub with the flag (as opposed to the #gsubm [the name I had chosen instead of #gsub_md) form which I rejected in writing my response to Nobu) is that IMO we should encourage the use of the new form with MatchData objects, not the other form. By using #gsubm, we discourage the use of the new form in favour of the old form. The only way that I think that this would really work is to have #gsub yield MatchData and #gsubs yield Strings, if we take that approach. --AustinZiegler
Instead of adding a flag, why not check the arity of the block?
-- Paul Brannan
This might work and perhaps be better; it will also be compatible with the form:
This, when I am not using the MatchData, is my common case for String#gsub.