RCR 276: Make String#scan, #gsub, and #sub yield MatchData objects

RCRchive		Top	Help	Register	Sign in	RSS	Contact	Credits

Robert Klemme has suggested that instead of modifying String#scan, String#gsub, and String#sub, new methods Regexp#scan, Regexp#sub, and Regexp#gsub be created that yield MatchData objects. They would be called as re.scan(string) { |md| block }, etc. Obviously, I prefer the change noted above (there would be no way to have #sub! and #gsub! versions with this change), but I would not be opposed to this, either. -- Austin Ziegler

If accepted, this RCR would do three good things, IMHO... it would:

Allow avoidance of $globals
Encourage usage of MatchData (so much more OO), even in trivial cases
Make the modified methods safer to use in threads

IANYM, but this seems to me like a natural progression as Ruby continues to define it's own style, and throws out the trappings of perlism :)

As for the Regexp#gsub idea, I'm not convinced that it makes sense to have these methods, which are basically identical in function, attached to two different objects. It would make about as much sense to me as offering String#join(array), with it being a near parallel to Array#join(string). Then there's the lack of possible bang methods, which I would miss. --Mark Hubbart

Third try: Nobu Nokada suggested in : "#to_str doesn't solve everything. MatchData#[] returns a matched portion for sub-patterns, whereas String#[] returns a byte at the position." I responded ():

Agreed. It also is 100% incompatible on #scan with groups in the regexp (e.g., "foobar".scan(/(..)(.)/) will yield [["fo", "o"], ["ba", "b"]]. This is the argument for Regexp#scan instead of modifying String#scan. However, this is something that I believe should be changed. An alternative is to yield both the normal values and the match -- but that itself will be incompatible with #scan and most current uses of #gsub and #sub that use the match value.

Yet another alternative is to add an optional parameter in all cases. String#gsub currently expects a regexp and a replace pattern OR a regexp and a block. #gsub could be modified such that when it gets a regexp, a "boolean", and a block, it yields something different. This could be, for example:

 String#gsub(pattern, true) { |match_data| ... }
 String#gsub(pattern) { |string| ... }

I would actually rather see the opposite form, if we do this:

 String#gsub(pattern, true) { |string| ... }
 String#gsub(pattern) { |match_data| ... }

This would encourage the use of the new form. By doing it this way, a transition period can be introduced for this (e.g., it in 1.8.3 it may warn that the current replace will be changed to yield a match_data instead of a string; in 1.9 it yields a match_data instead of a string).

I have *not* analysed code out there that uses #gsub/#scan/#sub, but I think that this is an ideal change.

In , Robert Klemme suggested that this may not be ideal:

Adding a flag to change method behavior is usually regarded bad OO practice. The usual solution is to have two different methods - one for each behavior. That increases modularity, simplifies the implementation and improves performance.

I'd rather have String#gsub_md, String#gsub_md! and String#scan_md than the flag although I have to admit that those method names are ugly.

I agree that the presence of the flag may not be ideal, but note that #gsub already uses such conditional work -- and not having the flag would actually cause unnecessary code duplication (per the C code above). Right now, it accepts #gsub(patt, repl) or #gsub(patt, repl) { repl-block }; this would extend the capability to include #gsub(patt, repl-type) { repl-block }, the repl-type determining what is yielded. Again: this is entirely about the yielded value. I believe that all forms are correct for how they work when they don't use a block.

The idea of having #gsub with the flag (as opposed to the #gsubm [the name I had chosen instead of #gsub_md) form which I rejected in writing my response to Nobu) is that IMO we should encourage the use of the new form with MatchData objects, not the other form. By using #gsubm, we discourage the use of the new form in favour of the old form. The only way that I think that this would really work is to have #gsub yield MatchData and #gsubs yield Strings, if we take that approach. --AustinZiegler

Instead of adding a flag, why not check the arity of the block?

 String#gsub(pattern) { |string| ... }
 String#gsub(pattern) { |string, match_data| ... }

-- Paul Brannan

This might work and perhaps be better; it will also be compatible with the form:

  String#gsub(pattern) { ... }

This, when I am not using the MatchData, is my common case for String#gsub.

RCR 276: Make String#scan, #gsub, and #sub yield MatchData objects

Abstract

Problem

Proposal

Analysis

Implementation

If you have registered at RCRchive, you may now sign in below. If you have not registered, you may sign up for a username and password. Registering enables you to submit new RCRs, and vote and leave comments on existing RCRs.
Your username:
Your password: