RCR 279: User defined % literals

Abstract

RCRchive		Top	Help	Register	Sign in	RSS	Contact	Credits

Herein, it is proposed to allow for the creation of custom % literals. % literals are especially useful in reducing code clutter for commonly recreated data structures. Presently, the built-in % literals (namely %q, %Q, %r, %s, %w and %x) are handled opaquely by the Ruby interpreter. User-defined % literals can be provided via % methods in the same way as the present `` (backquote) construct, which in itself calls %x. Programmers would then be able to swiftly create data structures particular to their needs. For example, %y could be used for YAML::load.

Problem

Ruby literals provide a clean, concise way of specifying commonly used data structures like strings, regular expressions, word lists, etc. without the need for excessive escaping of quotes. However a request for adding new literals arises from time to time, indicating that people appreciate the convenience of literals and that there is an interest in an addition to the presently provided literals. This is not possible in Ruby now.

Proposal

A user-defined literal constructor may be any single alphabetical character (upper or lower case) preceeded by the % character. The syntax of these custom % literals conforms to the same rules as the current literals, i.e., matching braces, etc. Additionally, the literal's closing delimiter may be followed by any number of letters, serving as a limited form of parameter, congruent with the present behavior of %r.

When a lowercase % literal is evaluated (i.e. %m, where m is any lowercase letter), the literal and its parameters are passed as strings to a method of like name, e.g. def %m(string, options). This method then interprets the string according to any optional parameters and returns a representative object. In the case of an uppercase % literal (%M, where M is an uppercase letter) the lowercase method is also called, but only after the interpreter applies the additional substitutions for double-quoted strings.

To demonstrate the definition of a % method, we will first give the trivial case of %q:

  <pre>
  module Kernel
    def %q(string)
      string
    end
  end
  </pre>

That was easy :-). Let's add an option to convert the string to uppercase and do error checking on the parameters. We will also move the method to another class, making it our own private version:

  <pre>
  class OurClass
    private
    def %q(string, params)
      params.split(//).each do |p|
        case p
          when 'u'
            string.upcase!
          else
            raise "unknown string option: #{p}"
        end
      end
      string
    end
  end
  </pre>

This new definition of the %q literal can then be used in any place where the usual method resolution would find OurClass#%q as the method to call. An example would be:

  class OurClass
    def test
      %q{Hello #{world}!}u
    end
  end

Note how the option u is passed in the same way as to the %r literal in Ruby now. When calling OurClass#test, the evaluation of the literal will result in a call to OurClass#%q with 'Hello #{world}!' and 'u' as parameters. This method will then return 'HELLO #{WORLD}!'

In a likewise manner we can call the uppercase variant:

  class OurClass
    def test
      world = 'ruby-talk'
      %Q{Hello #{world}!}u
    end
  end

Calling OurClass#test will again result in a call to OurClass#%q, but this time with 'Hello ruby-talk!' and 'u' as parameters. This is because the uppercase variant does do string interpolation. The result of it all would be 'HELLO RUBY-TALK!'.

Of course, a % literal method can return an object other than a string. For instance, this is how the aforementioned YAML case is defined:

  <pre>
  module Kernel
    def %y(string, params)
      YAML::load(string)
    end
  end
  </pre>

One last (almost illegible) example for a definition of %r:

  <pre>
  module Kernel
    def %r(string, options)
      Regexp.new(string, options.split(//).inject(0) { |v, c|
        v | Hash.new { |h, k|
              raise "unknown regexp option - #{k}"
            }.update({"i" => Regexp::IGNORECASE,
                      "m" => Regexp::MULTILINE,
                      "x" => Regexp::EXTENDED}[c]})
    end
  end
  </pre>

Possible extensions

In Ruby no alphanumeric character is allowed as delimiter in the % literals. So it would also be possible to allow more than one letter after the %, e.g., %yaml which is less cryptic than %y (although longer). While not a necessity, it increases the possibilities.

Analysis

   <h3>Pros</h3>

Allows you to define your own % literals without changing the Ruby interpreter. Requests for new % literals have come up a few times on ruby-talk, e.g., >, >, >; lots of people would know what to do with this feature: YAML literals, XML literals, syntax literals (on-the-fly parser generation), ...
They have the same advantages that the existing % literals have: convenience, with less typing for commonly occurring data structures, conciseness, thus reducing code clutter and less escaping of quotes in literals.
This proposal moves part of the Ruby core to the core library. Keeping the kernel small is generally a Good Thing.

Cons

The disadvantages are effectively the same as those attributed to the current % literals and the `` notation.
Overriding the global definition of the built-in, default % literals must be handled with care. When using other's people's libraries, it may cause non-backward compatiblity issues. Then again, this comes with the territory of having open classes. While overriding the built-in % literals could be prohibited, a warning would probably suffice.
%r and %x, though both lowercase, are presently treated as double-quoted, thus differing from the general convention set forth in this proposal. Changing this can break backward compatibility. In which case a phased implementation is recommended, issuing a strong warning for a number of release cycles. The (less elegant) alternative is to make asymmetric exceptions for %r and %x regardless of whether they are overridden or not. These two will then always be parsed as if double-quoted, rendering the uppercase variants useless.

Implementation

We don't have an implementation yet since we are no Ruby core wizards and it requires

extending Ruby's method definition syntax. The new syntax strictly extends on existing syntax, thus it does not clash with anything in Ruby's syntax as we know it.
changing Ruby's % literal syntax to allow any letter and allow any parameters to each literal. Again this strictly extends ruby's syntax and does not clash with Ruby's syntax.
changing Ruby's evaluation of % literals. During the parsing phase, the literal and its flags are stored as strings, the method that will be called at evaluation is stored in some form and it is flagged as being single or double-quoted. At evaluation time all required substitutions are performed before passing the literal and its parameters as strings on to the according method. The result of that method is the result of the evaluation of the literal.

If we manage to cook up an implementation, the patch will probably appear first at >RcrFoundry.

Comments

Current voting

Custom literals will make it more difficult to do syntax coloring. Ruby syntax coloring is already complicated... such feature will make it even more complex. --Simon Strandgaard

Can't you simply color a %x literal as a single-quoted string and a %X literal as a double-quoted string.

-- Peter

Peter, first let me tell you that this is one of the better-written RCRs I've seen. IMO, we need more RCRs like this (not that I agree with the proposal, but I think it's really well-written).

Could you please provide some examples of how calling %q with parameters would look?

Also, do the new %-methods follow the normal method lookup rules? E.g. if I have a derived class, an included module, and a base class, and the derived class defines a new %x, and the module uses %x, does it get the derived class's version or the one in Kernel? If it gets the derived class's version, is it then good advice to avoid using %-literals from inside mixins?

Will multi-character %-literal names be allowed, e.g.:

  %yaml{...}

If so (and even if not -- the same applies to single-character %-literals), then if I write this:

  foo %bar{baz}

should this be interpreted as:

  foo(%bar{baz})

(that is, a method foo being called with the result from the %bar literal that was passed the string "baz"), or should it be interpreted as:

  foo % (bar{baz})

(that is, the result of mod'ing foo with the result of calling bar with {baz} as a block)?

Lastly, could you elaborate on the advantages that creating user-defined %-literals has over simply passing strings into methods? Since there are only 26 letters of the alphabet, and I think most people will tend toward defining single-letter literals, is this really a good idea (it seems to favor libraries that establish their %-literals over newer libraries).

Simon, I think syntax highlighting is a solvable problem (just highlight all unknown literals as you would a string), so long as the rules for opening and closing the literal are well-defined and not redefinable at run-time (since the parser is pretty much static at the moment, I think this is a reasonable requirement).

-- Paul Brannan

Paul, thanks for the compliment about the quality of the RCR, but the credit isn't all mine. T. Onoma helped me put it together at which was started exactly because we also think we need better quality RCRs. Now we know it was not an illusion :-)

First the easy part: the syntax ambiguity. It is already resolved in Ruby now. This:

  foo %q{bar}

is interpreted as

  foo(%q{bar})

and this:

  foo % q{bar}

is interpreted as

  foo % (q{bar})

The standard method lookup rules will indeed be used. This does pose some dangers when using literals in modules. The easy answer is that the same danger exists with ``, so the danger level must be acceptible. But really, the danger of name clashes when including modules exists anyhow, but poses very few problems in practice. I'm not sure if these literals will make it worse, even if there are only 26 letters in the alphabet. I think you would define many more methods than % literals, and in practice not that many different method names are actually used, so the danger is IMO not that much higher than everyday Ruby coding nowadays. Besides, if Matz implements namespaces, the issue becomes resolvable without having to revert to ugly notations like Kernel::%q{Hi}.

Lastly the battle for the literals. Defining the literals in Kernel or Object is bad just like any definition in Kernel or Object. But even if multiple libraries you use define the same literals in separate modules or so, you still can't use then both in the same place. But the libraries should always provide an alternative way to do the same thing, even if it is more verbose. That means you can't use the literals but that's tough luck then. And again, if Matz implements namespaces, this will be a problem no more.

I think a good reason to have these literals is twofold. One is that these % literals use less escaping. But you can still use m(%q{}) for that instead of %m{}. But secondly, IMO if your code uses m(%q{blah blah}) a lot, it will become more readible if you can just type %m{blah blah} all the time. It's not just less typing, it's one indirection less to interpret when reading the code. It's also a matter of taste, if you don't like %q, %r and alike now, you won't like it in any form. If you do like these literals now, it's a small step to wanting some of your own.

-- Peter

In my ruby lexer I attemt to color escapes and interpolated code inside literals. see these screenshots:

I want to color %r{} regexp literals as regexp, if I detect illegal regexp constructions I want to color it red. %r literals can have tailing options too. I want to color %w{} as an array literal, so that its easier to see where the seperators are. Uppercase literals, such as %W and %R is being resolved by ruby.

I think allowing for custom single letters wont break too much. But allowing for arbitrary strings %tag content tag will course problems. I don't know exactly what I want to say.

-- Simon Strandgaard

Simon, It would still be possible to color escape sequences and interpolations as well as the tailing options, because the rules for this will be the same for all of the literals. Detecting errors in the literals will have to be generic, unless it would be acceptable to you to color all %r literals as regexps and %w literals as word lists even if they are redefined. I don't remember the details, but I distinctly remember an editor doing something like that, and I also remember I hated that.

And for all clarity, I do not intend to allow arbitrary strings as delimiters, that's a bad idea. I meant arbitrary identifiers, like this:

  %tag( content )

-- Peter

I would like to point out that the main advantage here is that no one would need to pine to ruby's engineers, nor would the developers have to worry about such, any longer, when those cases arise that a new % literal is desired. For instance, _why would like a %y for YAML. Sean Russell might like a %xml, etc.

A couple things happen. 1) the Ruby interpretor can actually be simplified. That's good. And 2) A generally acceptable syntax for creating constructor shortcuts is promoted. These two facts leads me to believe that either Ruby should do away % literals altogether (if their perlism-ness is undesired) or embrace this RCR.

Allowing programmers to change things like this doesn't seem to hurt the language at all and gives a large benefit of flexibility, which is what I see as a major benefit of Ruby (with things like open classes: new methods can be added, old methods can be changed, &c).

-- Olathe

Strongly opposed	3
Opposed	2
Neutral	1
In favor	6
Strongly advocate	6

If you have registered at RCRchive, you may now sign in below. If you have not registered, you may sign up for a username and password. Registering enables you to submit new RCRs, and vote and leave comments on existing RCRs.
Your username:
Your password: