Specified Endianness for signed 2- and 4-byte integers from String.unpack (#12)

Submitted by: Gavin Kistner

This revision by: Gavin Kistner

Date: Thu Jun 21 23:32:58 -0400 2007

ABSTRACT

There is currently no way to unpack a 2-byte or 4-byte integer using a specified endianness with String.unpack. This would add four more characters to String.unpack (without changing any existing options) to support this function.

PROBLEM

Here is a summary of the options available in String.unpack for unpacking 1-, 2-, and 4-byte integers with and without signedness. (ASCII table follows:
      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           c             |           C             | 
    2 |   ?   |    ?   |    s   |   n   |    v   |    S   |
    4 |   ?   |    ?   |    l   |   N   |    V   |    L   |
 

The four question marks represent options not currently possible using String.unpack directly. (As noted on ruby-talk, it is possible to achieve the same functionality by unpacking as unsigned with the desired endianness, re-packing as unsigned in native order, and then unpacking as signed in native order.)

This prevents one from writing a simple binary parser that is portable across platforms when signed integers are involved.

PROPOSAL

According to the String.unpack docs, the following characters are unused from the ranges of ‘a’..’z’ and ‘A’..’Z’: ‘j’, ‘J’, ‘k’, ‘K’, ‘o’, ‘O’, ‘r’, ‘R’, ‘t’, ‘T’, ‘W’, ‘y’, ‘Y’, ‘z’. Choosing somewhat arbitrarily from these (but matching the lettering system currently existing for the unsigned integers):

Extend String.unpack to support the four following codes: These would complete the table to be:
      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           c             |           C             | 
    2 |   j   |    k   |    s   |   n   |    v   |    S   |
    4 |   J   |    K   |    l   |   N   |    V   |    L   |

ANALYSIS

The missing functionality seems like an obvious hole in the String.unpack function. That it can be worked around various ways could only be an argument against the other existing ‘convenience’ methods.

I don’t personally care what letters are chosen, but the parity with the unsigned side of things seems appropriate to make the 2-byte choices lowercase versions of the 4-byte choices. (It seems very confusing to me that the native versions follow a different system, but that’s a separate issue.)

IMPLEMENTATION

(Sorry, I don’t understand the internals of Ruby or C well enough to supply an implementation.)

Comments

from David Black, Fri Jun 22 06:12:17 -0400 2007

Hi --

View at:

https://rcrchive.net/rcrs/12

not /19.  (An artefact of the switch from using id to using the
sequential number field for the links :-)


David

On Thu, 21 Jun 2007, rcrchive+12@rcrchive.net wrote:

> Welcome to the comment list for:
>
>  Specified Endianness for signed 2- and 4-byte integers from String.unpack (RCR #12)
>
> This RCR was submitted by Gavin  Kistner,
> at Thu Jun 21 23:32:58 EDT 2007.
>
>
> To comment on this RCR, just reply to this message, using the Reply-To
> address.  All commenting will be done via this email list.
>
> Please trim away this introductory message when you reply!
>
>


from Nobuyoshi Nakada, Fri Jun 22 17:02:15 -0400 2007

Hi,

> The missing functionality seems like an obvious hole in the
> String.unpack function. That it can be worked around various
> ways could only be an argument against the other existing
> 'convenience' methods.

Agreed, but it doesn't seem a good idea to add vague characters
more.

What about adding signed flags to integers, like:

  n+    signed big endian 2 bytes
  v+    signed big endian 2 bytes
  N+    signed big endian 4 bytes
  V+    signed big endian 4 bytes

> I don't personally care what letters are chosen, but the
> parity with the unsigned side of things seems appropriate to
> make the 2-byte choices lowercase versions of the 4-byte
> choices. (It seems very confusing to me that the native
> versions follow a different system, but that's a separate
> issue.)

They are same semantics as "C" and "c".

IMHO, it's better to those integral characters take an optional
byte size, so that bignum also can be pack/unpacked, but a
separate issue indeed.


from Gavin Kistner, Wed Aug 29 16:05:02 -0400 2007

On Fri Jun 22 17:02:15 EDT 2007 Nobuyoshi Nakada wrote:
> > The missing functionality seems like an obvious hole in the
> > String.unpack function. That it can be worked around various
> > ways could only be an argument against the other existing
> > 'convenience' methods.
> 
> Agreed, but it doesn't seem a good idea to add vague characters
> more.

Respectfully, I disagree. There are already 20 letters chosen (seemingly) arbitrarily. The method does nothing *but* map arbitrary codes to specified functionality. There are already 4 letters in this domain alone (5 if you count i/I). Picking a couple more characters at random seems appropriately in-line with the design decisions so far.


> What about adding signed flags to integers, like:
> 
>   n+       signed big endian 2 bytes
>   v+       signed big endian 2 bytes
>   N+       signed big endian 4 bytes
>   V+       signed big endian 4 bytes

IF we were going to overhaul existing codes to make more sense, I would be fine with using modifiers for new functionality.
For example, the above would make sense to me iff we changed the full table to be something like:

      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           C+            |           C             | 
    2 |   n+  |    v+  |    s+  |   n   |    v   |    s   |
    4 |   N+  |    V+  |    S+  |   N   |    V   |    S   |

Or, if we were to change existing values, I'd also be fine with:

      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           c             |           C             | 
    2 |   n   |    v   |    s   |   N   |    V   |    S   |
    4 |   j   |    k   |    l   |   J   |    K   |    L   |

Note, however, that I am definitely *not* suggesting we make changes to #unpack that break old code.


Nobu, your suggestion seems incorrect when we leave the existing functionality:

      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           c             |           C             | 
    2 |   n+  |    v+  |    s   |   n   |    v   |    S   |
    4 |   N+  |    V+  |    l   |   N   |    V   |    L   |

With that, we have:
  * lowercase means signed if bytes is 1,
  * or lowercase means 2 bytes
  * or lowercase means signed
  * + means signed, if it's not 1 bytes or native
You're adding another partially-supported axis, and using one-character versus two-characters at the same time!


My other concern is that '+' looks like the regexp "one or more" modifier, particularly confusing when the '*' modifier does a similar function (albeit not the same as regexp). If we have to go with a modifier and not additional characters (though I still think characters are the more consistent-with-the-inconsistent-situation choice) then I would suggest using something like "-n". (Because unsigned numbers are always 0 or positive, the - helps to indicate that the number may be negative. )


> > I don't personally care what letters are chosen, but the
> > parity with the unsigned side of things seems appropriate to
> > make the 2-byte choices lowercase versions of the 4-byte
> > choices. (It seems very confusing to me that the native
> > versions follow a different system, but that's a separate
> > issue.)
> 
> They are same semantics as "C" and "c".

Right. My point is that the semantics of S/s and L/l are different from the semantics of N/n and V/v; C/c seems like a special case, since it has no endian issues.


> IMHO, it's better to those integral characters take an optional
> byte size, so that bignum also can be pack/unpacked, but a
> separate issue indeed.

That certainly would be nice, but definitely outside the scope of what I'm proposing here. Unless, of course, people think that it's OK to overhaul *all* the number-related String#unpack codes in a non-backwards compatible way. If so, then I would much rather do that (make things clean and logical) instead of hacking on this functionality.

from Gavin Kistner, Wed Aug 29 16:08:53 -0400 2007

On Fri Jun 22 17:02:15 EDT 2007 Nobuyoshi Nakada wrote:
> > The missing functionality seems like an obvious hole in the
> > String.unpack function. That it can be worked around various
> > ways could only be an argument against the other existing
> > 'convenience' methods.
> 
> Agreed, but it doesn't seem a good idea to add vague characters
> more.

Respectfully, I disagree. There are already 20 letters chosen (seemingly) arbitrarily. The method does nothing *but* map arbitrary codes to specified functionality. There are already 4 letters in this domain alone (5 if you count i/I). Picking two more characters at random seems appropriately in-line with the existing design decisions.

(I don't mean that to sound critical of the way things are. If someone put a lot of thought and effort into hand-choosing each and every character used by unpack, I'm sorry, but I can't figure out what the rational was.)


> What about adding signed flags to integers, like:
> 
>   n+       signed big endian 2 bytes
>   v+       signed big endian 2 bytes
>   N+       signed big endian 4 bytes
>   V+       signed big endian 4 bytes

IF we were going to overhaul existing codes to make more sense, I would be fine with using modifiers for new functionality.
For example, the above would make sense to me iff we changed the full table like:

      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           C+            |           C             | 
    2 |   n+  |    v+  |    s+  |   n   |    v   |    s   |
    4 |   N+  |    V+  |    S+  |   N   |    V   |    S   |

Or, if we were to changing existing values, I'd also be fine with:

      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           c             |           C             | 
    2 |   n   |    v   |    s   |   N   |    V   |    S   |
    4 |   j   |    k   |    l   |   J   |    K   |    L   |

Note, however, that I am definitely *not* suggesting we make changes to unpack that break old code.

Nobu, your suggestion seems incorrect when we leave the existing functionality:

      |         signed          |        unsigned         |  
bytes |  big  | little | native |  big  | little | native |
------+-------+--------+--------+-------+--------+--------+ 
    1 |           c             |           C             | 
    2 |   n+  |    v+  |    s   |   n   |    v   |    S   |
    4 |   N+  |    V+  |    l   |   N   |    V   |    L   |

With that, we have:
  * lowercase means signed if bytes is 1,
  * or lowercase means 2 bytes if it's not native,
  * or lowercase means signed
  * + means signed, if it's not 1 bytes or native

You're adding another partially-supported axis, and using one-character versus two-characters at the same time!


My other concern is that '+' looks like the regexp "one or more" modifier, particularly confusing when the '*' modifier does a similar function (albeit not the same as regexp). If we have to go with a modifier and not unique characters (though I still think characters are the more consistent-with-the-inconsistent-situation choice) then I would suggest using something like "-n". (Because unsigned numbers are always 0 or positive, the - helps to indicate that the number may be negative. )


> > I don't personally care what letters are chosen, but the
> > parity with the unsigned side of things seems appropriate to
> > make the 2-byte choices lowercase versions of the 4-byte
> > choices. (It seems very confusing to me that the native
> > versions follow a different system, but that's a separate
> > issue.)
> 
> They are same semantics as "C" and "c".

Right. My point is that the semantics of S/s and L/l are different from the semantics of N/n and V/v; C/c seems like a special case, since it has no endian issues.


> IMHO, it's better to those integral characters take an optional
> byte size, so that bignum also can be pack/unpacked, but a
> separate issue indeed.

That certainly would be nice, but definitely outside the scope of what I'm proposing here. Unless, of course, people think that it's OK to overhaul *all* the number-related String#unpack codes in a non-backwards compatible way. If so, then I would much rather do that (make things clean and logical) instead of hacking on this functionality.


Return to top

Copyright © 2006, Ruby Power and Light, LLC