ruby picture

RCR 332: mmap'd version of IO.scan( file_name, regexp)

Submitted by cyent (Tue Apr 18 05:51:42 UTC 2006)

Abstract

Currently there exists two very useful functions in ruby.

IO.read( file_name) reads in the entire file into a string.

string.scan( regexp){|match| } scans the entire string for regexp yielding matches.

The limit on doing...

IO.read(file_name).scan( regexp)

is the size of your machines unused physical memory.

Unix has the very handy facility called mmap that allows one to memory map and entire file and the contents of that file appears mapped into your virtual address space.

The operating system handles all the fuss and bother of reading (and forgetting) pages of that file into memory.

Thus is would be very easy to create a mmap'd version, semantically the same as the following function...

 def IO.scan( file_name, regexp, &block)
   IO.read(file_name).scan( regexp, &block)
 end

But being mmap'd could handle files up (almost) up to 4GB in size.

Problem

IO.read(file_name).scan(regexp) is limited to the available physical memory on your system.

Proposal

Reimplement...

 def IO.scan( file_name, regexp, &block)
  IO.read(file_name).scan( regexp, &block)
 end

to use unix mmap.

Analysis

No language level change, merely an extension to the existing IO.c

Implementation

Here is some example code.

Where they do the second mmap and the memcpy, we would do the regexp scan.

So that would have to be mashed together with io_read in io.c and rb_str_scan in string.c

ruby picture
Comments Current voting
Hmm. Just thinking. Before STL existed I did my own template library in C++. One of the most useful features was I could mmap a string to a file and thereafter the entire file behaved as an ordinary string.

The alternate to this RCR would be something that hacked the internal representation of a ruby string so that the data pointed to was mmap'd.

Now I can think of _many_ uses for that.

However, that would be a far harsher change on the string class and GC system.


Thinking on that a bit more.

One of the Grand Unifying Principles of Unix is...

"Everything (graphics card, directories, sockets, network cards, ....) is a file, and a File is just a stream of Bytes."

Repeat that until it's firmly stuck in your head.

Now take one small step further.

A stream of bytes is just a (possibly mmap'd) String.

Doesn't that make life really really simple?


Existing implementations!

Similar idea discuss here..

Implementation for Unix here...

Implementation for Win32 here...


Strongly opposed 0
Opposed 0
Neutral 0
In favor 0
Strongly advocate 1
ruby picture
If you have registered at RCRchive, you may now sign in below. If you have not registered, you may sign up for a username and password. Registering enables you to submit new RCRs, and vote and leave comments on existing RCRs.
Your username:
Your password:

ruby picture

Powered by .