Submitted by cyent (Tue Apr 18 05:51:42 UTC 2006)
IO.read( file_name) reads in the entire file into a string.
string.scan( regexp){|match| } scans the entire string for regexp yielding matches.
The limit on doing...
IO.read(file_name).scan( regexp)
is the size of your machines unused physical memory.
Unix has the very handy facility called mmap that allows one to memory map and entire file and the contents of that file appears mapped into your virtual address space.
The operating system handles all the fuss and bother of reading (and forgetting) pages of that file into memory.
Thus is would be very easy to create a mmap'd version, semantically the same as the following function...
def IO.scan( file_name, regexp, &block) IO.read(file_name).scan( regexp, &block) end
But being mmap'd could handle files up (almost) up to 4GB in size.
def IO.scan( file_name, regexp, &block) IO.read(file_name).scan( regexp, &block) end
to use unix mmap.
Where they do the second mmap and the memcpy, we would do the regexp scan.
So that would have to be mashed together with io_read in io.c and rb_str_scan in string.c
Comments | Current voting | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
RCRchive copyright © David Alan Black, 2003-2005.
Powered by .
The alternate to this RCR would be something that hacked the internal representation of a ruby string so that the data pointed to was mmap'd.
Now I can think of _many_ uses for that.
However, that would be a far harsher change on the string class and GC system.
Thinking on that a bit more.
One of the Grand Unifying Principles of Unix is...
"Everything (graphics card, directories, sockets, network cards, ....) is a file, and a File is just a stream of Bytes."
Repeat that until it's firmly stuck in your head.
Now take one small step further.
A stream of bytes is just a (possibly mmap'd) String.
Doesn't that make life really really simple?
Existing implementations!
Similar idea discuss here..
Implementation for Unix here...
Implementation for Win32 here...