The Awesome Factor: files

Showing posts with label files. Show all posts

Saturday, September 04, 2010

Filling a file with zeros

In this blog post I'll demonstrate a few of ways to fill a file with zeros in Factor. The goal is to write a some number of bytes to file in the least amount of time and using only a small amount of RAM; writing a large file should not fail.

Filling a file with zeros by seeking

The best way of writing a file full of zeros is to seek to one byte from the end of the file, write a zero, and close the file. Here's the code:

: (zero-file) ( n path -- )
    binary 
    [ 1 - seek-absolute seek-output 0 write1 ] with-file-writer ;

ERROR: invalid-file-size n path ;

: zero-file ( n path -- )   
    {
        { [ over 0 < ] [ invalid-file-size ] }
        { [ over 0 = ] [ nip touch-file ] }
        [ (zero-file) ]
    } cond ;

The first thing you'll notice about the zero-file is that we special-case negative and zero file sizes. Special-casing zero file length is necessary to avoid seeking to -1, which does everything correctly but throws an error in the process instead of returning normally. Special-casing negative file sizes is important because it's always an error, and though the operation fails overall, the file-system can become littered with zero-length files that are created before the exception is thrown.

To call the new word:

123,456,789 "/Users/erg/zeros.bin" zero-file
"/Users/erg/zeros.bin" file-info size>> .
123456789

Copying a zero-stream

With Factor's stream protocol, you can write new kinds of streams that, when read from or written to, do whatever you want. I wrote a read-only zero-stream below that returns zeros whenever you read from it. Wrapping a limit-stream around it, you can give the inexhaustible zero-stream an artificial length, so that copying it reaches an end and terminates.

TUPLE: zero-stream ;

C: <zero-stream> zero-stream

M: zero-stream stream-read drop <byte-array> ;
M: zero-stream stream-read1 drop 0 ;
M: zero-stream stream-read-partial stream-read ;
M: zero-stream dispose drop ;

:: zero-file2 ( n path -- )
    <zero-stream> n limit-stream 
    path binary <file-writer> stream-copy ;

The drawback to this approach is that it creates 8kb byte-arrays in memory that it immediately writes to disk.

Setting the contents of a file directly

Using the set-file-contents word, you can just assign a file's contents to be a sequence. However, this sequence has to fit into memory, so this solution is not as good for our use case.

:: zero-file3 ( n path -- )
    n <byte-array> path binary set-file-contents ;

Bonus: writing random data to a file

The canonical way of copying random data to a file in Unix systems is to use the dd tool to read from /dev/urandom and write to a file. But what about on Windows, where there is no /dev/urandom? We can come up with a cross-platform solution that uses method number two from above, but instead of a zero-stream, we have a random-stream. But then what about efficiency? Well, it turns out that Factor's Mersenne Twister implementation generates random numbers faster than /dev/urandom on my Macbook -- writing a 100MB file from /dev/urandom is about twice as slow as a Factor-only solution. So not only is the Factor solution cross-platform, it's also more efficient.

TUPLE: random-stream ;

C: <random-stream> random-stream

M: random-stream stream-read drop random-bytes ;
M: random-stream stream-read1 drop 256 random ;
M: random-stream stream-read-partial stream-read ;
M: random-stream dispose drop ;

:: stream-copy-n ( from to n -- )
    from n limit-stream to stream-copy ;

:: random-file ( n path -- )
     
    path binary <file-writer> n stream-copy-n ;

! Read from /dev/urandom
:: random-file-urandom ( n path -- )
    [
         path
        binary <file-writer> n stream-copy-n
    ] with-system-random ;

Here are the results:

$ dd if=/dev/urandom of=here.bin bs=100000000 count=1
1+0 records in
1+0 records out
100000000 bytes transferred in 17.384370 secs (5752294 bytes/sec)

100,000,000 "there.bin" random-file
Running time: 5.623136439 seconds

Conclusion

Since Factor has high-level libraries that wrap the low-level libc and system calls used for nonblocking i/o, we don't have to deal with platform-specific quirks at this level of abstraction like handling EINTR, error codes, or resource cleanup at the operating system level. When calls get interrupted, when errno is set to EINTR after the call returns, the i/o operation is simply tried again behind the scenes, and only serious i/o errors get thrown. There are many options for correct resource cleanup should an error occur, but the error handling code we used here is incorporated into the stream-copy and with-file-writer words--resources are cleaned up regardless of what happens. We also demonstrated that a Factor word is preferable to a shell script or the dd command for making files full of random data because it's more portable and faster, and that custom streams are easy to define.

Finally, there's actually a faster way to create huge files full of zeros, and that's by using sparse files. Sparse files can start off using virtually no file-system blocks, but can appear to be as large as you wish, and only start to consume more blocks as parts of the file are written. However, support for this is file-system dependent and, overall, sparse files are of questionable use. On Unix file-systems that support sparse files, the first method above should automatically creates them with no extra work. Note that on MacOSX, sparse file-systems are supported but not enabled by default. On Windows, however, you have to make a call to DeviceIoControl. If someone wants to have a small contribution to the Factor project, they are welcome to implement creation of sparse files for Windows.

Edit: Thanks to one of the commenters, I rediscovered that there's a Unix syscall truncate that creates zero-length files in constant time on my Mac. This is indeed the best solution for making files full of zeros, and although unportable, a Factor library would have no problem using a hook on the OS variable to call truncate on Unix and another method on Windows.

Tuesday, January 13, 2009

Files and file-systems in Factor, part 2

In my previous post I wrote about the file-info and file-system-info words. Using those words, it's possible to write a portable program for directory listing (dir or ls) and one for file-systems listing (df). Uses for directory listing include file-system utility programs and FTP servers.

File listing tool

The file-listing tool is in the Factor git repository as basis/tools/files/files.factor. Which slots you see depend on how it is configured and on which platform it's running. Directory listings can be sorted by slot and the default sort is by name.

On Unix platforms, it's configured like this:

    <listing-tool>
       { permissions nlinks user group file-size file-date file-name } >>specs
       { { directory-entry>> name>> <=> } } >>sort

Listing a directory on MacOSX:

-rw-r--r-- 1  erg staff 27185 Nov 28  2008 #factor.el#
drwxr-xr-x 6  erg staff 204   Nov 17  2008 Factor.tmbundle
-rw-r--r-- 1  erg staff 30282 Nov 30  2008 factor.el
-rw-r--r-- 1  erg staff 18037 Nov 17  2008 factor.vim
-rw-r--r-- 1  erg staff 12496 Nov 17  2008 factor.vim.fgen
drwxr-xr-x 26 erg staff 884   Jan 13 11:17 fuel
drwxr-xr-x 8  erg staff 272   Nov 17  2008 icons

Windows platforms take the look of the ``dir'' command by default:

2009-01-14 00:00:53 <DIR>                Factor.tmbundle
2009-01-14 00:00:53                30282 factor.el
2009-01-14 00:00:53                18037 factor.vim
2009-01-14 00:00:53                12496 factor.vim.fgen
2009-01-14 00:00:53 <DIR>                fuel
2009-01-14 00:00:53 <DIR>                icons

File-systems tool

A tool for disk usage and mounted devices was easy to write once file-system tuples worked everywhere. Once again, it's configurable for whatever you want to see. The default file-system word is:

: file-systems. ( -- )
    {
        device-name available-space free-space used-space
        total-space percent-used mount-point
    } print-file-systems ;

File-systems on a Mac:

device-name   available-space free-space   used-space   total-space  percent-used mount-point
/dev/disk0s2  118599725056    118861869056 200867090432 319728959488 62           /
fdesc         0               0            1024         1024         100          /dev
fdesc         0               0            1024         1024         100          /dev
map -hosts    0               0            0            0            0            /net
map auto_home 0               0            0            0            0            /home

On Windows:

device-name      available-space free-space used-space total-space percent-used mount-point
                 3814506496      3814506496 6911225856 10725732352 64           C:\
VBOXADDITIONS_2. 0               0          27983872   27983872    100          D:\
                 0               0          0          0           0            A:\

Friday, January 09, 2009

Files and file-systems in Factor, part 1

Factor now has an easy way access to get information about files and file-systems in a high-level way across all the platforms that it supports. The API is really simple -- pass a pathname and get information back about the file or file-system as a tuple. The second part of this post will demonstrate a clone of the Unix tools ls, for listing files, and df, for listing file-systems.

File-info

There are now words to get information about files and symlinks, using file-info and link-info which map to C system calls stat and lstat. Some slots are shared across all platforms while others are only present on a particular platform. There are symbols representing all of the file types, like +regular-file+, +directory+, and +symbolic-link+. Here are some examples.
MacOSX

( scratchpad ) "resource:license.txt" file-info .
T{ bsd-file-info
    { type +regular-file+ }
    { size 1252 }
    { permissions 33188 }
    { created
        T{ timestamp
            { year 2008 }
            { month 11 }
            { day 17 }
            { hour 23 }
            { minute 34 }
            { second 5 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { modified
        T{ timestamp
            { year 2008 }
            { month 11 }
            { day 17 }
            { hour 23 }
            { minute 34 }
            { second 5 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { accessed
        T{ timestamp
            { year 2008 }
            { month 12 }
            { day 9 }
            { hour 12 }
            { minute 34 }
            { second 8 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { uid 501 }
    { gid 20 }
    { dev 234881026 }
    { ino 992362 }
    { nlink 1 }
    { rdev 0 }
    { blocks 8 }
    { blocksize 4096 }
    { birth-time
        T{ timestamp
            { year 2008 }
            { month 11 }
            { day 17 }
            { hour 23 }
            { minute 34 }
            { second 5 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { flags 0 }
    { gen 0 }
}

Windows XP

( scratchpad ) "resource:license.txt" file-info .
T{ windows-file-info
    { type +regular-file+ }
    { size 1252 }
    { permissions 32 }
    { created
        T{ timestamp
            { year 2008 }
            { month 3 }
            { day 23 }
            { hour 23 }
            { minute 28 }
            { second 12 }
        }
    }
    { modified
        T{ timestamp
            { year 2008 }
            { month 3 }
            { day 27 }
            { hour 23 }
            { minute 24 }
            { second 12 }
        }
    }
    { accessed
        T{ timestamp
            { year 2008 }
            { month 9 }
            { day 19 }
            { hour 23 }
            { minute 8 }
            { second 41 }
        }
    }
    { attributes { +archive+ } }
}

FreeBSD

( scratchpad ) "resource:license.txt" file-info .
T{ bsd-file-info
    { type +regular-file+ }
    { size 1252 }
    { permissions 33188 }
    { created
        T{ timestamp
            { year 2008 }
            { month 4 }
            { day 6 }
            { hour 12 }
            { minute 6 }
            { second 53 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { modified
        T{ timestamp
            { year 2008 }
            { month 4 }
            { day 6 }
            { hour 12 }
            { minute 6 }
            { second 53 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { accessed
        T{ timestamp
            { year 2008 }
            { month 4 }
            { day 6 }
            { hour 12 }
            { minute 6 }
            { second 59 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { uid 1002 }
    { gid 1002 }
    { dev 89 }
    { ino 343452 }
    { nlink 1 }
    { rdev 1348575 }
    { blocks 4 }
    { blocksize 4096 }
    { birth-time
        T{ timestamp
            { year 2008 }
            { month 4 }
            { day 6 }
            { hour 12 }
            { minute 6 }
            { second 53 }
            { gmt-offset T{ duration { hour -6 } } }
        }
    }
    { flags 0 }
    { gen 0 }
}

File Systems

The file-system utility word above works on file-system tuples that contain cross-platform information like the device name, the mount point, the number of free, used, and total bytes. A file-system tuple on Unix has all of the file-system information found in both statfs and statvfs while a Windows file-system object has the device-id, volume name, and byte usage slots. There is not a single win32 API call that gives as much information as on Unix systems -- instead I call a combination of GetDiskFreeSpaceEx, FindFirstVolume, GetVolumePathNamesForVolumeName, GetVolumeInformation.
On every Unix besides Linux, there is a member of the statfs or statvfs structure that gives you the file-system that contains the file. So, I had to roll my own for it to work the same way across all platforms. The algorithm is pretty simple: the follow-links word follows links up to 10 times (configurable) and once it stops, finds the parent directory and follows the links again until the directory is a member of the directories in the /etc/mtab file. If there is circularity or a broken link, it throws an error.

FreeBSD

( scratchpad ) "/" file-system-info .
T{ freebsd-file-system-info
    { device-name "/dev/da0s1a" }
    { mount-point "/" }
    { type "ufs" }
    { available-space 368117760 }
    { free-space 409702400 }
    { used-space 110110720 }
    { total-space 519813120 }
    { block-size 2048 }
    { preferred-block-size 2048 }
    { blocks 253815 }
    { blocks-free 200050 }
    { blocks-available 179745 }
    { files 65790 }
    { files-free 61501 }
    { files-available 61501 }
    { name-max 255 }
    { flags 20480 }
    { id { 0 0 } }
    { version 537068824 }
    { io-size 16384 }
    { owner 0 }
    { syncreads 0 }
    { syncwrites 0 }
    { asyncreads 0 }
    { asyncwrites 0 }
}

Windows XP 64

( scratchpad ) "k:\\" file-system-info .
T{ win32-file-system-info
    { device-name "" }
    { mount-point "k:\\" }
    { type "NTFS" }
    { available-space 174530142208 }
    { free-space 174530142208 }
    { used-space 225547444224 }
    { total-space 400077586432 }
    { max-component 255 }
    { flags 459007 }
    { device-serial 3695676537 }
}

The Factor build farm will start using file-system-info to report when a drive fills up pretty soon.

The Awesome Factor