Memory mapped file I/O

Syntax:

var = mm(path|fd,ofs,type[,value])

num = mind(path|fd,[-]ofs,[-]len,str)

str = fold(str[,script])

var = bigi(num|str)

Synopsis:

The mm() function implements memory mapped I/O for regular files. This is not only significantly faster than conventional I/O using get() and put(), it also gives JavaScript programs direct access to native system data types (such as binary integers). If the path in the first argument does not exist, it is created as a one-byte file, initialized to binary zero (this is because a zero length file cannot be mapped). The file is mapped into the virtual memory of the current process at a unique address, but might also be shared with other processes at a different address. This is why mm() requires the ofs (offset) into the file as its second argument, so that all sharing processes refer to the same data elements. An offset of -1 refers to the current position in the file mapping, which is normally the byte immediately following the last byte read or written. As with other files, this can be inspected or set using the pos() function.

The third argument specifies the type of a data value to be read from or written into the mapping. Most commonly, this is a positive integer giving the maximum length of a line of text, usually terminated with a \n (new line) and containing no embedded null characters. A negative integer represents a byte array, which may contain embedded null characters and a terminating \n. On input, if there is no terminating new line, exactly abs(type) bytes are read and returned as a JavaScript string. Any early terminating new line character is returned at the end of the string. If ofs points to a series of one or more successive new line characters, mm() returns only the last of these (as an empty line) and the file pointer is set to the first byte of the next line.

The optional value argument specifies a new value to be stored at the specified offset, after conversion to the specified type. For an output string, up to abs(type) bytes are written and any excess bytes are truncated. If fewer bytes are written, the string is padded to the maximum specified with new line characters — each representing an empty text line in the file. On success, mm() returns the number of bytes actually written. If any error is detected a null is returned, with errno set appropriately.

Less commonly, other data types may be specified by a type string with one of the following keywords: char, short, int, long, float, or double. In most cases, only the initial letter is significant (and needs to be included in the type string). However, the integer value keywords may all be prefixed with the letter "u" (meaning unsigned) and, in these cases, at least the first two letters must be included (e.g. uint or ui). In all cases, the specified type is equivalent to what would be expected on a 64-bit computer (i.e. a long and a double are both 64 bits). For performance, these values are always stored in machine byte order and are passed to and from the program as normal JavaScript numbers (which can hold integer values up to 2**53).

The mind() function searches the mapping for a particular str (string), starting at a given ofs (offset), and proceeding for a maximum of abs(len) bytes. A negative ofs value institutes a reverse search, from abs(ofs) toward the beginning of the file, for the specified len. A len of zero searches to the end (or beginning) of the file mapping. A negative len causes the str to be treated as a byte array, as above. On success, mind() returns the offset of the beginning the string. If the string is not found, a null is returned but errno is not set (since this is not an error).

As a special case, given a string with just one character between \uFF00 and \uFFFF, mind() searches for the first byte which does not match the given character, modulo 256. It then returns the number of characters which were matched. This can be used to count a series of identical characters, for example, mind(path,-ofs,0,'\uFF0A'). This counts the number of successive empty lines (\n), searching back from ofs and proceeding toward the beginning of the file.

The mind() function never modifies the file's current position, even if the search is successful. And, as with mm(), a one byte file is created automatically if the specified path does not already exist.

Helper functions:

The fold() function is intended to ease the problem of sorting and comparing strings containing Unicode characters. There are over a million possible code points in the UCS and, unlike ASCII, they are not arranged in any binary alphabetical sequence, so they cannot be compared directly for sorting purposes. This function takes a Unicode string in UTF-16 and returns a JavaScript string containing two 8-bit byte codes in each 16-bit character. The UCS string is "folded" down into our own 8-bit Trans-Unicode Grapheme Set (TUNGS) which ignores case, accents, punctuation, language rules, etc. and is then packed 2 by 2 into 16-bit JavaScript characters. TUNGS strings can be compared directly for sorting purposes, and they only require half the memory space. At present, only the ISO-8859-1 character set is supported, which covers most Western European and American scripts. Other scripts will be added in the future (Greek, Hebrew, etc.) and these can be specified by the optional script code. If omitted, this defaults to zero, identifying ISO-8859-1. Further discussion is, unfortunately, beyond the scope of this documentation. Returns an empty string if errors are detected.

In its present form, the fold() function is deprecated, and should not be used to produce strings for long-term storage. The original intent was simply to create strings for the efficient collation of keys used in Scroll database files. Based on further research, we have decided upon a single, expanded Latin character set which applies directly to about 5/6th of the literate global human population. In future, we will require other users to transliterate their own character sets into the Latin standard, and we will provide additional functionality to assist with this.

The bigi() function addresses the need to process large binary integers (big ints) without regard to native byte ordering or the limitations of JavaScript numbers. If passed a number, bigi(num) returns a 4-character (8 byte) JavaScript string containing a 53-bit integer in network byte order (big-endian). The string.slice() method can be used to extract 16-, 32- or 48-bit integers from the string. If passed a string of 1 to 4 characters, bigi(str) returns a JavaScript number holding the integer representation of the bits found in the string. Big int strings (of the same size) can be compared directly as strings, even when concatenated with other strings. This makes them highly appropriate for use as key fields and offsets in index systems. However, they cannot be compared directly after conversion to UTF-8, and there are no space savings when stored externally in that encoding. Therefore, large integers are best stored in external files as decimal character strings.

Memory mapped file I/O

Syntax:

Synopsis:

Helper functions:

See also: