I/O with regular files

 

Syntax:

num = add[b](path|fd,str)

str = ask([str[,locale],][max])

t/f = close(path|fd)

num = cry(str[,locale])

t/f = flush([path|fd[,auto]])

str = get[b](path|fd[,max])

str = grab(path)

num = plop(path,str)

num = pos(path|fd[,posn])

num = print(str[,locale])

num = put[b](path|fd,str)

num = say(str[,locale])

t/f = sync([path|fd])

 

Synopsis:

It is often said that, in UNIX, everything is a file. Throughout this manual, the word "file" means any stream of data, anywhere, from which a series of characters may be extracted with the get() function and/or into which a series of characters may be inserted with put(). These two are the most important functions for input and output to and from any type of UNIX file. On this page we describe the use of so-called "regular" files which are stored, typically, on a spinning or solid state disk and which are managed by one of the many UNIX file systems. We also describe use of the three "standard" files which are normally connected to a virtual terminal console. The syntax list, above, contains all of the JSUS functions required to access these files. They are all either a variation of get() and put() or designed to provide additional support to them.

However, with different semantics, the get() and put() functions are also used to process non regular files — pipes and sockets. These are described on their own manual pages and there is also a separate page for JSUS' own database management system — the Scroll library. Yet other pages describe common file manipulation functions such as move() or chown(), directory searching, hard and symbolic links and file meta information. Finally, the lock() and free() functions, used to co-operatively protect files from other programs, also have their own page.

In principle, regular files can contain any and all of the data types supported by the system hardware, including various binary types native to the C and C++ programming languages. But, the JavaScript language has only three fundamental data types, of which only character strings can be reliably and portably stored in files. For practical reasons, then, JSUS normally operates only on character stream files encoded in UTF-8 (which is converted automatically to and from UCS for use within the JavaScript program). However, since binary files must occasionally be accessed from JavaScript, JSUS provides "binary" versions of the primary I/O functions (i.e. getb(), putb() and addb() functions). These read and write byte streams instead of strings. Each byte is then stored in the low 8 bits of each 16 bit character of a JavaScript string. It is the user's responsibility to handle byte ordering, as needed. This not discussed any further in the rest of this manual, which assumes UTF-8 for all external data.

In UNIX like operating systems every regular file has at least one name (additional names may be added using the link() function). Typically, file names may consist of up to 255 characters, including UCS characters, and are case-sensitive. File names may not contain either a null or slash character (which is used exclusively as a directory separator, below). Technically, any other character can be used within a file name (including spaces). However, some characters have special meaning for the command shell and those in this list should generally be avoided: \ [ ] { } ( ) ' " % $ * ? ~ < > (as well as spaces and all other non-printable characters). File names are stored in special files called directories (sometimes called folders, because that is the way they appear to behave). Within any directory all file names must be unique, however, the same name may be used for a different file in another directory. Since directories are themselves files they must also have names, exactly like those of all other regular and special files. And, they must also be stored in other, higher level directories (or folders).

All of these directories are organized into a hierarchial tree, beginning with the single "root" directory which, uniquely, does not have a name. Each and every file, regular or special, can always be uniquely identified by a "path" through this tree consisting of the names of all directories to be searched and, finally, the name of the file itself. When passed as an argument to a JSUS function, a file's path is represented as a single character string, with each directory name terminated by a slash character ("/", aka a solidus). Because a full path begins at the root, which has no name, the path string normally begins with the terminating slash for the root directory, for example:

/var/my.app/file.name

A path like this, which begins at the root, is called an "absolute" path. However, the string representing this path can easily become long, complex and subject to error. So, UNIX supports a shorter form of path, "relative" to a "current working directory" (cwd, see cwd() and cd() functions). Once a working directory has been set, it and all higher level directories are automatically prepended to the given path string together with their trailing slashes. Thus, if "/var/my.app/" is set as the cwd, the relative path to a file in this directory is nothing more than the file name itself. Similarly, if the cwd contains a sub-directory, holding a program's input file, the relative path to that file might be (without any leading slash):

my.files/input.file

A path string passed to a JSUS function may be either absolute or relative. But, internally, all relative paths are converted to their absolute equivalent (with symbolic links substituted, if necessary). In turn, each absolute path always points to an inode number on a particular storage device — which is the unique, internal reference to a specific file. If a file has more than one name, there will be multiple absolute paths to the same file, but only one, unique inode number. Within any JSUS program, all paths which resolve to the same inode effectively refer to one and the same file. Internally, the state of that file is maintained in a single instance of a C++ object called a "File IN General" or a "File thING" (Fing). All JSUS functions operate through that single Fing, depending on its current state.

This differs markedly from traditional programming in C/C++. The UNIX runtime library contains an open() function which accepts a path string; prepares the file for reading, writing or both; and then returns an integer "File Descriptor" (FD) which serves as a reference to the state information for that file. However, the same file can be "open()ed" several times, in each case returning a different FD. And, confusingly, file status information is sometimes stored at the global level (in the containing directory) and sometimes at the level of the individual open FD. For example, one FD may be opened read-only, starting from a particular offset in the file. A second FD, for the same file, might be opened write-only, starting at a completely different offset.

The internal JSUS Fing object is designed to simplicate all this. Within a given program, there is never more than one instance of a file, and its state is passed consistently from one function to another. And, the file is opened automatically the first time it is referenced in a get(), put() or similar call. In fact, there is no open() function in JSUS and the JSUS close() function is entirely optional (the file will be properly closed when the program ends). In addition, all files are always opened with the maximum read-write permissions available to the user, group or everyone (see the umask() function). Thus, a get() may always be followed by a put(), as long as the program's user has write permission for the file.

With few exceptions, a path string argument may always be replaced by the numeric FD for an already "open" file. This is rarely needed but may be essential when another program passes the file by FD (in which case the file path is not available). If the first reference to a file is by FD (rather than path) then all other references must also be by FD. This also applies to pipe and socket files (but is not specifically documented on their manual pages).

 

Basic output operations:

num = put(path|fd,str)

num = add(path|fd,str)

Before data can be read from a file it must obviously first be written. In JSUS, this is done with the put() function, above, or a variation thereof such as add(). Its arguments are two strings, one containing the path to the file, and the other the UCS characters to be written. The specified file need not necessarily exist. Provided the user has write permission to the last directory in the path, the file will be created, empty, in that directory. If the file already exists, and access to the file begins with a put(), the file is first truncated to zero length, before output. Truncation can be avoided by preceding the first put() with a get(), add() or pos() function to the same file.

If the output string is not empty, its contents are converted to UTF-8 and written into the file at its current "position". If the string is empty, nothing is written. A new file is still created, if non-existent, and any existing file is still truncated. Thus, a put() with an empty string can be used to "open" an empty file, without writing to it.

The pos() function, below, may also be used before an initial put() to position the file to any offset within or even beyond its current size. This will automatically "open" (and if necessary create) the file but will not truncate it. Effectively, this gives the first put() the same effect as an add() call.

The add() function is used to append new data after the end of the file. It combines a pos() to the end of the file followed immediately by a put() of the output string.

On success, both put() and add() return the number of UTF-8 bytes actually written (which is often more than the number of characters in the output string). In addition, the file's position is incremented by this value, ready for the next put(). Failure to write the entire output string is treated as a serious error. Fortunately, this is extremely rare, because the file may be left in an inconsistent state.

 

Basic input operations:

str = get(path|fd[,max])

Once data is written into a file, any portion of it can be retrieved with the get() function. In most cases, the second argument is omitted and only the file path is passed. In this case, get() returns a single line from the file, starting at the current position. It reads up to and including the next line break ("\n") or up to the end of the file. All data read is converted from UTF-8 and returned to the program as a single UCS character string. The position in the file is also incremented by the number of UTF-8 bytes actually read.

The optional second argument (max) is used to read "blocks" of data, rather than individual lines. It is a number which specifies the size of a block of UTF-8 bytes to be read from the current file position. (At the end of the file, only, a "short" block may be read if there are not enough bytes remaining). Again, the data is converted from UTF-8 and returned to the program as a single UCS string (and the file position is updated). However, block reading should normally be restricted to files whose contents are known to be pure ASCII. This is because, beyond the ASCII range, UTF-8 encodes UCS characters as multi-byte sequences, with up to 6 bytes per character. As a result, it is difficult to compute the size of a block, or its offset within the file.

If the max block size is given as zero, it is interpreted to mean there is no maximum block size and, with a regular file, the entire remaining file is returned as a single string. This is effective, but not quite the same as the grab() function, below, which provides additional useful processing. See also the ask() function, below.

It is treated as an error if a get() is issued with the current file position pointing to other than the first byte of a UTF-8 sequence (basic ASCII characters are always the "first" and only byte). It is not an error if there is no data to be read, for example, if the file is positioned at or beyond the end of file. In this case, null is returned to the program.

 

Current file position:

num = pos(path|fd[,posn])

For each regular file JSUS maintains a current file position. This is an integer value representing the offset within the file at which put() will write new data or get() will read existing data. The pos() function either obtains or sets the current position of a particular file, identified by the path string passed as its first argument. If the optional second argument is omitted, pos() simply returns the current position of the file (and this call never fails). If necessary, an implied seek to this position is performed before every get() and put(), to allow these functions to be safely intermixed, without any chance of get() receiving stale data from earlier put()s to the file.

If the optional second argument is given, it must be an integer and the current position of the file is set to that offset from the beginning of the file (which may be greater than its current size). Then, the new position is returned to the calling program. A special "posn" argument of -1 is used to position the file to the next byte, after the end of the file. Conveniently, the position then returned from the call just happens to be the size of the file, in UTF-8 bytes.

It is essential to note that file sizes, positions and block lengths all exist as integer counts of UTF-8 bytes, which do not directly correlate to UCS characters. In UTF-8, all ASCII characters, each with a code point of 0 through 127, occupy exactly one byte. However, all other characters in the Universal Character Set (UCS) are represented by a series of two to six bytes. It is the programmer's responsibility to ensure the current file position points to the very first byte of a UTF-8 sequence (which, of course, includes any single ASCII character). Otherwise, the following JSUS function against that file will terminate with an error! This potential problem can be largely avoided by using line I/O instead of byte delimitted blocks.

 

Optionally closing a file:

t/f = close(path|fd)

While many programming languages require a file be explicitly "opened" and "closed", JSUS JavaScript does not. Files are automatically "opened" the first time they are referenced in any JSUS function call. Similarly, files are automatically and safely "closed" when the process terminates (but not when a jump() is made to another program in the same process). However, as long as a file is "open", it uses resources which are then unavailable for other purposes. If a program is going to continue operating for a significant period of time, without accessing a particular file, it makes sense to release the resources held by that file for use by other programs. The close() function does precisely that.

 

Flushing file I/O buffers:

t/f = flush([path|fd[,auto]])

Internally, JSUS performs I/O operations on regular files using the buffered stream facility of the underlying operating system (for better performance). Normally, the stream's buffers are handled automatically by the underlying system. However, it is sometimes preferable to ensure that data already written to the buffers is committed to the file itself. A flush() operation force writes any buffered output data to the file (and also discards any input buffers).

If no argument is given, all open file streams are flushed. If an optional path is given, by itself, only that file is flushed. The optional auto argument is a boolean (true or false) which turns auto-flushing on or off either for a single file or for all files (if an empty path is given). With auto-flushing on, JSUS automatically flushes the file at the end of each put(). Global auto-flush is turned off when a JSUS program is started. This gives the best performance, at the possible cost of potentially losing data on an unreliable system. If auto is changed for a single file, that file is no longer governed by the global auto setting.

It is important to note that flushing does not guarantee that data is actually written out to the external storage medium. For that, the sync() function must be used.

 

Forcing data to external storage:

t/f = sync([path|fd])

Given a path to a regular file, sync() first performs a flush() and then forces its in-memory data (and critical meta data) out to the external storage medium, using the fdatasync() function of the system run-time library. If the path is omitted a global flush() is first performed (above). This is followed by a system sync() call, which force writes all in-memory data, for all files, in all file systems, and all running processes (not just the calling program). This does no harm to the other programs, but sync() will take a relatively long time to complete.

The sync() function is the best way to secure critical data, but it still does not guarantee all data is written to external storage. It blocks until all drives have reported successful completion, but some (or all) of the data may still be stored in the drive's own hardware buffers.

 

Operating on entire files:

str = grab(path)

num = plop(path,str)

There are many situations where it is desirable to read an entire file before processing it, or to write an entire file after all processing is complete. For example, reading a configuration file at the beginning of program execution. For smaller files, particularly, this is much more efficient than reading or writing line by line.

The astute reader has probably noticed this is already possible with the get() and put() functions, above. Indeed, the grab() and plop() functions are nothing more than convenience wrappers around get() and put(), each followed by a close(), releasing all system resources acquired in order to process the file. Like get(), grab() returns the entire file as a single string. Like put(), plop() returns the total number of UTF-8 bytes written to the file. And all other semantics, including error detection, are exactly the same as the wrapped functions.

 

The three "standard" files:

num = print(str[,locale])

num = say(str[,locale])

num = cry(str[,locale])

str = ask([str[,locale],][max])

In UNIX, every program is provided with three "standard" files called stdin, stdout and stderr. These files are always created, before the program even begins to execute. Normally, they are "special" files, connected to a terminal or virtual console device. However, they can also be redirected to regular files elsewhere in the file system, using command line options. (Expert programmers can also do this by modifying symbolic links in the /proc filesystem). Since the paths and FD's of these files are fixed, they do not need to be provided to JSUS. Respectively, the ask(), say() and cry() functions assume paths of /dev/stdin, /dev/stdout and /dev/stderr. It is also perfectly legal to pass these path names to get(), put() and other functions.

The say() function, and its alias, print(), is by far the most commonly used. By design, it displays its single string argument on the console device (so don't forget to include a terminating "\n", if needed). However, in a web server environment, stdout is frequently redirected to another file, which is then transmitted to a remote web browser. Thus, in a web based application, the say() function is the primary means of sending HTML pages from a server program to the client. (The print() alias is exactly the same as say(), and is provided for compatibility with other JavaScript environments, which often provide a similar function with this name).

The cry() function is identical to say(), except its string argument is written to /dev/stderr. By default, this is directed to the console device but it may be redirected to an error log file if that is more appropriate. Within a run() function command, it is usually redirected to /dev/stdout, so its output can be captured together with the normal output of the command.

The ask() function first displays an optional text string on /dev/stdout, then reads either a complete line or a single key stroke from /dev/stdin. By default both of these files are connected to the console device but they can, of course, be redirected. While directed to the console, all input is normally echoed back to the terminal, so it can be verified by the user. Special terminal control keys other than NL, TAB, STOP and START are echoed as ^X (where X is the ASCII character 64 code points higher than the control character itself). A negative max size may be given to turn off echoing, allowing sensitive information such as passwords to be entered invisibly.

Input from a single key stroke may be read with ask(1) or, invisibly, with ask(-1). This allows a program to read console input key by key, but requires the programmer to provide custom editing procedures, which can become quite complex. In particular, ask(1), or ask(-1), might produce a string with more than one character if the key produces an ESCape sequence. And, on a console device, a "\n" is always printed after an ask("question (y/n)",1) call, as expected. But an ask(-1) call returns only the string for a single key stroke, with no console output at all. This is the form that should be used by programs that need to do all of their own keyboard editing and echoing.

On a console device basic editing can be performed using a max value of anything other than 1, or -1 (including 0). Input and editing is performed in an internal 4K buffer and is usually terminated by pressing the Enter or Return key. Then, the entire contents of the buffer are returned to the program as a single string (which might be longer than the specified max value). During editing, the Backspace key rubs out the previous key stroke and this action is echoed to the display (if max is positive). A ^W key rubs out the previous word and ^U rubs out the entire buffer. The terminating "\n" is not returned at the end of the string and if only the Enter key is pressed an empty string is returned. If the terminating character is required, use get() instead of ask().

Otherwise, the max value is treated exactly the same as in a get() call. If omitted, a single line is returned, up to and including the terminating "\n" (or up to the end of available data). A zero value returns all available data, including any embedded new line characters. If input is not hidden and a text string was provided to ask() the input string is copied to /dev/stdout, followed by a new line. If both these conditions are not met, the input string is only echoed if /dev/stdin is directed to the console. This is a convenience feature which addresses most practical application needs. However, in some situations, it may be better to avoid passing ask() output strings altogether, and to provide all output via separate say() or cry() function calls.

All of these functions support a string that is output to the user, and which should therefore be understandable to him. In particular, it should be presented in his own preferred language. So, JSUS passes all of these strings through the getText() function before they are even passed to the underlying code. But, only if these functions are used directly, not if the standard files are accessed with get() or put(). This means it is unneccessary to enclose the (default) output text within a getText() call or its alias _(). Obviously, this does not work if no message server is running, or if the message code is invalid or not present in the database. In these cases, only the default text can be displayed to the user. The optional second argument to all of these functions provides a locale name, to be used in place of the program's current locale.

 

See also:

Pipes, named and anonymous

Sockets, Internet and UNIX domains

File status information

File manipulation

Directory management

Hard and symbolic links

Inter-process file locking

The Scroll database library

The text translation database