The JSUS "Scroll" library

(A Network Database Manager)

Syntax:

scr = new Scroll(path);

rec = scr.rec();

rec = scr.loc(key [ ,rec [,time] ]);

var = scr.ins(rec);

var = scr.del(rec);

var = scr.rep(rec);

t/f = scr.free( [ rec ] )

t/f = scr.close();

t/f = scr.serve( [path] );

jsdb [ -h ] [ -c ] [ -k ] [ [scroll-path][@socket-path] ] . . .

Synopsis:

Consider a simple table containing a list of business contacts. Each horizontal row in the table contains all the information for one person, and the table is vertically divided into columns, which hold similar items for all of the rows (name, email address, etc). This tabular structure has been used for written business records for more than a thousand years (account ledgers, for example). Today, in the information age, the same paradigm is still applied in most database systems and in all spreadsheet programs. In practice, like a hand-written ledger, a spreadsheet really is a type of database and manual entry of data into a spreadsheet is often the easiest way to manage information for general business purposes. Unfortunately, spreadsheet files use proprietary data formats which make them impractical for custom business applications (especially in JavaScript).

Nevertheless, all spreadsheet programs do provide various means of exporting their data in more usable formats. At the very simplest, all of these programs allow a spreadsheet to be printed, often over several pages, which can then be used in traditional manual systems (without any further need for a computer). Sometimes, because of the amount of data, it helps to tape the pages together, forming a single, large sheet of paper. Ironically, for compact storage, this single sheet is sometimes rolled up into a scroll — a truly ancient form of business record predated only by clay tablets. This is the paradigm JSUS has chosen for its basic database technology, the Scroll library.

Essentially, a scroll is a simple text file in Comma Separated Value format (csv), containing a single table of rows and columns. It can be created with any standard text editor or it can be created directly in the JavaScript program itself. It can also be exported from a spreadsheet with the .csv file extension (which is supported by all spreadsheet programs). In this format, each row of the table is written as a character string, terminated by a new line character (\n). Conceptually, the column (or cell) values in each row are separated from each other by a comma, but this can cause confusion because commas are often also part of the value itself. Because of this, the csv specification allows any character to be used as a separator. The Scroll library requires that a tab character (\t) be used as the separator and this is then often called a Tab Separated Value file (tsv). As allowed by the specification, the .csv file extension is not required (and should probably be avoided). We also prefer to use traditional IT terminology, where a row or line is called a "record", and a column or cell is called a "field". In principle, each field may contain any UCS character, except an embedded tab or newline character. (Embedded null characters may also create problems for some spreadsheet programs but not for the Scroll library itself).

This choice of the tsv file format allows spreadsheet programs to handle some of the work related to simple database design, management and even reporting. In some cases, it is possible to maintain a master copy of the database as a spreadsheet file, exporting the actual scroll file each time the content is updated. This can greatly reduce the amount of custom programming required to support static datasets (e.g. a product look-up file). On the other hand, scroll files can also be imported into a spreadsheet for further processing. For example, an online sales database might be imported into a spreadsheet to create monthly sales reports. This approach can greatly reduce the total amount, time and cost of custom programming required for smaller business applications.

Spreadsheet programs are not recommended for large and/or complex databases, however. The Scroll library includes advanced features which are difficult or impossible to properly manage within the spreadsheet paradigm. In addition, spreadsheet programs often waste storage space in exported .csv files by writing unnecessary tab characters at the end of short lines (rows with empty cells at the end). For professional quality database applications, it is probably best to include all scroll file processing in the JavaScript programs themselves, even though the following discussions frequently refer to the use of spreadsheets.

The library manages scroll files using JSUS' built-in memory mapped I/O functions. Smaller files, up to a few thousand records, tend to remain resident in RAM memory, where they can be accessed almost instantaneously. Larger files may be occasionally swapped between RAM and external storage, using the system paging mechanism, but I/O activity is typically minimal and high-performance. In addition, the library automatically maintains multiple, user-specified, in-memory indices — avoiding lengthy searches through the file itself. These are implemented using the separate Index library (which can also be used to provide simple, indexed-sequential, key+data collection objects). Performance can be further enhanced, in a multi-user environment, by running the library as a scroll file server program (below). For many applications, the Scroll library is an order of magnitude faster than SQL database managers. Yet it provides all of the key features needed to create highly complex applications using a modernized form of the Network Database Model (using keys instead of pointers).

The DataBase Definition:

Before a scroll file can be accessed by the library it must physically exist and must conform to certain limitations (including the tsv specification). In particular, it must begin with a set of header records which, together, comprise the DataBase Definition (DBD). These must be terminated either by end-of-file, or by an empty or "comment" record, which separates the DBD from the actual data records in the file. Empty records contain nothing but a single new line character (\n). A series of these are typically used to fill data space deleted from the file and appear in a spreadsheet program as empty "rows". Comment records begin with an asterisk as the first character of the first (or only) field. Within a spreadsheet program, comment records may be used to include formulae, etc., without affecting the data records processed by the JavaScript program. Both record types are completely ignored by the Scroll library and cannot be read by user programs. Both may appear anywhere in the file, after the DBD.

The first four DBD records must always be present and always in the same order. The first two contain heading display information for use by external programs (such as a spreadsheet or a database administration program). The very first record (dbd1) contains an overall title for the dataset and usually consists of a single text string field. The remainder of this record may contain any number of fields, with any content. The second record (dbd2) contains display headings for the fields (or columns). It must contain one text field (cell) for each possible data field (column), again with any desired content. Neither of these records are used by the library itself, but their content is passed to the JavaScript program — as public properties of the Scroll object — in the form of a text string (.dbd1) and an array of text strings (.dbd2).

The third record (dbd3) contains a tab-separated list of the names used to refer to data record fields within the JavaScript program. There must be exactly one name for each possible data field (column); they must exactly match the corresponding field in the dbd2 record; and each must be a unique, valid, JavaScript identifier (without embedded escape sequences). Data is passed between the library and the program as record objects, with properties identified by these field names. An error is thrown on any attempt to read a data record containing more fields than there are names in the dbd3 record. Conversely, any missing fields are simply treated as being empty and are passed to the program with minimal property values (false, zero or an empty string).

The fourth record (dbd4) contains a list of descriptors, one for each record field, which specifies how content is to be handled on input and output. Each begins with a single letter representing one of the elementary JavaScript data types — b (boolean), n (number) or s (string). The letter u (unfolded) may also be used to specify a string which is not "folded" down when used as part of an index key (below). Optionally, the type letter may be immediately followed by the maximum number of UCS characters allowed in the field when written to the file (for numbers and strings only). This is not necessarily the same as the number of UTF-8 bytes physically stored in the file (which may be greater). A zero (or omitted) length defaults to the maximum: 1 for a boolean, 32 for a number or 1024 for a string.

Finally, the field length may be followed by a dot and one or two digits specifying the decimal precision of a number, or, the numeric code of the "script" type contained in the field's strings. If omitted, for a number, the field may be stored in the file in any legal format, including exponential notation. Otherwise, within the file, the number must be stored in fixed point notation, rounded to "precision" decimal places. If omitted, for a string, this defaults to zero, the code for the ISO-8859-1 script (other script codes will be added in future releases). For any field descriptor that is completely empty, or missing, the library assumes s1024.0 (a maximum string in ISO8859-1 script).

The Scroll library also allows any field descriptor to define a list or array of numbers or strings, simply by upper-casing the data type (N, S or U). In the file, a semi-colon (;) is used to separate the members of the list — which means such strings cannot ever contain this character as data. Within the program, the field's property value is presented as an array of numbers or strings, with the semi-colons removed. An empty array value is represented in the file by two semi-colons. A leading or trailing semi-colon means the first or last array value is zero or an empty string. Boolean fields may not be arrays, neither may the very first field of the record (because this is used as the primary index key, described next).

Defining index keys:

Scroll files are intended to be fairly small, so they remain resident not only in virtual memory but, for the most part, in real memory as well. However, their records are never guaranteed to be ordered in any particular sequence. A sequential search, even in a memory mapped file, cannot be expected to quickly locate a given record, or the next in an ordered sequence. To allow full, indexed-sequential searching the Scroll library supports multiple Index objects, which are dynamically created and maintained opaquely within each Scroll object. The first of these is defined automatically and is called the primary index. Effectively, it sequences the scroll records in order of the value of their first field (the primary key). Primary keys must be unique and should be either integers or short character strings.

Alternate, secondary indices can be defined within the DBD by creating index-key records (dbdx). Zero or more of these may be present, immediately after dbd4, and each contains exactly two, tab-delimited fields. The first holds the name of the index-key, followed by a colon (:), while the second contains a list of key field names, separated by semi-colons (;). The key name must be a valid identifier, and must be different from all other key names, but it may be the same as a field name. The primary key is automatically named identically to the first record field (and may also be identified by an empty string in calls to Scroll methods, below). The key's field name list provides the sort order for the index, which may be searched randomly or sequentially. Alternate keys do not need to be unique, because they are pointed at the primary key (which then makes them unique). Primary keys are pointed directly at the offset of the record in the file (which is why they must be unique).

Only strings and numbers may be defined as key fields and, when numbers are used, they must be integers (any sign is ignored). When used as key fields, strings may normally contain only those characters from the first 256 code points of Unicode (ISO-8859-1). This includes ASCII and most Roman scripts. The library orders these fields by folding characters down into an internal character set — ignoring case, accents, ligatures, punctuation, etc. Character codes above 255 are folded down to 255. This may offend cultural sensitivities but it does provide a single sorting standard which is excellent for English and reasonable for most western European languages. (Additional scripts will be added in future — Greek, Cyrillic, Hebrew, etc.). In those cases where this is not appropriate, character folding can be suppressed by defining the field with the letter u (unfolded), instead of s (string). Such fields are compared purely on the basis of the binary values of their characters and, therefore, may contain any UCS code points. In all other respects they are treated exactly the same as other strings.

Except for the primary key, any one field in the list may have been defined with an array type descriptor. In this case, when the record is indexed, a separate key is created for each member of the array. Similarly, records may contain fields with arrays of keys pointing to multiple records in other scroll files. These two powerful features allow the design of complex databases with efficient many-to-many relationships, both within and between multiple scroll files. They are in addition to the more conventional use of secondary keys, which support only one-to-many and many-to-one relationships. For example, an "invoice" file might be designed with invoice number as the primary key and customer number as a secondary key (pointed at the invoice number). This creates a one-to-many relationship between one customer and many invoices. And there is a many-to-one relationship back from many invoices to one customer. By design, all these techniques are helpful for implementing the Network Database Model, which many computer scientists consider superior to the Relational Model of SQL — particularly in the areas of performance and storage space.

All Index objects are fully populated at the time their containing Scroll object is constructed. A complete pass is made through all records in the file, and all the keys they contain are inserted into all indices simultaneously. At this time, an internal "free space" index is also built, with keys pointing to areas in the file where data has been deleted. (This free space, indicated by a series of new-lines, is allocated to new data inserted or replaced in the scroll, in preference to adding data at the end of the file). For smaller files, all indices are typically built in just a fraction of a second. Very large files might take longer to index but, if scroll file servers are used, this only occurs once (usually during system boot). After the indices have been populated, individual keys are inserted and deleted dynamically (which requires very little processing overhead). Given enough RAM memory, actual reading and writing of records is virtually instantaneous.

Initial file creation:

The DBD (and possibly data) records are often exported from a spreadsheet program. In this case, in the spreadsheet itself, the DBD records are entered as the first few rows, with dbd1 and db2 set up as display headings. Normally, after they have been entered, the remaining DBD records are "hidden" — to avoid cluttering the display screen and the printed pages. Comment rows are completely ignored by the library, to allow a spreadsheet to contain rows with information other than the data itself (for example, formulae). It is also a good idea to adjust the width and formatting of the columns, to make maintaining the spreadsheet easier (and visually more appealing). After all DBD and data rows have been entered, the entire sheet is exported as a .csv file, for use by the scroll library. How all this is accomplished can be gleaned from the spreadsheet program's help system.

Alternatively, the file and its header records can be created within the JavaScript program, using standard JSUS function calls, prior to creating the Scroll object. Typically, this would be made conditional on whether the file already exists and, at this time, some or all of the data records could also be written. For example:

if (!fish(path)) { put (path,'Customer Database\n'); put (path,'Cust. No\tCustomer\tPhone No\tContacts\n'); put (path,'custNmbr\tcustName\tcustFone\tcontacts\n'); put (path,'n5.0\ts30\tN10\tS\n'); put (path,'custName:\tcustName'); put (path,'custFone:\tcustFone'); put (path,'\n'); put ('5000' +'\tAcme Information Services' +'\t1235559876;8006663333' +'\tJohn Smith' ); }

Scroll file connections:

The Scroll library is written in JavaScript and is loaded into the global context with either a load('jsus.Scroll') function call or, before the program even starts, with the -j jsus.Scroll option (in a four-stroke line at the top of the program). In either case the library code is compiled, executed and installs constructor and method functions for the user's own Scroll connection objects. A connection to an existing scroll file can then be created with:

scr = new Scroll(path);

If there is no valid scroll file at the end of the path this will fail (errno is set appropriately before the error is thrown). If the file structure is valid, it is mapped into the program's virtual memory and, from this point forward, it should only be accessed using the JSUS memory mapped I/O functions (around which the library is wrapped). Prior to making a connection, however, regular I/O function calls may be used against the same path to physically create or otherwise modify the file (above).

For a number of reasons related to security, performance and reliability, only one connection to a scroll is permitted at any time. For this reason, the Scroll library prevents multiple connections being created for the same file, in the same program. It also issues an exclusive lock() against the entire file, to protect it from access by other programs. Unfortunately, in Unix, there is presently no reliable method of making these locks mandatory for all programs. They work well with co-operating programs, including all other JSUS programs, but it is always possible for rogue programs to corrupt any file — if they have write permission. Perhaps worse, if a new process is forked after a Scroll object is instantiated, the new process will not inherit the exclusive file lock of the parent. So, additional protection should always be provided through appropriate file access permissions and forking should be avoided, if possible.

As a special case, the scroll's path may point to a UNIX socket, created by a separate scroll file server program. Library calls in the client program are then redirected, transparently, to the scroll server. This important feature allows large numbers of programs to safely and efficiently share the same scroll file — which is particularly useful in a web server environment. Sharing is described more fully below but, since it is transparent, it has no impact on the following discussions of normal library operations.

Normal database operations are performed by calling method functions of the Scroll object (next). Technically, it is still possible to call other JSUS functions against the same file path but this is generally discouraged (because the results can be unpredictable). To close a scroll connection, and to release its allocated resources, simply call the close() method. (If omitted, this will happen automatically, when the program terminates).

Scroll file operations:

All of the normal operations against a scroll file require record objects to be passed to and from the library. Even the most common operation, reading a data record using a random key, should be preceded by creating a new, empty record using the rec() method. This returns a new record object with one named property for each field defined in the DBD, with content preset to a minimum value — false, zero or an empty string or array. To read a data record, appropriate key field values are set in the record object, which is then passed to the loc() method:

rec = scr.rec(); rec.keyField = someValue; . . . rec = scr.loc(key [ ,rec [,time] ]);

In the loc() method call, the key name identifies the key Index object to be searched. The rec object, which is optional, contains values for the key fields to be used in the search. It may be substituted by a tab-delimitted key string, but this is somewhat more error prone than entering values into record fields. The string option is provided mainly to support special key values — '\u0000' for the very first record, '\uFFFF' for the very last, or an empty string for the next record in key sequence. If both record and string are omitted (or replaced with a zero), an empty string is assumed. Either complete or partial key field values may be set in the rec object, or the string, and loc() always returns the first record with an equal or higher key. However, key evaluation ends at the first empty key field. Thus, leaving the first key field empty (or zero) has the same effect as substituting the rec object with a '\u0000' string (i.e. the first record is returned). If any key field is an array, only the first element is used in the search key. Note, also, that sequential reads from different key indices may be freely intermixed, without breaking their sequences. And, similar to get(), a null is returned if any attempt is made to read beyond the last record (which is not considered an error).

The optional time argument specifies a maximum (fractional) number of seconds for which the record is to be "leased" by the program and remain locked from reads or writes by other processes. The lease is released when the record is replaced, deleted or passed to the free() method. All records leased in the same scroll, by the same process, can also be released by a single call to free() at the end of a transaction. If none of these occur, the lease is automatically released after the time has expired. Currently, leasing applies only to records read via scroll servers (otherwise the time value is ignored). Nevertheless, for future-proofing, all applications should be designed and programmed for multi-user scroll servers — where a lease is required before a record may be replaced or deleted. If this lease cannot be acquired, loc() returns a null (again, this is not considered an error).

On success, the loc() method returns a new object containing the desired record. A named property is provided for each like-named field in the DBD. If a field is empty or truncated in the file itself, that field is given a minimal value in the record object — false, zero or an empty string or array. An error is thrown if any value in the file cannot be coerced into the format specified in the field's descriptor. Stored fields longer than specified are truncated. For output, the program can pass this (or any other) record object to the following methods:

var = scr.ins(rec); var = scr.del(rec); var = scr.rep(rec);

On success, all of these methods return the primary key of the record, either a number or a string. The ins() method will "fail" if the primary key of the record already exists in the Index. However, the library does not throw an error but, instead, returns null, and the record is not inserted. This approach allows the test for an existing record to be combined with the actual insertion, eliminating an additional search of the Index and avoiding possible race conditions. It will also fail if the primary key of the record is an empty string. But, if the primary key is the number zero, the library automatically replaces this with a key one higher than the highest primary key in the file. (Keys above or below this can always be inserted by providing a specific number). The del() and rep() methods will fail if a record with the same primary key does not exist or if a scroll server is in use and the record is not currently leased. If these or any other errors are detected, a message is thrown, with errno set appropriately.

Client/server programs:

The scroll library's Index system is extremely fast but it has one noteworthy side effect. Scroll files can never be safely opened by more than one process at a time because the Index objects are created separately in each process, within the Scroll object. If one process wrote a record into the file it would also update its own Index objects. However, no changes would be made to the Index objects in any of the other processes and they would no longer match the current content of the scroll file. (To tell the truth, all database managers share similar problems, which is one reason they are so complex).

The scroll library offers a simple solution. Only one process is allowed to "own" the file but that process can operate as a server for other processes wishing to access the same file. Any process can be placed into scroll server mode by calling the serve() method:

t/f = scr.serve( [path] );

The path argument points to a location where a (UNIX socket) file will be created by the library. If the path is omitted, the suffix .sok is appended to the absolute path of the scroll file itself. This socket file does not occupy any external storage space, it is used only as a link to a buffer in the system kernel and to provide file access protection. When a scroll file client is connected to this path it re-directs all method requests into the socket where they are received, in order, by the server. Responses from the scroll server (including errors) are passed back over the socket to the originating client. All of this is fast, reliable, and completely transparent to the client.

Except for calling the serve() method, it is also transparent to the server. When this method is invoked, the library simply enters server mode and does not return until a SIGNAL event is received (for example, SIGTERM). No additional processing is required in the user program. The server program can, of course, provide its own custom processing both before and after the serve() method call, but this is optional. However, one process cannot serve more than one scroll file at a time and it cannot do anything else while serving.

In many cases, it is more desirable to create a custom "transaction server" program which manipulates several different scroll files. These can, of course, all be "owned" by a single transaction server but, then, they cannot be shared by other programs which might need them. Fortunately, it is perfectly acceptable for a transaction server to also be a client of one or more single-file scroll servers, which can share their files with other processes. Using this technique, large and complex client-server applications can be readily built entirely in JavaScript.

Scroll file protections:

In a client-server environment, it can be challenging to ensure a match between the scroll file's permissions and the user-id's of all client programs at the same time. In addition, many modern applications require finer-grained control of access to databases. Every scroll file server enforces the permissions of the file itself on each client, based solely on the user-id of the client process (the group-id is ignored). Group and world permissions behave as expected. If the file has any world permissions, any client user can read (and possibly write) it, through the server. If there are no world permissions, but there are group permissions, the client user must be a member of the file's group in order to access the database (with those group permissions).

At the file user (owner) level the scroll library provides extended functionality. The file owner's user name is treated as the name of a group, whose members all have the same user access rights as the owner. This is compatible with modern Unix systems, which assign all users their own private group with exactly the same name as their user-id. Scroll file access can then be systematically restricted to the members of two distinct groups — one group which may only read (file group) and another which may also write (owner group). Administratively, this is achieved by setting the file's permissions to 0640 and creating separate reader and writer groups for permitted users.

For this to work, the scroll server program must be run with the user-id of the primary member of the writer group. To simplify this task, we provide the jsdb utility which starts (SUID) as the root user and, for each listed file, forks a new server process in which its privileges are immediately dropped to those of the file's owner and group. Each new process then functions as a standard scroll server until terminated with SIGQUIT or SIGTERM. This program is easily incorporated into application start-up scripts and avoids any need to reinvent the wheel. The syntax is as follows:

jsdb [ -h ] [ -c ] [ -k ] [ [scroll-path][@socket-path] ] . . .

Without any arguments at all, or with the -h option, jsdb prints simple help information. If no other options are given, a new scroll server is started for each scroll file in the command line list. The socket path for each server may be given explicitly or, if omitted, it defaults to the absolute path of the file appended with .sok, as above. The -c option checks that the server is still running and, if so, displays its process id. The -k option is used to terminate the servers in the list, if they are still running.

The JSUS "Scroll" library

(A Network Database Manager)

Syntax:

Synopsis:

The DataBase Definition:

Defining index keys:

Initial file creation:

Scroll file connections:

Scroll file operations:

Client/server programs:

Scroll file protections:

See also: