Copyright ©1997-1998 by Axel T. Schreiner.  All Rights Reserved.



6
Limbo

Overview

Inferno programs are written in Limbo or in Java. Both languages are compiled into Dis code which is interpreted or compiled just-in-time for execution.

Limbo combines concepts from C and Modula with threads (spawn) and channels for synchronous communication from Hoare's CSP. There are dynamic strings and vectors (array) with a slicing operation as well as lists. Characters are manipulated as string using Unicode; conversions from and to array of byte with UTF are possible. Abstract datatypes (adt) are structures with procedure components that can be called C++ style. Finally, there are tuples -- value lists much like anonymous structures -- frequently used as function results. Inferno 2.0 adds an Exception adt.

Limbo is not object oriented; there is no inheritance or dynamic binding. There are no pointers but some types have ref semantics for passing parameters and adt values can be manipulated by reference (ref). Dis manages memory dynamically, normally be reference counting and adt values marked cyclic with garbage collection, i.e., ref can be used to construct self referential, dynamic data structures.



echo
{06/echo.m}
Echo: module {
init: fn (ctxt: ref Context, argv: list of string);
};
{06/echo.b}
implement Echo;   # a trivial first program
# that outputs its arguments
include "sys.m";  sys: Sys;
include "draw.m"; Context: import Draw;
include "echo.m";

init (nil: ref Context, argv: list of string) {
sys = load Sys Sys->PATH;

if (argv != nil) # hd argv is the command name
while ((argv = tl argv) != nil) {
arg := hd argv;

if (len argv > 1)
sys->print("%s ", arg);
else
sys->print("%s\n", arg);
}
}
{}
A program consists of modules which are dynamically instantiated (load). A module contains global constants (PATH), adt definitions (Context), global variables (sys) and procedures with or without a result (init).

A source file starts with implement and selects the module interface to be implemented (Echo). The module interface contains the exported names; constants (PATH) and adt types (Context) can be reached through module names, variables and functions (print) only through module references obtained with load (sys). import is used to avoid explicit references with ->.

include replaces a declaration and includes a file that is searched for in various places. By convention the file contains a module interface with the name of the file, usually for import clauses. implement introduces a module interface that is completely imported; it's contents can then be used without references (init).

init() follows the convention for command line programs (see sh.b). nil makes a parameter inaccessible. list has ref semantics; therefore, argv can have nil as a value. hd produces the first element, tl the rest, and len the length of a list. When defining a variable, := can be used to obtain the type from the initializer (arg); basically, the type between : and = is omitted. As usual, names are visible locally.



Modules

The following example demonstrates that modules can be instantiated more than once and their global variables can be accessed quite intentionally:
{06/mymod.m}
MyMod: module {  # a module with a string
name: string;
};
{06/mymod.b}
implement MyMod; # to implement it...
include "mymod.m"; # ...the module interface is sufficient
{06/mods.b}
implement Mods;

include "sys.m";   sys: Sys;
include "draw.m";
include "mymod.m";

Mods: module {
init: fn (c: ref Draw->Context, argv: list of string);
};
init (nil: ref Draw->Context, argv: list of string) {
sys = load Sys Sys->PATH;
a := load MyMod "mymod.dis"; # create 2 instances of MyMod
b := load MyMod "mymod.dis";

if (a == nil || b == nil) # did it work?
sys->print("cannot load a or b: %r\n");
else {    # deposit two names...
a->name = "a"; b->name = "b";
print(a); print(b); # ...and display by module reference
}
}
print (m: MyMod) {
sys->print("%s\n", m->name);
}
{}
Modules have ref semantics. MyMod contains a string. The module is instantiated twice, then the two strings are initialized and output with a function that uses a module reference.

%r as a format element produces the last system error message.



Statements

Statements mostly follow C:

expressionopt ;
{ statement-or-declaration... }
if ( condition ) statement
if ( condition ) statement else statement
labelopt while ( conditionopt ) statement
labelopt do statement while ( conditionopt ) ;
labelopt for ( expressionopt ; conditionopt ; expressionopt ) statement
labelopt case selector { qual => statement-or-declaration... ... }
labelopt alt { test => statement-or-declaration... ... }
break identifieropt ;
continue identifieropt ;
return expressionopt ;
spawn term ( expression ,... opt ) ;
exit ;

Statements, definitions, and declarations may appear in any order. Conditions must use int; missing conditions produce infinite loops. break exits from loops, case and alt statements, continue continues loops; labels can be used to specify the range.

spawn creates a new thread to execute the function call term(...). Threads begin by sharing everything; they must be synchronized using channels.

exit terminates it's thread and releases exclusive resources.



cnt

cnt, like wc, counts lines, words, characters, bytes, and characters within a range in files. The implementation is designed to exhibit as many features of Limbo as possible.

Data structure
{06/cnt.b}
implement Cnt;   # a baroque version of wc

include "sys.m";    sys: Sys; FD, OREAD: import Sys;
stdin, stderr: ref FD;
include "draw.m";   Context: import Draw;
include "string.m"; str: String;

Cnt: module {
init: fn (ctxt: ref Context, argv: list of string);
};

usage () {
sys->fprint(stderr, "Usage: cnt [-lwcb] [-C class] file...\n");
exit;
}

Count: adt {
count: big;
report: fn (c: self Count);
};

lines, words, chars, bytes, spec, all: con iota;
class: string;    # class for spec
{}
The String module contains functions to deal with strings and characters. In particular, it can handle character ranges such as a-z or ^0-9.

Count is an adt that the Cnt module does not export. The component count is used for counting. Limbo has the integer types byte (8 bits, positive), int (32 bits) and big (64 bits) as well as real for floating point calculations.

report() is a method; because of self it is called without an argument for a Count object that is then referred to by the parameter c in the procedure.

adt objects have value semantics; ref can be used to create references and pass them.

con defines constants. iota takes the values int 0, 1... for the names in turn. Here, these constants specify in which order the counters will later be output if Count objects exist.



Command line: options
{06/cnt.b}
init (nil: ref Context, argv: list of string) {
sys = load Sys Sys->PATH;
stdin = sys->fildes(0);
stderr = sys->fildes(2);

any := 0;
total := array[all] of ref Count;
if (argv != nil)
while ((argv = tl argv) != nil) { # first word is "cnt"
arg := hd argv;
if (arg[0] != '-' || arg == "-")
break;   # -: stdin
if (arg == "--") {  # --: end of options
argv = tl argv;
break;
}
opt: for (i := 1; i < len arg; i ++)
case arg[i] {
'b' =>   # -b: bytes
any++; total[bytes] = ref Count(big 0);
'c' =>   # -c: chars
any++; total[chars] = ref Count(big 0);
'C' =>   # -C class
class = arg[i+1:]; # rest of arg
if (class == nil) {
argv = tl argv; # next arg
if (argv == nil)
usage();
class = hd argv;
}
any++; total[spec] = ref Count(big 0);
break opt;
'l' =>   # -l: lines
any++; total[lines] = ref Count(big 0);
'w' =>   # -w: words
any++; total[words] = ref Count(big 0);
* =>
usage();
}
}
{}



A process normally has standard input and output and diagnostic output, identified by the numbers 0, 1, and 2. fildes() produces ref FD which can, for example, be passed to fprint().

Limbo uses list and array for aggregates. Both include the element type but not the number of elements as part of the aggregate type. Both have ref semantics. Elements may be aggregates. len produces the number of elements in a list, in a vector, or in a string.

To create a vector, the number of elements or the elements themselves must be specified.

x := array[10] of int; 10 (local) undefined elements
y := array of { 1, 2, 3 }; 3 defined elements
z := array[7] of { 1, 2, 5 => 5, * => 0 }; here, four elements are 0
w := list of { 1.5, 2.8, 3.4 };

Global definitions are initialized with zero, objects with ref semantics are always initialized with nil, the rest remains uninitialized. To initialize a vector, positions can be specified; * initializes all others in undefined order.

Integer constants are specified in decimal or with an explicit base as in 16rabc; if the value is large enough, it is big. Additionally, unicode characters and escape sequences such as \n or \uabcd in single quotes may be used as int constants. string constants use double quotes. A string element is a unicode character as an int value. Strings may be compared.

There are arithmetic conversions such as big 10, real 3, string 10, real "1.23". Additionally, an array of byte can be converted to string and back, see below. Finally, a tuple, i.e., a sequence of values in parentheses, can be converted into an adt or ref adt #; this creates a new objects with it's value componentes filled with the tuple elements one after another. Therefore, if -b is specified to cnt, total[bytes] refers to a Count object initialized with big 0.

cnt has a typical command line with switches like -b and values for options like -C class. As usual, one or two minus signs terminate the options; one additionally refers to standard input. Switches can be specified singly or together. The value of an option is the rest of an argument or the next argument; 'C' demonstrates how this is decoded.

Vectors and strings may be sliced by specifying the first index and the one past the end; if the latter is missing a slice like arg[i+1:] extends to the end. A vector slice has ref semantics while a string slice is a copy. One can assign an int value to the end of a string to increase it's length. Elements in a vector slice can be replaced. Empty strings "" are equivalent to nil.



Command line: files
{06/cnt.b}
if (! any)
total = array[all] of { spec => nil, * => ref Count(big 0) };
if (class != nil) {
str = load String String->PATH;
if (str == nil) {
sys->fprint(stderr,
"cnt: cannot load String module: %r\n");
exit;
}
}

if (argv == nil)
cnt(total, "-");
else {
nfile := len argv;
do
cnt(total, hd argv);
while ((argv = tl argv) != nil);
if (nfile > 1)
report("total", total);
}
}
{}
With no options given, most counters are shown; therefore, Count objects are created for most elements in total.

If a range of characters is to be counted the String module must be loaded. If this fails an error message is sent with fprint() to diagnostic output and the program is terminated.

If there are no arguments left, cnt() should count standard input. Otherwise, a file name is specified to be counted. If more than one file is specified on the command line report() must produce a sum.

Functions like cnt() or report() can be used before they are defined.



Counting bytes
{06/cnt.b}
cnt (total: array of ref Count, fnm: string) {
fd: ref FD;

if (fnm == "-")
fd = stdin;
else
fd = sys->open(fnm, OREAD);
if (fd == nil) {
sys->fprint(stderr, "cnt: cannot open %s: %r\n", fnm);
return;
}

n := array[all] of ref Count;
for (i := 0; i < all; ++ i)
if (total[i] != nil) n[i] = ref Count(big 0);

unicode := total[chars] != nil || total[spec] != nil
|| total[words] != nil;
content := unicode || total[lines] != nil;

buf := array[1024] of byte;
while ((nb := sys->read(fd, buf, len buf)) > 0) {
if (n[bytes] != nil) n[bytes].count += big nb;
if (content) {
for ((beg, i) := (0, 0); i < nb; ++ i)
if (int buf[i] == '\n') {
if (n[lines] != nil) ++ n[lines].count;
if (unicode) {
wc(n, buf[beg:i+1]);
beg = i+1;
}
}
if (unicode) wc(n, buf[beg:nb]); # rest of buf, if any
}
}
if (nb < 0)
sys->fprint(stderr, "cnt: read error %s: %r\n", fnm);
if (unicode) wc(n, nil); # rest of file, if any

report(fnm, n);
for (i = 0; i < all; ++ i)
if (total[i] != nil) total[i].count += n[i].count;
}
{}



A minus sign represents standard input; otherwise, open() is used to access a file for reading. The definition fd: ref FD is not necessary because the first assignment to fd is not in a block of it's own and could, therefore, be written as a definition, but the definition clarifies what the code sequence tries to do.

Each file is counted in a copy of total[] containing Count objects initialized with big 0. If one wants to count characters or words one has to deal with unicode, see below. Another gain in efficiency is possible if the number of lines need not be counted.

Following these preparations, read() produces a buffer with bytes and the length is counted if required. If the contents need to be analyzed the buffer is split at line separators and the lines are counted if required. buf[] contains byte values which have to be converted before they can be compared with int values.

The for loop illustrates a definition and initialization with a tuple value.

++ and -- exist not only as postfix operators: as in C, they deliver as prefixes the new and as postfixes the previous value.

Taking care of unicode is not quite trivial. Therefore, a function wc() is used which receives the Count vector n[] and a buffer slice. In principle, wc() is called once per line, but at the end of the buffer and at the end of a file some bytes may be left over. Therefore, beg tracks the beginning of a line and the buffer slice for wc() can be empty or nil.

At the end, file name and counters are output with report() and the sums are added up in total[].



Output
{06/cnt.b}
report (item: string, c: array of ref Count) {
for (i := 0; i < all; i ++)
if (c[i] != nil) (*c[i]).report();
if (item != "-")
sys->print(" %s", item);
sys->print("\n");
}
{}
report() produces the desired counters and possibly a filename.

Starting with an adt object, ref produces a reference to a copy. * is used to get from a reference to an object, but there, too, only a copy is reached.

It would have been better to define Count.report() so that a reference suffices as an argument, but here it serves as an example for passing an adt value.
{06/cnt.b}
Count.report (c: self Count) {
sys->print("%7bd", c.count);
}
{}
report() is a global function in the module, Count.report() is a method for Count. The method has no particular privileges.

To output big values the format element must include b.



Counting characters

Files contain bytes but Limbo processes unicode characters with string. By convention, a file should contain UTF where ASCII represents itself and all bytes with value >= 16r80 represent other unicode characters using two or three bytes.

Given a byte vector with one line it is relatively simple to analyze characters:
{06/cnt.b}
line: array of byte;

wc (c: array of ref Count, buf: array of byte) {
if (buf != nil && len buf > 0) {
l := 0;
if (line != nil)
l = len line;
new := array[l + len buf] of byte; # line + buf
new[0:] = line;    # copy line
new[l:] = buf;    # append buf
line = new;    # yields new line
if (int line[len line-1] != '\n') # ...complete?
return;
}
if (line != nil) {
s := string line; l := len s;
if (c[chars] != nil) c[chars].count += big l;
if (c[words] != nil) {
(n, nil) := sys->tokenize(s, " \t\n");
c[words].count += big n;
}
if (class != nil)
for (n := 0; n < l; ++ n)
c[spec].count += big str->in(s[n], class);
line = nil;
}
}
{}



line[] contains as yet unused bytes. A new buffer slice is appended to line[] where the current memory must be replaced. This can be done without pointers. When assigning to a vector, affected elements are replaced.

The line separator \n is an ASCII character, i.e., it represents itself in a byte and can be used to recognize lines as complete UTF sequences. If line[] does not contain a complete line wc() waits for it's next activation. line[] is global and thus persists between calls.

At the latest at end of file wc() is called with nil and an incomplete line does get processed.

When converting between array of byte and string UTF is converted into unicode and back. However, the result is undefined if the bytes do not follow UTF conventions. The official implementation of wc uses a finite automaton rather than string to discover defective sequences. An exact solution could also be based on sys->byte2char().

len produces the length of a string, i.e., the number of unicode characters. Words are either counted with a finite automaton that reacts appropriately for white space or one uses sys->tokenize() to split at particular characters. If nil is used in the target tuple of an assignment the corresponding value is ignored. Unicode contains more white space than mentioned here.

str->in() is used to classify characters.

13/May/1998