Samuel Williams Sunday, 12 December 2010

How do you delimit statements and expressions in modern programming languages? While this question specifically depends on the parser for a given language, at a more general level we can see some interesting patterns.

Many popular languages (e.g. Java, C/C++, Python) use commas "," and semi-colons ";" to delimit argument lists and statements respectively. Other languages (e.g. Smalltalk, Scheme, io, OOC) use various forms of whitespace as a natural separator of statements and arguments.

Language Design

Programming languages typically (but not always) consist of a list of expressions. Sometimes expressions can contain nested expressions, and often there are specific rules for the ways in which expressions can be nested (e.g. namespaces can contain classes which can contain functions which can contain procedural logic). Because expressions are written sequentially, they often need specific syntax (such as end of statement markers) to ensure that they are parsed separately.

Implicit Delimitation

Using whitespace to delimit statements reduces visual clutter. There are several different forms of whitespace delimitation, ranging from the very simple S-Expression, to the more complex Python indentation model.

S-Expressions

The S-expression is a bracketed sequence of items. Because they have a unique start and end symbol, they can be parsed unambiguously when in a sequence and nested.

(fn x (- y 10) z)

For this to hold true, all terminal expressions must also be unambiguous. In many cases, this means that prefix, infix and postfix operators cannot be supported general, because they introduce ambiguities.

; If the minus operator was defined to be both a prefix and infix operator,
; what does the following mean?
(fn x - y)

; Do we read this as
(fn x (- y)) // Two arguments
; .. or ..
(fn (- x y)) // One argument

Such ambiguities can only be resolved by explicit delimitation, or limitations on the syntax model (for example, only supporting infix operators).

Block Markers

A more generalised form of the S-Expression can be seen in many modern languages, through the use of white-space separated blocks.

class Alice {
	
	public void f() {}
	public void g() {}
	
}

In this case, the Java functions have no explicit separator between them - there is an implicit understanding about the structure of a class block and its direct children when they are also blocks. However, it should also be noted that when listing other non-block declarations inside a class body, these must be explicitly separated.

Another interesting involves expressions that can either be followed by a single expression or a block of code, such as the if statement; when followed by a block, no delimitation is required, but when followed by a single statement, an end of statement marker is required.

// End of line delimited
if (x) y(); // <- there is a ; here

// Whitespace delimited
if (x) { y(); } // <- there is no ; here

The primary reason for this is that like S-Expressions, blocks can be unambiguously parsed, but general expressions can't be. Because there is no ambiguity, explicit delimitation is not required.

The io syntax model incorporates both implicit delimitation of chained expressions, as well as infix operators.


# io delimits method invocations using spaces:
Account := Object clone
Account balance := 0

# io uses commas for argument lists:
Account deposit := method(amount,
	balance = balance + amount
)

account := Account clone
account deposit(10.00)
account balance println

Explicit Delimitation

Using explicit characters to delimit statements and expressions can improve the expressiveness of the language, at the cost of additional syntactic complexity. Typically, high level statements are divided using semi-colon ";", and argument lists using comma ",".

Separation Markers

This form is traditionally seen in function argument lists, or in chained method invocations. Each item has a mark between it and the next argument, but there is no mark at the end.

fn(x, y, z)

Arguments can easily be added into the list simply by adding additional commas and expressions. However, it can sometimes be difficult to copy arguments from one location to another; to counter this some languages accept the following equivalent syntax:

fn(x, y, z,)

The trailing comma means that each sub-argument is a complete element, and can be shifted around:

fn(z, x, y,)

This is primarily of use when specifying arrays of data that span multiple lines, e.g. in Ruby:

items = [
	10,
	20,
	30,
	40,
]

Because each line includes a comma, they can (the lines as whole units) be shifted around without fear that there is a delimiter syntax error.

Statement Markers (i.e. end of line)

This form is traditionally seen when separating statements.

fn(x); fn(y); fn(z);

This is again similar to the above case where statements can be moved around as whole units (i.e. each statement includes the termination marker).

The C++ syntax model includes mostly explicitly delimited expressions.


int bob (Baz * baz) {
	// We can see "->" and "," separation markers 
	int result = baz->calculateFoo(x, y, z)->bar();
	
	if (result) {
		return baz->apples();
	}
	
	// Whitespace delimited expression (there is no previous statement terminator):
	return baz->oranges(); // This statement terminator is extraneous but required for correctness.
}

Ease of Use

Humans are capable of dealing with complex patterns and as programmers we are generally happy when we have tools for expressing our ideas concisely. This has lead to many programming languages with complex syntax models, including the ability to use prefix, infix and postfix operators, since they allow humans to concisely express their ideas with a minimal overhead. However, because these operators introduce ambiguity, we require explicit delimitation.

On top of this, we have languages which don't require explicit delimitation if it is not ambiguous (e.g. JavaScript statement markers). This creates very interesting situations where some expressions look okay, but actually mean something completely different:

function add(x, y) {
	return
		x + y;
}

This function returns undefined, even if the intention of the programmer was to return x + y. This is because a semi-colon was inserted automatically after the return keyword (a form of implicit delimitation), and this is not a syntax error.

// This is how the code is actually parsed.
function add(x, y) {
	return;
		x + y;
}

In order to get the correct behaviour, the programmer can use a nested expression block such as the following:

// This is what the programmer actually meant.
function add(x, y) {
	return (
		x + y
	);
}

Languages with different symbols for separation present better opportunities for error detection and correction. If every sub-expression uses the same set of delimiters, a erroneous delimiter could potentially belong to any previously opened expressions. As an example, if argument lists in C used a semi-colon rather than a comma, it would be much harder to detect errors relating to unbalanced brackets.

fn(a, b, c);  // This is fine.
fn(a, b, c;   // This argument list is not terminated, (1)

// Suppose that we used the same terminator for separating arguments and statments
fn(a; b; c);  // This is fine.
fn(a; b; c;   // This argument list is not terminated (2)

In the case of (1), we can clearly identify the error on this line. However, in the case of (2) the specific location of the error cannot be detected easily - we might need to process the entire input before finding out that the argument list is not terminated. The same is also applicable to the use of curly braces blocks. If there is an invalid statement within a block, the error does not propagate beyond the end of that block: i.e. unless there is a problem with the nested structure of the blocks, any error can be isolated to a particular block.

In contrast, languages with implicit delimitation do not have this expressive richness. LISP code uses the same characters for almost every possible syntactic structure. Because of this, some kinds of errors in a nested statement cannot be detected until the entire program is processed.

Another benefit of explicit delimitation is the reduction in typing required to amend or add to an expression.

// Original statement
if (foo) bar();

// Amend this statement with a second expression:
// Cursor typed 7 additional characters, moved 1 space backwards
// Typing required: move back 1, insert 7 characters
if (foo) bar(), baz();

// Amend this statment with a second expression in a block:
// Typing required: move back 6, insert 2 characters, move forward 6, insert 9 characters
if (foo) { bar(); baz(); }

// This is similar in any language that delimits statements using blocks (i.e. implicit whitespace)

When expressions require specific characters at the front and back (such as LISP), typing code can be a bit tedious because when embedding previously written expressions, the cursor needs to work both in before and after the embedded expression.

Despite the potential issues with implicit delimitation, there are several benefits too. Implicit delimitation reduces visual clutter. Despite the fact that operators may help people express their ideas, the ambiguity they introduce may not be worth the net benefit to expressibility.

Combinations

There is no reason why a syntax model can't contain elements of both implicit and explicit separation of expressions. Ruby provides several constructs which involve whitespace delimitation. As an example, if you have a implicitly delimited set of string tokens, you can write this using the following:

%w{foo bar baz}

This is equivalent to:

["foo", "bar", "baz"]

Neat huh?

Comments

Leave a comment

Please note, comments must be formatted using Markdown. Links can be enclosed in angle brackets, e.g. <www.codeotaku.com>.