A Virtual Machine with Lexical Addressing
revised on June 4, 2022
In a nutshell
This post describes two implementations of machine and compiler. To play with them, click the links:
Motivation
The last post evolved the machine given in the post A Virtual-machine-based Implementation in Stages by making the operand stack explicit to get operand stacks ready for a realistic memory management in the virtual machine. In this post, I similarly prepare environments for realistic memory management. For this, I simplify the representation of environment frames such that they become just list of values accessed with frame and position numbers instead of lookup tables accessed with the symbols of names. This removes the need for linearly searching environment frames to find the value of a name using a technique that is explained in detail in Section 5.5.6 of SICP JS. In this post I apply this technique to the virtual machine of the previous post. I first show how the machine makes use of this information for environment lookup and assignment, by implementing environment frames as lists of values. Then, I show how the compiler predict the address in which the values of names will be found in the environment at runtime. As a final step, I change the respresentation of frames from lists to arrays to ensure constant time access. But first let me review the way environments are represented in the previous machines.
Baseline: Symbolic addressing
Environments in the previous machines were lists of frames.
function enclosing_environment(env) { return tail(env); }
function first_frame(env) { return head(env); }
const the_empty_environment = null;
A frame was a pair of lists. The head list contained symbols (strings) and the tail list contains the values of those symbols.
function make_frame(symbols, values) { return pair(symbols, values); }
function frame_symbols(frame) { return head(frame); }
function frame_values(frame) { return tail(frame); }
function extend_environment(symbols, vals, base_env) {
return length(symbols) === length(vals)
? pair(make_frame(symbols, vals), base_env)
: length(symbols) < length(vals)
? error("too many arguments supplied: " +
stringify(symbols) + ", " +
stringify(vals))
: error("too few arguments supplied: " +
stringify(symbols) + ", " +
stringify(vals));
}
Therefore, environment lookup proceeded as follows.
function lookup_symbol_value(symbol, env) {
function env_loop(env) {
function scan(symbols, vals) {
return is_null(symbols)
? env_loop(enclosing_environment(env))
: symbol === head(symbols)
? head(vals)
: scan(tail(symbols), tail(vals));
}
if (env === the_empty_environment) {
error(symbol, "unbound name");
} else {
const frame = first_frame(env);
return scan(frame_symbols(frame), frame_values(frame));
}
}
return env_loop(env);
}
The compiler passed the symbol of a name to the constructor of the load instruction.
function compile_component(comp) {
if (is_literal(comp)) {
instrs[wc] = load_constant(literal_value(comp));
wc = wc + 1;
} ... { ...
} else if (is_name(comp)) {
instrs[wc] = load(symbol_of_name(comp));
wc = wc + 1;
} else ...
}
The constructor of the load instruction represented the symbol explicitly in the machine instruction.
function load(symbol) {
return list("load", symbol);
}
function is_load_instruction(instr) {
return is_tagged_list(instr, "load");
}
function load_symbol(instr) {
return head(tail(instr));
}
The machine used the instruction’s symbol to perform environment lookup.
// clause in virtual machine for load instruction
} else if (is_load_instruction(instr)) {
pc = pc + 1;
push_on_operand_stack(lookup_symbol_value(load_symbol(instr),
environment),
operands);
} ...
Similarly, the entering of a block and lambda expressions were implemented by storing the local names and parameters in machine instructions so that the machine could perform environment extension when executing enter scope instructions and function call instructions.
Lexical addressing
The lookup_symbol_value
function above scans the frames of the given
environment successively to find the given name and its corresponding value.
SICP Section 5.5.6 describes
a technique to optimize name lookup such that the compiler predicts the
position where the value can be found. This is possible because in
lexical scoping, the structure of the environment exactly matches the
structure of nested scopes. Similar to the previous post, I proceed in two
steps: First I present
an implementation that retains the list structure
of environments, and then I show how arrays can be used to further speed
up environment lookup and prepare our machine for heap allocation.
The following example, taken from SICP 5.5.6, illustrates the idea of lexical addressing.
((x, y) =>
(a, b, c, d, e) =>
((y, z) => x * y * z)(a * b * x, c + d + x))(3, 4)
Consider the problem of looking up the value of x
while evaluating
the expression x * y * z
in an application of the function that is
returned by the expression above. The structure of the environment
at that point is as follows.
list(pair(list("y", "z"), list(v_1, v_2)),
pair(list("a", "b", "c", "d", "e"), list(v_3, v_4, v_5, v_6, v_7)),
pair(list("x", "y"), list(v_8, v_9)),
...)
To find the value for the symbol "x"
, lookup_symbol_value
examines
the symbol list list("y", "z")
of the first frame in search for "x"
.
It doesn’t find "x"
there, and therefore continues with the symbol
list list("a", "b", "c", "d", "e")
of the second frame. This search
is also unsuccessful, and thus it proceeds to the the third symbol
list list("x", "y")
where it finds "x"
in position 0, and thus
returns the corresponding value, here denoted by v_8
. This behavior
can be predicted by examining the expression in which x
occurs.
When looking for a declaration of x
starting from x * y * z
, you
need to skip two surrounding lambda expression, before you find the
declaration at position 0 in the parameters of (x, y) => ...
.
Using lexical addresses in the machine
For each name, the new compiler predicts in which frame of the environment
at runtime and in what position of that frame
the corresponding value is located.
Instead of a load
instruction that contains the name, the compiler
generates a load_lexical
instruction that contains the frame index
and the position index.
function load_lexical(frame, position) {
return list("load_lexical", frame, position);
}
function is_load_lexical_instruction(instr) {
return is_tagged_list(instr, "load_lexical");
}
function load_lexical_frame(instr) {
return head(tail(instr));
}
function load_lexical_position(instr) {
return head(tail(tail(instr)));
}
Similarly, the assign
instruction is replaced by an
assign_lexical
instruction.
function assign_lexical(frame, position) {
return list("assign_lexical", frame, position);
}
function is_assign_lexical_instruction(instr) {
return is_tagged_list(instr, "assign_lexical");
}
function assign_lexical_frame(instr) {
return head(tail(instr));
}
function assign_lexical_position(instr) {
return head(tail(tail(instr)));
}
New machine instructions for lexical addressing
The enter_scope
instruction
is replaced by an enter_scope_lexical
instruction that only carries
the number of values to be allocated in the extended environment.
function enter_scope_lexical(number_of_declarations) {
return list("enter_scope_lexical", number_of_declarations);
}
function is_enter_scope_lexical_instruction(instr) {
return is_tagged_list(instr, "enter_scope_lexical");
}
function enter_scope_lexical_declarations(instr) {
return head(tail(instr));
}
The load_function
instruction is replaced by a
load_function_lexical
instruction that carries
the number of parameters (arity) instead of their symbols.
function load_function_lexical(arity, address, stack_limit) {
return list("load_function_lexical", arity, address, stack_limit);
}
function is_load_function_lexical_instruction(instr) {
return is_tagged_list(instr, "load_function_lexical");
}
function load_function_lexical_instruction_arity(instr) {
return list_ref(instr, 1);
}
function load_function_lexical_instruction_address(instr) {
return list_ref(instr, 2);
}
function load_function_lexical_instruction_max_stack_size(instr) {
return list_ref(instr, 3);
}
Some cleanup
A consequence of the resolution of names at compile time is that names are no longer needed in environments. Each frame consists of just the list of values that are associated with their corresponding names in the program.
function extend_environment(vals, base_env) {
return pair(vals, base_env);
}
Adapting the machine to lexical addressing
Here is the implementation of the enter_scope_lexical
instruction.
// machine clause for enter_scope_lexical instruction
} else if (is_enter_scope_lexical_instruction(instr)) {
pc = pc + 1;
runtime_stack = pair(make_runtime_stack_block_frame(
environment),
runtime_stack);
environment = extend_environment(
list_of_unassigned(
enter_scope_lexical_declarations(instr)),
environment);
} else ...
The extend_environment
function no longer gets the list
of declared symbols as argument; the initial values (here the special value
"*unassigned*"
) suffice. Similarly, the instructions call
and
tail_call
no longer pass the list
of parameter symbols to the extend_environment
function.
The instruction load_function_lexical
works like this.
// machine clause for load_function_lexical instruction
} else if (is_load_function_lexical_instruction(instr)) {
pc = pc + 1;
push_on_operand_stack(
make_function(
load_function_lexical_instruction_arity(instr),
load_function_lexical_instruction_address(instr),
environment,
load_function_lexical_instruction_max_stack_size(instr)),
operands);
Function values now just carry the arity of the function and no longer the list of parameter symbols.
Finally, the lookup_lexical
instruction makes use of the frame number
and position computed at compile time and stored in the instruction.
// machine clause for load_lexical instruction
} else if (is_load_lexical_instruction(instr)) {
pc = pc + 1;
push_on_operand_stack(lexical_address_lookup(
load_lexical_frame(instr),
load_lexical_position(instr),
environment),
operands);
} else ...
It uses the function lexical_address_lookup
, which operates on the
simplified environment data structures.
function lexical_address_lookup(frame, position, env) {
function find_position(vals, position) {
return position === 0
? head(vals)
: find_position(tail(vals), position - 1);
}
return find_position(find_frame(env, frame), position);
}
function find_frame(env, frame) {
return frame === 0
? first_frame(env)
: find_frame(enclosing_environment(env), frame - 1);
}
Similarly, the assign_lexical
instruction uses a lexical_address_assign
function, which operates on the simplified environments.
Generating lexical addresses in the compiler
As described in SICP JS Section 5.5.6, the compiler generates lexical addresses by adding a compile-time environment to the recursive compilation process.
function compile(program) {
// wc: write counter
let wc = 0;
// instrs: instruction array
const instrs = [];
function compile_component(comp, env) {
if (is_literal(comp)) {
instrs[wc] = load_constant(literal_value(comp));
wc = wc + 1;
} else ...
}
...
const the_compile_time_environment =
setup_compile_time_environment();
compile_component(program, the_compile_time_environment);
instrs[wc] = done();
return instrs;
}
The frames in compile-time environments only contain symbols, not values. The initial environment is the list of predeclared symbols.
function setup_compile_time_environment() {
return extend_compile_time_environment(
append(primitive_function_symbols,
primitive_constant_symbols),
the_empty_compile_time_environment);
}
Like runtime environments, compile-time environments can be extended to accommodate new names in a local scope.
function enclosing_compile_time_environment(env) { return tail(env); }
function first_compile_time_frame(env) { return head(env); }
const the_empty_compile_time_environment = null;
function extend_compile_time_environment(symbols, base_env) {
return pair(symbols, base_env);
}
The compiler makes use of the compile-time environment when it compiles occurrences of names.
// compiler clause for names
} else if (is_name(comp)) {
const address = find_symbol(symbol_of_name(comp), env);
instrs[wc] = load_lexical(head(address), tail(address));
wc = wc + 1;
} else ...
Similarly, the compiler uses the compile-time environment for turning declarations into assignments.
} else if (is_declaration(comp)) {
compile_component(declaration_value_expression(comp), env);
const address = find_symbol(declaration_symbol(comp), env);
instrs[wc] = assign_lexical(head(address), tail(address));
wc = wc + 1;
The function find_symbol
computes the address of a given symbol
in the compile-time environment. An address is a pair whose head
is the frame number and whose tail is the position.
function find_symbol(symbol, env) {
function find(symbol, env, frame) {
if (is_null(env)) {
return error("not found");
} else {
const current_frame = first_compile_time_frame(env);
const symbols = member(symbol, current_frame);
if (is_null(symbols)) {
return find(symbol,
enclosing_compile_time_environment(env),
frame + 1);
} else {
const position = length(current_frame) - length(symbols);
return pair(frame, position);
}
}
}
return find(symbol, env, 0);
}
This works because the compile-time environment exactly matches the runtime environment. To achieve this, the compiler extends the compile-time environment to include the symbols of the new names each time the compiler enters a local scope.
// compiler clause for blocks
} else if (is_block(comp)) {
const local_symbols = scan_out_declarations(block_body(comp));
instrs[wc] = enter_scope_lexical(length(local_symbols));
wc = wc + 1;
compile_component(block_body(comp),
extend_compile_time_environment(
local_symbols,
env));
instrs[wc] = exit_scope();
wc = wc + 1;
} else ...
// compiler clause for lambda expressions
} else if (is_lambda_expression(comp)) {
const body = lambda_body(comp);
const parameters = lambda_parameter_symbols(comp);
instrs[wc] = load_function_lexical(
length(parameters),
wc + 2,
max_stack_size(body));
wc = wc + 1;
// jump over the body of the lambda expression
const goto_instruction = goto();
instrs[wc] = goto_instruction;
wc = wc + 1;
compile_component(body,
extend_compile_time_environment(
parameters,
env));
instrs[wc] = load_constant(undefined);
wc = wc + 1;
instrs[wc] = return_instr();
wc = wc + 1;
set_jump_address(goto_instruction, wc);
} else ...
Frames as Arrays
The representation of frames as lists of values means that access is still linear even when the index is know at compile time. The final version of the machine uses arrays instead of lists to fix this. Here is the new implementation of environments that provides constant time access to frames.
function make_frame(frame_size) {
return list("frame", [], frame_size);
}
function frame_array(frame) {
return head(tail(frame));
}
function extend_environment(vals, base_env) {
const len = length(vals);
const frame = make_frame(len);
const a = frame_array(frame);
for (let i = 0; i < len; i = i + 1) {
a[i] = head(vals);
vals = tail(vals);
}
return pair(frame, base_env);
}
function find_frame(env, frame) {
return frame === 0
? first_frame(env)
: find_frame(enclosing_environment(env), frame - 1);
}
function lexical_address_lookup(frame, position, env) {
return frame_array(find_frame(env, frame))[position];
}
function lexical_address_assign(frame, position, val, env) {
frame_array(find_frame(env, frame))[position] = val;
return env;
}
The road ahead
With local operand stacks and lexical addressing in place, I can allocate environment frames, runtime stack frames, and function values in an explicit heap data structure, which can be a single array that only holds primitive values. If the machine runs out of memory, garbage collection can free unused memory in this heap. Stay tuned.