In a nutshell

This post describes two implementations of machine and compiler. To play with them, click the links:

Motivation

The last two posts evolved the machine given in the post A Virtual-machine-based Implementation in Stages by preparing it for explicit memory management: making operand stacks explicit and moving from symbolic to lexical addressing. So far, the machine relied on the memory management of the underlying implementation language (here JavaScript) for representing runtime entities. These included environments, operand stacks, runtime stack frames (call and block frames), and (compound and primitive) function objects. In this post, I deliver on explicit memory management by using an array of fixed size called heap to represent these runtime entities. I’ll first show how segments of the heap called nodes can represent runtime entities. Migrating them to the heap results in a machine that works on the usual examples, but can be “fooled” into unsafe behavior. In a last step, I show how to fix this issue.

Structure of heap

The heap is an array of fixed size. Runtime entities are represented by sequences of subsequent heap slots called nodes.

// node layout
// 0: kind: -1001, -1002, -1003, for easy spotting
// 1: size, including header
// 2: current number of children: 0 for no children
// ---- end of header ---
// 3: first child
// 4: second child
// ...
// followed by non-children slots

A node can refer to other nodes, using their heap address. These nodes are called the node’s children. I declare:

const kind_offset = 0;
const size_offset = 1;
const number_of_children_offset = 2;
const first_child_offset = 3;

Each node has a header consisting of three values: its kind, represented by an integer from -1008 to -1001 (chosen for easy visual spotting), its size, and its current number of children. If a node has any children, their heap addresses are found in subsequent slots starting with an offset of 3.

All memory allocation is done with the function make_node:

const heap = [];
const heap_size = 1000;
let free = 0;
function make_node(kind, size) {
    if (free + size < heap_size) {
        const node_address = free;
        heap[node_address + kind_offset] = kind;
        heap[node_address + size_offset] = size;
        heap[node_address + number_of_children_offset] = 0;
        free = free + size;
        return node_address;
    } else {
        error("out of memory");
    }
}

Heap allocation of runtime entities

Let’s start with representing environments.

// node layout
// 0: kind: -1001
// 1: size, including header: 3 + number of values + 1 (for encl env)
// 2: current number of children:  number of values + 1 (for encl env)
// 3: first value
// 4: second value
// ...
// .: last value
// .: enclosing environment

const environment_kind = -1001;

function enclosing_environment(env) {
    // enclosing environment is in last slot of env node
    return heap[env + heap[env + size_offset] - 1]; 
}

// null is arbitrarily chosen: will never be used
const the_empty_environment = null;

Due to lexical scoping, a program that compiles without error will never access the empty environment at then end of the chain of environment nodes. Environment extension uses the heap allocation function make_node to allocate a node of sufficient size. It then copies the given values into the slots reserved for the children nodes.

function extend_environment(vals, base_env) {
    const len = length(vals);
    const node = make_node(environment_kind, first_child_offset + len + 1);
    heap[node + number_of_children_offset] = len + 1;
    for (let i = 0; i < len; i = i + 1) {
        heap[node + i + first_child_offset] = head(vals);
        vals = tail(vals);
    }
    heap[node + first_child_offset + len] = base_env;
    return node;
}

The function find_environment_node uses the new enclosing_environment function above.

function find_environment_node(env, frame) {
    return frame === 0
           ? env
           : find_environment_node(enclosing_environment(env), frame - 1);
}

The functions lexical_address_lookup and lexical_address_assign operate on heap nodes.

function lexical_address_lookup(frame, position, env) {
    return heap[find_environment_node(env, frame) + 
                first_child_offset +
                position];
}
function lexical_address_assign(frame, position, val, env) {
    heap[find_environment_node(env, frame) + 
                first_child_offset +
                position] = val;
    return env;
}

Compound function nodes have the following structure.

// node layout
// 0: kind: -1002
// 1: size, including header: 7
// 2: current number of children:  1
// 3: environment
// 4: parameters
// 5: address
// 6: limit

Their implementation follows.

const compound_function_kind = -1002;
const parameters_offset = 4;
const address_offset = 5;
const limit_offset = 6;

function make_function(parameters, address, env, limit) {
    const node = make_node(compound_function_kind, 4 + first_child_offset);
    heap[node + number_of_children_offset] = 1;
    heap[node + first_child_offset] = env;
    heap[node + parameters_offset] = parameters;
    heap[node + address_offset] = address;
    heap[node + limit_offset] = limit;
    return node;
}

function is_compound_function(f) {
    return heap[f + kind_offset] === compound_function_kind;
}
function function_parameters(f) {
    return heap[f + parameters_offset];
}
function function_address(f) {
    return heap[f + address_offset];
}
function function_environment(f) {
    return heap[f + first_child_offset];
}
function function_max_stack_size(f) {
    return heap[f + limit_offset];
}

Primitive functions carry an ID that can be used to access their implementation in an array primitive_implementations.

/ node layout
// 0: kind: -1003
// 1: size, including header: 4
// 2: current number of children: 0
// 3: id

const primitive_function_kind = -1003;
const primitive_id_offset = 3;

function make_primitive_function(id) {
    const node = make_node(primitive_function_kind, first_child_offset + 1);
    heap[node + primitive_id_offset] = id;
    return node;
}
function is_primitive_function(fun) {
    return heap[fun + kind_offset] === primitive_function_kind;
}

function primitive_implementation(fun) {
    return primitive_implementations[heap[fun + primitive_id_offset]];
}

The runtime stack contains call frames and block frames. Call frames have the following structure and implementation.

// node layout
// 0: kind: -1004
// 1: size, including header: 5
// 2: current number of children: 1
// 3: environment
// 3: pc
// 4: operands

const call_frame_kind = -1004;
const pc_offset = 4;
const operand_stack_offset = 5;

function make_runtime_stack_call_frame(pc, env, opnds) {
    const node = make_node(call_frame_kind, first_child_offset + 3);
    heap[node + number_of_children_offset] = 1;
    heap[node + first_child_offset] = env;
    heap[node + pc_offset] = pc;
    heap[node + operand_stack_offset] = opnds;
    return node;
}
function is_runtime_stack_call_frame(sf) {
    return heap[sf + kind_offset] === call_frame_kind;
}
function runtime_stack_call_frame_pc(sf) {
    return heap[sf + pc_offset];
}
function runtime_stack_call_frame_environment(sf) {
    return heap[sf + first_child_offset];
}
function runtime_stack_call_frame_operands(sf) {
    return heap[sf + operand_stack_offset];
}

The structure and implementation of block frames is similar.

// node layout
// 0: kind: -1005
// 1: size, including header: 3
// 2: current number of children: 1
// 3: environment

const block_frame_kind = -1005;

function make_runtime_stack_block_frame(env) {
    const node = make_node(block_frame_kind, first_child_offset + 1);
    heap[node + number_of_children_offset] = 1;
    heap[node + first_child_offset] = env;
    return node;
}
function is_runtime_stack_block_frame(sf) {
    return heap[sf + kind_offset] === block_frame_kind;
}
function runtime_stack_block_frame_environment(sf) {
    return heap[sf + first_child_offset];
}

For operand stacks, I use the current_number_of_children slot to keep track of the top of stack.

// node layout
// 0: kind: -1006
// 1: size, including header: 3 + max 
// 2: current number of children: initially 0
// 3: first child
// ...
// .: last child

const operand_stack_kind = -1006;

function make_operand_stack(max_stack_size) {
    const node = make_node(operand_stack_kind, 
                           first_child_offset + max_stack_size);
    heap[node + number_of_children_offset] = 0;
    return node;
}
function push_on_operand_stack(v, operand_stack) {
    heap[operand_stack + first_child_offset +
         heap[operand_stack + number_of_children_offset]] = v;
    heap[operand_stack + number_of_children_offset] = 
        heap[operand_stack + number_of_children_offset] + 1;
}
function pop_from_operand_stack(operand_stack) {
    const val = heap[operand_stack + first_child_offset - 1 +
                     heap[operand_stack + number_of_children_offset]];
    heap[operand_stack + number_of_children_offset] = 
        heap[operand_stack + number_of_children_offset] - 1;
    return val;
}
function top_of_operand_stack(operand_stack) {
    return heap[operand_stack + first_child_offset - 1 + 
                heap[operand_stack + number_of_children_offset]];
}

A nasty problem

The machine with explicit heap works as expected on the usual example programs. Note, however, that some values such as numbers are represented directly in environments and operand stacks, while others such as functions are presented by the address of their nodes in the heap. This means that I can trick the machine into treating a number as the address of a function. For example, let’s assume that the primitive function for the operator + is located on the heap at position 32. Then the following erroneous program will execute without error and produce the result 5.

32(2, 3);

A solution

Consistently distinguishing nodes from primitive values is a common challenge in the design of virtual machines. A straightforward approach will suffice here, without overly complicating the machine design. The idea is to “box” every primitive value (number, boolean, string) in a heap node.

// node layout
// 0: kind: -1007
// 1: size, including header: 4
// 2: current number of children:  0
// 3: the value

const primitive_value_kind = -1007;
const actual_value_offset = 3;

function make_primitive_value(the_value) {
    const node = make_node(primitive_value_kind, 
                           first_child_offset + 1);
    heap[node + actual_value_offset] = the_value;
    return node;
}

function is_primitive_value(f) {
    return heap[f + kind_offset] === primitive_value_kind;
}
function primitive_value_actual_value(f) {
    return heap[f + actual_value_offset];
}

function unbox(x) {
    return is_primitive_value(x) 
           ? primitive_value_actual_value(x)
           : error(heap[x], "unexpected value of kind:");
}

The load constant instructions now need to make boxed values.

        // machine clause for load constant instructions
        } else if (is_load_constant_instruction(instr)) {
            pc = pc + 1;
            push_on_operand_stack(
                make_primitive_value(load_constant_value(instr)), 
                operands);
        }

The primitive functions in call and tail-call instructions need to operate on boxed values.

        // machine clause for call instructions
        } else if (is_call_instruction(instr)) {
            const arity = call_instruction_arity(instr);
            const args = pop_arguments_from_operand_stack(operands, arity);
            const callee = pop_from_operand_stack(operands);
            if (is_primitive_function(callee)) {
                pc = pc + 1;
                push_on_operand_stack(apply_primitive_function(
                                          callee,
                                          args),
                                      operands);
            } else {
                ...
	    }
	} ...

where the function apply_primitive_function works as follows.

function apply_primitive_function(fun, arglist) {
    return make_primitive_value(
               apply_in_underlying_javascript(
                   primitive_implementation(fun), 
                   map(address => heap[address + actual_value_offset],
                       arglist)));
}

Finally, the machine clause for jump-on-false instructions needs to unbox the boolean value before it can be used for the branching.

        // machine clause for jump-on-false
        } else if (is_jump_on_false_instruction(instr)) {
            pc = is_truthy(unbox(pop_from_operand_stack(operands)))
                 ? pc + 1
                 : jump_address(instr);

Outlook

Above, you have seen a machine that doesn’t rely on JavaScript to construct runtime entities such as environment frames. All runtime allocation is done in an array that contains only primitive values: numbers, boolean values, and strings. (I admit strings of arbitary length as “primitive” values. The reader is invited to find a way to fix this. Hint: Section 5.5.6 of SICP JS describes the required data structure called constant pool.)

What happens when memory runs out? Currently, the machine just gives up with an error. There could be many nodes in the heap that are not used any longer. The final post in this series will describe garbage collection: A way to reclaim unused heap memory. Stay tuned.


A Virtual Machine with Lexical Addressing     A Virtual Machine with Garbage Collection