An Intro to Compilers in JavaScript

Profile picture

Kartik Nair / Sun Apr 30 2023

As programmers languages are tools that we use on a daily basis, usually many of them interacting in different ways each used for a different purpose. For many of us, however, a language (or technically its compiler/interpreter) is a black box; our code goes in and out comes a running output or executable. In this guide, I’m gonna try and step you through building a programming language so we get to build out that black box with all of its components. The language we’ll be building is gonna be a very simple JS-like dynamically typed interpreted language. The language we decide to use to build this interpreter makes a difference since it decides how much we get for free from the “host” (implementing) language.

Since this is an introduction to compilers we’ll be picking JavaScript to implement this language, which gives us a lot of things for free since our language is also JavaScript-like. The full code for the final result can be found on GitHub.

The Pipeline

The compiler pipeline can mostly be split into a few parts. The first step is to convert the source code (a string of text) into a nested tree structure (known as an abstract syntax tree). This step is done to make stuff like whitespace irrelevant while also ensuring that the code is available in an easy-to-work-with format. This first step is usually split into two: tokenizing (converting the string of text into many small tokens) and then parsing (taking this flat list of tokens and converting it into an abstract syntax tree). Then once we have this tree we can do a few things, the most direct thing to do is to directly evaluate this tree and “run” the program and this is what many interpreted languages do. Instead, we could compile this to some lower-level language (like assembly or bytecode for a virtual machine). There are also many steps you could add in between the tree and interpreting/compiling; for example, if you’re language is typed this is where you would do type checking or you could optimize and “lower” (simplify) expressions in the tree.

The Language

The language we’re implementing will be really simple. It should have the basics of JavaScript and include things like control flow (conditionals, looping), functions/recursion, and some way to get output (like printing to the terminal). Here’s an example program that we’ll be able to run by the end of this guide:

let add1 = (n) -> {
	return add(n, 1);
}

let num = 0;
while(lt(num, 5), () -> {
	println(num);
})

One thing you might notice is that we use lt() instead of <, this is because operators can introduce unnecessary complexity while implementing a parser. It’s something that I recommend the reader work on themselves after the guide (highly recommend reading the Wikipedia pages for operator precedence and Pratt parsing).

Guide

We’re going to work in JavaScript, so set up a JavaScript project however you like. Personally, I’m just gonna be working in single file and using node to run the file directly.

Tokenizing

As mentioned earlier tokenizing converts the code into a flat list of tokens. It’s in this step that we’ll remove whitespace as a factor (but note that very few programmer errors are caught in this step). Below is what we’re trying to accomplish:

// Given code like:
let foo = 23;

// We want the output to be something like:
[
  {type: 'keyword', keyword: 'let'},
  {type: 'identifier', name: 'foo'},
  {type: 'symbol', symbol: '='},
  {type: 'numberLiteral', value: 23},
  {type: 'symbol', symbol: ';'},
]

Let’s start with the function signature for our tokenize function:

function tokenize(code) {
	// ...
}

Within this function, we’ll have sections for each part of the code. All of these will be wrapped into a token() function within this function. This allows us to have shared state between tokenizing. We then call this function in a loop to collect these tokens and return the list back.

function tokenize(code) {
  let cursor = 0;

	// The list of reserved identifiers and symbols
  let keywords = ["let", "return"];
  let symbols = [";", ",", "=", "(", ")", "{", "}"];

	function token() {
	}

	let tokens = [];
  while (cursor < code.length) {
    let t = token();
    if (t !== null) {
      tokens.push(t);
    }
  }
  return tokens;
}

The first thing we can start with is by skipping any whitespace. First let’s make a top-level function to check if a character is whitespace:

// Little regex test
function isWhitespace(str) {
  return /\\s/.test(str);
}

And then back in our token function:

function token(code) {
	if (isWhitespace(code[cursor])) {
		cursor++;
		return null; // no token returned
	}
}

Next, we’ll be adding 4 sections to our token function to support 4 different things:

  1. Symbols

  2. Numbers

  3. Strings

  4. Identifiers (names)

Starting first with symbols (which were listed in the shared state of our tokenize function):

function token(code) {
	// ...
	if (symbols.includes(code[cursor])) {
		cursor++; // skip the character for the next time token() is called
		return {
			type: "token",
			tokenType: "symbol",
			symbol: code[cursor - 1], // the character we skipped is the symbol
		};
	}
}

After this, we add tokenizing for numbers. We’re currently only adding support for integers (so no need to worry about .), which can be tokenized by looking for a digit and then keep going until there’s no more digits:

function token(code) {
	// ...
	if (isDigit(code[cursor])) {
		let numberBeginIdx = cursor;
		while (isDigit(code[cursor])) {
			cursor++;
		}

		let numberEndIdx = cursor;
		let numberString = code.substring(numberBeginIdx, numberEndIdx);
		let numberValue = parseInt(numberString); // convert the number to an actual int

		return {
			type: "token",
			tokenType: "number",
			value: numberValue,
		};
	}
}

Then we add support for strings. Strings work by looking for a quote (either single or double) and then considering everything until the other quote as part of the string:

function token(code) {
	// ...
	if (code[cursor] === `'` || code[cursor] === `"`) {
		let quote = code[cursor]; // save which quote we started with
		cursor++;

		let stringBeginIdx = cursor;
		while (code[cursor] !== quote) cursor++;

		let stringEndIdx = cursor;
		let stringValue = code.substring(stringBeginIdx, stringEndIdx);
		cursor++; // skip the closing quote

		return {
			typ: "token",
			tokenType: "string",
			value: stringValue,
		};
	}
}

And then finally we’ll look for names (like variable names, function names, keywords). Names must start with an alphabet and then can have alphabets or digits in their names. So when we see an alphabet we keep going until the characters are alphanumeric:

function token(code) {
	// ...
	if (isAlpha(code[cursor])) {
		let identBeginIdx = cursor;
		while (isAlpha(code[cursor]) || isDigit(code[cursor])) cursor++;

		let identEndIdx = cursor;
		let name = code.substring(identBeginIdx, identEndIdx); // "let", "bar"
		if (keywords.includes(name)) {
			return {
				type: "token",
				tokenType: "keyword",
				keyword: name,
			};
		} else {
			return {
				type: "token",
				tokenType: "identifier",
				name: name,
			};
		}
	}

	// Last case if we get a character we don't know what to do with
	throw `Lexer error: unknown character '${code[cursor]}'`;
}

And that’s all for tokenizing! Let’s test out this function with some examples:

let code = `let foo = 24;`;
let tokens = tokenize(code);
console.log(tokens);

Run as you would (either in the browser or preferably through Node) and you should get:

[
  { type: 'token', tokenType: 'keyword', keyword: 'let' },
  { type: 'token', tokenType: 'identifier', name: 'foo' },
  { type: 'token', tokenType: 'symbol', symbol: '=' },
  { type: 'token', tokenType: 'number', value: 24 },
  { type: 'token', tokenType: 'symbol', symbol: ';' }
]

As I mentioned earlier, there are many valid list of tokens that are not valid programs, for example, this will be tokenized correctly:

let 24 = ;

Parsing

With tokenizing out of the way, we can now take these tokens and convert them into a tree. To do this we’ll start similarly to how we started tokenizing. A parent function with a simple signature:

function parse(tokens) {
	// ...
}

This function will take in a list of tokens and return back a tree. As we did in tokenize we will have shared state at the top of the function and then have multiple functions within to handle different scenarios. Just like in tokenize we’ll have a cursor but this time it’ll represent the cursor into our list of tokens:

function parse(tokens) {
  let cursor = 0;
	// ...
}

Let’s start with a function to parse an expression. Expressions are the end nodes in our tree, they can have children but the most nested node will always be an expression. In languages expressions are things that evaluate to a value (think numbers, strings, lists, etc.). For now, we only have 2 kinds of expressions: literals (numbers/strings), identifiers (variables):

function parseExpr() {
  if (
    tokens[cursor].tokenType === "number" ||
    tokens[cursor].tokenType === "string"
  ) {
    cursor++;
    return {
      type: "expr",
      exprType: "literal",
      token: tokens[cursor - 1],
    };
  } else if (tokens[cursor].tokenType === "identifier") {
    cursor++;
    return {
      type: "expr",
      exprType: "identifier",
      name: tokens[cursor - 1],
    };
  } else {
    parseErrorAtCursor("Expected expression.");
  }
}

As you can see return the token within the object we return allowing us to extract information from that later on while interpreting. We also used a function we haven’t made called parseErrorAtCursor, as you can guess by the name, this allows us to report an error wherever we are currently in the middle of parsing. We have the actual index into our source code through the current token so you could implement nice error reporting, but for now I’ll go with something very simple:

function parse(tokens) {
  let cursor = 0;

	function parseErrorAtCursor(msg) {
		throw `Parse error at token: ${cursor}: ${msg}`;
	}
	
	function parseExpr() {
		// ...
	}
}

With expressions out of the way we can move to statements. We’re gonna start very simple with only one kind of statement: a variable declaration. Before we make our statement parsing function, there’s something that pops up commonly while parsing. Sometimes we expect the next token to be a specific type and if it isn’t that type we want to throw an error. We have to do this often enough where it makes sense to wrap this logic into a function:

function expect(expectedType, errorMsg) {
  if (tokens[cursor].tokenType === expectedType) {
    cursor++;
		// return the token so caller can use it
    return tokens[cursor - 1];
  } else {
    parseErrorAtCursor(errorMsg);
  }
}

Ok now let’s start with our function to parse statements. Since we have only one kind of statement we can quickly check if the start is valid by checking for the let keyword. Once we have that we go step-by-step checking for what we need. A variable declaration has these 4 parts:

  1. let keyword

  2. identifier

  3. = symbol

  4. expression

So in our parsing logic we’ll check for these 4 things as well:

function parseStmt() {
	// 1.
  if (
    tokens[cursor].tokenType === "keyword" &&
    tokens[cursor].keyword === "let"
  ) {
    cursor++;
		// 2.
    let name = expect("identifier", "Expected variable name.");
		// 3.
		// bit longer since expect() won't directly work (as specific type of symbol)
    if (
      tokens[cursor].tokenType === "symbol" &&
      tokens[cursor].symbol === "="
    ) {
      cursor++;
    } else {
      parseErrorAtCursor("Expected `=` after variable name");
    }
		// 4.
    let initializer = parseExpr();
    return {
      type: "stmt",
      stmtType: "binding",
      name: name,
      initializer: initializer,
    };
  } else {
    parseErrorAtCursor("Only let statements allowed.");
  }
}

Now that we have the parse statement function we can add some actual logic to our top-level parse function. We’ll do a similar thing to what we did in tokenize and constantly call parseStmt (as we expect a program to be a list of statements); however, we’ll also check for semicolons in between them:

function parse(tokens) {
  let cursor = 0;

	function parseErrorAtCursor(msg) { /* ... */ }
	function parseExpr() { /* ... */ }
	function expect() { /* ... */ }
	function parseStmt() { /* ... */ }

	let stmts = [];

  while (cursor < tokens.length) {
    stmts.push(parseStmt());

    // if not a symbol or if symbol but not semi
    if (
      tokens[cursor].tokenType !== "symbol" ||
      (tokens[cursor].symbol && tokens[cursor].symbol !== ";")
    ) {
      parseErrorAtCursor("Expected ';'.");
    }
    cursor++; // skip the ;
	}

  return stmts;
}

And now we have parsing! Try running the previous examples (and also try some invalid examples to see if the errors work!):

let code = `let foo = 34;`;

let tokens = tokenize(code);
let ast = parse(tokens);
console.log(ast);

Should give you:

[
  {
    type: 'stmt',
    stmtType: 'binding',
    name: { type: 'token', tokenType: 'identifier', name: 'foo' },
    initializer: { type: 'expr', exprType: 'literal', token: [Object] }
  }
]

And if you try the broken example from before:

let 24 = ;

We get an error:

Parse error at token 1: Expected variable name.

Definitely not the best error, but that’s a fun thing to work on by itself.

Complicating Things a Bit

Before we get into executing this tree, at it’s current stage this language can’t really do much. There’s one really important thing missing: functions! I thought we would save this for later, since now you should be more familiar with the other two phases and you can get some real experience of adding something horizontally (through each phase).

If you remember from previously, functions are expressions (just like in JavaScript) and syntactically look like:

(param1, param2) -> {
	// body
	return add(param1, param2)
}

We can almost full tokenize this except for the one -> symbol. Our other symbols were all one character so we could just check if the current character was in the symbols list, but for this we’re gonna have to do two checks (adding an else if after our initial symbols check):

// ...
if (symbols.includes(code[cursor])) {
  cursor++;
  return {
    type: "token",
    tokenType: "symbol",
    symbol: code[cursor - 1],
  };
} else if (code[cursor] === "-") {
  cursor++;
  if (code[cursor] === ">") {
		cursor++;
    return {
      type: "token",
      tokenType: "symbol",
      symbol: "->",
    };
  } else {
    throw `Lexer error: unexpected symbol '-'`;
  }
}
// ...

With that done, just make sure your keywords list includes the return keyword, and we now have functions that can be tokenized. Let’s move on to parsing!

There’s a few new things we need to parse:

  1. Function literals (expressions)

  2. Function calls (expression)

  3. Return statements

Let’s start with the easiest: return statements. A return statement is just 2 components: the return keyword and an expression:

function parseStmt() {
  if (
    tokens[cursor].tokenType === "keyword" &&
    tokens[cursor].keyword === "let"
  ) {
		// ...
	} else if (
		tokens[cursor].tokenType === "keyword" &&
		tokens[cursor].keyword === "return"
	) {
		cursor++;
		let value = parseExpr();
		return {
			type: "stmt",
			stmtType: "return",
			value: value,
		};
  } else {
		// ...
  }
}

Also a quick note on the final else that reports an error. Previously it used to say that only let statements are allowed, now it could say only let or return statements are allowed but we now have useful expressions that could exist as a statement. For example, a void function call on it’s own line. So let’s add expression statements (which as the name might suggest are statements that just have an expression):

function parseStmt() {
  if (
    tokens[cursor].tokenType === "keyword" &&
    tokens[cursor].keyword === "let"
  ) {
		// ...
	} else if (
		tokens[cursor].tokenType === "keyword" &&
		tokens[cursor].keyword === "return"
	) {
		// ...
  } else {
		let expr = parseExpr();
		return {
			type: "stmt",
			stmtType: "exprStmt",
			expr: expr,
		};
  }
}

Now let’s try adding function literals. These are similar to number and string literals so we’ll add it close to there. A function literal is parsed when we see a ( where an expression is expected. We can then parse as we parse long things by splitting it into many steps:

  1. ( opening parenthesis

  2. comma separated parameter list

  3. ) closing parenthesis

  4. -> arrow

  5. { opening brace

  6. semicolon separated statement list

  7. } closing brace

Before we start there’s a lot of symbols we’re expecting, so let’s make a couple convenience functions to check for those (make sure these are within the top-level parse function):

function isSymbol(token, symbol) {
  return (
    token.tokenType === "symbol" && token.symbol === symbol
  );
}

function expectSymbol(symbolType, errorMsg) {
  if (isSymbol(symbolType)) {
    cursor++;
    return tokens[cursor - 1];
  } else {
    parseErrorAtCursor(errorMsg);
  }
}

I won’t show it but you can use that expectSymbol function in old code as well!

With those out of the way, let’s see how we can parse function literals in code (this is a long piece of code but keep checking the list and the numbers in the comments to see how things line up):

function parseExpr() {
  if (
    tokens[cursor].tokenType === "number" ||
    tokens[cursor].tokenType === "string"
  ) {
		// ...
  } else if (tokens[cursor].tokenType === "identifier") {
		// ...
  } else if (isSymbol(tokens[cursor], "(")) {
		// 1.
    cursor++;

		// 2.
    let params = [];
    if (!isSymbol(tokens[cursor], ")")) {
      while (true) {
        let param = expect(
          "identifier",
          "Expected parameter name in function literal."
        );
        params.push(param);

        if (isSymbol(tokens[cursor], ",")) {
          cursor++;
        } else {
          break; // if we get anything except for a comma we stop looping
        }
      }
    }

		// 3.
    expectSymbol(
      ")",
      "Expected closing `)` after parameters in function literal."
    );
		// 4.
    expectSymbol(
      "->",
      "Expected '->' after parameter list in function literal."
    );
		// 5.
    expectSymbol(
      "{",
      "Expected opening '{' after arrow in function literal."
    );

		// 6.
    let body = [];
    while (cursor < tokens.length && !isSymbol(tokens[cursor], "}")) {
      body.push(parseStmt());
      if (!isSymbol(tokens[cursor], ";")) {
        parseErrorAtCursor("Expected ';'.");
      }
      cursor++; // skip the ;
    }

		// 7.
    expectSymbol("}", "Expected closing '{' after body in function literal.");

    return {
      type: "expr",
      exprType: "funLit",
      params: params,
      body: body,
    };
  } else {
    // ...
  }
}

That was a lot of code, but hopefully still readable. That’s the reason we make convenience functions, to make parsing look very human (like a list of instructions just saying what you need in order). Now finally let’s add parsing for function calls. Calls are weird since any expression can be a call if you add () to the end of it.

This creates a bit of a problem since we need to keep track of the current expression being parsed (i.e. we can’t just return it once we’re done parsing), so we’ll declare a shared variable called expr and then assign to it every time instead of returning as we did before:

function parseExpr() {
    let expr;
		// ...
		if (
      tokens[cursor].tokenType === "number" ||
      tokens[cursor].tokenType === "string"
    ) {
	    cursor++;
			// like this instead of returning
	    expr = {
	      type: "expr",
	      exprType: "literal",
	      token: tokens[cursor - 1],
	    };
	  }
}

Once you’ve done that across the function we can add a trailing check at the end to see if this expression is being called:

function parse(tokens) {
	// ...
  while (isSymbol(tokens[cursor], "(")) {
    cursor++;

    let args = [];
    if (!isSymbol(tokens[cursor], ")")) {
      while (true) {
        args.push(parseExpr());

        if (isSymbol(tokens[cursor], ",")) {
          cursor++;
        } else {
          break; // if we get anything except for a comma we stop looping
        }
      }
    }

    expectSymbol(")", "Expect closing `)` in function call.");

    expr = {
      type: "expr",
      exprType: "call",
      callee: expr,
      args: args,
    };
  }

  return expr;
}

Notice how since we used while in the initial check instead of if we get multiple calls (like: foo()()()) for free! With this done we can finally move on to execution.

Execution

Now we’re gonna take these statements and actually run it! We’ll also add a little print() function so we can see things in the console output. Execution starts like every other step, a simple top-level function:

function execute(stmts) {
	// ...
}

Let’s first start by creating a function that can take an expression node and evaluate it to an actual JavaScript value. To do this we’ll create a nested function within our execute function and match on all the different types of expressions we have.

function execute(stmts) {
  function evaluateExpr(expr) {
    switch (expr.exprType) {
      case "literal":
        break;
      case "identifier":
        break;
      case "funLit":
        break;
      case "call":
        break;
    }
  }
}

Literals are simple since we already convert them to a JavaScript string or number while parsing:

switch (expr.exprType) {
  case "literal":
    return expr.token.value;
	// ...
}

Then we get to identifiers. So we need to store any names declared by the programmer during execution, so when a name is used we can fetch its value for them. So let’s make a top-level values object:

function execute(stmts) {
	let values = {};
	// ...
}

And now we can just assume that a name used will be referencing something in that values object. If it isn’t we can throw a runtime error:

switch (expr.exprType) {
	// ...
  case "identifier":
		// the first name is the token the second is the string version
    if (expr.name.name in values) {
      return values[expr.name.name];
    } else {
      throw `Runtime Error: Undefined variable '${expr.name}'.`;
    }
	// ...
}

We’ll stop here and add functions later. Let’s move on to executing statements! Like we did with expressions we’ll make a nested function to do this and use a switch for every statement type.

function evaluateStmt(stmt) {
  switch (stmt.stmtType) {
    case "binding":
      break;
    case "exprStmt":
      break;
    case "return":
      break;
  }
}

Since we’re doing functions later let’s ignore return statements, which leaves us with only variable declarations and expression statements. Variable declarations will use the values object from before and store the provided initial expression in there. We can use our new evaluateExpr function to actually evaluate the initial expression before storing it:

switch (stmt.stmtType) {
  case "binding":
    let name = stmt.name.name;
    values[name] = evaluateExpr(stmt.initializer);
    break;
	// ...
}

Same thing for our expression statements:

switch (stmt.stmtType) {
	// ...
  case "exprStmt":
    evaluateExpr(stmt.expr);
    break;
	// ...
}

And hey we’re getting somewhere now! But again, until we have functions we don’t have much we can do yet. So let’s get into the slightly more complicated stuff, it’s time to revisit the evaluateExpr function.

So what exactly should a function literal evaluate to? How are functions represented at runtime in JavaScript? What we need is an object that holds the body of the function and executes it with the provided arguments. To execute “with” arguments what we actually would do is rewrite those parameter names in our values object until the function is done executing:

// ...
case "funLit":
  return {
    call: (args) => {
      if (args.length !== expr.params.length) {
        throw `Runtime Error: Function called with invalid number of arguments.`;
      }

      let shadowed = {};
      expr.params.forEach((param, i) => {
        if (param.name in values) {
          shadowed[param.name] = values[param.name];
        }
        values[param.name] = args[i];
      });

      expr.body.forEach((stmt) => {
        evaluateStmt(stmt);
      });

      Object.entries(shadowed).forEach(([name, value]) => {
        values[name] = value;
      });
    },
  };
// ...

The reason this works is because the call member function is capturing the expression and storing it for whenever it’s called (we get this convenience from having JavaScript as our host language). you might also notice the whole shadowed thing. That’s there to prevent overwriting global variables that have the same name as function parameters, so we just capture the old values and then restore them post execution.

You might notice something problematic however, we have no way to handle returns. This is a pretty hard problem but we can exploit the fact that we’re using JavaScript and make return statements actually throw a custom error. This error behaves like a return in that it short circuits execution and bubbles up through recursive calls! Let’s declare this custom error type:

class Return {
  constructor(value) {
    this.value = value;
  }
}

And now we’ll change up the way we looped through and executed statements. We’ll switch from the forEach to a flat loop so we can return from the call function directly and we’ll catch (looking specifically for the Return error);

// ...
for (let stmt of expr.body) {
  try {
    evaluateStmt(stmt);
  } catch (e) {
    if (e instanceof Return) {
      return e.value;
    } else {
      throw e;
    }
  }
}
// ...

We’re missing out resetting our shadowed variables if we return early however. So let’s add that back:

return {
  call: (args) => {
		// ...
    function resetShadowed() {
      Object.entries(shadowed).forEach(([name, value]) => {
        values[name] = value;
      });
    }

    for (let stmt of expr.body) {
      try {
        evaluateStmt(stmt);
      } catch (e) {
        if (e instanceof Return) {
          resetShadowed();
          return e.value;
        } else {
          throw e;
        }
      }
    }

    resetShadowed();
  },
};

With this in place we can also quickly implement return statements!

switch (stmt.stmtType) {
	// ...
  case "return":
    throw new Return(evaluateExpr(stmt.value));
}

Now finally we can implement calling these functions. Calling is simple now that our function object just has a call member function, all we have to do is evaluate the callee to get the function object and then evaluate all the arguments and then hand it over to call:

// ...
case "call":
  let callee = evaluateExpr(expr.callee);
  let args = expr.args.map((arg) => evaluateExpr(arg));
  return callee.call(args);

Finally since everything is done let’s actually call our evaluateStmt function on all the statements we got:

function execute(stmts) {
  let values = {};

  function evaluateExpr(expr) { /* ... */ }
  function evaluateStmt(stmt) { /* ... */ }

  for (let stmt of stmts) {
    evaluateStmt(stmt);
  }
}

Since we have functions now we can do a lot of fun stuff! Let’s start real simple and add a built-in function that let’s us print a value to the console. To add a built-in function we can just add something to the values object that we declared. Note that it has to be a function object (since it has to look like a user-declared function):

let values = {
  println: {
    call: (args) => {
      console.log(args[0]);
    },
  },
};

And with so much code written we can finally give “hello world” a shot! Try it out at the bottom where you tried out tokenizing and parsing:

let code = `println("hello world");`;

let tokens = tokenize(code);
// console.log(tokens);

let ast = parse(tokens);
// console.log(ast);

execute(ast);

If everything works you should see that glorious “hello world” in your console. If not try going through to see where your code is different (or how it might behave differently in a different language if you’re using that), if you’re really struggling feel free to open an issue on GitHub!

Filling in Holes Using Functions

There’s technically a lot of problems with this language (which you’ll probably discover if you try using it for long) however, there’s some simple stuff we should add to make it more usable and then the complex problems are challenging fun for later!

I mentioned this while introducing the language but to keep parsing simple this language doesn’t have any operators, so if you want to include addition for example, you’d add a new add built-in function (like println). We’ll add a couple of these later and that should let you figure the rest out yourself, but before we go there is one operator we need to add that can’t be a function (in this type of language at least) and that’s assignment!

Until now we haven’t been able to re-assign to a variable (technically you could redeclare it but that wouldn’t work if you did it within a function because of shadowing. So let’s add this to parsing and execution (tokenizing is fine since = is already used by the variable declaration). We’re going to do something controversial for simplicity and make assignment a statement not an expression, this doesn’t allow you to chain them like: a = b = c, however, that’s a pretty rare case. We try to parse an assignment if we see an identifier followed by an =, if we get both of those then we look for the new expression as well:

function parseStmt() {
  if (
    tokens[cursor].tokenType === "keyword" &&
    tokens[cursor].keyword === "let"
  ) {
		// ...
  } else if (
    tokens[cursor].tokenType === "keyword" &&
    tokens[cursor].keyword === "return"
  ) {
		// ...
  } else if (
    tokens[cursor].tokenType === "identifier" &&
    isSymbol(tokens[cursor + 1], "=")
  ) {
    let name = tokens[cursor];
    cursor += 2;
    let value = parseExpr();
    return {
      type: "stmt",
      stmtType: "assign",
      name,
      value,
    };
  } else {
		// ...
  }
}

There is a weird edge case here, where if the identifier is the last thing in the file (with no semicolon) then our code would fail. So let’s add a quick check for that:

// ...
} else if (
  tokens[cursor].tokenType === "identifier" &&
  cursor + 1 < tokens.length &&
  isSymbol(tokens[cursor + 1], "=")
) {
// ...

With parsing done, evaluation for the assign statement is the exact same as the variable declaration:

function evaluateStmt(stmt) {
  switch (stmt.stmtType) {
    case "binding":
			// ...
    case "assign":
      values[stmt.name.name] = evaluateExpr(stmt.value);
      break;
		// ...
  }
}

Now we can run simple programs like:

let foo = 32;
println(foo);
foo = 45;
println(foo);

Ok and finally I’ll show you how to add convenience functions, we’ll add add (for +), lt (for <), and while. You can work on other operators or other control flow using this as inspiration. Let’s start with add. Add is a built-in function that takes two arguments and then uses the JavaScript + operator on them and then returns them back:

let values = {
  println: { /* ... */ },
  add: {
    call: (args) => {
      return args[0] + args[1];
    },
  },
};

For conditionals and comparisons we need to add boolean values to our language. For this we need to add the true/false keywords and treat them the same as string/number literals:

// ...
let keywords = ["let", "return", "true", "false"];
// ...

While parsing we look for these keywords now:

function parseExpr() {
  let expr;

  if (
    tokens[cursor].tokenType === "number" ||
    tokens[cursor].tokenType === "string"
  ) {
		// ...
  } else if (
    tokens[cursor].tokenType === "keyword" &&
    (tokens[cursor].keyword === "true" || tokens[cursor].keyword === "false")
  ) {
    cursor++;
    expr = {
      type: "expr",
      exprType: "literal",
      token: {
        ...tokens[cursor - 1],
        // add in the JavaScript value
        value: tokens[cursor - 1].keyword === "true",
      },
    };
  }
	// ...

With that slight side-tour done we can come back to adding lt to our values object:

let values = {
  println: { /* ... */ },
  add: { /* ... */ },
  lt: {
    call: (args) => {
      return args[0] < args[1];
    },
  },
};

Let’s add some control-flow now! We’ll add the concept of a while loop, but as a function. The first argument will be the condition and the second argument will be a function that holds the body. The problem with this is that we need to re-evaluate the condition every iteration. But we can’t do that since call get’s pre-evaluated arguments. The easy way to get out of this is to let call evaluate the arguments and just pass it raw expressions. Let’s start by changing this at the call expression:

case "call":
  let callee = evaluateExpr(expr.callee);
  return callee.call(expr.args);

And then we need to handle this while making function objects from literals and in our own built-in functions. Starting with function literals:

case "funLit":
  return {
    call: (args) => {
			// ...
      expr.params.forEach((param, i) => {
				// ...
        values[param.name] = evaluateExpr(args[i]); // the difference
      });
			// ...
    },
  };

In our built-in functions:

let values = {
  println: {
    call: (args) => {
      console.log(evaluateExpr(args[0]));
    },
  },
  add: {
    call: (args) => {
      return evaluateExpr(args[0]) + evaluateExpr(args[1]);
    },
  },
  lt: {
    call: (args) => {
      return evaluateExpr(args[0]) < evaluateExpr(args[1]);
    },
  },
};

Now that we have access to the raw expressions let’s try implementing a while loop!

let values = {
  println: { /* ... */ },
  add: { /* ... */ },
  lt: { /* ... */ },
  while: {
    call: (args) => {
      while (evaluateExpr(args[0])) {
        let func = evaluateExpr(args[1]);
        func.call([]);
      }
    },
  },
};

And with that in place we can finally run our goal program that we’d decided all the way at the beginning!

Conclusion

The language we’ve implemented isn’t a great language (it has many problems in terms of design and lacks a lot of usability); however, the point of this guide was to get you comfortable enough where you can add features (or even start from scratch!) to make something that’s your own. If you do end up doing that, let me know!