Writing a self compiling compiler

April 15, 2020

During the past year I started working on a compiler called Rumi, where I implemented small things that I found useful. The process involved writing a C++ code and using LLVM to optimize and generate the code. It quickly reached to a point where I could run and compile complex programs in it, however, it was a bother to have such a good language and not be able to use it to further develop the compiler.

So I decided to rewrite rumi in itself, and keeping a log of what happened and how I handled the problem. Just to keep things interesting, I'm going to rewrite it entirely from scratch, so that I can explore more options in bootstrapping. But let's start by showing off what rumi can do:

Rumi's Introduction

Rumi is a compiled, typed, language. The main feature is that rumi's meta language is itself. The goal for Rumi is to be a language that provides compiler's interface in compile time, so that the developer can decide how certain things should be have, such as how a certain struct is initializing its values or how a function should be linked. Here is a quick sample of rumi:

rem := (a: u64, b: u64) -> u64{
  return a % b;
}

@compile
test_rem_1 := () -> int {
  if (rem(8, 4) != 0 ){
    printf("rem(8,4) != 0\n");
    return 2;
  }
  
  if (rem(6,4) != 2){
    printf("rem(6,4) != 2\n");
    return 2;
  }
  
  return 0;
}

You can see two functions, a remainder function and a test of that, which is ran in compile time. This method prevents compilation if there is anything wrong with this function, since it is returning a number higher than 1. Okay, so maybe that's not what I promised, but this works and you can run complex patterns either for testing or configuring the compiler. But that's enough for now, let's see what bootstrapping a compiler is.

Bootstrapping

Bootstrapping, in term of compilers, is a process done to provide a basic version of compiler powerful enough to compile itself. It is usually a recursive pattern where each version adds new features to make creating the next version easier. There are a couple of different ways to do it:

Cross-compiling

An initial version of a compiler could be compiled on another machine architecture, this is usually how c compilers are born. However, this is not a valid option for us since rumi is not available on any architecture at this point.

Handwritten

Let's not even joke about this one. The complexity of modern compilers and the assembly language put together is so hard that might make it take months, and I want to do this in a less than a weekend.

Interpretting

Another option is writing an interpreter, where we don't actually compile anything, but just interpret it, kind of like python or JavaScript. This might be useful for us, but...

Using Another Language

We can start creating a smaller subset of our language in another language, let's say c++, and then start from there. This is a valid option for us, since I already have a working compiler and can borrow code from it a lot. Plus, this makes it easier to port to another architecture.

Choosing a subset

So we have to choose a subset that is both simple enough to create, while being powerful enough to compile itself. Here is what I think should suffice:

  • Function Define: We can't really have a program without them
  • Function Call: Again, we need this.
  • While Loop: We need something that allows repetition, and I'm not going to rely on goto (we don't even have them in rumi!).
  • If Else: Branching is a must, otherwise things won't make sense.
  • Break: We might be able to get away without having a continue, but we can't really do so without a break command, ommiting it requires more complex structures
  • Pointers: Because we need to use LLVM, so it's not even a question
  • Pointer Refrencing: Again, how am I going to see what's inside?
  • CStrings: We need them for certain small behavior, but we need them.
  • Structs: Truth be told, we might be able to get away without them, but they aren't that hard to implement and they give us tremendous power
  • Integers and Floats: Do I even need to say these? We need them!
  • Simplified Expressions: As in only one operator per command, and we only support + - * / and %
  • Forward Decleration: Granted, they aren't really useful in rumi, but we need them here because otherwise we need to write a double passing compiler at the start, which I'm not a fan of.
  • Comments: Yup.

With all that said simple rumi, or srumi if you will, needs to parse a single file and output object files. We can take care of the rest.

In the next post, I'll talk about how it went and see what else we can do!