AArch64 Bare Metal Boot Code

The basic bare metal code required to boot an AArch64 system is not terribly complicated, however, basic code will not do much. The code below handles most basic setup for a Raspberry 3 or 4 and has a few advanced features not found in other boot code examples. The code assumes RPi 3 or 4 HW, as elements in the BCM 283x or BCM2711 peripheral are initialized in the code. It is commented and the BCM specific code is generally abstracted out, so hopefully it is transparent enough that it can be adapted to different AArch64 platforms and different use cases. Features include:

  • Ability to handle entry in EL2 or EL3
  • Auto-detects Raspberry Pi version
  • Sets up the RPi Physical Timer
  • Sets up the General Interrupt Controller (GIC) for the RPi4
  • Sets up the Stacks for EL1 and EL0 Exception Processing
  • Initializes Environment for C Runtime
  • Initializes Environment for C++ Runtime

There are few few limitations at this point:

  • Single Core Only
  • No Virtual Memory Management
  • Semi-specific to RPi 3 & 4, though compatibles *should* work

This is a lengthy post but splitting it into multiple separate posts would probably be distracting. At the 100,000ft level, the idea is that the processor enters the top of this code in either EL3 or EL2, initializes the functions listed above and then exits in EL1 to the C++ code which performs the ‘kernel initialization’ and the kernel itself. All the code discussed in this post for the Raspberry Pi Bare Metal OS project can be found in my Github repository. Not all code in the kernel boot sequence is contained below, particularly the handful of subroutines which initialize different parts of the RPi hardware. Consulting Github for this code will be helpful.

RPi Boot Process Overview

Raspberry Pis have a somewhat unique boot process which works well to prevent bricking of the device. When powered on, it is actually the GPU in the BCM peripheral chip’s video core which starts to run boot code in an internal ROM and eventually starts the ARM processor. The ‘config.txt‘ file and the ‘command_line.txt‘ files are loaded by the video core, parsed and a variety of internal attributes are configured.

Once the two files are parsed and the video core is configured, the GPU loads the ‘armstub‘ file into the right spot in physical memory for the ARM processor to start executing it. On entry to the armstub the ARM core will be running in EL3. It is the armstub file which will eventually jump to the start of the kernel code.

The Raspberry Pi OS ships with ‘armstub8.bin‘ which is the default, but the armstub loaded by the video core can be changed in the ‘config.txt‘ file (consult Github for an example). The default armstub does some initialization before shifting the Exception Level down to EL2 prior to jumping into the kernel.

This project includes a custom ‘armstub’, named ‘armstub_minimal.bin‘. This minimal armstub does nothing more than jump into the kernel code – still at the EL3 Exception Level. This permits the startup code to handle any EL3 initialization that might be required for different use cases. There are a number elements of the HW that need to be configured in EL3, those appear in the boot code below. The boot code below can be used with the default armstub file shipped with the Raspberry Pi OS, as it will detect on entry if the core is running in EL3 or EL2 – and will skip all the EL3 initialization if the core is already running in EL2.

Exception Levels

There are four Exception Levels built into ARM 8 cores. I suspect the term ‘Exception Level’ comes from ARM 8 interrupt processing (interrupts are a subset of more general ‘exceptions’) where different hardware or software ‘exceptions’ (not to be confused with C++ or Java exceptions) are tied to different Exception Levels. Additionally, there are sets of instructions that are restricted to a specific Exception Level.

  • EL3 – Highest Exception Level and the only level in which the processor can switch from ‘secure mode’ to ‘insecure mode’. Code running at EL3 is typically called a ‘Secure Monitor’. EL3 is optional in ARM processors.
  • EL2 – Hypervisor Exception Level, virtualization code will run at this level and page fault exceptions generated by the memory manager when using 2 Stage address translation will be handled in this level. EL2 is also optional in ARM processors.
  • EL1 – What used to be called ‘Ring 0’ in OS development. This is the level the kernel and most interrupt handlers should execute within.
  • EL0 – What used to be called ‘Ring 3’ in OS development. This is the level within which application code will execute.

Exception Levels can change as a result of either (1) an exception which is handled at a specific (usually higher) exception level -or- (2) execution of the ‘eret‘ (exception return) call which permits the core (PE or ‘Processing Element’ in ARM documentation) to potentially drop to a lower exception level. Exceptions can leave the PE at the same EL or move to a higher level and conversely the exception return can leave the PE at the same EL or move it down.

The boot code in this post only supports execution in EL1 and EL0. Maybe in the future I will dabble in a lightweight hypervisor which would pull in EL2 but I doubt I will write a Secure Monitor for EL3. It should be noted that either EL3 or EL2 may be used – but not together. If running in Trusted or Secure Mode, EL2 is disabled.

Boot Code

The code below is a pretty complete Raspberry Pi Aarch64 boot up example which has been tested on RPi 3 and 4 but which should be generalizable to any AArch64 system with EL3, processors without EL3 would require more modifications to initialize subsystems correctly in EL1.

Linker Directives

Below the #defines but just before the assembly code, there are 3 linker directives. The first:

instructs the linker to place the code in the file into the ‘text.boot‘ section of the memory map. The ‘text‘ section is referenced in the linker script described in the previous post. The next directive simply tells the linker to expose the symbol _start global.

Determining Current Exception Level

AArch_64 has a dedicated register for holding the current exception level, unsurprisingly named ‘CurrentEL‘. Bits 2 and 3 of this register hold the exception level – which is binary 0 through 3 for exception levels 0 through 3 respectively.

Configuration in EL3

There are a number of probably non-obvious initialization steps in the boot code. I found these in the RPi Armstub, though I believe they are also described in the ARM Documentation.

First, the L2 Cache for EL1 is configured with a latency of 3 cycles. This register needs to be configured early in the boot process, before memory access occurs. Next, the floating point and SIMD instruction sets are enabled.

The Secure Configuration Register for EL3 (SCR_EL3) is initialized next. The bits set in the SCR_EL3 register appear in the code, and their meaning can be found in the ARM documentation. After the SCR, the Auxiliary Configuration Register for EL3 (ACTLR_EL3) is initialized. The ACTLR_EL3 register contains implementation defined features – so the reference for them will be with the actual processor documentation. After the ACTLR_EL3 register, the CPU Extended Control Register for EL1 (CPUECTLR_EL1) is initialized. Again, documentation is the best place for more detail.

Identifying the RPI Type and Setting Up the Physical Timer

Next, the boot code jumps to a subroutine which identifies the Raspberry Pi Board Type, currently RPi3 or RPi4. This needs to be done to correctly configure the Physical Timer and to determine how to configure the interrupt controller.

The Physical Timer must be configured in EL3 and configuration is different between BCM 283x and BCM 2711 peripherals. The code for identifying the board type and setting up the timer can be found my Github repo. After initializing the Physical Timer, the code will then initialize the Generic Interrupt Controller (GIC 400 in this case) if the board is an RPi4, otherwise the GIC400 initialization is skipped for the RPi3 family. The RPi3 does not contain a GIC.

Further down the boot code, IdentifyBoardType is called again in the boot code, which may seen odd. This is a bit inefficient but fortunately there is not a lot of code in the identification subroutine. The second call to IdentifyBoardType is needed as it occurs in EL1 and is then stored in a global variable which is then accessible to the kernel code. The value cannot be stored after the first call to IdentifyBoardType, as that call is made in EL3 and EL3 has a separate memory space which is not shared with EL1.

Switching to EL2

Just after configuring the Physical Timer and conditionally configuring the GIC, the boot code initializes the System Control Register for EL2 (SCTLR_EL2). This register and the initialization values are in the ARM Documentation. After SCTLR_EL2 is initialized, we jump to the EL2 Exception Level – if we entered in EL3.

In the code snippet above, the Saved Program Status Register for EL3 (SPSR_EL3) is initialized and the address of the ‘running_in_el2‘ symbol is loaded into the Exception Level Return Register for EL3 (ELR_EL3). The meaning of the bits set in the SPSR_EL3 register are in the ARM documentation, but the one worth noting here is the last 4 bits which are initialized with the value 9 which tells the PE to shift to EL2H mode on return from the exception routine. When the ‘eret‘ instruction is executed, then the Exception Level is changed to EL2 and the program counter picks up at the address in ELR_EL3, which is the ‘running_in_el2‘ symbol.

Single-Core Execution

At present, the boot code exits running in single-core mode. PE 0 is used for execution and PE2 1, 2 and 3 are parked in an infinite loop. This is temporary and will be relaxed for SMP execution later – right now, single threaded execution is all that is needed. There are a number of other examples of booting with multiple cores available.

In the code above, the bottom 2 bits of the Multiprocessor Affinity Register (MPIDR_EL1) are checked to see if they hold the value of 0. The MPIDR_EL1 register holds information on the multi-processing state of the hardware and the different PEs. The value of zero in the bottom 2 bits of the register indicates PE 0 is running for an RPi quad PE CPU. There is not much documentation on this register, so for other CPU configurations – you will likely have to do some research to find the magic values to check.

If the current PE is not PE 0, then the PE is simply put into an infinite loop with the ‘wfe‘ instruction to let the CPU know the PE can go into a low power state.

Setting the Stack Pointer for EL1

In the code snippet that follows, the stack pointer for EL1 is set to values associated with symbols defined in the linker script. The stack grows down from the indicated location toward the heap which grows up from the end of the program code. The EL1 stack pointer can only be set in EL2 or EL3.

Processor Configuration

After setting the stack pointer for EL1, there are a handful of configurations for the counter/timer register, disabling EL2 traps for a variety of architectural features, enabling AArch64 in EL1 and finally configuring the CPU for execution in EL1 and EL0. A number of these settings are rather cryptic, particularly for CPTR_EL2, HSTR_EL2 and CPACR_EL1, so consult the ARM documentation before modifying values. In general, all traps from EL1 or EL0 to EL2 are disabled (as we are not implementing a hypervisor – at least yet) and traps from EL0 to EL1 for various instructions are also disabled. If you change the architectural features enabled, you should double-check the instruction traps.

I am not expert on these settings, they appear to be ‘standard’ for RPi bare-metal code. For other AArch64 implementations with special execution requirements (like Streaming SVE Mode) the configuration will be different.

Switching to EL1

Much like moving from EL3 to EL2, to move from EL2 to EL1, it is necessary to setup the ‘EL return address’ register and execute the ‘eret‘ instruction.

After switching to EL1, the code gets the board identity again and stores it in a global variable for use from kernel code.

Setting Up Exception Vectors

AArch64 exception vector tables are setup in memory and are used to identify the correct handler for different exceptions. Recall, exceptions are a super-set of just interrupts. The code required for setting up the vectors can be found in my Github repository in the isr_kernel_entry.S file. The key elements in that file are the kernel entry and exit code which just saves the registers on entry and restores them on exit and the exception table itself.

Setting the Stack Pointer for EL0 Exceptions

Code required to setup the stack pointer for EL0 exceptions appears below. The comments in the snippet describe the interaction of exception processing and stack pointers. In short, if SPSel == 1, then h suffixed vectors are used and each exception level will have its own stack pointer. The CPU could be configured to share a stack pointer between EL1 and EL0 and that could be fine for bare metal code executing only in EL1 but for an OS with processes running in EL0, we should have different stacks.

For clarity, each process in EL0 will have a different stack of its own. The EL0 stack here is for exception processing where the exception code is run in EL0. Looking through the documentation, it appears as if SPSel settings are partly driven by support for the Linux exception processing model.

As is the case for EL1, the symbol used for the EL0 top of stack is found in the linker script.

Clearing the BSS Segment for C Code

The C Language model prescribes that the BSS segment, which holds uninitialized data, be zeroed prior to execution jumping to the ‘main()‘ function. The code below relies on symbols from the linker script to zero out the BSS. The is done in 8 byte chunks, so the alignment needs to be correct.

There is some discussion of the bss segment in my post on Linker Scripts.

Initializing C++ Static Globals

In the C++ Language Model, global static variables must be initialized prior to execution of the ‘main()‘ function. For static class instances, this will require invoking the class constructor and passing the correct memory location for the class instance.

Fortunately, C++ compilers do the heavy lifting for us. The compiler generates an array of void functions which can be called just prior to jumping to the ‘main()‘ function which will initialize each static variable. All the initialization code must do is walk the array and call the functions. The code below does just that.

This is the last step performed in the boot code before jumping to the kernel main.

Jumping to Kernel Main

Finally, we have the branch to kernel_main(). One detail here is that I chose to use the symbol name kernel_main() instead of main() specifically to avoid any risk of ‘special handling of main()’ applied by the compiler or linker.

If execution returns from kernel_main(), the PE is just parked.

Conclusion

The post above provides *mostly complete* bare metal boot code for RPi3 or 4 platforms running in AArch64. Code referenced above can be found in the associated Github Repository.

Basics of GCC Linker Scripts

Even if you have been using the GCC suite of compilers heavily for years, it is unlikely you have had to create a Linker Script. Linker Scripts are needed when fine-grained control over the memory layout of an executable is required. Most C/C++ code is compiled to an application or service for a specific OS platform, so memory layout is both pre-defined and generally pretty relaxed.

This is not the case for Bare Metal code. When developing for bare metal, the location of the entry point for the code and the locations of global statics, the stack and possibly the heap must be specified. There is no OS or OS Memory Model to be used. The Linker Script defines the foundation of a bare metal code memory model to be used in the output image.

What the Linker Does

The role of the Linker in creating an executable image is to take a collection of input files that contain ‘segments’ and ‘symbols’ and combine those ‘input segments’ into ‘output sections’ of the final image – while also determining the correct value for various unresolved symbols in the ‘input segments’. Additionally, symbols can be defined in a Linker Script and those symbols will be available to source code. Examples of that kind of symbol resolution appear in the example script.

‘Input segments’ are generated by the compilers or assemblers generating the object files linked together to form the output image. As you progress below, ‘segments’ are in object files and ‘sections’ are generated by the linker. Since the code for my bare metal Aarch64 OS is C/C++ (with a little assembly) the C/C++ Memory Model must be used.

Example Linker Script

Linker scripts use the LD Command Language. It is *mostly* specification, there are not if/then style conditionals, though it is possible to test if a symbol is defined. The only required element of a linker script is at least one SECTION. Sections describe the memory layout of the output binary. LD documentation may be found here.

Below is a linker script from a bare metal OS I have been tinkering with for the Aarch64 on the Raspberry Pi. It has a bit of complexity but is still simple enough to understand and either edit or extend for your needs.

As explained at the very top of the script, this is actually a template which is run through the C Preprocessor which then expands the preprocessor directives and generates the final script. The ‘os_memory_config.h‘ file contains the following:

#pragma once

#define STATIC_HEAP_SIZE_IN_BYTES 65536
#define DYNAMIC_HEAP_SIZE_IN_BYTES 65536

The advantage of running this template through the preprocessor is that the symbols STATIC_HEAP_SIZE_IN_BYTES and DYNAMIC_HEAP_SIZE_IN_BYTES are now shared in both the linker script and the C/C++ code base – at C/C++ compile time. It is possible to adjust the size of the heaps from a single file, instead of having to remember that there are two places that must be changed.

Inside the Linker Script

Basic LD Command Syntax

Numeric values in a linker script are all integers and C integer operations are permitted. Symbols may be defined in a linker script. Unquoted symbols follow the same rules as C symbols, but symbols may also be quoted – which permits the inclusion of spaces or perhaps reserved words in the symbol.

Probably the most important symbol is the dot ‘.‘ global symbol. The dot symbol represents the current memory location counter maintained by the linker as it is assembling the output image. It may be both read and set.

Semicolons are required after assignment statements and are permitted in other locations but are not required. If you deep dive and end up using ELF Program Headers, semicolons are required there as well.

Standard C block comments are permitted with /* */ delimiters.

Defining a Memory Block

The snippet above defines a memory block named OS_RAM of 32 megabytes in length, starting at the physical location 0x00080000 which may be read, written to an executed. This location is not an accident – it is the place where the RaspberryPi boot loader loads the OS image. There are additional attributes that can be specified for a memory block and are described in the GCC LD documentation for Memory Layout.

LD permits only a single MEMORY declaration but multiple blocks may be defined in the declaration. It is an optional declaration, if it does not exist, the linker assumes there is sufficient memory for the image.

Defining a Simple Memory Section

Above is the start of the script section specifications and a simple memory section called ‘start’ which is required to start on a 4 byte aligned memory address. The ‘.’ location counter is set to the next four byte aligned location with . = ALIGN(4) and is then read and assigned to the global variable __start with the __start = . statement.

At the end of the section specification, the > OS_RAM directive tells the linker to assign this section to the OS_RAM memory block defined previously in the script. As successive sections are assigned to this block, it will fill. If the size of the sections assigned to the block exceed the 32M size of the block, the linker will exit with an error.

Defining a Section as a Group of Compiler Defined Segments

The C and C++ compilers define code and/or data segments. A section in a linker script usually defines a collection of input segments that are to be grouped into a single section of memory in the output image. An example follows:

The .text section specification contains the KEEP statement in addition to a regular segment inclusion specification. KEEP is not documented in the GCC LD man pages (I have no idea why) but what it does is includes those segments into the linker section and marks them as ‘used’ even if they are not referenced anywhere else in the input object files. Unreferenced input segments will be eliminated as dead code by the linker, unless those segments as identified to be kept. In this case, we need to be sure the .text.boot segment is retained.

The .rodata section includes read only segments defined in the input object files.

The text segment of a C or C++ program is typically the object code to be executed. The C memory model is illustrated below and the different segments can be found in section declaration statements in the Linker Script.

Borrowed from Geeks4Geeks – https://media.geeksforgeeks.org/wp-content/uploads/memoryLayoutC.jpg

The .rodata and .data linker sections are generally the ‘initialized data’ segment of the map.

Wildcards in Section Specifications

The example above contains a couple different uses of the ‘*’ asterisk as a wildcard. When specifying a segment from an input file be assigned to a segment, the actual syntax is ‘foo.o (.input1)‘ where ‘foo.o‘ is the name of an input object file and .input1 is a segment in ‘foo.o‘. If you know that you will have multiple input files with a .input1 segment, then the file specification can be replaced with a wildcard: ‘*(.input1)‘ – which specifies that any .input1 segment in any input file should be included in the output image.

Segment names may also be wildcarded. For example: ‘*(.input*)‘ specifies that any segment that starts with ‘.input‘ in any input file should be included in the output. In this case, segments .input1, .input2, .input345 will be assigned to the linker section. The linker handles wildcards much like the Unix shell with ‘?’ for single characters, ‘[chars]’ for membership and ‘-‘ for range (i.e. ‘[a-z]’).

C++ Static Variable Initialization

Given the object oriented nature of the C++, there needs to be a mechanism to initialize global class instances *before* program execution. In the Linker Script we see this in the .init_array section.

In the snippet above, memory is aligned to a 16 byte boundary before the .init_array section is declared. Then we insure we have 4 byte alignment (should be the case after the prior alignment anyway) and define the __init_array_start symbol as the current memory location. The .init_array segments from the input object code are then assigned to the segment and then a second symbol __init_array_end is set to the new value of the memory location counter.

The __init_array_start and __init_array_end symbols may be referenced in object code and will be linked into the output image. In the Aarch64 startup code, C++ globals are initialized as the last step before jumping to the start of the ‘kernel main’ function. The init array is just a list of void functions that initialize a global static when called. Therefore, all the assembly language does is starts with __init_array_start , gets a 4 byte address, jumps to it, and then moves to the next sequential address until __init_array_end is reached.

In the C memory map, the C++ intialization array is assigned to the ‘initialized data’ portion of the map.

The BSS Section

In the C memory map there is the ‘uninitialized data segment’ called ‘bss’ which is also referenced in the Linker Script. The bss segment is not initialized as the C++ globals are initialized above but the whole section must be set to zero. The relevant section of the Linker Script is below:

There is a similar pattern here. Align the program counter to a 4 byte boundary, set the __bss_start symbol to the current memory location, keep a couple other sections labelled ‘bss’ in the input files, align the current location counter to an 8 byte boundary so we can set double words in memory to zero and finally create the __bss_end symbol with the 8 byte aligned location. The __bss_size_in_double_words symbol is also computed in the linker script and can be referenced in code (example below).

The section is decorated with NOLOAD, which instructs the linker that there is no code or data to be placed in the output file for this part of the memory map. This makes sense for the .bss section – as it will all be explicitly set to zero in startup code. Another type of section that might be decorated with NOLOAD would be ROM which exists on the HW platform and can be referenced but does not need to be present in the image generated by the linker.

Aarch64 startup assembly language to zero out the .bss section:

Defining an Empty Section

Sometimes it is helpful to define an empty block of memory in the output memory map. The .static_heap section below does just that.

This section is aligned to a 4 byte boundary and then the __static_heap_start symbol is set to the current memory location and then the value of the STATIC_HEAP_SIZE_IN_BYTES symbol included from the .h file is added to the current location. After the location is advanced, the __static_heap_end is set to the current location. No input segments from the input files are assigned to the section and nothing is kept. This is just a chunk of memory. I guess it could be decorated with NOLOAD but since there are no input segments specified – there will be nothing to load anyway. Finally, the symbol __static_heap_size_in_bytes is computed for potential use in the code. Based on the location in the linker script, this heap will appear just after the bss section of the C memory map.

Final Interesting Bits

The Linker Script contains a couple more semi-duplicative sections which carve out memory for heaps and stacks. The need for two different stacks will be discussed in my post on Aarch64 bootstrapping code. The last part of the script that is worth mentioning is the DISCARD section.

The DISCARD section is a ‘reserved’ section which can be assigned input segments from the object code files and which will explicitly remove those segments from the output image. In the example above, anything in any .comment segment will be discarded by the linker and anything in any segment starting with .gnu or .note or .eh_frame will be dropped as well.

Adding the Linker Script to the Link Statement

The code snippet below shows the Makefile specification to process the Linker Script Template with the C Preprocessor, write that file to a new file and then use that new file when linking the output image.

There are a bunch of Makefile symbols above – but the key elements should be apparent.

Where to Find the Code

The Linker Script, Makefiles and source code can be found in my Github repository. I have a prior post on the Makefile design which may also be helpful.

Bare Metal Build System

Normally, I prefer using CMake for C/C++ projects but for bare metal development, there are a number of requirements that are difficult to meet with CMake. Put simply, I could probably find a way to manage a bare metal build with CMake but at present, using good old Make seems preferable. One example of a bare metal challenge for CMake is specifying a Linker Script. There are ways to add a linker script as a dependency using CMake but they are generally pretty cryptic.

This post describes features of the GNU Make tool.

Primary Requirements for Bare Metal Build System

There are a handful of requirements I have for the build system:

  1. DRY – define compile and link settings once in one location
  2. Build subdirectories without a separate makefile in each subdirectory
  3. Support linker scripts naturally
  4. Simple debugging capability

Make is really just the combination of a rule execution engine with a text substitution and expansion engine. This combination is powerful but it is sometimes difficult to wrap one’s head around what is happening when make processes a makefile. In short, when making a build, make starts with the top level goal and then derives all the dependencies required to complete that goal and then satisfies those dependencies. For the case of an executable, the dependencies are compiling the source files and then linking the executable. Along the way, there is the potential for a lot of test expansion to generate the dependency names.

Make ‘include’

To satisfy the DRY (Do not Repeat Yourself) requirement, make has the ability to include other make files using the ‘include’ command. I have extracted the parameters needed across all the different projects in my bare metal system and put them into a ‘makefile.mk‘ file which is included at the top of each project’s makefile. For example, the shared symbols for the AArch64 bare metal cross compilation build appear below:

TOOLS := ${HOME}/dev_tools
GCC_CROSS_DIRECTORY := ${HOME}/dev/gcc-cross

GCC_VERSION := 12.3.1

GCC_CROSS_TOOLS_PATH := $(TOOLS)/arm-gnu-toolchain-12.3.rel1-x86_64-aarch64-none-elf/bin/
GCC_CROSS_INCLUDE := $(GCC_CROSS_DIRECTORY)/aarch64-none-elf

CC := $(GCC_CROSS_TOOLS_PATH)aarch64-none-elf-gcc

LD := $(GCC_CROSS_TOOLS_PATH)aarch64-none-elf-ld

AR := $(GCC_CROSS_TOOLS_PATH)aarch64-none-elf-ar

OBJCOPY := $(GCC_CROSS_TOOLS_PATH)aarch64-none-elf-objcopy

CPREPROCESSOR := $(GCC_CROSS_TOOLS_PATH)aarch64-none-elf-cpp

ASM_FLAGS := -Wall -O2 -ffreestanding -mcpu=cortex-a53 -mstrict-align

C_FLAGS := -Wall -O2 -ffreestanding -fno-stack-protector -nostdinc -nostdlib -nostartfiles -fno-exceptions -fno-unwind-tables -mcpu=cortex-a53 -mstrict-align

CPP_FLAGS := $(C_FLAGS) -std=c++20 -fno-rtti

LD_FLAGS := -nostartfiles -nodefaultlibs -nostdlib -static

INCLUDE_DIRS := -I$(GCC_CROSS_INCLUDE)/lib/gcc/aarch64-none-elf/$(GCC_VERSION)/include -I$(GCC_CROSS_INCLUDE)/lib/gcc/aarch64-none-elf/$(GCC_VERSION)/include-fixed -I$(GCC_CROSS_INCLUDE)/aarch64-none-elf/include

CATCH2_PATH := $(TOOLS)/Catch2

TEST_CC := gcc

TEST_LD := g++

TEST_CFLAGS := -Wall -O2

TEST_CPP_FLAGS := $(TEST_CFLAGS) -std=c++20

COVERAGE_CC := gcc

COVERAGE_LD := g++

COVERAGE_CFLAGS := -Wall -O0 -fprofile-arcs -ftest-coverage

COVERAGE_CPP_FLAGS := $(COVERAGE_CFLAGS) -std=c++20

This file is included in subsequent makefiles with:

include ../Makefile.mk

Build Subdirectories with Make Functions

Make has a powerful set of functions which operate on symbols and lists specified in the makefile. I have used six of these functions to allow my makefile to process subdirectories in a project. The functions and snippets of the GNU Make documentation for each appears below :

Key Functions

  • $(addprefix prefix,names…) – prepends each ‘name‘ with the specified ‘prefix
  • $(patsubst pattern,replacement,text) – Finds whitespace-separated words in ‘text’ that match ‘pattern’ and replaces them with ‘replacement‘. Here ‘pattern’ may contain a ‘%’ which acts as a wildcard, matching any number of any characters within a word. If ‘replacement’ also contains a ‘%’, the ‘%’ is replaced by the text that matched the ‘%’ in ‘pattern‘. Words that do not match the ‘pattern’ are kept without change in the output. Only the first ‘%’ in the ‘pattern’ and ‘replacement’ is treated this way; any subsequent ‘%’ is unchanged.
  • $(wildcard pattern…) – used anywhere in a makefile, is replaced by a space-separated list of names of existing files that match one of the given file name ‘patterns‘. If no existing file name matches a ‘pattern‘, then that ‘pattern’ is omitted from the output of the wildcard function. Note that this is different from how unmatched wildcards behave in rules, where they are used verbatim rather than ignored (see Pitfalls of Using Wildcards).
  • $(foreach var,list,text) – The first two arguments, ‘var’ and ‘list‘, are expanded before anything else is done; note that the last argument, ‘text‘, is not expanded at the same time. Then for each word of the expanded value of ‘list‘, the variable named by the expanded value of ‘var’ is set to that word, and ‘text’ is expanded. Presumably ‘text’ contains references to that variable, so its expansion will be different each time. The result is that ‘text’ is expanded as many times as there are whitespace-separated words in ‘list‘. The multiple expansions of ‘text’ are concatenated, with spaces between them, to make the result of ‘foreach‘.
  • $(call variable,param,param,…) – The ‘call’ function is unique in that it can be used to create new parameterized functions. You can write a complex expression as the value of a ‘variable‘, then use call to expand it with different values. When make expands this function, it assigns each ‘param’ to temporary variables $(1), $(2), etc. The variable $(0) will contain ‘variable‘. There is no maximum number of parameter arguments. There is no minimum, either, but it doesn’t make sense to use ‘call’ with no parameters.
  • $(eval param) – The ‘eval’ function is very special: it allows you to define new makefile constructs that are not constant; which are the result of evaluating other variables and functions. The argument to the ‘eval’ function is expanded, then the results of that expansion are parsed as makefile syntax. The expanded results can define new make variables, targets, implicit or explicit rules, etc. The result of the ‘eval’ function is always the empty string; thus, it can be placed virtually anywhere in a makefile without causing syntax errors. It’s important to realize that the ‘eval’ argument is expanded twice; first by the ‘eval’ function, then the results of that expansion are expanded again when they are parsed as makefile syntax. This means you may need to provide extra levels of escaping for “$” characters when using ‘eval’. The ‘value’ function (see The value Function) can sometimes be useful in these situations, to circumvent unwanted expansions.

With the exception of ‘addprefix‘ the other functions can be a bit complex. GNU Make documentation has more descriptive detail and examples. The key to the makefile operation is the use of these functions together with the text expansion engine to automatically generate dependencies from subdirectories.

Example Makefile

include ../Makefile.mk

SRC_ROOT := src
BUILD_ROOT := build
IMAGE_DIR := image

BUILD_DIRS := $(IMAGE_DIR) $(BUILD_ROOT) $(BUILD_ROOT)/asm $(BUILD_ROOT)/c $(BUILD_ROOT)/c/utility $(BUILD_ROOT)/c/platform $(BUILD_ROOT)/c/platform/rpi3 $(BUILD_ROOT)/c/platform/rpi4 $(BUILD_ROOT)/c/devices $(BUILD_ROOT)/c/devices/rpi3 $(BUILD_ROOT)/c/devices/rpi4 $(BUILD_ROOT)/c/isr $(BUILD_ROOT)/c/filesystem $(BUILD_ROOT)/c/services

ASM_DIRS := asm
C_DIRS := c c/utility c/platform c/platform/rpi3 c/platform/rpi4 c/devices c/devices/rpi3 c/devices/rpi4 c/isr c/filesystem c/services
CPP_DIRS := c c/utility c/platform c/platform/rpi3 c/platform/rpi4 c/devices c/devices/rpi3 c/devices/rpi4 c/isr c/filesystem c/services

ASM_SRC_DIRS := $(addprefix $(SRC_ROOT)/,$(ASM_DIRS))
C_SRC_DIRS := $(addprefix $(SRC_ROOT)/,$(C_DIRS))
CPP_SRC_DIRS := $(addprefix $(SRC_ROOT)/,$(CPP_DIRS))

ELF := $(BUILD_ROOT)/kernel8.elf
IMG := $(IMAGE_DIR)/kernel8.img

ASM_SRC := $(foreach sdir,$(ASM_SRC_DIRS),$(wildcard $(sdir)/*.S))
C_SRC := $(foreach sdir,$(C_SRC_DIRS),$(wildcard $(sdir)/*.c))
CPP_SRC := $(foreach sdir,$(CPP_SRC_DIRS),$(wildcard $(sdir)/*.cpp))

OBJ := $(patsubst src/asm/%.S,build/asm/%.o,$(ASM_SRC)) $(patsubst src/c/%.c,build/c/%.o,$(C_SRC)) $(patsubst src/c/%.cpp,build/c/%.o,$(CPP_SRC))

INCLUDE_DIRS += -Iinclude -I../minimalstdio/include -I../minimalclib/include -I../minimalstdlib/include
LDFLAGS += -L../minimalstdio/lib -L../minimalclib/lib
LDLIBS = -lminimalstdio -lminimalclib

LINKER_SCRIPT_TEMPLATE=link.template.ld
LINKER_SCRIPT=$(BUILD_ROOT)/link.ld


all: clean checkdirs $(IMG)


$(IMG): $(ELF)
$(OBJCOPY) -O binary $(ELF) $(IMG)
/bin/cp redistrib/*.* image/.
/bin/cp armstub/image/armstub_minimal.bin image/.
/bin/cp resources/*.txt image/.
/bin/cp resources/sd.img image/.

$(ELF): $(OBJ) $(LINKER_SCRIPT)
$(LD) $(LDFLAGS) $(OBJ) $(LDLIBS) -T $(LINKER_SCRIPT) -o $(ELF)


$(LINKER_SCRIPT):
$(CPREPROCESSOR) -Iinclude $(LINKER_SCRIPT_TEMPLATE) -o $(LINKER_SCRIPT)

define make-asm-goal
$(BUILD_ROOT)/$1/%.o: $(SRC_ROOT)/$1/%.S
$(CC) $(INCLUDE_DIRS) $(ASM_FLAGS) -c $$< -o $$@
endef

define make-c-goal
$(BUILD_ROOT)/$1/%.o: $(SRC_ROOT)/$1/%.c
$(CC) $(INCLUDE_DIRS) $(C_FLAGS) -c $$< -o $$@
endef

define make-cpp-goal
$(BUILD_ROOT)/$1/%.o: $(SRC_ROOT)/$1/%.cpp
$(CC) $(INCLUDE_DIRS) $(CPP_FLAGS) -c $$< -o $$@
endef


$(foreach bdir,$(ASM_DIRS), $(eval $(call make-asm-goal,$(bdir))))
$(foreach bdir,$(C_DIRS), $(eval $(call make-c-goal,$(bdir))))
$(foreach bdir,$(CPP_DIRS), $(eval $(call make-cpp-goal,$(bdir))))

The BUILD_DIRS symbol contains the list of all the directories into which source code will be compiled. It is critical (though unsurprising) that the subdirectory structure of the build directory match the subdirectory structure of the source code.

ASM_DIRS, C_DIRS and CPP_DIRS specifies the list of directories with assembly, C and C++ source code. There must be agreement between these directories and those listed in BUILD_DIRS.

In the first bit of Make function magic, the ASM_SRC_DIRS, C_SRC_DIRS and CPP_SRC_DIRS are generated by adding the SRC_ROOT prefix to ASM_DIRS, C_DIRS and CPP_DIRS respectively.

ASM_SRC, C_SRC and CPP_SRC are then generated by looping over each entry in ASM_SRC_DIRS, C_SRC_DIRS and CPP_SRC_DIRS with ‘foreach‘ and then using the ‘wildcard‘ function to find all assembly, C and C++ source code in each of the directories in each of the source directories list respectively.

After all the source files are assembled above, all of the object files are generated in the OBJ list by using the ‘patsubst‘ function on each of the three source lists to replace the source code extensions ‘.S’, ‘.c’ and ‘.cpp’ with ‘.o’.

In the top section of the makefile, we have used a set of Make functions to transform a list of directories into a fully enumerated list of source files and object files. If a new subdirectory is added to the project, it simply needs to be added to the list of directories in the BUILD_DIRS list and the correct ASM_DIRS, C_DIRS and/or CPP_DIRS list depending on the compilation requirements.

Skipping ahead a bit, the variables make-asm-goal, make-c-goal and make-cpp-goal are defined using the ‘define‘ syntax which permits the variable to contain newlines. This is helpful for makefile snippets which will be passed to ‘eval‘.

Below those definitions, the is another set of ‘foreach‘ functions which then use ‘eval‘ and ‘call‘ to take the goals just defined and generate a fully enumerated set of goals for the ASM_DIRS, C_DIRS and CPP_DIRS directory lists.

What finally ties everything together is the $(ELF) target which has a dependency on all of the object files in the $(OBJ) list. This also pulls in the linker script as a separate target which has been fed to the C preprocessor. I had a handful of symbols that needed to be shared across the C/C++ source code and the linker script, so I chose to use the C preprocessor to include those symbols and expand them in the linker script. This is a pretty clean, elegant way to insure the symbols can be defined outside of the linker script and shared with other source code.

Double-Checking the Build

The echo target lists all of the directories, source files an object files to be built and demonstrates the expansion engine in action with the operation of the different functions:

$ make echo

Build Directories: image build build/asm build/c build/c/utility build/c/platform build/c/platform/rpi3 build/c/platform/rpi4 build/c/devices build/c/devices/rpi3 build/c/devices/rpi4 build/c/isr build/c/filesystem build/c/services

ASM Source Directories: src/asm

C Source Directories: src/c src/c/utility src/c/platform src/c/platform/rpi3 src/c/platform/rpi4 src/c/devices src/c/devices/rpi3 src/c/devices/rpi4 src/c/isr src/c/filesystem src/c/services

CPP Source Directories: src/c src/c/utility src/c/platform src/c/platform/rpi3 src/c/platform/rpi4 src/c/devices src/c/devices/rpi3 src/c/devices/rpi4 src/c/isr src/c/filesystem src/c/services

ASM Files: src/asm/configure_gic.S src/asm/get_exception_level.S src/asm/global_variables.S src/asm/identify_board_type.S src/asm/isr_kernel_entry.S src/asm/park_core.S src/asm/setup_physical_timer.S src/asm/start.S src/asm/utility.S

C Files:

CPP Files: src/c/main.cpp src/c/utility/cppsupport.cpp src/c/utility/dump_diagnostics.cpp src/c/utility/memory.cpp src/c/utility/regex.cpp src/c/platform/exception_manager.cpp src/c/platform/kernel_command_line.cpp src/c/platform/platform.cpp src/c/platform/platform_info.cpp src/c/platform/rpi3/rpi3_platform_info.cpp src/c/platform/rpi4/rpi4_platform_info.cpp src/c/devices/block_io.cpp src/c/devices/emmc.cpp src/c/devices/log.cpp src/c/devices/mailbox.cpp src/c/devices/physical_timer.cpp src/c/devices/power_manager.cpp src/c/devices/sd_card.cpp src/c/devices/std_streams.cpp src/c/devices/system_timer.cpp src/c/devices/uart0.cpp src/c/devices/uart1.cpp src/c/devices/rpi3/rpi3_hw_rng.cpp src/c/devices/rpi4/rpi4_hw_rng.cpp src/c/isr/system_timer_reschedule_isr.cpp src/c/isr/task_switch_isr.cpp src/c/filesystem/fat32_blockio_adapter.cpp src/c/filesystem/fat32_directory_cluster.cpp src/c/filesystem/fat32_filesystem.cpp src/c/filesystem/filesystem_errors.cpp src/c/filesystem/filesystems.cpp src/c/filesystem/master_boot_record.cpp src/c/services/murmur_hash.cpp src/c/services/os_entity_registry.cpp src/c/services/random_number_generator.cpp src/c/services/uuid.cpp src/c/services/xoroshiro128plusplus.cpp

Object Files: build/asm/configure_gic.o build/asm/get_exception_level.o build/asm/global_variables.o build/asm/identify_board_type.o build/asm/isr_kernel_entry.o build/asm/park_core.o build/asm/setup_physical_timer.o build/asm/start.o build/asm/utility.o build/c/main.o build/c/utility/cppsupport.o build/c/utility/dump_diagnostics.o build/c/utility/memory.o build/c/utility/regex.o build/c/platform/exception_manager.o build/c/platform/kernel_command_line.o build/c/platform/platform.o build/c/platform/platform_info.o build/c/platform/rpi3/rpi3_platform_info.o build/c/platform/rpi4/rpi4_platform_info.o build/c/devices/block_io.o build/c/devices/emmc.o build/c/devices/log.o build/c/devices/mailbox.o build/c/devices/physical_timer.o build/c/devices/power_manager.o build/c/devices/sd_card.o build/c/devices/std_streams.o build/c/devices/system_timer.o build/c/devices/uart0.o build/c/devices/uart1.o build/c/devices/rpi3/rpi3_hw_rng.o build/c/devices/rpi4/rpi4_hw_rng.o build/c/isr/system_timer_reschedule_isr.o build/c/isr/task_switch_isr.o build/c/filesystem/fat32_blockio_adapter.o build/c/filesystem/fat32_directory_cluster.o build/c/filesystem/fat32_filesystem.o build/c/filesystem/filesystem_errors.o build/c/filesystem/filesystems.o build/c/filesystem/master_boot_record.o build/c/services/murmur_hash.o build/c/services/os_entity_registry.o build/c/services/random_number_generator.o build/c/services/uuid.o build/c/services/xoroshiro128plusplus.o

This is where one can see all of the pieces of the makefile come together and meets my requirement for a simple debugging capability. 

Extending the build

As mentioned above, adding a new subdirectory is straightforward – all one must do is add the subdirectory to the list of directories in the BUILD_DIRS list and the correct ASM_DIRS, C_DIRS and/or CPP_DIRS list depending on the compilation requirements. Yes, it has to be put in two places – but that is because the BUILD_DIRS list specifies where the object file will be written whereas the ASM_DIRS, C_DIRS and CPP_DIRS lists specify which tool is used to assemble or compile the source code.

Beyond that, really you don’t need to understand the mechanics of the makefile. If you have a C++ only project, you can strip out the C and Assembly code processing, though for C++ only, CMake is probably a better choice. If you don’t want to use CMake, then the makefile skeleton above *should* meet just about any need.

RPI Bare Metal Project

The makefile above is part of the Raspberry Pi Bare Metal OS project I have been pursuing. This project can be found in Github at: https://github.com/stephanfr/RPIBareMetalOS.git

Using Packer To Build Development VMs

One fundamental development practice is to have a bullet-proof, repeatable process for building, upgrading and maintaining development environments. My current development practice relies on developing inside a VM using Visual Studio Code’s Remote Development plugin. I treat the development VMs as ‘disposable’, i.e. at any moment in time I ought to be able to commit my work in progress to Github, destroy the VM, build a new one and pick up right where I left off. I should also be able to move seamlessly from one virtualized environment to another – for example, I should be able to develop in the Proxmox VM at home and a VirtualBox VM on my laptop when I travel without any friction between the two.

I use Hashicorp’s Packer tool to automate the VM build and configuration process. I usually maintain two target platforms: 1) a Proxmox server with an NFS backend I maintain at home and 2) VirtualBox which is installed on my laptops. Packer is declarative and supports multiple ‘builders’ for different backends. Both Proxmox and VirtualBox are supported and there is minimal difference in the builder specifications between the two.

Practical Packer

Packer and its builders and provisioners do most of the heavy lifting for you in terms of getting a ‘vanilla’ VM built. Perhaps the most complicated bit is figuring out the correct ‘boot command prefix’ to get past the bootloader. Honestly, getting a prefix that works involves a bit of trial and error and is somewhat cryptic. For the projects I have in Github, the prefix ‘works for me’ but if you are running on either a very fast or very slow machine, then your mileage may vary.

With a ‘vanilla’ VM in hand, the next step is to tailor it to your development needs. Packer is not inherently modular but I have managed to introduce some modularity by providing a collection of scripts that will be run inside the VM after it boots to customize the environment. The Packer specification will invoke these scripts which will either execute or return immediately based on environment variables set from Packer variables which can be set in a HCL file or set on the Packer command line.

The main challenge when executing the scripts is determining if you want the scripted commands to run as root, which is how the provisioner executes shell scripts, or as the ‘development user’ created early in the provisioning process. Essentially, the ‘development user’ is the username you will want to use when logging into the VM for development. The scripts will automatically create this user and assign a single-use password that will have to be changed on first login. The ‘change on login’ feature was not straightforward – so if you want a similar capability, just lift it from my code.

Github Projects

In the https://github.com/stephanfr/Packer repository you will find the Packer specifications for building Ubuntu development VMs in either a Proxmox or VirtualBox environment as well as a project which will allow you to build a VM which can then be used to build bootable, customized RPi images in QEMU. This is a very nice capability and is all due to the work of Mateusz Kaczanowski’s in his PackerBuilderArm Project.

One VM build option sets up an AArch64 bare-metal build environment inside the VM with a directory structure for my RPi Bare Metal OS project. In general though, if you are looking for an easy ARM toolchain setup – you can lift that code as well and modify it for your purposes.

Maintenance

Since I use these specs myself, I will track Ubuntu releases and tooling updates – but the timing is likely to be a bit erratic. I do not generally update tools immediately, I tend to value a stable environment over ‘latest and greatest’ but once a year or so I will update. Mostly updates *should* be limited to tweaking config values but sometimes, stuff just breaks. For example, I have not had success automating deployment of the very most recent Ubuntu VMs.

RPI Bare Metal Cross Compiling Toolchain

Bare metal builds require a specific toolchain and compile options to insure that code normally generated for processes running in a standard OS environment is not generated and various standard libraries are not included.

Cross Compiling

I chose to use a cross-compiling approach for development. It should absolutely be possible to build the OS on a RPI itself with the correct toolchain but there are a lot of advantages that accrue from building on a larger x86_64 system with NFS, etc.

A variety of toolchains are available on the ARM developer website. At present they are all GCC based. Clang builds ought to work as well, though the make files will have to be modified. The toolchain I have been using is: AArch64 bare-metal target (aarch64-none-elf).

Libraries and Header Files

I have avoided using any libraries and the absolute bare minimum of header files to build the OS. This includes creating very minimal replacements for the C standard library, the standard IO library and the C++ standard library. There are a number of platform specific libraries that are required.

Catch2 for Unit Testing

Unit Tests for the project are built using Catch2. At present, unit tests using Catch2 are compiled for the host machine – not cross-compiled for the target platform. Running Catch2 also requires access to some headers and libraries for standard compilation on the host. This is messy – no doubt about it – and this approach will miss issues with data structure alignment, which matters a lot on AArch64 platforms without memory management enabled. Eventually, it will be possible to host the tests on the target platform but for now my focus is to get the code tested in the most straightforward fashion.

Source code coverage metrics are also available for the unit tests using a coverage target in the makefile.

Directory Structure

The make files (yep good old fashioned make) expect the following structure:

~/dev
|--- RPIBareMetalOS
|--- project cloned from github
|--- gcc-cross
|--- aarch64-none-elf
|--- cross compiling files
~/dev_tools
|--- arm-gcc-toolchain
|--- Catch2

There is a script in the root directory of the project named setup_dev_env.sh which will setup the development environment. If you have a vanilla Ubuntu (or probably any reasonable Debian based instance) you can simply execute that script and it will install the cross compiling toolchain, Catch2, QEMU for AArch64 and then create the correct directories and copy files to the right places.

Quick Build

There will be another post with greater detail on the build process, however the steps to build are straightforward. To get a running OS, simply do the following after cloning the project from Github:

cd RPIBareMetalOS/
cd minimalclib/
make all
cd ../minimalstdio/
make all
cd ../rpibaremetalos/
cd resources
unzip sd_compressed.zip
cd ..
make armstub_all
make all

The to run the OS in QEMU:

cd image/
qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdio -sd sd.img

Repeatable Scripted Setup

My development pattern is to create an Ubuntu VM customized for various projects and develop within it using the Remote Development Extension in Visual Studio Code. I run a Proxmox server at home but also use VirtualBox.

I have a separate project with Packer scripts to create Ubuntu development VMs in either of the two hypervisors listed above. Modifying the script for other hypervisors should be straightforward. In addition to the build process, there are a number of install scripts that are available to customize the VM after creation, one of which sets up an AArch64 bare metal toolchain and the directory structure I use for development. After building the VM with the right options, one should be able to simply clone the bare metal OS repository, build and run on QEMU.

The github repository for the Packer scripts is: https://github.com/stephanfr/Packer

An example command line to build a development VM template in Proxmox is:

packer build -var "dev_username=????" -var "dev_password=password" -var "proxmox_host=????" -var "proxmox_node_name=????" -var "proxmox_api_user=packer@pve" -var "proxmox_api_password=ubuntu" -var "ssh_username=packer" -var "ssh_password=password" -var "vmid=????" -var "http_interface=????" -var "install_aarch64_cross=true" -var-file="./22.04/ubuntu-22-04-version.pkrvars.hcl" -var-file="./proxmox/proxmox-config.pkrvars.hcl" -var-file="vm_personalization.pkrvars.hcl" ./proxmox/ubuntu-proxmox.pkr.hcl

After the template is created, clone it and then you will be good to go.

An example command line to build a development VM in VirtualBox appears below:

packer build -var "dev_username=????" -var "dev_password=password" -var "ssh_username=packer" -var "ssh_password=ubuntu" -var "install_aarch64_cross=true" -var-file="./22.04/ubuntu-22-04-version.pkrvars.hcl" -var-file="./virtualbox/virtualbox-config.pkrvars.hcl" -var-file="vm_personalization.pkrvars.hcl" ./virtualbox/ubuntu-virtualbox.pkr.hcl

Replace the ‘????’ sequences with appropriate values. At first login as the development user, you will be forced to change the user’s password.

More documentation can be found in the README file in the Packer script repository.

Building a Raspberry Pi 64 Bit Operating System with C++

I have undertaken many different projects through the years, one area which I have not really explored is Operating System development. When I started developing software on 8 bit computers, the closest you came to an OS was a ‘monitor’ or perhaps a ‘Master Control Program’ for those old enough to remember Tron.

I have started tinkering with a 64 bit operating system for Raspberry Pi based computers. Given how powerful those small single board computers have become, they make a great platform for OS experimentation.

My goals for this project are four-fold:

  1. Get back to ‘bare metal basics’ for a while
  2. Provide a platform for experimentation with different approaches to OS architecture
  3. Explore the advantages and disadvantages of C++ for OS development
  4. Provide a collection of tutorials and working code for others to learn from

C++ for OS Development

There is a definite bias against C++ for bare metal programming, though increasingly there are bare metal projects utilizing C++. In the Raspberry Pi ecosystem, the Circle – C++ Bare Metal Environment for Raspberry Pi is perhaps the best and most useful example. It is a remarkable system.

Prior to C++ 11, I probably would not have considered this but now at C++ 20 and beyond the language system is both rich enough and flexible enough to span from bare metal up to the highest application layer development. At the time of writing, this project is built with C++ 20.

Part of my goal is to create a single image which runs across multiple RPI versions and makes obvious the points at which board specific code is required. My approach is to create interfaces using classes and abstract virtual functions which, yes adds a bit of overhead but anymore it is minimal. The optimizations available in modern compilers and increased clarity associated with C++ code may help close the performance gap between C and C++.

I am not particularly concerned about size at present. On systems with gigabytes of RAM, the difference between a 64k kernel and a 128k kernel is negligible. Honestly the kernel size is going to be much more tightly correlated to the OS architecture than the implementation language or optimizations. A monolithic kernel containing lots of services will be big whereas a microkernel with most services running in user space will be much smaller. These days though, I tend to favor speed over size.

New Revision of Cork Computational Geometry Library – runs on Linux !

I have had enough free time lately to return to Cork and have made a couple key improvements to the build :

  • Builds and runs on Linux !
  • Generally 10% faster !
  • Moved to CMake build system
  • Script available to build 3rd Party dependencies
  • Updated to C++20
  • Updated to most recent Boost, TBB and MPIR libraries
  • Started vectorization with AVX2 SIMD instruction set
  • A few improvements to the regression test app
  • Added a few more unit tests
  • Faster OFF file output

Combined this makes for a much smoother ‘getting started’ experience. I will publish a Packer script that can be used to create a Ubuntu Mate 20.04 VM in VirtualBox or Proxmox for development.

The Github repository is here : https://github.com/stephanfr/Cork.git At present I am working in the v0.9.0 branch.

I plan to move forward and bring the 3rd party dependencies up to date and build out more unit tests while working on performance improvements. I believe there are a number of places in the code that will benefit from AVX2 vectorization.

Serial and SIMD implementation of the Xoshiro256+ random number generator – Part 1 Implementation and Usage

The Xoshiro256PlusSIMD project provides a C++ implementation of Xoshiro256+ random number generator that matches the performance of the reference C implementation of David Blackman and Sebastiano Vigna (https://prng.di.unimi.it/). Xoshiro256+ combines high speed, small memory space requirements for stored state and excellent statistical quality. For cryptographic use cases or use cases where absolutely the best statistical quality is required – maybe consider a different RNG like the Mersenne Twist. For any any other conventional simulation or testing use case, Xoshiro256+ should be perfectly fine statistically and better than a whole lot of other slower alternatives.

This implementation is a header-only library and provides the following capabilities:

  • Single 64 bit unsigned random value
  • Single 64 bit unsigned random value reduced to a [lower, upper) range
  • Four 64 bit unsigned random values
  • Four 64 bit unsigned random values reduced to a [lower, upper) range
  • Single double length real random value in a range of (0,1)
  • Single double length real random value in a (lower, upper) range
  • Four double length real random values in a range of (0,1)
  • Four double length real random values in a (lower, upper) range

Implementation Details

For platforms supporting the AVX2 instruction set, the RNG can be configured to use AVX2 instructions or not on an instance by instance basis. AVX2 instructions are only used for the four-wide operations, there is no advantage using them for single value generation.

The four-wide operations use a different random seed per value and the the seed for single value generation is distinct as well. The same stream of values will be returned by the serial and AVX2 implementations. It might be faster for the serial implementation to use only a single seed across all the four values – each increasing index being the next value in a single series, instead of each of the four values having its unique series. The downside of that approach is that the serial implementation would return different four wide values than the AVX2 implementation. The AVX2 implementation must use distinct seeds for each of the four values.

The random series for each of the four-wide values are separated by 2^192 values – i.e. a Xoshiro256+ ‘long jump’ separates the seed for each of the four values. For clarity, the Xoshiro256+ has a state space of 2^256.

The reduction of the uint64s to an integer range takes uint32 bounds. This is a significant reduction in the size of the random values but permits reduction while avoiding taking a modulus. If you have a need for random integer values beyond uint32 sizes, I’d suggest taking the full 64 bit values and applying your own reduction algorithm. The modulus approach to reduction is slower than the approach in the code which uses shifts and a multiply.

Finally, the AVX versions are coded explicitly with AVX intrinsics, there is no reliance on the vageries of compiler vectorization. The SIMD version could be written such that gcc should unroll loops and vectorize but others have reported that it is necessary to tweak optimization flags to get the unrolling to work. For these implementations, all that is needed is to have the -mavx2 compiler option and the AVX2_AVAILABLE symbol defined.

Usage

The class Xoshiro256Plus is a template class and takes an SIMDInstructionSet enumerated value as its only template parameter. SIMDInstructionSet may be ‘NONE’, ‘AVX’ or ‘AVX2’. The SIMD acceleration requires the AVX2 instruction set and uses ‘if contexpr’ to control code generation at compile time. There is also a preprocessor symbol AVX2_AVAILABLE which must be defined to permit AVX2 instances of the RNG to be created. It it completely reasonable to have the AVX2 instruction set available but still use an RNG instance with no SIMD acceleration.

#define __AVX2_AVAILABLE__

#include "Xoshiro256Plus.h"

constexpr size_t NUM_SAMPLES = 1000;
constexpr uint64_t SEED = 1;

typedef SEFUtility::RNG::Xoshiro256Plus Xoshiro256PlusSerial;
typedef SEFUtility::RNG::Xoshiro256Plus Xoshiro256PlusAVX2;

bool InsureFourWideRandomStreamsMatch()
{
    Xoshiro256PlusSerial serial_rng(SEED);
    Xoshiro256PlusAVX2 avx_rng(SEED);

    for (auto i = 0; i < NUM_SAMPLES; i++)
    {
        auto next_four_serial = serial_rng.next4( 200, 300 );
        auto next_four_avx = avx_rng.next4( 200, 300 );

        if(( next_four_serial[0] != next_four_avx[0] ) ||
           ( next_four_serial[1] != next_four_avx[1] ) ||
           ( next_four_serial[2] != next_four_avx[2] ) ||
           ( next_four_serial[3] != next_four_avx[3] ))
        { return false; }
    }

    return true;
}

Struct Timespec Utilities

There are a number of different representations for time in C and C++, ‘struct timespec’ was added in C++11 to provide a representation of times that range beyond a simple integer. A ‘struct timespec’ contains two long fields, one for seconds and another for nanoseconds. Unlike the std::chrono classes, there are no literal operators or other supporting functions for timespec. C++17 has added std::timespec but that still lacks literals and other operators.

Examples

The utilities can be found in my github repository:

https://github.com/stephanfr/TimespecUtilities

Including the ‘TimespecUtilities.hpp’ header file is all that is required. Literal suffix operators ‘_s’ for seconds and ‘_ms’ for milliseconds are defined as well as addition, subtraction and scalar multiplication operators. Examples follow:

const struct timespec five_seconds = 5_s;
const struct timespec one_and_one_half_seconds = 1.5_s;
const struct timespec five_hundred_milliseconds = 500_ms;

const struct timespec six_and_one_half_seconds = five_seconds + one_and_one_half_seconds;
const struct timespec three_and_one_half_seconds = five_seconds - one_and_one_half_seconds;

const struct timespec ten_seconds = five_seconds * 2;
const struct timespec fifty_milliseconds = five_hundred_milliseconds * 0.1;

HeapWatcher : Memory Leak Detector for Automated Testing


This project provides a simple tool for tracking heap allocations between start/finish points in C++ code. It is intended for use in unit test and perhaps some feature tests. It is not a replacement for Valgrind or other memory debugging tools – the primary intent is to provide an easy-to-use tool that can be added to unit tests built with GoogleTest or Catch2 to find leaks and provide partial or full stack dumps of leaked allocations.

The project can be found in github at: https://github.com/stephanfr/HeapWatcher

Design

The C standard library functions of malloc(), calloc(), realloc() and free() are ‘weak symbols‘ in glibc and can be replaced by user-supplied functions with the same signatures supplied in a user static library or shared object. This tool wraps the c standard library calls and then tracks all allocations and frees in a map. The ‘book-keeping’ is performed in a separate thread to (1) limit the need for mutexes or critical sections to protect shared state and (2) limit the run-time performance impact on the code under test. The functions in HeapWatcher are not intrusive in that they simply delegate to the glibc functions and then track allocations in a separate data structure. Allocation tracking can be paused in any thread being tracked and there is a facility to capture stack traces for ‘intentional leaks’ and then ignore those for tracking purposes.

There exists a single global static instance of HeapWatcher which can be accessed with the SEFUtility::HeapWatcher::get_heap_watcher() function.

Additionally, there are a pair of multi-threaded test fixtures provided in the project. One fixture launches workload threads and requires the user to manage the heap watcher. The second test fixture integrates the heap watcher and tracks all allocations made while the instance of the fixture itself is in scope.

For memory intensive applications running on many cores, the single tracker thread may be insufficient. All allocation records go into a queue, will not be lost and will eventually be processed. Potential problems can arise if the application allocates faster than the single thread can keep up and the queue used for passing the records to the tracker thread grows to the point that it exhausts system memory. When the HeapWatcher stops, the memory snapshot it returns is the result of processing all allocation records – so it should be correct.

Including into a Project

Probably the easiest way to use HeapWatcher is to include it through the fetch mechanism provided by CMake:

FetchContent_Declare(
    heapwatcher
    GIT_REPOSITORY "https://github.com/stephanfr/HeapWatcher.git" )

FetchContent_MakeAvailable(heapwatcher)

include_directories(
    ${heapwatcher_SOURCE_DIR}/include
    ${heapwatcher_BIN_DIR}
)

The CMake specification for HeapWatcher will build the library which muct be linked into your peoject. In addition, for the call stack decoding to work properly, the following linker option must be included in your project as well:

SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -rdynamic")


HeapWatcher is not a header-only project, the linker must have concrete instances of malloc(), calloc(), realloc() and free() to link to the rest of the code under test. Given the ease of including the library with CMake, this doesn’t present much of a problem overall.

Using HeapWatcher


Only a single header file HeapWatcher.hpp must be included in any file wishing to use the tool. This header contains all the data structures and classes needed to use the tool. The HeapWatcher class itself is fairly simple and the call to retrieve the global instance is trivial :

namespace SEFUtility::HeapWatcher
{
    class HeapWatcher
    {
        public:
            virtual void start_watching() = 0;
            virtual HeapSnapshot stop_watching() = 0;

            [[nodiscard]] virtual PauseThreadWatchGuard pause_watching_this_thread() = 0;
            
            virtual uint64_t capture_known_leak(std::list<std::string>& leaking_symbols, std::function<void()> function_which_leaks) = 0;
            [[nodiscard]] virtual const KnownLeaks known_leaks() const = 0;

            [[nodiscard]] virtual const HeapSnapshot snapshot() = 0;
            [[nodiscard]] virtual const HighLevelStatistics high_level_stats() = 0;
    };

    HeapWatcher& get_heap_watcher();
}


Note the namespace declaration. There are a number of other classes declared in the HeapWatcher.cpp header for the HeapSnapshot and to provide the pause watching capability. A simple example of using HeapWatcher in a Catch2 test appears below:

void OneLeak() { int* new_int = static_cast(malloc(sizeof(int))); }

void OneLeakNested() { OneLeak(); }
   
TEST_CASE("Basic HeapWatcher Tests", "[basic]")
{
    SECTION("One Leak Nested", "[basic]")
    {
        SEFUtility::HeapWatcher::get_heap_watcher().start_watching();

        OneLeakNested();

        auto leaks(SEFUtility::HeapWatcher::get_heap_watcher().stop_watching());

        REQUIRE(leaks.open_allocations().size() == 1);

        REQUIRE_THAT(leaks.open_allocations()[0].stack_trace()[0].function(), Catch::Matchers::Equals("OneLeak()"));
        REQUIRE_THAT(leaks.open_allocations()[0].stack_trace()[1].function(),
                    Catch::Matchers::Equals("OneLeakNested()"));

        REQUIRE(leaks.high_level_statistics().number_of_mallocs() == 1);
        REQUIRE(leaks.high_level_statistics().number_of_frees() == 0);
        REQUIRE(leaks.high_level_statistics().number_of_reallocs() == 0);
        REQUIRE(leaks.high_level_statistics().bytes_allocated() == sizeof(int));
        REQUIRE(leaks.high_level_statistics().bytes_freed() == 0);
    }
}

Capturing Known Leaks

In various third party libraries there exist intentional leaks. A good example is the leak of a pointer for thread local storage for each thread created by the pthread library. There is a leak from the symbol ‘dl_allocate_tls‘ that appears to remain even after std::thread::join() is called. This appears not infrequently in Valgrind reports as well. Given the desire to make this a library for automated testing, there is the capability to capture and then ignore allocations from certain functions or methods. An example appears below:

SECTION("Known Leak", "[basic]")
{
    std::list<std::string> leaking_symbol({"KnownLeak()"});

    REQUIRE( SEFUtility::HeapWatcher::get_heap_watcher().capture_known_leak(leaking_symbol, []() { KnownLeak(); }) == 1 );

    REQUIRE(SEFUtility::HeapWatcher::get_heap_watcher().known_leaks().addresses().size() == 2);
    REQUIRE_THAT(SEFUtility::HeapWatcher::get_heap_watcher().known_leaks().symbols()[0].function(),
                 Catch::Matchers::Equals("_dl_allocate_tls"));
    REQUIRE_THAT(SEFUtility::HeapWatcher::get_heap_watcher().known_leaks().symbols()[1].function(),
                 Catch::Matchers::Equals("KnownLeak()"));

    SEFUtility::HeapWatcher::get_heap_watcher().start_watching();

    OneLeakNested();
    KnownLeak();
    OneLeak();

    auto leaks(SEFUtility::HeapWatcher::get_heap_watcher().stop_watching());

    REQUIRE(leaks.open_allocations().size() == 2);
}

The capture_known_leak() method takes two arguments: 1) a std::list<std::string> containing one or more symbols which if located in a stack trace will cause the allocation associated with the trace to be ignored and 2) a function (or lambda) which will evoke one or more leaks associated with the symbols passed in the first argument. The leaking function need not be just adjacent to the malloc, it may be further up the call stack but the allocation will only be ignored if it appears at the same number of frames above the memory allocation as at the time the leak was captured.

This approach of actively capturing the leak at runtime is effective for dealing with ASLR (Address Space Layout Randomization) and does not require loading of shared libraries or other linking or loading gymnastics.

Pausing Allocation Tracking


The PauseThreadWatchGuard instance returned by a call to HeapWatcher::pause_watching_this_thread() is a scope based mechanism for suspending heap activity tracking in a thread. For example, the above snippet can be modified to not log the leak in OneLeakNested() by obtaining a guard and putting the leaking call into the same scope as the guard:

    SEFUtility::HeapWatcher::get_heap_watcher().start_watching();

    {
      auto pause_watching = SEFUtility::HeapWatcher::get_heap_watcher().pause_watching_this_thread();

      OneLeakNested();
    }

    auto leaks(SEFUtility::HeapWatcher::get_heap_watcher().stop_watching());

    REQUIRE(leaks.open_allocations().size() == 0);

Once the guard instance goes out of scope, HeapWatcher will again start tracking allocations in the thread.

Test Fixtures

Two test fixtures are included with HeapWatcher and both are intended to ease the creation of multi-threaded unit test cases, which are useful for detecting race conditions or dead locks. The test fixtures feature the ability to add functions or lambdas for ‘workload functions’ and then start all of those ‘workload functions’ simultaneously. Alternatively, ‘workload functions’ may be given a random start delay in seconds (as a double so it may be fractions of a second as well). This permits stress testing with a lot of load started at one time or allows for load to ramp over time.

The SEFUtility::HeapWatcher::ScopedMultithreadedTestFixture class starts watching the heap on creation and takes a function or lambda which will be called with a HeapSnapshot when all threads have completed, to permit testing the final heap state. This test fixture effectively hides the HeapWatcher instructions whereas the SEFUtility::HeapWatcher::MultithreadedTestFixture class requires the user to wrap the test fixture with the HeapWatcher start and stop.

Examples of both test fixtures appear below. First is an example of MultithreadedTestFixture :

    SECTION("Torture Test, One Leak", "[basic]")
    {
        constexpr int64_t num_operations = 2000000;
        constexpr int NUM_WORKERS = 20;

        SEFUtility::HeapWatcher::MultithreadedTestFixture test_fixture;

        SEFUtility::HeapWatcher::get_heap_watcher().start_watching();

        test_fixture.add_workload(NUM_WORKERS,
                                  std::bind(&RandomHeapOperations, num_operations));  //  NOLINT(modernize-avoid-bind)
        test_fixture.add_workload(1, &OneLeak);

        std::this_thread::sleep_for(10s);

        test_fixture.start_workload();
        test_fixture.wait_for_completion();

        auto leaks = SEFUtility::HeapWatcher::get_heap_watcher().stop_watching();

        REQUIRE(leaks.open_allocations().size() == 1);
    }

An example of ScopedMultiThreadedTestFixture follows :

    SECTION("Two Workloads, Few Threads, one Leak", "[basic]")
    {
        constexpr int NUM_WORKERS = 5;

        SEFUtility::HeapWatcher::ScopedMultithreadedTestFixture test_fixture(
            [](const SEFUtility::HeapWatcher::HeapSnapshot& snapshot) { REQUIRE(snapshot.numberof_leaks() == 5); });

        test_fixture.add_workload(NUM_WORKERS, &BuildBigMap);
        test_fixture.add_workload(NUM_WORKERS, &OneLeak);

        std::this_thread::sleep_for(1s);

        test_fixture.start_workload();
    }

Conclusion

HeapWatcher and the multithreaded test fixture classes are intended to help developers create tests which check for memory leaks either in simple procedural test cases written with GoogleTest or Catch2 or in more complex multi-threaded tests in those same base frameworks.

https://github.com/stephanfr/HeapWatcher