emerald OS logo

Emerald OS

Emerald

This is an operating system that I'm building for fun and learning in Rust.

Please check it out, and if you have any questions, feel free to ask, open an issue, fix a bug, etc...

Running

If you don't want to build the project, you can download the latest artifacts from:

You get ISO file containing the kernel and compressed filesystem directory containing the userspace programs.

The current command is what we use normally to run the OS, but can be run by any VM with some setup.

qemu-system-x86_64 -cdrom <kernel.iso> -serial mon:stdio -m 512 -boot d -drive format=raw,file=fat:rw:<filesystem>

where <kernel.iso> is the path to the ISO file, and <filesystem> is the path to the filesystem directory decompressed.

Some extra info:

  • -serial mon:stdio is used to redirect the serial output to the terminal.
  • -m 512 is the amount of memory to allocate for the VM, 512MB.
  • -boot d is to boot from the CD-ROM we just loaded.
  • -drive format=raw,file=fat:rw:<filesystem> is to pass the filesystem directory to the kernel as a disk.

Here we use a feature of QEMU, virtual fat, where it will treat the directory as a FAT filesystem, and being passed to the kernel as a disk.

Building

Building

The whole building and packaging is done by xtask

The ISO file can be used to run on other VMs/hardware(not tested)

For building the ISO image, you can use make but you need to have other dependencies installed to build and run the ISO:

xorriso mtools grub-pc-bin qemu-system-x86

Build kernel iso:

cargo xtask build-iso

Building userspace programs

This builds userspace programs into filesystem directory (used by qemu):

The userspace programs are built using a custom rust toolchain (See more info here)

Anyway, there are 2 options to build our userspace programs and in general any other program.

Using the prebuilt toolchain

We distribute a prebuilt toolchain in:

bash tools/install_toolchain_and_link.sh <path_to_toolchain.zip>

This will install the toolchain into extern/toolchain and link it to rustup as emerald.

Then, xtask will use the installed toolchain to build userspace programs, if its not installed it will give an error.

cargo xtask userspace build

Building the toolchain

We don't build the toolchain automatically, i.e. if you don't have the toolchain you can build the toolchain yourself from source if you don't want to installed prebuilt.

cargo xtask toolchain

If you want to build and install from source into extern/toolchain directory

You can then use rustup toolchain link ... to link to this folder

cargo xtask toolchain --install

Building and running

To build and run kernel and userspace programs:

cargo xtask run

You need to have qemu-system-x86_64 installed.

Debugging

You can use gdb or lldb to debug this.

But I have included vscode configs to enable easily debugging with CodeLLDB extension.

And to boot QEMU in debug mode you can use (it will wait for debugger on port :1234)

cargo xtask run --gdb

Kernel

Link to the Kernel rust documentation

Here we explain the kernel and its components in details.

From Boot, to drivers, filesystem, processes, memory management, and more...

Boot

What happens during boot?

This OS doesn't include a bootloader, the kernel is being loaded by default using grub using multiboot2.

The kernel crate will be compiled as an ELF file implementing the multiboot2 specification which can be loaded by any bootloader that supports it.

For example, in grub we use the multiboot2 command to load the kernel.

menuentry "Kernel" {
    insmod all_video
    multiboot2 /boot/kernel # the kernel ELF file
    boot
}

The kernel ELF file is in elf64-x86-64 format, but we start in x86 protected mode (32-bit) and then switch to long mode (64-bit) using assembly code in boot.S.

Initial boot, protected mode.

We start in assembly, since we are using rust, and we target x64 for compilation, we have to add 32-bit code manually using assembly.

The boot.S file is included in main.rs using global_asm! macro, and we use .code32 directive to generate 32-bit code.

The initial boot here in the assembly performs the following:

  • Setup initial page tables (required for 64-bit mode).
  • Setup basic GDT (Global Descriptor Table) (required for 64-bit mode).
  • Switch to 64-bit long mode.
  • Jump to kernel_main entry point in main.rs.

The kernel ELF file is loaded at 0x100000 physically, and 0xFFFFFFFF80100000 virtually. The virtual address 0xFFFFFFFF80000000 will map to the physical address 0x0 initially.

Note: you may notice in the code that we use - 0xFFFFFFFF80000000 a lot, such as lgdt [gdtr64 - 0xFFFFFFFF80000000]. And this is because we don't have virtual memory yet, so we are operating with physical memory currently, and the linker will use 0xFFFFFFFF80000000 as the base address. But since we know it maps to 0x0 physically, we can subtract to convert to physical addresses.

Multiboot2 header

We specify some information to the bootloader that we want in the multiboot2 header:

  • Address: Here we specify load address information, which include:
    • start and end of the kernel image.
    • bss_end which is the end of the .bss section.
  • Entry point: The entry point of the kernel, which is entry function in boot.S,
    i.e. where execution will start. This is executed in 32-bit mode.
  • Module alignment: Modules will be page aligned. (maybe not needed, as I thought it affected the kernel alignment, but looks like its for modules).

After that, execution will start at entry, where we check that the value in EAX is equal to the special multiboot2 magic value just to make sure we are correct.

Switching to long mode

Then, we check that long mode is supported in the machine by making sure that PAE feature is supported in CPUID:0000_0001 and LM feature is supported in CPUID:8000_0001.

If its not supported, we display an error and infinite loop.

# check for PEA
mov eax, 0x00000001
cpuid
test edx, CPUID_FEAT_EDX_PAE # (1 << 6)
jz not_64bit
# check for long mode
mov eax, 0x80000001
cpuid
test edx, CPUID_FEAT_EX_EDX_LM # (1 << 29)
jz not_64bit

If all is good, we setup some basic page tables as follows

Initial page tables

We map the first 128MB of physical memory into two ranges.

  • [0x0000000000000000..0x0000000007FFFFFF], 1:1 mapping which give us easy access and required when we switch to 64-bit mode.
  • [0xFFFFFFFF80000000..0xFFFFFFFF87FFFFFF], This is where the rust kernel is loaded in the ELF file, and all references addresses in this range.

The structure of the page table is as follows:

- [0x0000000000000000..0x0000000007FFFFFF] - 1:1 mapping with the physical pages
    - This will be:
        - PML4[0]
        - PDPT[0]
        - PDT[0..63] # 2MB each
- [0xFFFFFFFF80000000..0xFFFFFFFF87FFFFFF] - the kernel ELF file virtual address space
    - This will be:
        - PML4[511]
        - PDPT[510]
        - PDT[0..63] # 2MB each (shared with the above)

The location of where we store the initial page tables is at boot_page_tables which is a region of memory in the .boot_page_tables section in the .data section that fits the size of 4 page tables, each 4KB in size.

The usage of boot_page_tables is as follows:

  • PML4 (boot_page_tables[0])
  • PML4[0] -> PDPT-A (boot_page_tables[1])
  • PML4[511] -> PDPT-B (boot_page_tables[2])
  • PDPT-A[0] -> PDT (boot_page_tables[3])
  • PDPT-B[510] -> PDT (boot_page_tables[3]) // same PDT as above

And then PDT[0..63] is shared between the two and maps the first 128MB of physical memory.

Then, CR3 is set to the address of PML4 and CR4 is set to enable PAE.

GDT (Global Descritor Table)

We setup a very basic GDT that contain kernel code and data segments (even though, data probably is not needed).

.align 16
gdt64:
    .quad 0x0000000000000000    # null descriptor
    # Code segment (0x8)
    .long 0x00000000                                       # Limit & Base (low, bits 0-15)
    .byte 0                                                # Base (mid, bits 16-23)
    .byte GDT_CODE | GDT_NOT_SYS | GDT_PRESENT | GDT_WRITE # Access
    .byte GDT_LONG_MODE                                    # Flags & Limit (high, bits 16-19)
    .byte 0x00                                             # Base (high, bits 24-31)
    # Data segment (0x10)
    .long 0x00000000                                       # Limit & Base (low, bits 0-15)
    .byte 0                                                # Base (mid, bits 16-23)
    .byte GDT_NOT_SYS | GDT_PRESENT | GDT_WRITE            # Access
    .byte 0x00                                             # Flags & Limit (high, bits 16-19)
    .byte 0x00                                             # Base (high, bits 24-31)

I prefer it this way, but other just use a quad static value. This won't change at all so either is fine.

Then, we load the GDT using lgdt instruction.

lgdt [gdtr64 - 0xFFFFFFFF80000000]

Switch to long mode

We need to do the following:

  • Setup page tables.
  • Enable PAE in CR4.
  • Enable PG (Paging) and PE (Protection Enable) in CR0.
  • Enable long mode in EFER MSR.
    mov ecx, EFER_REG # (0xC0000080)
    rdmsr
    or  eax, EFER_LME # (1 << 8)
    wrmsr
    
  • Setup GDT.
  • Jump to kernel_main in main.rs.

But before we go to kernel_main directly, we jump to a location in boot.S (kernel_main_low). Where we setup the data segment, the stack, also forward the multiboot2 information to kernel_main.

The stack used is 512 pages, i.e. 2MB in size, and is located in the .stack section which is inside the .data section.

The stack is a bit large, but we don't require that much stack most of the time, its needed like this due to our recursive AML (ACPI) parser, which we should improve.

We also setup a guard page for the stack that is unmapped later using the more advanced virtual memory setup.

Then, we jump to kernel_main in main.rs, and its all rust from here.

CPU state in rust

When we jump into rust, here is the state of the CPU (do note, that the kernel may change the state later on):

  • 64-bit mode
  • Basic GDT setup with only 2 segments (kernel code and data) (see above GDT)
  • Empty IDT setup (i.e. exceptions will trigger triple faults)
  • interrupts are disabled
  • cr3 is set to the .boot_page_tables (which is a temporary page tables)
  • cr0 = CR0_PG | CR0_PE | CR0_MP
  • cr4 = CR4_PAE | CR4_OSFXSR | CR4_OSXMMEXCPT
  • EFER = EFER_LME | (EFER_LMA would be set by the CPU, indicating that long mode is active)
  • The stack is setup at the end of the .stack section
  • The multiboot info is passed in rdi (which is the same as ebx since we haven't touched it)
  • rax is set to the kernel_main and then jumped to
  • the rest of the registers are arbitrary, so make sure kernel_main only takes one argument (rdi).

Memory

Explaining the memory management in the kernel.

We have several tools to manage memory in the kernel, each with its own purpose and goal.

First, let's look at the memory layout of the kernel, to know where everything is located.

Then, we can look at the physical allocator which provide us with the raw memory from the hardware, which we can use later with the virtual mapper to map it into virtual memory in order to access it.

The heap in the kernel is implemented with heap allocator, and the PageAllocator will allocate pages using the virtual mapper.

Lastly we have the virtual space which is a very useful tool to get a virtual address for components that are not part of the kernel directly but we know where they are located in physical space. This includes, Memory mapped IO, ACPI structures, Multiboot structures, etc.

Memory pointer types

Another thing you might notice in the code is that we use u64 sometimes and usize other times. Here is what they mean:

  • u64: This is a 64-bit unsigned integer, and it will be 64 no matter the platform. It is used to represent physical addresses. Because in x86 with CR0.PG bit enabled, the CPU can map 40-bit physical addresses to 32-bit virtual addresses. And thus it can be more than 32-bit.
  • usize: This is a pointer-sized unsigned integer, and it will be 32 or 64 depending on the platform. It is used to represent virtual addresses, and it is the same size as a pointer on the platform.

Something tangent. For filesystem operations we only use u64, as hardware drives can have easily more than 4GB of space. without needing for the CPU to be 64-bit.

Memory layout

The memory layout constants and code is defined in memory_layout

This is the structure of the memory for any process running in the system.

Layout

0000_0000_0000_0000 .. FFFF_FF7F_FFFF_FFFF - User   (15.99~ EB)
FFFF_FF80_0000_0000 .. FFFF_FFFF_7FFF_FFFF - Process specific kernel (510 GB)
FFFF_FFFF_8000_0000 .. FFFF_FFFF_FFFF_FFFF - Kernel (2 GB)

Kernel layout

FFFF_FFFF_8000_0000..FFFF_FFFF_8010_0000          nothing
FFFF_FFFF_8010_0000..FFFF_FFFF_8XXX_XXXX          kernel elf (text, rodata, data, bss)
FFFF_FFFF_8XXX_XXXX..FFFF_FFFF_8800_0000          (kernel end) physical allocator low (until 128MB mark pre-mapped in `boot`)
FFFF_FFFF_8800_0000..FFFF_FFFF_8900_0000          kernel heap (16MB)
FFFF_FFFF_8900_0000..FFFF_FFFF_890E_7000          interrupt stacks
    FFFF_FFFF_8900_0000..FFFF_FFFF_8900_1000      interrupt stack 0 guard page (4KB) *not mapped by purpose*
    FFFF_FFFF_8900_1000..FFFF_FFFF_8902_1000      interrupt stack 0 (32 * 4KB = 128KB)
    ... *repeat for 6 more stacks*
FFFF_FFFF_890E_7000..FFFF_FFFF_FFFF_F000          Kernel extra (virtual space, free virtual space to use)

The kernel is loaded by the bootloader at physical address 0x10_0000, and then it will perform virtual mapping for physical 0x0 into 0xFFFF_FFFF_8000_0000 for 128MB i.e. until the end of the initial physical page allocator. See more details in the boot chapter.

Look at virtual space for more info on it.

Virtual space

Virtual space is a I'm using (not sure what other OSes call), that solves the issue of "I have a physical address of an object, but I don't have virtual space to map it to". This is useful for reading structures that are in specific location in physical memory, such as ACPI tables, PCI configuration space, memory mapped IO, etc.

Its very simple, it will take memory from the kernel extra space, and map it to the physical address.

Process specific kernel layout

FFFF_FF80_0000_0000..FFFF_FF80_0000_1000          process kernel stack guard page (4KB) *not mapped by purpose*
FFFF_FF80_0000_1000..FFFF_FF80_0004_1000          process kernel stack (64 * 4KB = 256KB)

This is a space specific to each process, but reside in kernel space.

The idea is to have structures that are specific to processes here, that others won't have access and thus reduce the need to setup a lock around them.

We use it currently for kernel stack, which is where the kernel will store the stack when an interrupt happens while we are in user space.

It solves the issue where having a single kernel stack for all processes can't work, as if two processes gets interrupted while the first one is still in the kernel, the second one will overwrite the first one's stack.

As you might have expect, the previous paragraph was a source of a crazy bug that introduced this kernel stack. Fixed in 0dc04f8

User layout

Not much to talk about here, as this will depend on the process itself and where to load the ELF file, currently we load at the address specified in the ELF file. This is of course not safe, as we don't do extra checks for that value.

But anyway, the other parts of the userspace are as follows:

XXXX_XXXX_XXXX_XXXX .. YYYY_YYYY_YYYY_YYYY - ELF file
YYYY_YYYY_YYYY_YYYY .. ZZZZ_ZZZZ_ZZZZ_ZZZZ - Heap. From the end of the ELF and grows up
ZZZZ_ZZZZ_ZZZZ_ZZZZ .. FFFF_FF7F_FFFF_D000 - Stack. From the top and grows down
FFFF_FF7F_FFFF_D000 .. FFFF_FF7F_FFFF_E000 - Stack guard page. *not mapped, just for reference*
FFFF_FF7F_FFFF_E000 .. FFFF_FF7F_FFFF_F000 - Process Metadata structure

A lot of symbols XD. But in general, the stack is at the top of the user space, and the elf file is at the bottom, and the heap is in the middle starts after the elf file.

Physical Allocator

This is implemented in physical_page_allocator

This provides physical memory allocation and deallocation.

Currently it is very basic, and can allocate in 4KB pages. And only allocates 1 page at a time. This is due to our design.

Current design

The design now is built using a linked list, each page point to the next free page. The reason we can't get more pages is that when we create the list, we create it by freeing all the pages, and free means that we add it to the list.

The issue is that we are adding them one by one, start to finish, and when we allocate, we take the last one from the list, so its disconnected from the rest of the pages, it doesn't know if the next page is immediately after it in memory or not.

This could be solved by having a more complex allocator, like what we have in the heap allocator, but I want to use another design that is fast, since that one is slow.

Another issue is that we only have 128MB of memory to allocate from, and we can't allocate more than that.

This is not a design issue, but the physical page allocator initially relies on the memory we have during boot where we map the first 128MB of memory directly into the kernel space, see boot and memory layout for more details.

Design issues to fix

  • Can only allocate 1 page at a time
  • Only has 128MB of memory to allocate from

Ideas

Linux uses a tree like structure, where each node in the tree is a page, and the children are the sub-pages, this allows easy allocation of powers of 4KB pages, and also allows easy coalescing of pages.

I'm thinking of doing something like that.

For the 128MB limit, I'm thinking of having another allocator, that won't have direct access to the memory, i.e. is not mapped, and we will store metadata about those pages in the heap, which we already have access to. i.e. it will be a "late" physical page allocator.

Virtual Mapper

This is implemented in virtual_memory_mapper

This is where we map physical memory to virtual memory, and where virtual memory pages are managed after [boot].

The main features is to map and unmap physical memory to virtual memory, and to allocate and free virtual memory pages.

The map function takes 1 argument VirtualMemoryMapEntry which is a struct contains information about the mapping:

#![allow(unused)]
fn main() {
pub struct VirtualMemoryMapEntry {
    pub virtual_address: usize,
    pub physical_address: Option<u64>,
    pub size: usize,
    pub flags: u64,
}
}
  • virtual_address is the virtual address to map to, and must be known of course
  • physical_address is the physical address to map to, if its None, then we will allocate new memory from the [physical allocator]
  • size is the size of the mapping, must be 4K aligned
  • flags is the flags of the mapping, such as Writable, UserAccessible. For now, these are just constants mapping directly to x86 page table flags.

The unmap function takes the same struct, but physical_address must be None, and virtual_address must be known, it also takes is_allocated which is a boolean to indicate if the memory was allocated by the virtual memory mapper, if it was, it will be freed.

This API can be improved, but currently we don't keep track of the mappings, so we rely on the caller to do that for us :D.

Beside that, we got other functionalities used by processes. Like:

  • switch_to_this: to switch to the self VirtualMemoryMapper, which each process has its own VirtualMemoryMapper.
  • get_current_vm: to get the current VirtualMemoryMapper of the current process.
  • clone_current_vm_as_user: Clones the kernel mappings of the current vm, and mark it as user vm, so it doesn't allow kernel mappings anymore.

boot physical allocator

Virtual space

This is implemented in virtual_space

Virtual space is a I'm using (not sure what other OSes call), that solves the issue of "I have a physical address of an object, but I don't have virtual space to map it to". This is useful for reading structures that are in specific location in physical memory, such as ACPI tables, PCI configuration space, memory mapped IO, etc.

Its very simple, it will take memory from the kernel extra space, and map it to the physical address.

It can be used by VirtualSpace, which is similar to Box, i.e. its a wrapper for a pointer, and it will automatically unmap the memory when it goes out of scope.

#![allow(unused)]
fn main() {
let mut vs = unsafe { VirtualSpace::<u32>::new(0x1000).unwrap() };
*vs = 0x1234;
assert_eq!(*vs, 0x1234);
}

Drivers

This will cover all drivers that interact with hardware.

We may call it drivers or devices. For now at least, we treat them as the same thing.

Some of these devices and virtual devices may be available as devices in the filesystem.

IDE

This is implemented in ide_device

IDE devices are the type of PCI device we support now. The PCI device type is MassStorageController(0x01, ..).

We support both ATA and ATAPI devices. ATA devices are hard drives and ATAPI devices are CD/DVD drives.

Its basic support without DMA or async.

We perform read with read_sync

Keyboard

This is implemented in keyboard

The keyboard driver is simple, and uses the legacy PS/2 interface at 0x60 and 0x64 ports, its implemented alongside the mouse driver in the same file.

The driver provide events broadcasts to all listeners using blinkcast. These listeners are mostly processes reading from the /devices/keyboard file (see keyboard reader).

#![allow(unused)]
fn main() {
pub struct Key {
    pub pressed: bool,
    // the state of the modifiers at the time of the fetch
    pub modifiers: u8,
    pub key_type: KeyType,
}
}

Where KeyType is an enum containing all keys from a US mapping.

The keyboard user can then use this as the origin, and map it to any other key depending on the layout they want.

Currently, we use the US layout to get the character of a key using the function Key::virtual_key (used in the kernel and userspace).

The modifiers field is a bitflags from modifier, so use these constants to check if a specific modifier is on.

There are 2 types of modifiers:

  • Held modifiers: SHIFT, CTRL, ALT
  • Toggled modifiers: CAPSLOCK, NUMLOCK, SCROLLLOCK

Keyboard reader

The keyboard driver provide a way to get a blinkcast reader using get_keyboard_reader, where the user can read keyboard events without blocking anytime they want.

The console and userspace processes use this reader to read keyboard events.

For userspace processes, they can read the keyboard events through the virtual device at /devices/keyboard.

A process can open a file descriptor to this device and read from it to get keyboard events.

The file descriptor will hold a blinkcast reader to the keyboard driver, then each process can read events without blocking.

The user can open the file and read the content, but since we are performing some encoding, its better to use the library emerald_runtime which provide easy way to read the events.

Example:

#![allow(unused)]
fn main() {
use emerald_runtime::keyboard::Keyboard;

let mut keyboard = Keyboard::new();

if let Some(key) = keyboard.get_key_event() {
    println!("Key: {:?}", key);
}
// or
for key in keyboard.iter_keys() {
    println!("Key: {:?}", key);
}
}

Mouse

This is implemented in mouse

The mouse driver is simple, and uses the legacy PS/2 interface at 0x60 and 0x64 ports, its implemented alongside the keyboard driver in the same file.

The driver provide events broadcasts to all listeners using blinkcast. These listeners are mostly processes reading from the /devices/mouse file (see mouse reader).

#![allow(unused)]
fn main() {
pub enum ScrollType {
    None = 0,
    VerticalUp = 1,
    VerticalDown = 2,
    HorizontalRight = 3,
    HorizontalNegative = 4,
}

pub struct MouseEvent {
    pub x: i16,
    pub y: i16,
    pub scroll_type: ScrollType,
    pub buttons: u8,
}
}

The buttons field is a bitflags from buttons, so use these constants to check a button is pressed.

Note, that this is the state of the mouse, so you must keep the old state to know if a button was pressed or released.

The buttons are:

  • LEFT: 0b0000_0001
  • RIGHT: 0b0000_0010
  • MIDDLE: 0b0000_0100
  • FORTH: 0b0000_1000
  • FIFTH: 0b0001_0000

Mouse reader

The keyboard driver provide a way to get a blinkcast reader using get_mouse_reader, where the user can read mouse events without blocking anytime they want.

Userspace processes can read the mouse events through the virtual device at /devices/mouse.

A process can open a file descriptor to this device and read from it to get mouse events.

The file descriptor will hold a blinkcast reader to the mouse driver, then each process can read events without blocking.

The user can open the file and read the content, but since we are performing some encoding, its better to use the library emerald_runtime which provide easy way to read the events.

Example:

#![allow(unused)]
fn main() {
use emerald_runtime::mouse::Mouse;

let mut mouse = Mouse::new();

if let Some(event) = mouse.get_event() {
    println!("Event: {:?}", event);
}
// or
for event in mouse.iter_events() {
    println!("Event: {:?}", event);
}
}

UART

This is implemented in uart

A very basic UART driver, connects to 0x3F8 port (COM1). And can be read and written to.

It is used by the console to print and read characters.

Virtual devices

These are "devices" that we interact with from other parts of the kernel, but may not be actually available in hardware. They provide a sort of abstraction layer above other drivers/devices and maybe other components.

One of the examples, is the console, which is a virtual device that we use to print and read characters from the screen, it will use the [keyboard] and [uart] drivers to do so.

Some of these devices may be available as devices in the filesystem.

Console

This is implemented in console

The console is a virtual device that we use to print and read characters from the screen, it will use the keyboard and uart drivers to do so.

This is called by print! and println! macros.

We have 2 consoles, for now, I don't like the design now, and would like to change it in the future.

EarlyConsole

This is a console object that is statically initialized, can only write, and doesn't have access to the keyboard.

LateConsole

This is the main console that is initialized later, and can read and write and has access to keyboard.

The main purpose of this is to add this to the /devices directory, and act as a kernel device, so we can use it from the userspace.

The design can be improved, the issue is that LateConsole is inside an Arc<Mutex<>> (so it can be used as a device), EarlyConsole is static, there is several differences, so there is a lot of code duplication, and I would like to improve it somehow.

Pipe

This is implemented in pipe

A pipe is a virtual device that allows two processes to communicate with each other. It is a unidirectional communication channel. It is used to pass data from one process to another. t is similar to a file, but it is not stored on the disk. It is stored in the memory. It is a first-in-first-out (FIFO) data structure.

It acts as a special file, i.e. the process just write to it as a normal file.

It is created with create_pipe_pair, which will return 2 File objects, one for reading and one for writing. The kernel then assign those to the process and such.

Internally, the Pipe is a dyn Device, so its stored in the INode as a device. See filesystem for more details on INode.

Filesystem

This is implemented in filesystem

In this kernel, the filesystem is implemented by several layers, and components.

When you try to open a file path, the filesystem will look for the best entity that contains this path. And this is achieved by the mapping system.

Check FAT for more information about the FAT filesystem specifically.

Then, when you open a file, you can specify several flags (implemented in OpenOptions):

  • read - Open the file for reading.
  • write - Open the file for writing.
  • create - Create the file if it doesn't exist.
  • create_new - Fail if the file exists.
  • truncate - Truncate the file if it exists.
  • append (implicit write) - Append to the file if it exists.

With these, you can create new files and choose which mode to open them with. Of course the filesystem may refuse to create the file if the operation is not supported, such as with /devices directory mappings.

Mapping

The mapping, is a dictionary that maps a path prefix to a Filesystem manager. For example, currently we have the following mappings:

'/' -> FAT (filesystem backed by disk)
'/devices' -> Devices (a virtual filesystem)

When you open a path, it will find the best mapping, i.e. the longest prefix that matches the path. Then will use the resulting Filesystem manager to open the file.

For example, if you open the path /devices/console, it will use the Devices filesystem manager to open the file /console.

Filesystem trait

The "manager" here is an implementor of the Filesystem trait, which is a simple interface that controls all the filesystem operations.

Operations supported are:

  • open_root - Open the root directory, this is the entry point when treversing the filesystem.
  • read_dir - Read the directory entries from a DirectoryNode.
  • create_node - Create a new file or directory inside a DirectoryNode.
  • read_file - Read the file contents from a FileNode.
  • write_file - Write the file contents to a FileNode.
  • close_file - Send a message that we are closing the file, if you notice, we don't have open_file, but instead, the user can treverse the filesystem with open_root and read_dir until the file node is found, then it can be used directly. This function is used to alert the filesystem to clean up any resources that it might have allocated for this file.
  • set_file_size - Set the file size to a custom value, this is similar to truncate in Unix systems, write_file, will increase the file size if needed.

Node

See Node

The Filesystem will give us an Node when we open a directory or a file, and this Node can be either FileNode or DirectoryNode

I'm calling Node even though FAT doesn't have this concept, but I'm using it to represent the file information.

Partition tables

Currently we only support the MBR partition table, and we can only read the first partition, we don't check the partition type, and just forward it to the FAT filesystem.

Devices

See Devices

This is a basic dictionary that maps a device name, to a Arc<dyn Device>. Then, when its opened, the device clone is returned in a special FileNode, so we can act upon it as a file.

FAT (File Allocation Table) filesystem

This is implemented in fat

The FAT filesystem is a simple filesystem that is widely used in many devices, such as USB drives, SD cards, and floppy disks.

In this kernel, we have support for:

  • FAT12
  • FAT16
  • FAT32
  • Long file names (LFN).
  • reading and writing files, changing file size, and creating files and directories.

Processor

This is implemented in cpu.

Here we talk about processor related structures and functions. Including:

  • Interrupts and exceptions
  • Global Descriptor Table (GDT)
  • Advanced Programmable Interrupt Controller (APIC) and IO APIC.

Let's first talk about other processor stuff that are not the above.

Saved CPU state

Each CPU (currently only 1) has a structure that contain the state related to the CPU.

Where it contains among others:

  • the id and apic_id of the cpu, for identification.
  • the n_cli and old_interrupt_enable which is used by the implementation of locks.
  • the context which is a process context, used when switching between processes, also process_id for current process, and other scheduling related fields.

CPU initialization

Currently, we don't perform any additional initialization after boot, and its causing some issues. As UEFI results in a different CPU state than BIOS, and we need to handle that, there is an issue for that #34.

Interrupts and exceptions

This is implemented in interrupts.

This is where interrupts are managed. We have 2 types of interrupts.

  • Exceptions: These are errors that occur in the CPU, like division by zero, invalid opcode, etc, and these are defined in specific places in the IDT.
  • Interrupts: These are events that are triggered by devices, like the keyboard, timer, etc, these do not have specific placement in the IDT, and are instead handled mostly in the user_interrupts range of the IDT, which is from 32 to 255.

The last 16 entries of the user interrupts are used specially by the kernel as such:

  • SPECIAL_SCHEDULER_INTERRUPT=0xFF: Used by scheduler to switch between contexts.
  • SPECIAL_SYSCALL_INTERRUPT=0xFE: Used by the syscall instruction to switch to kernel mode, this is defined in kernel user link.

Generally we use functions like allocate_user_interrupt, which will allocate an interrupt entry, and puts our function there (see Interrupts Handlers later for types of interrupts), then it will give us where it was allocated, so we can use it later, for example with APIC.

The InterruptDescriptorTableEntry gives us some extra functionalities that are not provided by default. Such as:

  • set_stack_index which sets the stack index to use when handling the interrupt.
  • set_privilege_level which sets the privilege level of the interrupt.
  • set_disable_interrupts which sets if the interrupts flag should disabled when handling this interrupt.
  • override_code_segment which sets the code segment to use when handling the interrupt.

Interrupts handlers

There are 2 types of interrupts handlers based on what arguments they take:

  • The minimal, which takes InterruptStackFrame64, i.e. the data provided by the CPU automatically.
    #![allow(unused)]
    fn main() {
    pub struct InterruptStackFrame64 {
        pub rip: u64,
        pub cs: u8,
        pub rflags: u64,
        pub rsp: u64,
        pub ss: u8,
    }
    }
  • The full, which takes all registers (except fxsave), this is used when we expect to switch between processes or need those extra registers.
    #![allow(unused)]
    fn main() {
    pub struct InterruptAllSavedState {
        pub rest: RestSavedRegisters,   // contain all the rest of the registers
        pub number: u64,
        pub error: u64,
        pub frame: InterruptStackFrame64,
    }
    }

Generally, APIC is the only user of allocate_user_interrupt, as it uses it to allocate interrupts for hardware devices. Look at the APIC for more details on what interrupts we have now.

Global Descriptor Table (GDT) and others

This is implemented in gdt and idt.

The Global Descriptor Table (GDT) is a data structure used by the x86 architecture to define the characteristics of the various memory and privileges of the segments used by the CPU.

The Interrupt Descriptor Table (IDT) is a data structure used by the x86 architecture to define the characteristics of the various interrupts and exceptions.

Interrupt Descriptor Table (IDT)

I'll start with this just to get it out of the way.

The setup for IDT is very simple, we just have a static memory of "default" empty handlers, and we use the lidt instruction to load the IDT.

Later, when we add an interrupt, we just modify the IDT entry with the new handler, and it will be used from now on.

For more information about specific usage of interrupts, see Interrupts and exceptions and APIC.

Global Descriptor Table (GDT)

This kernel is x86_64, and segments are not used as much as they were in the past, so we have a very basic implementation of the GDT.

Currently, we have 4 segments excluding the NULL segment:

  • KERNEL_CODE: This is the code segment for the kernel.
    • flags: flags::PRESENT | flags::CODE | flags::USER | flags::dpl(KERNEL_RING)
  • USER_CODE: This is the code segment for the userspace.
    • flags: flags::PRESENT | flags::CODE | flags::USER | flags::dpl(USER_RING)
  • KERNEL_DATA: This is the data segment for the kernel.
    • flags: flags::PRESENT | flags::USER | flags::WRITE | flags::dpl(KERNEL_RING)
  • USER_DATA: This is the data segment for the userspace.
    • flags: flags::PRESENT | flags::USER | flags::WRITE | flags::dpl(USER_RING)

The code segments will have the LONG flag set. Technically, we don't also need the KERNEL_DATA segment, but its included to be more consistent.

I won't go into details of the flags, you can check the documentation of GDT or the CPU manual.

Also an interesting node, flags::WRITE seem to be required, at least with qemu it would crash when switching to data segment where its not available, even though, the AMD64 manual says that the CPU ignores those bits in 64-bit mode.

From above:

  • KERNEL_RING is 0
  • USER_RING is 3

As part of the GDT setup, we also setup the TSS (Task State Segment), which is used by the CPU to switch between tasks generally. But since we don't use hardware tasks, we at least need to set it up to configure interrupts stacks.

Task State Segment (TSS)

The TSS is a structure that is used by the CPU to switch between tasks, and it also contains the IST (Interrupt Stack Table) which is used to provide a separate stack for interrupts, also provide the stack for when to go from user to kernel modes.

For us, we setup 7 stacks, usable by any interrupt. Look at memory layout for where those are located. The interrupts manager can then choose the stack to use for each interrupt with set_stack_index (see Interrupts and exceptions). A value of None means to use the default stack.

The default stack will be the current stack if the privilege level is the same as the current privilege level, otherwise it will change to the stack specified in the TSS based on the target privilege level.

Currently, we only have 1 stack for KERNEL_RING, which is at Process kernel stack in the memory layout. I.e. this is a stack specific to each process, as this will only be used when transitioning from user to kernel mode, and inside user mode, we will always be inside a process.

Advanced Programmable Interrupt Controller (APIC) and IO APIC

This is implemented in apic.

The APIC is a part of the CPU, and it is used to manage interrupts and exceptions. It is used to manage the interrupts and exceptions that are triggered by hardware devices, and it is also used to manage the interrupts and exceptions that are triggered by the CPU itself.

Initialization

The initialization of the APIC is done using ACPI tables, which contains data about the ACPI, such as:

  • Number of CPUs.
  • The APIC ID of each CPU.
  • The address of the IO APIC.

The APIC is a memory-mapped address provided by the CPU, we can fetch it by reading the MSR register 0x1B (which is the APIC_BASE register).

Then we can map the APIC and IOAPIC (also memory-mapped) addresses to the virtual address space using virtual space.

Interrupts

The APIC can be used to allocate and assign interrupts to hardware devices. We can use the functions:

  • assign_io_irq to assign an interrupt to a hardware device.
  • assign_io_irq_custom which is similar to the above but provide extra changes to the interrupt, such as:
    • with_interrupt_polarity_low: Set the interrupt polarity to low/high (boolean).
    • with_trigger_mode_level: Set the trigger mode to level/edge (boolean).
    • with_mask: Set the interrupt to be masked or not (boolean).

It will setup the interrupts with the correct IO APIC based on the argument irq_num provided.

Currently, we have these interrupts configured:

  • 1: Used by the keyboard driver.
  • 14 & 15: Used by the IDE driver.
  • HPET timer, with interrupt number specified dynamically based on its configuration, but looks to be 2 on the VM.

We also have interrupts from the APIC itself, such as:

  • Timer interrupt: this is used by the scheduler to switch between processes.
  • Error interrupt: This is mapped, but haven't seen it triggered yet.
  • Spurious interrupt: This is mapped, but haven't seen it triggered yet.

Advanced Configuration and Power Interface (ACPI)

This is implemented in acpi.

ACPI is a standard for operating systems to discover and configure computer hardware components, to perform power management, and to configure the system's power management features. It is a successor to Advanced Power Management (APM), and it is a part of the UEFI specification.

We have some support to some of the components of the ACPI standard.

ACPI comes in tables, and these tables include information about the system's hardware.

ACPI Tables Support

We have parsing support for these tables now, having support does mean we use all the features inside them. It just means that we can read the table, and that's it, some of them are used, but not all.

  • RSDP: This is either provided by multiboot2 during boot, or we search for it in memory, and this just points to the RSDT or XSDT.
  • RSDT: This is the Root System Description Table, and it contains pointers to other tables.
  • MADT/APIC: This is the Multiple APIC Description Table, and it contains information about the APIC,see APIC for more details.
  • FACP: This is the Fixed ACPI Description Table, and it contains information about the power management features of the system, also contain some info about the RTC.
  • HPET: This is the High Precision Event Timer, and it contains information about the HPET, see HPET for more details.
  • DSDT (not used*): This is the Differentiated System Description Table, and it contains AML code, which is used to configure the system.
  • SSDT (not used*): This is the Secondary System Description Table, and it also contains AML code, which is used to configure the system.
  • BGRT (not used): This is the Boot Graphics Resource Table, and it contains information about the boot logo, and it is used by the UEFI firmware.
  • WAET (not used): This is the Windows ACPI Emulated Devices Table, and it contains information about the emulated devices, and it is used by the UEFI firmware.
  • SRAT (not used): This is the System Resource Affinity Table, and it contains information about the system's memory and processors locality.

*: We don't use DSDT and SSDT data for now, but we do parse it as AML code, see AML for more details.

We use virtual space to map APIC tables, and then we copy them to the heap, this will make it easier to use, and we can reclaim ACPI memory later.

ACPI Machine Language (AML)

This is implemented in aml.

AML is a language used to describe the hardware of the system, and it is used by the ACPI to configure the system, check spec here (this is what's used to implement this parser).

This is an AML parser that is able to parse the code inside the DSDT and SSDT tables.

Currently, we do not use the parsed code, as we need to build an interpreter for it, so that we can emulate executing it and get the data we need.

Why this is needed?

There are some details of the hardware that is hidden in AML, such as:

  • Finding out the sleep commands for the system, i.e. which registers to write to, and what values to write.
  • Finding out interrupts configuration, for devices that share the same interrupt line.

Generally, most OSes will use acpica which is a tool that can parse and execute AML code.

Some annoying parts

Just to share here, since this is a documentation and all.

Generally, parsing AML is not that hard, its a binary format, we can read the opcode and know what term type we need.

The issue is one thing, method calls.

Method calls are encoded as such, NameString followed by N arguments as TermArg (which is a term type). The issue is:

  • The expression itself doesn't provide the number of arguments.
  • The method call can happen before the method is defined (method definition provide number of arguments).
  • The method can be external (I think).

So, we have one choice (acpica also does it) and that is to "guess" the number of arguments. And this is very error prone, as a Term can be also considered an expression, like NameString can be considered as a variable if it is not a method call. very messy :'(

The current implementation I have I think is quite good, and I got to fix all the bugs I found, but of course there could be more.

This seems kinda "complaining" XD, but I just wanted to share the experience.

Processes

This is implemented in process

This module, implements the process mangement of the kernel, including creating, destroying, and scheduling processes. As well as managing the process's memory, and the process's resources.

Currently, when we say process, we also mean thread, as we don't have multi-threading support yet.

Process Data

The process structure Process contain all the information relating to the process, some of the important ones:

  • id: The process id.
  • parent_id: The id of the parent process.
  • vm: The process's virtual memory, an instant of VirtualMemoryMapper, see virtual mapper.
  • context: A saved state of the CPU before the process is being scheduled. i.e. while the process is running, this is considered invalid and doesn't represent the current state of the process.
  • open_filesystem_nodes: A map of open file nodes, see filesystem a node can be a file or a directory, the mapping here is from usize, we use map instead of a list since we can remove a file from the middle of the list, and we don't want to have to shift all the elements after it.
  • argv: A string list of the arguments passed to the process.
  • stack_ptr_end: The end of the stack, the stack grows down, so this is the highest address of the stack, and where the stack starts when the process is created.
  • stack_size: The current size of the stack, currently, its constant, until we get growing stack support.
  • heap_start: The start address of the heap, this will be padded by around 1MB from the end of the ELF file loadded into memory.
  • heap_size: The current size of the heap. The user process can request more heap space with the inc_dec_heap system call.
  • heap_max: The maximum possible size of the heap, this is not changed, currently set to 1GB.
  • priority: The priority of the process, this is used by the scheduler. see PriorityLevel.)
  • exit_code: The exit code of the process, if the process is exited, this will be set to the exit code.
  • children_exits: A list of the children processes that have exited, with their exit code (see #process-exit later for more information).

Process Creation

Process creation (structure creation) is as follows:

  • Load the ELF file, this doesn't load the whole thing, just the header to make sure its valid.
  • Maps the stack region.
  • Loads argv into the stack (check argv structure for more information).
  • Load ELF regions into memory.
  • Load the Process Metadata structure (check Process Metadata for more information).
  • Add process-specific kernel memory regions, like the kernel stack (this must be done after loading the ELF, and last modification to the VM manually, because we can't switch to this VM after this point unless its by the scheduler, see the comments process/mod.rs::allocate_process for more details)
  • Add data about the heap, with size 0 and max size 1GB, i.e. no memory allocated yet.
  • Default context is created, everything is 0, except for:
    • rip: The entry point of the ELF file.
    • rsp: The end of the stack.
    • rflags: enable the interrupts.
    • cs: The code segment for the USER_RING (see GDT).
    • ss/ds: The data segment for the USER_RING (see GDT).
    • rdi: The value of argc, since we use SYSV calling convention from the userspace.
    • rsi: The address of the argv array, since we use SYSV calling convention from the userspace.

Argv Structure

The argv structure in the stack is as follows:

+-------------------+   <- stack_ptr_end
| arg0              |
+-------------------+
| arg1              |
+-------------------+
| ....              |
+-------------------+
| null              |
+-------------------+
| argv              |   <- pointer to `arg0`
+-------------------+
| argc              |   <- number of arguments
+-------------------+
| ....              |   <- this is where `rsp` will be set to, when the process is created
+-------------------+

Process Metadata Structure

Structure definition at ProcessMetadata.

This structure is placed in a static location in userspace memory (see user memory layout), and is used to store information about the process, such as:

  • process id.
  • image base address.
  • image size.
  • program header offset (even though it can probably be obtained by reading the image header).
  • eh_frame address and size, this is used to implement unwinding.
  • text address and size, this is useful for debugging and getting backtrace from usermode.

Process Exit

When the syscall exit is called, the process moved to exited list, and the exit code is set.

The Exited process will be removed from the scheduler's list, at the next schedule call, see scheduler for more information.

When the process exits, it does the following as well:

  • It will notify all processes that are in the state WaitingForPid with the process's id, it will give it the exit_code, and continue those processes.
  • It will add itself to the parent's children_exits list, with the exit_code, so that parents can know when their children have exited without blocking on WaitingForPid (i.e. they can call waitpid without blocking, only because they are parents).

Scheduler

This is implemented in scheduler

The scheduler is responsible for scheduling the processes, and managing the CPU time between them.

Scheduling Algorithm

We are using priority-queue based approach for scheduling processes.

The queue order is determined by a value priority_counter, that starts at u64::MAX, its decremented by a value generated from the process's priority level, higher priority level will decrease the value less, and thus staying on top for more times.

Each time we schedule a process we perform the following:

  • Check all waiting processes, and wake them if its time, currently, we have ProcessState::WaitingForTime and ProcessState::WaitingForPid states that support waiting.
  • After waking them (moving them to scheduled list), pick the top scheduled process, and run it, moving it to the running_and_waiting list.
  • If we have exited processes, handle notifying waiters and parents and remove the process. Its important we remove the process here, since we can't do it while the process is running (still handling the exit syscall) since we still hold the virtual memory, deleting the process will free it up and cause a page fault.

Running the process is simple:

  • copy the context of the process to the saved context of the CPU, see processor saved state, which will be used by the scheduler interrupt to jump to it.
  • Set the pid of the process to the process_id of the CPU.
  • Mark the process as ProcessState::Running, and move it to the running_and_waiting list as mentioned.

Yielding

When a process is running, it can yield to the scheduler through 2 ways now:

  • Timer: The APIC timer, see APIC, will interrupt the CPU every once in a while, and this gives us preemptive multitasking.
  • System Call: When a syscall is executed, after the syscall, we perform yield as well, see syscalls.

When yielding, we perform the following:

  • Save the all_state of the CPU to the context of the process, and this all_state comes from the interrupt, i.e. we can only yield when an interrupt occurs from that process, since we have to save the exact cpu before the interrupt.
  • reschedule the process, by putting it in the scheduled list and fixing up the priority_counter to be similar to the top process. This is important, as if a process was sleeping for some time, we don't want it to hog the execution when it wakes up because at that point, its priority_counter will be much higher than any other process.

Sleeping

When a process is running, it can sleep, and this is done through the syscall sleep, see syscalls.

When sleeping, we perform the following:

  • Mark the process as ProcessState::WaitingForTime(deadline), where deadline is the expected time to finish the sleep from the current time. See Clocks.
  • the process would already be in running_and_waiting list, so no movement is done here.

And then, in the scheduler, we handle sleeping processes (see scheduling algorithm).

Scheduler Interrupt

This is interrupt 0xFF, See interrupts for more information.

This interrupt is used to easily change the execution context of the current cpu.

When the interrupt is triggered, we do the following:

  • The cpu must contain a context, which we will move to.
  • We must be in the kernel of course, this is a private interrupt for the kernel, and the scheduler alone.
  • Switch the all_state coming from the interrupt (which will be a point in the schedule function in the kernel, where we called this interrupt), and the context from the cpu, which will be the state of the process currently.

And with the last step, we achieve the context switch between two execution states, the kernel and the process.

Why?

I found this solution that worked quite well for switch between contexts, is it the best? I don't know, but it works quite well now and is very stable.

Syscalls

This is implemented in syscalls

The numbers of the syscalls and some metadata like SyscallResult and SyscallError are defined in kernel user link crate.

Syscalls specs is as follows:

  • Syscalls can have up to 7 arguments, and are passed in RCX, RDX, RSI, RDI, R8, R9, R10 in order.
  • The syscall number is passed in RAX.
  • The syscall return value is passed in RAX, the syscall may write to pointers passed in the other registers, but the result code will still be in RAX.
  • The returned RAX value is an encoded value of SyscallResult. So, its never intended to be read as u64 directly.
  • The arguments are not modified by the syscall, but the syscall may read from them, or write to memory pointed by them.
  • All pointers passed to the syscall are have to be valid, and point to user space memory only, the kernel will check that the memory is mapped and write to it, but it doesn't guarantee the validity of the memory if it was modified by the kernel (i.e. if the memory was pointed to random part in the heap it could corrupt the heap for example).
  • The syscall may block execution depend on the syscall itself, like wait_pid or a read to a blocking file with no data.

Syscalls list

This shows the list of syscalls and their arguments and return values. Some notes on this:

  • a: *const/*mut T, b: usize will be used in the kernel as a slice, for example write(file_index: usize, buf: &[u8]). But I didn't want to remove an argument and confuse the reader, so I split them into two as being sent from userspace.
  • types like &Path, BlockingMode, SpawnFileMapping, etc... are passed as u64 as is everything, but the kernel checks validity and cast/parse these pointers/values to the correct types.
  • All the return types are SyscallResult and will report any error during execution, for simplicity, I didn't just repeat that in the table.
  • When the return type is (), it means the kernel will return SyscallResult::Ok(0), the userspace will check that its 0.
NameArgumentsReturn valueDescription
openpath: &Path, access_mode: u64, mode: u64file_index: usizeOpens a file
writefile_index: usize, buf: *const u8, size: usizebytes_written: usizeWrites to a file
readfile_index: usize, buf: *mut u8, size: usizebytes_read: usizeReads from a file
closefile_index: usize()Closes a file
blocking_modefile_index: usize, blocking_mode: BlockingMode()Sets the blocking mode of a file. This is DEPRECATED, and should be replaced with set_file_meta with FileMeta::BlockingMode
exitexit_code: i32!Exits the current process
spawnpath: &Path, argv: *const *const u8, file_mappings: *const SpawnFileMapping, file_mappings_size: usizepid: u64Spawns a new process
inc_heapincrement: i64old_heap_end: usizeIncrease/decrease the heap of the current process (similar sbrk)
create_piperead_fd: *mut usize, write_fd: *mut usize()Creates a pipe
wait_pidpid: u64, block: boolexit_code: i32Waits for a process to exit
statpath: &Path, stat: *mut FileStat()Gets the file stat of a file
open_dirpath: &Pathdir_index: usizeOpens a directory
read_dirdir_index: usize, buf: *mut DirEntry, len: usizeentries_read: usizeReads from a directory
get_cwdbuf: *mut u8, len: usizeneeded_bytes: usizeGets the current working directory, returns fferTooSmall` if the buffer is too small
chdirpath: &Path()Changes the current working directory
set_file_metafile_index: usize, meta_id: u64, meta_data: u64()Sets the file meta
get_file_metafile_index: usize, meta_id: u64, meta_data: *mut u64()Gets the file meta
sleepseconds: u64, nanos: u64()Sleeps for a duration
get_timeclock_type: ClockType, time: *mut ClockTime()Gets the time based on the clock_type, see Clocks
graphicscommand: GraphicsCommand, extra: *mut ()()Graphics operations, see Graphics:VGA
seekfile_index: usize, whence: SeekWhence, offset: i64new_offset: u64Seeks a file
prioritypid: u64, priority: Option<PriorityLevel>PriorityLevelSets and gets the priority of a process

Executables

This is implemented in executable

Currently, we only have support for ELF and will probably stay like this for a while, I don't plan to support other formats soon.

ELF

The ELF file is the executable format used by most of the Unix-like systems, and it is the format we will be using for our executables.

This is implemented in elf

We load elf on process creation, see process creation for more information.

For now we support very basic loading, no dynamic linking, shared libraries, or relocation. Just loading segments.

Clocks

This is implemented in clocks.

Here we implement the clock functionality of the system.

This includes functionalities used in implementing:

  • system time.
  • timers.
  • sleep (see scheduler).

We have several clock sources, and these are devices that provide us with a "time", this "time" is not bound to a specific start, but it just guarantees that:

  • The time is always increasing.
    • TODO: We still haven't handled wrap arounds.
  • Querying the time at 2 points will give you the time difference based on real time speed.
    • This depends on the granularity of the source, for example, if we implement it for RTC, then the granularity is 1 second, i.e. querying it within 1 second will mostly give you the same time.

And thus with these, we can keep track of the time.

We have 2 times in the system:

  • boot time (uptime).
    • This is the time since the system booted.
    • This is calculated by periodically updating the time based on the best clock source we have.
  • real time (unix time).
    • On boot, we get the current time from the RTC.
    • When we need this value, we calculate it based on the boot time and the start time we got at boot.

These times can be fetched with the get_time syscall.

RTC

This is implemented in rtc.

The RTC (Real Time Clock) is a hardware clock that is used to provide the current time and date for the system.

The RTC is used as the base time to determine the unix time of the system, and provide this time to the user space.

Beside that, the kernel generally doesn't care about unix, and everything is based on "system boot" time.

The RTC can technically be used as a clock source, such as HPET and TSC, but its accuracy is very low, so for now its not used as such.

High Precision Event Timer (HPET)

This is implemented in hpet.

See Intel's High Precision Event Timer (HPET) Specification for more details.

The HPET is a hardware timer that is used to provide a high-precision time base for the system.

The other clocks such as TSC is calibrated using the HPET, then TSC is used to provide the time for the system as it is faster than the HPET.

Currently, we only use 1 timer for the clock, and we don't use the interrupts. But we could use it in the future to provide a more accurate time based events.

Time Stamp Counter (TSC)

This is implemented in tsc.

The TSC is a 64-bit register present in all x86 processors since the Pentium. It counts the number of cycles since the processor was reset. It is used to provide a high-precision time base for the system.

But it doesn't count human time, so we have to calibrate it using a more accurate timely based clock. Like the HPET or the RTC, RTC is very slow (1 second interval), so we use the HPET for now to calibrate.

We also may need to recalibrate the TSC once a while, as it may drift.

Graphics

Currently, we only have a VGA driver implemented, in VGA.

It only has CPU based rendering, and only copies the images to vga framebuffer provided to us by the hardware.

There is no hardware acceleration yet.

VGA

This is implemented in vga.

This is a basic graphics driver implementation.

We take the framebuffer info from multiboot2 structure coming from boot, and use it to write to that memory, which is then displayed on the screen.

This framebuffer is controlled by the kernel, the kernel can use it internally to display stuff, for example the console uses it to display characters on the screen.

But then, the userspace processes can take ownership of the framebuffer, what this means is that the kernel will stop rendering to it, but still owns the memory region. At this stage the kernel will just stay there waiting for rendering commands coming from the owner process.

The rendering commands here are just Blit, which is an operation that copies a region from one framebuffer (user allocated) to another (the vga framebuffer, that the kernel owns).

Which means that all the rendering is done by the userspace processes, and the kernel just copies the images to the screen.

Userspace processes can be more efficient by telling the kernel which regions of the framebuffer have changed and only sending those regions to the kernel, so the kernel can copy only the changed regions to the screen.

These operations are accessible by the graphics syscall

Graphics Command

There are 4 commands supported:

  • TakeOwnership: This is used to take ownership of the graphics device.
  • ReleaseOwnership: This is used to release ownership of the graphics device, and is executed automatically when the process exits if it was not released manually.
  • GetFrameBufferInfo(&mut info_out): This is used to get information about the framebuffer, see FrameBufferInfo.
  • Blit(&BlitCommand): This is used to blit a region from userspace memory into the graphics framebuffer, it can control (See BlitCommand for more info):
    • src_framebuffer: memory reference to the source framebuffer (only read by the kernel)
    • src_framebuffer_info: The framebuffer info of the source framebuffer, i.e. its shape in the memory, so that we can copy correctly from it.
    • src_x, src_y: The top-left corner of the source region (user memory)
    • dest_x, dest_y: The top-left corner of the destination region (kernel)
    • width, height: The width and height of the region to copy, applies to both

Testing

We have testing framework in the kernel that tries to replicate the testing framework in rust-std.

We can't just use #[test], so instead we created our own macro for tests testing::test!.

Example:

#![allow(unused)]
fn main() {
testing::test! {
    fn test_free_realloc() {
        let page = unsafe { alloc() };
        let addr = page as usize;

        unsafe { free(page) };

        let page2 = unsafe { alloc() };

        assert_eq!(page as usize, addr);

        unsafe { free(page2) };
    }

    #[should_panic]
    fn test_unaligned_free() {
        let page = unsafe { alloc() };

        let addr_inside_page = unsafe { page.add(1) };

        unsafe { free(addr_inside_page) };
    }
}
}

When you create a new feature be sure to add a test for it as much as possible.

Userspace

This section will cover the userspace part of the operating system. It will cover the programs, libraries and other userspace related topics.

Building custom userspace programs

Since we are using Rust STD, you can build (hopefully) a lot of rust projects that don't have any dependencies on libc, linux or windows specific libraries.

Getting the toolchain

First, you need to get the toolchain, and you can do that by either:

Using the prebuilt toolchain

We distribute a prebuilt toolchain in:

sh tools/install_toolchain_and_link.sh <path_to_toolchain.zip>

This will install the toolchain into extern/toolchain and link it to rustup as emerald.

Then, when using our cargo xtask to build our programs.

cargo xtask userspace [build/check/fmt/clippy]

Building the toolchain

You can build our toolchain with the command

cargo xtask toolchain --install

in the root of the project.

Which will install a toolchain in ./extern/toolchain, which you can then link to your rustup toolchain with rustup toolchain link emerald ./extern/toolchain.

Building the userspace programs

Then, you can build your project with the toolchain

cargo +emerald build --target x86_64-unknown-emerald

Please open an issue if you have any problems building your project, or if you want to add a new feature to the toolchain.

Programs

These can be found in userspace

Here are the list of programs that are found in the userspace of the operating system by default.

init

The first program that is run when the operating system boots, for now the operating system requires this program and expects it to be found at /init.

Currently, init performs the following:

  • Sets the stdin as blocking (will see why in a bit).
  • Creates a new /shell process, using stdin: Piped and pass stdout and stderr normally inherited.
  • Stays in the following loop:
    • Check if /shell has exited (not blocking).
    • Reads from stdin and buffers it until a newline is found, then it sends it to the pipe of /shell's stdin, effectively, giving us behavior similar to a normal terminal in linux.
  • If the process exits, it will spawn a new /shell process and goes back to the loop.

This is a temporary behavior (maybe?), but we still need to improve file operations as init is looping a lot.

shell

This is a basic shell, that can change directories, and execute programs.

It also support output redirect (no piping between processes yet), so you can do something like:

ls > file.txt

or even append to a file:

ls >> file.txt

List of commands/programs

NameDescription
cd (internal)Change directory
pwd (internal)Print working directory
exit (internal)Exit the shell, which will just cause another to come back up
touch (internal)Create a file, if not present
lsList directory contents
treeList directory contents recursively
echoWrite arguments to the standard output
catPrint 1 file on the standard output (no concat yet XD)
xxdHexdump utility
keyboardKeyboard test program
mouseMouse test program

graphics

Here we have simple graphics programs that will take control of the graphics controller from the kernel and thus will look like exiting from shell, upon exiting the program, the shell will come back up.

List of commands/Programs

NameDescription
graphicsSimple graphics demo progam, it will display a red ball and it will bounce around the screen
videoVideo player, it will take a video in image zip format, that is a zip file with jpg images inside it, check tools/video_to_zip.sh for how to convert normal videos to this format. You can specify the fps upon creation (default is 30), and specify it as will when running the program.

Libraries

In this chapter we will talk about the libraries that we will use in our userspace programs.

Libraries are code shared between multiple programs, and can be "all" programs, such as the case for the Rust Standard Library.

Rust Standard Library

Currently, everything is built by Rust, we don't have libc yet. And instead we have a custom rust target that we use to build our userspace programs.

This is implemented in my fork of rust-lang/rust.

We have the target x86_64-unknown-emerald. The changes needed are in rust/library/std/src/sys/emerald and rust/library/std/src/os/emerald.

These changes are provided by another crate emerald_std, where we have all the basic implementation of userspace functionalities including:

  • allocator: See Heap Allocator for more details.
  • files and io: For file operations and input/output.
  • process: For process management.

This crate, is very basic and only performs syscalls basically, nothing much else, then in rust we perform the rest of the bindings.

Using rust like this gives us a lot of benefits, since we just need to implement basic unsafe functionalities, and it will handle synchronization, memory management, etc., of course we need to make sure our implementation is correct.

Extra

Here are extra components that are shared between the kernel and userspace.

You might find references to these components in both the kernel and userspace chapters.

Heap allocator

In the kernel, and also in userspace we need a heap.

Heap simply is a memory region that we can allocate and deallocate memory from when we need dynamically.

Anyway, we need heap implementation for the kernel and userspace.

This is achieved in increasing_heap_allocator. This is a crate implementing heap allocator based on "increasing" memory block.

What does that mean?

It means that this crate require a Page provider, i.e. an entity that can provide pages of memory that reside after each other, and the rest of handling heap memory is done by the crate.

This example will make it clear, this is how its implemented in userspace:

use increasing_heap_allocator::{PageAllocatorProvider, HeapAllocator};

const PAGE_4K: usize = 4096;

struct PageAllocator {
    heap_start: usize,
    mapped_pages: usize,
}

impl PageAllocator {
    fn new() -> Self {
        Self {
            heap_start: unsafe { syscall::inc_dec_heap(0).unwrap() as usize },
            mapped_pages: 0,
        }
    }
}

impl PageAllocatorProvider<PAGE_4K> for PageAllocator {
    fn allocate_pages(&mut self, pages: usize) -> Option<*mut u8> {
        assert!(pages > 0);

        // calculate the last heap base, using the `heap_start` and `mapped_pages`
        // the next heap base will be `last_heap_base + (pages * PAGE_4K)`
        let last_heap_base = self.heap_start + self.mapped_pages * PAGE_4K;
        let new_addr = unsafe { syscall::inc_dec_heap((pages * PAGE_4K) as isize) };

        let Ok(new_addr) = new_addr else {
            return None;
        };
        // `inc_dec_heap` will return the top of the new block, so it must match where we think the heap ends
        assert!(new_addr as usize == last_heap_base);

        // update the mapped pages
        self.mapped_pages += pages;

        Some(new_addr)
    }
}

// just a simple usage
fn main() {
    let allocator = HeapAllocator::new(PageAllocator::new());

    let layout = !; // example of layout
    // allocate and deallocate
    let ptr = allocator.alloc(layout);
    allocator.dealloc(ptr, layout);
}

The only thing users of this crate need to implement is the PageAllocatorProvider trait, which is a simple trait that requires a method to allocate pages.

syscall::inc_dec_heap is very similar to sbrk in Unix, it will increase or decrease the heap size by the given size. An argument of 0 will return the current heap address without changing it, which we use in the beginning to get the initial heap address.

Allocator implementation

The internal implementation of the allocator is as follows:

  • The HeapAllocator keep of a free linked-list, these are free heap blocks. We exploit that the blocks are free and store some metadata in the block itself.

    #![allow(unused)]
    fn main() {
    struct HeapFreeBlock {
        prev: *mut HeapFreeBlock,
        next: *mut HeapFreeBlock,
        // including this header
        size: usize,
    }
    }

    We have pointers to the previous and next free block, and the size of the block. As you might have guessed, HeapAllocator is a !Sync type, because we use raw pointers.

  • Whenever we need to allocate a block, we check the free list, pick the smallest block that fits the requested size, and split it if necessary. If we can't find a block, we allocate new pages from PageAllocatorProvider and add the new block to the free list.

  • Whenever we add a free block to the list, i.e. by dealloc or when getting new pages, we will try to merge with free blocks around it to avoid fragmentation. The implementation of this is at free_block And explained later at dealloc algorithm.

  • The allocated block will contain metadata residing before it in memory, this is used to know the size of the block and any padding it has, which the dealloc function can use later to correctly deallocate it without leaving any memory behind. The structure is:

    #![allow(unused)]
    fn main() {
    struct AllocatedHeapBlockInfo {
        magic: u32,
        size: usize,
        pre_padding: usize,
    }
    }

    The magic is a constant value 0xF0B0CAFE that we use to check if some bug happened, it could probably be removed when we have a stable implementation. But it has helped me a lot in debugging. size is the size of the block, and pre_padding is the padding before the block, which is used to align the block to the correct alignment, see alloc algorithm for more details.

alloc algorithm

First, we need to know how much memory we need for this allocation, we get the argument Layout which contain size, but we need to make sure of:

  • Including the AllocatedHeapBlockInfo metadata.
  • Making sure the block is aligned to the correct alignment, Layout::align.

First, we get a rough estimation of the size by doing:

#![allow(unused)]
fn main() {
let block_info_layout = Layout::new::<AllocatedHeapBlockInfo>();
// whole_layout here is the layout of the the info header + requested block
// `whole_block_offset` is the offset of the block after the info header
let (whole_layout, block_offset_from_header) = block_info_layout
    .extend(layout.align_to(block_info_layout.align()).unwrap())
    .unwrap();

let allocation_size = whole_layout.pad_to_align().size();
}

This will give us the raw size of the block along with any padding we may need to align it.

The alignment of this whole_layout is equal to the alignment of the largest of the two. Example:

#![allow(unused)]
fn main() {
use std::alloc::Layout;
#[repr(align(32))]
struct A(i32);

#[repr(align(128))]
struct C(i32);

let layout_a = Layout::new::<A>();
let layout_c = Layout::new::<C>();

let (whole_layout, _) = layout_a.extend(layout_c).unwrap(); // whole_layout.align() == 128
let (whole_layout, _) = layout_c.extend(layout_a).unwrap(); // whole_layout.align() == 128
}

Which means we can totally just use allocation_size, and be done with it. But the issue is that we don't know the address of the free_block we will get, i.e. in order for this to work, we must make sure the address of the block we will return be aligned by whole_layout.align().

And the whole allignment of the returned free_block caused a bug before, which was fixed in 471eff0.

We know that free_block will always be aligned to AllocatedHeapBlockInfo, since all allocations are aligned to it.

But if the block we require, needs higher alignment we will probably need to increase the size of allocation_size to account for the extra padding.

The algorithm is as follows:

  • Get a free block from the free list that fits the allocation_size.

  • Get the allocated_block_offset which is an offset from the free_block pointer to the start of the block.

    • The new pointer from this offset is aligned to the block, that the user will get.
    • This is computed as such:
      #![allow(unused)]
      fn main() {
      // `layout` is the input `block` layout
      // `align_up` is a function that aligns the `base` to the `align`
      let possible_next_offset = align_up(base, layout.align()) - base;
      let allocated_block_offset = if possible_next_offset < mem::size_of::<AllocatedHeapBlockInfo>() {
          // if we can't fit the info header, we need to add to the offset
          possible_next_offset + mem::size_of::<AllocatedHeapBlockInfo>().max(layout.align())
      } else {
          possible_next_offset
      };
      }
  • Then, the ptr result will be free_block + allocated_block_offset. And the metadata will be stored in ptr - mem::size_of::<AllocatedHeapBlockInfo>(). i.e. directly before the ptr.

    It may seem that this is unsafe since ptr may not be aligned to AllocatedHeapBlockInfo, and thus we shouldn't just ptr - mem::size_of::<AllocatedHeapBlockInfo>(). But if layout has lower alignment than AllocatedHeapBlockInfo, the previous if statement will be true and we will get possible_next_offset + mem::size_of::<AllocatedHeapBlockInfo>().max(layout.align()).git log

    Maybe this is not the best way, or at least it should be documented better.

  • We need to check that allocated_block_offset doesn't exceed block_offset_from_header, if it does, then allocation_size is not enough anymore. As it relies on the previous block_offset_from_header, and we are using allocated_block_offset, which is higher.

    To fix this, we just increase allocation_size. Currently we don't check that free_block is enough for the new allocation_size, but we should. (Check TODO)

  • As a last step, we need to shrink/split the free_block. The idea is to see if the free block is larger than the allocation_size, then we split it into two blocks, one for the allocation and the other for the remaining free block. The remaining free block will be kept in the free list.

    But here we must be careful, we need to make sure that the remaining free block is large enough to hold the HeapFreeBlock metadata, i.e. mem::size_of::<HeapFreeBlock>(). Otherwise, when we update the metadata of the new split, we will overwrite the next block metadata.

    This was a bug before, and was fixed in cf53cf9.

    The split is done as such:

    #![allow(unused)]
    fn main() {
    let required_safe_size = allocation_size + mem::size_of::<HeapFreeBlock>();
    // do we have empty space left?
    if free_block_size > required_safe_size {
        // split
        let new_free_block = free_block + allocation_size;
        // linked-list logic....
    } else {
        // no space left, just remove the block from the free list
        // we need to note the size we took
        allocation_size = free_block_size;
        // linked-list logic....
    }
    }

    The add_free_block will add the new free block to the free list, and try to merge it with the free blocks around it.

  • The pre_padding is the allocated_block_offset value, the size is the allocation_size at the end after all the considerations.

dealloc algorithm

This is a lot simpler than alloc, as the data is already provided by AllocatedHeapBlockInfo metadata.

The algorithm is as follows:

  • We get the pointer to the AllocatedHeapBlockInfo metadata, and we make sure the magic is correct.
  • We get the size and pre_padding from the metadata, the freeing block address will be ptr - pre_padding, and the size will be size.
  • We call free_block here, which is the same function used when getting new pages.

free_block algorithm

This is a function that will take a pointer to a block and its size, and add it to the free list.

In the beginning we look at all the free blocks (maybe slow?) and get the following information:

  • previous block if present (this is the block that ends at ptr - 1)
  • next block if present (this is the block that starts at ptr + size)
  • "closest previous block", this is an optimization, where we find the closest block that ends before ptr, the new block (this) will be put after this in the linked list if its not merged with anything before it. As we need to keep the list sorted.

    And as you might expect this was discovered after a bug in da23ee9 :'(.

The merge logic is very simple as follows:

  • If we have previous AND next block, we merge with both. The next block will be removed from the list as well.
  • If we have previous block, we merge with it, very easy prev.size += size.
  • If we have next block, we merge with it, the next block will be removed and this will replace it in the list.
  • If we have none, we just add the block to the list, after the closest previous block.
    • If we don't have a closest previous block, then we are the first block in the list, and we set it as the head of the list.

TODO

  • Add special implementation for realloc, which would be smarter without the need to alloc and dealloc.
  • I think there is a better way to implement alloc, currently, we waste a lot of space. Imagine this layout is align=512 and AllocatedHeapBlockInfo is align=8,size=32. The whole_layout will be align=512 and the allocated size will be 1024 even though we clearly don't need that. Also we can improve how we handle unaligned free_block.
  • Handle cases where we can't increase allocation_size as the free_block isn't enough

Kernel user link

This is a crate that contains common definitions for the kernel and the user space. And it is implemented in emerald_kernel_user_link crate.

It contains common definitions such as:

  • syscall numbers.
  • SyscallResult, and SyscallError types.
  • some process related arguments:
    • SpawnFileMapping: For mapping files from current process to new process.
  • file related structures and arguments:
    • DirEntry: For directory entries.
    • FileStat: For file stats, such as size, type, etc.
    • BlockingMode: For blocking modes of the operations on files.
    • FileType: For file types.
    • FileMeta: For assigning and getting metadata of files.
  • clock related structures and arguments:
    • ClockType: For specifying the type of the clock to get the time from.
      • RealTime: For getting the real time, which is based on the unix time.
      • SystemTime: For getting the time since the system booted.
    • ClockTime: Structure holding the time, seconds and nanos.
  • STDIN, STDOUT, STDERR file descriptors numbers.