RISCVBusiness Overview

RISCVBusiness is intended to be a configurable RV32 implementation targeting embedded applications. For information on the RISC-V Instruction Set Architecture, see the RISC-V Ratified Specifications.

RISCVBusiness 3-Stage Pipeline

Currently, the core supports the following features:

Base ISA:
- RV32I
- RV32E (Reduced register count)
ISA Extensions:
- M
- A
- C
- B
- Zicsr
- Zifencei
- Zicond
Privileged Architecture
- Machine mode with 16 PMP regions
- Supervisor with SV32 virtual addressing
- User mode
Microarchitectural Features
- Dual core
- 3-stage pipeline
- Parameterized I/D Caches (coherent)
- Parameterized I/D TLB
- Branch prediction
- Selectable multiplier implementation

Most of the features listed above can be tuned with a YAML file, see example.yml for an example of core configuration.

For planned features, see the GitHub issues

RISC-V Supervisor Extension Overview

Introduction

This project aims to integrate the RISC-V “S” Supervisor Extension in to SoCET’s RISC-V processor. This project implements the base extension as outlined in the privilege specification, along with a translation lookaside buffer (TLB), hardware page walker (PW), and a virtually indexed, physically tagged L1 cache (VIPT L1$) to with the goal of running an operating system on the processor.

Background

Supervisors, also known as operating systems (OS), are privileged programs that interface between user-programs and the hardware it runs on. These programs introduce important functionalities, such as paging/page tables, virtualized address spaces, and address translation. Paging splits memory up into manageable chunks. In RISC-V’s case, pages are 4KB in size. During the start up of an OS, page tables are defined in memory. These tables store the translations for virtual addresses to physical addresses. In order to easily support this, a register is designated to storing the pointer to the page table of a current process. A virtualized address spaces, or virtual memory, is assigned to each running process by the operating system. These address spaces are private for a given process, so no other process will be able to directly edit another process’ address space. However, in order for virtual memory to be possible, the pages of a virtual address space need to be mapped to physical pages. These physical page mappings are what’s stored in a processes page table. Through the process of address translation, a virtual address can be converted to a physical address. This physical address is then what addresses main memory. The amount of physical pages depends on how much physical memory a processor has.

The 32-bit RISC-V “S” Supervisor Extension adds 11 new Control and Status Registers (CSRs), 1 instruction, a 32-bit virtualized address space (Sv32), and two-level address translation. Some of the CSRs are subsets of machine mode CSRs.

Architecture

Below is a high level overview of our proposed pipeline architecture.

Supervisor pipeline diagram

It is based on the 3-stage pipeline from the multi-core branch, but with a split (only visually, not actually) decode and execute stage. This was done out of consideration for the vector extension, which is close to being integrated in to the pipeline. The functionality of our design should not wildly differ between a 3- or 4-stage pipeline.

Two-level address translation will operate as shown in the below image.

Two-level address translation diagram

The L1 cache and TLB will be accessed in parallel. This means we opted for a virtually indexed, physically tagged (VIPT) cache design. The cache size must not be larger than it’s associativity multiplied by the page size (4KB) for a VIPT design to work without aliasing.

TLB and L1 Cache diagram

The TLB will be store the physical page number for a virtual page along with its permissions. TLB entries are matched against the stored address space identifier (ASID), a unique value for each program, along with the tag for the cache access.

TLB diagram

A hardware page walker is used to walk pages without the extra bits software page walks would add. Below is the general flow that our page walker uses to determine if two-level address translation is completed.

Page walker diagram

Status

As of December 2024, the TLB, PW, VIPT L1$, and CSRs + interrupt/exception handling have all been implemented into the processor. We feel the core at its current state needs to undergo extensive verification before we can consider proceeding with adding the hypervisor extension, which is why we are deciding not to move forward with implementing the hypervisor for Spring 2025.

Future Work

Instead, we will verify that the supervisor extension works in it entirety as a single-core using SystemVerilog test benches and S-Mode unit tests. Then, get it working as a multi-core. Once both single-core and multi-core function is perfected, we will get an operating system running on the processor and flash the design to an FPGA to have a “physical” deliverable. We also intend to perform a synthesis run of the design using the SkyWater 130nm PDK to determine the max operating frequency, power consumption, and area for the design. After this, the Spring semester will be completed, and we should have a solid supervisor implemented for this RISC-V core. The next steps would be implementing the Hypervisor extension for those brave enough.

Contibutors

William Cunningham (wrcunnin@purdue.edu)
William Milne (milnew@purdue.edu)
Kathy Niu (niu59@purdue.edu)

RISC-V Supervisor Extension Hardware and Software Tests

Hardware Tests

1. TLB (Direct Mapped)

Address translation off (M-mode and S-mode Bare Addressing)
Compulsory TLB Miss
TLB Hit
TLB Eviction
Mismatch ASID miss
Invalidate TLB entries, by VA & ASID
Invalidate TLB entries, by VA
Invalidate TLB entries, by ASID
Invalidate TLB entries, all entries
Address translation off (again)

2. TLB (2-Way Set Associative, only used in this test)

Address translation off (M-mode and S-mode Bare Addressing)
Compulsory TLB Miss (empty set)
Half-full set, TLB Hit
Half-full set, TLB miss
Full set, TLB hit
Full set, TLB eviction (conflict miss)
Mismatch ASID miss
Invalidate TLB entries, by VA & ASID
Invalidate TLB entries, by VA
Invalidate TLB entries, by ASID
Invalidate TLB entries, all entries
Address translation off (again)

3. Page Permission Checking

All:

Fault -> Invalid page
Fault -> Not readable, but writeable
Fault -> Reserved bits are set
Fault -> Level == 0 and no RWX bits are set
Fault -> U = 1, in S-Mode, and mstatus.sum = 0
Fault -> U = 0, in U-Mode
No Fault -> Level != 0, sv32 = 0, should NEVER EVER happen in RV32
Fault -> Level != 0, sv32 = 1, pte_sv32.ppn[9:0] != 0
No Fault -> U = 1, in S-Mode, and mstatus.sum = 1
No Fault -> U = 1, in U-Mode
No Fault -> Level != 0, sv32 = 1, pte_sv32.ppn[9:0] == 0 Loads:
Fault -> R = 0, W = 0, X = 0, mstatus.mxr = 0
Fault -> R = 0, W = 0, X = 0, mstatus.mxr = 1
Fault -> R = 0, W = 0, X = 1, mstatus.mxr = 0
No Fault -> R = 0, W = 0, X = 1, mstatus.mxr = 1
Fault -> R = 0, W = 1, X = 0, mstatus.mxr = 0
Fault -> R = 0, W = 1, X = 0, mstatus.mxr = 1
Fault -> R = 0, W = 1, X = 1, mstatus.mxr = 0
Fault -> R = 0, W = 1, X = 1, mstatus.mxr = 1
No Fault -> R = 1, W = 0, X = 0
No Fault -> R = 1, W = 0, X = 1
No Fault -> R = 1, W = 1, X = 0
No Fault -> R = 1, W = 1, X = 1 Stores:
Fault -> R = 0, W = 0, X = 0
Fault -> R = 0, W = 0, X = 1
Fault -> R = 0, W = 1, X = 0
Fault -> R = 0, W = 1, X = 1
Fault -> R = 1, W = 0, X = 0
Fault -> R = 1, W = 0, X = 1
No Fault -> R = 1, W = 1, X = 0
No Fault -> R = 1, W = 1, X = 1 Instructions:
Fault -> R = 0, W = 0, X = 0
No Fault -> R = 0, W = 0, X = 1
Fault -> R = 0, W = 1, X = 0
Fault -> R = 0, W = 1, X = 1
Fault -> R = 1, W = 0, X = 0
No Fault -> R = 1, W = 0, X = 1
Fault -> R = 1, W = 1, X = 0
No Fault -> R = 1, W = 1, X = 1

4. Page Walker

Walk to get page
Walk to get megapage
Walk to get gigapage (RV64 only)
Walk to get terapage (RV64 only)
Walk to get petapage (RV64 only)
Page walk fault, mid-level page not allocated
Page walk fault, final-lebel page not allocated
Page walk fault, improper read permissions
Page walk fault, improper write permissions
Page walk fault, improper execute permissions
Page walk fault, improper dirty permissions
Page walk fault, improper access permissions
Page walk fault, improper valid permissions

5. TLB + Page Walker

TLB miss, walk to get page
TLB miss, walk to get megapage
TLB miss, walk to get gigapage (RV64 only)
TLB miss, walk to get terapage (RV64 only)
TLB miss, walk to get petapage (RV64 only)
TLB hit on all page types, no walk
TLB miss, page walk fault (page not allocated)
TLB miss, page walk fault (bad permissions)

6. TLB + VIPT L1$ + Page Walker

Cache miss, TLB miss, walk to get page, handle cache miss
Cache miss, TLB hit, handle cache miss
Cache hit, TLB hit, no walk
Invalidate TLB entry, Cache hit, TLB miss, walk to get page, cache hit
Invalidate TLB entry, change page mapping, Cache hit, TLB miss, walk to get page, cache miss

7. I$/ITLB + D$/DTLB + Page Walker

Compulsory misses, Data walk, instruction walk, data memory access, instruction memory access
I/D $ misses, no page walk, data memory access, instruction memory access
Full hits, no page walk, no memory access

Multicore

A multicore design is a type of multiprocessing in which multiple cores are used to run programs in parallel. This is sometimes referred to as chip multiprocessing (CMP). Multicore designs generally offer greater performance at the cost of power and area.

Heterogenous Multicore

A heterogenous multicore system uses disparate core designs and features to provide multiprocessing. This allows for greater task level parallelism without the power and area overhead that a homogenous design requires. Currently, a dual-core big/little design is planned.

Planned heterogenous core features

3 stage pipeline RV32IMACV big core
2 stage pipeline RV32EAC little core

Current architecture plan for the Dual-core processor with MESI Coherence

Plan

Cache coherence

Incoherency can arise when programs acting on shared memory execute on a system using private caches which do not synchronize with each other. An example of this is shown below. Total program order in the example below is used to sequentialize operations occuring in parallel. Both processors are running the same program which loads a global variable, adds 100 to it, and then stores it back to memory. Initially, processor 2 has the value at 0(sp) in its cache. Processor 1 loads in 0(sp) into its cache and the a0 register. It then adds 100 to a0 which brings its value to 200. Finally, it stores the value back to the same address of 0(sp). However, this causes an incoherence in the caches of processors 1 and 2 because the value in processor 2’s cache is still 100. Processor 2 then loads the value of 100 from its cache into its a0 register. It then adds 100 to it, and again stores it back to memory.

Cache Incoherence

An example of coherent system running the same program is shown below. The runtime state is mostly the same until total program order 3 in which processor 1 stores 200 to 0(sp). When this happens, this also updates the value in processor 2’s cache to 200. Then, when it loads the value from memory in total program order 4, it loads the updated value of 200 to its a0 register. Then, when it adds 100 to it and stores it, it stores the correct value of 300.

Cache Coherence

Cache Modifications

To enable coherency and atomics, a reservation set register, a duplicate SRAM tag array, and a coherency interface were added to each L1 cache. Each cache is connected to an atomicity unit that handles bus transactions. This is intended to reduce the complexity of the cache and allows other cache designs to be used with few coherency-related modifications.

The coherency interface grants read-only access to the duplicate SRAM array to the bus. This allows the bus to compare tags when snooping without interrupting the cache. The bus only gains access to the cache’s SRAM data array if a snoop hit occurs, causing the cache FSM to enter a snooping state. The signals provided by the coherency interface, bus_ctrl_if, are available in the source_code/include/bus_ctrl_if.

Currently, the duplicate SRAM array contains both the data and the tags. This duplicate data is not used, so the extra SRAM cells used to store them will need to be removed. Additionally, the seperate_caches.sv file in the source_code/caches/ directory can be modified to use a cache without coherency for the Icache. At present, both the Dcache and Icache use the same L1 module.

Atomics

Atomic instructions can be used to provide memory ordering and ensure that a read-modify-write operation occurs as a single instruction. Memory ordering restrictions can be used to ensure whether or not certain instructions can be observed to have taken place by the time the atomic instruction executes. The table below briefly covers the semantics behind each ordering. Furthermore, a store-acquire or load-release operation would be semantically meaningless due to how data dependencies could be observed. It does not matter if a store is done with acquire semantics because stores in atomic contexts are used for ordering with respect to a previous load-acquire. Similarly, a load-release is nonsensical because it is not useful to have execution continue based off of future loads/stores.

A simple mental model which can be used to understand these orderings is how they would be used in a mutex. For example, when acquiring a mutex, we want all previous state to have completed so that the critical path relies on the most up to date information. When releasing the mutex, we want anyone who depends on the data behind the mutex to have the most up to date data from the execution of the critical path.

Memory Ordering	Description
Relaxed	No ordering with respect to any other loads occurs.
Acquire (acq)	Ensures that no following memory operations can be observed to have happened before this operation.
Release (rel)	Ensures that all previous memory operations must be observed to have happened before this operation.
Acquire-Release (acqrel)	Ensures that all previous memory operations must be observed, and no following memory operations can be observed to have happened before this operation (sequentially consistent).

It should be noted that these orderings are mostly relevant for out-of-order cores or for cores with out of order load-store queues. The current core is an in-order pipeline with no load-store queue therefore there is no support needed for these orderings.

Hardware atomics

Hardware atomics are planned to be implemented through a LRX, SC_AND_OP microop execution sequence. The LRX microinstruction will be used to perform a load-reserved with an exclusive bit set which would block any other transactions on that cache line. The SC_AND_OP microinstruction would use the results from the free ALU to be used as the store value the memory stage. As with all SC’s, this clears the reservation set on that cache line.

Another interesting architecture is to attach an atomic unit to the bus, however, this would likely be unnecessary for our small scale dual core system. Information on this architecture can be found here.

Software emulated atomics

Because LR/SC are the basic building blocks of all atomic instructions, a tight loop consisting of an LR, operation, and SC can be used to emulate any AMO. An LR can be followed by a sequence of instructions to emulate the effects of an AMO, which is then followed by an SC and a conditional jump back to the LR can be used to retry the execution of an AMO until the SC succeeds.

The core currently has support for the entire atomic instruction extension through the use of the C routines in verification/asm-env/selfasm/c/amo_emu.c. This file sets up an exception handler for emulating atomic extension instructions in the case of illegal instruction exceptions. The trap handler pushes all the registers to the stack, decodes the instruction, executes it and updates the saved register on the stack, then pops all the registers from the stack and returns. Support macros for assembly usage are in verification/asm-env/selfasm/amo_emu.h. The WANT_AMO_EMU_SUPPORT macro can be used to set up some support for the C runtime including setting up the stack and exception handler. This allows for trapping any encountered atomic instructions and emulating them using LR/SC loops. This can be used as a stopgap until the hardware implementation of atomics is completed.

Extensions

RISC-V defines many ISA extensions. Some add extra instructions, such as the “M” extension that adds multiply and divide instructions. Some add extra state, such as the “F” extension that adds a new floating point register file. Some assert properties about the microarchitecture, such as the “Ztso” extension that indicates the implementation follows the total store order memory consistency model.

To allow configurability, extensions must be implemented in a way that makes them easy to enable or disable as desired.

Most extensions implemented in RISCVBusiness will add an extra functional unit, and require decoding new instruction types. Extension implementation can be broken down into the following steps:

Implement the new functional unit
Implement a decoder that understands only the new instruction types.
Create a wrapper file that conditionally either instantiates the new functional unit, or assign all outputs to ‘0’ (for disabled)
Add your new decoder to source_code/standard_core/control_unit.sv. Your new decoder must assert a ‘claim’ signal to indicate that an instruction belongs to that extension. Your decoder must generate all the needed control signals for your functional unit, and generate a ‘start’ signal to tell it to start working on the computation.
Instantiate your functional unit wrapper in the execute stage.

Walkthrough: Adding a Custom Extension

This walkthrough demonstrates how to add a custom extension by implementing a simple DELAY instruction that stalls the CPU for 100 cycles. This example uses the CUSTOM-0 opcode space (0x0B) reserved by RISC-V for custom extensions. (Note: This is a custom instruction for documentation purposes, and is not actually implemented.)

File Structure

For an extension named “delay”, you would create:

source_code/
├── packages/
│   └── rv32delay_pkg.sv       // Type definitions
└── rv32delay/
    ├── rv32delay_decode.sv    // Decoder
    ├── rv32delay_enabled.sv   // Functional unit
    ├── rv32delay_disabled.sv  // Disabled stub
    └── rv32delay_wrapper.sv   // Wrapper

Step 1: Create the Package

File: source_code/packages/rv32delay_pkg.sv

package rv32delay_pkg;

    // CUSTOM-0 opcode (0x0B)
    localparam logic [6:0] RV32DELAY_OPCODE = 7'b0001011;

    // R-type instruction format
    typedef struct packed {
        logic [6:0] funct7;
        logic [4:0] rs2;
        logic [4:0] rs1;
        logic [2:0] funct3;
        logic [4:0] rd;
        logic [6:0] opcode;
    } rv32delay_insn_t;

    // Operation type
    typedef enum logic [2:0] {
        DELAY = 3'b000
    } rv32delay_op_t;

    // Decoder output
    typedef struct packed {
        logic select;
        rv32delay_op_t op;
    } rv32delay_decode_t;

endpackage

Step 2: Implement the Decoder

File: source_code/rv32delay/rv32delay_decode.sv

module rv32delay_decode (
    input [31:0] insn,
    output logic claim,
    output rv32delay_pkg::rv32delay_decode_t rv32delay_control
);
    import rv32delay_pkg::*;

    rv32delay_insn_t insn_split;

    assign insn_split = rv32delay_insn_t'(insn);

    // Claim any instruction with CUSTOM-0 opcode and funct3=000
    assign claim = (insn_split.opcode == RV32DELAY_OPCODE)
                && (insn_split.funct3 == 3'b000);

    assign rv32delay_control.select = claim;
    assign rv32delay_control.op = DELAY;

endmodule

Key point: The decoder “claims” instructions to prevent them from being treated as illegal. This claim signal is ORed with other extension claim signals in the control unit (see control_unit.sv).

Step 3: Implement the Functional Unit

File: source_code/rv32delay/rv32delay_enabled.sv

module rv32delay_enabled (
    input CLK,
    input nRST,
    input rv32delay_start,
    input rv32delay_pkg::rv32delay_op_t operation,
    input [31:0] rv32delay_a,
    input [31:0] rv32delay_b,
    output logic rv32delay_done,
    output logic [31:0] rv32delay_out
);
    import rv32delay_pkg::*;

    localparam DELAY_CYCLES = 100;

    logic [7:0] counter;
    logic counting;

    // Start counting when start is pulsed
    always_ff @(posedge CLK, negedge nRST) begin
        if (!nRST) begin
            counter <= '0;
            counting <= 1'b0;
        end else begin
            if (rv32delay_start && !counting) begin
                // Start new delay
                counter <= DELAY_CYCLES;
                counting <= 1'b1;
            end else if (counting && counter > 0) begin
                // Count down
                counter <= counter - 1;
            end else if (counting && counter == 0) begin
                // Done counting
                counting <= 1'b0;
            end
        end
    end

    // Assert done when not counting (or on last cycle)
    assign rv32delay_done = !counting || (counter == 1);
    assign rv32delay_out = 32'b0;  // No result value

endmodule

Key points:

Multi-cycle operations deassert done while busy
The execute stage will stall until done is asserted (via hazard_if.ex_busy in stage3_execute_stage.sv)
Single-cycle operations can keep done high always

Step 4: Create the Disabled Stub

File: source_code/rv32delay/rv32delay_disabled.sv

module rv32delay_disabled (
    input CLK,
    input nRST,
    input rv32delay_start,
    input rv32delay_pkg::rv32delay_op_t operation,
    input [31:0] rv32delay_a,
    input [31:0] rv32delay_b,
    output rv32delay_done,
    output logic [31:0] rv32delay_out
);
    // Always ready, return zero
    assign rv32delay_done = 1'b1;
    assign rv32delay_out = 32'b0;
endmodule

Key point: The disabled module must have identical ports but just tie outputs to safe values. This is never actually used because the decoder is also disabled, but prevents compilation errors.

Step 5: Create the Wrapper

File: source_code/rv32delay/rv32delay_wrapper.sv

`include "component_selection_defines.vh"

module rv32delay_wrapper (
    input CLK,
    input nRST,
    input rv32delay_start,
    input rv32delay_pkg::rv32delay_op_t operation,
    input [31:0] rv32delay_a,
    input [31:0] rv32delay_b,
    output rv32delay_done,
    output logic [31:0] rv32delay_out
);
    import rv32delay_pkg::*;

    `ifdef RV32DELAY_SUPPORTED
    rv32delay_enabled RV32DELAY(.*);
    `else
    rv32delay_disabled RV32DELAY(.*);
    `endif

endmodule

Key point: The wrapper uses preprocessor conditionals to select between enabled/disabled modules based on build configuration. These are found in component_selection_defines.vh, which is in turn generated by the configuration YAML file with the config_core.py script.

Step 6: Integrate into Control Unit

File: source_code/standard_core/control_unit.sv

Add these changes:

Import the package:

import rv32delay_pkg::*;

Declare claim signal:

logic rv32delay_claim;

Instantiate decoder:

`ifdef RV32DELAY_SUPPORTED
rv32delay_decode RV32DELAY_DECODE(
    .insn(cu_if.instr),
    .claim(rv32delay_claim),
    .rv32delay_control(cu_if.rv32delay_control)
);
`else
assign cu_if.rv32delay_control = {1'b0, rv32delay_op_t'(0)};
assign rv32delay_claim = 1'b0;
`endif // RV32DELAY_SUPPORTED

Update claimed signal:

assign claimed = rv32m_claim || rv32a_claim || rv32b_claim || rv32zc_claim || rv32delay_claim;

Key point: Adding your claim signal to the claimed logic prevents instructions from being marked as illegal when your extension is enabled. The main decoder generates a maybe_illegal signal that is surpressed by any claim being asserted; maybe_illegal && !claimed implies the insturction is illegal.

Step 7: Integrate into Execute Stage

File: source_code/pipelines/stage3/source/stage3_execute_stage.sv

Declare signals:

logic rv32delay_done;
word_t rv32delay_out;

Instantiate wrapper:

rv32delay_wrapper RV32DELAY_FU (
    .CLK(CLK),
    .nRST(nRST),
    .rv32delay_start(cu_if.rv32delay_control.select && !hazard_if.mem_use_stall),
    .operation(cu_if.rv32delay_control.op),
    .rv32delay_a(rs1_post_fwd),
    .rv32delay_b(rs2_post_fwd),
    .rv32delay_done(rv32delay_done),
    .rv32delay_out(rv32delay_out)
);

Add to output mux:

always_comb begin
    if(cu_if.rv32m_control.select) begin
        ex_out = rv32m_out;
    end else if(cu_if.rv32b_control.select) begin
        ex_out = rv32b_out;
    end else if(cu_if.rv32delay_control.select) begin
        ex_out = rv32delay_out;
    end else begin
        ex_out = alu_if.port_out;
    end
end

Add to hazard detection:

assign hazard_if.ex_busy = (!rv32m_done && cu_if.rv32m_control.select)
                        || (!rv32delay_done && cu_if.rv32delay_control.select);

Key points:

The start signal is gated with !hazard_if.mem_use_stall to prevent spurious starts during pipeline stalls
Use rs1_post_fwd/rs2_post_fwd instead of raw register file outputs to respect data forwarding (see stage3_execute_stage.sv)
Multi-cycle functional units must contribute to ex_busy or the pipeline won’t stall

Step 8: Add Interface Signals

File: source_code/include/control_unit_if.vh

Add your control signal bundle:

rv32delay_pkg::rv32delay_decode_t rv32delay_control;

Step 9: Add a `.core` file

**File: source_code/rv32delay/rv32delay.core

CAPI=2:
name: socet:riscv:rv32m:0.1.0
description: RV32M Extension for RISCVBusiness

filesets:
    rtl:
        depend:
            - "socet:riscv:riscv_include"
            - "socet:riscv:packages"
        files:
            - rv32delay_decode.sv
            - rv32delay_enabled.sv
            - rv32delay_disabled.sv
            - rv32delay_wrapper.sv
        file_type: systemVerilogSource

targets:
    default: &default
        filesets:
            - rtl
        toplevel: rv32delay_wrapper

At a minimum, your core file needs to have a default target listing all the files needed. You may have additional filesets and simulation targets for individual testing as well.

Step 10: Update core configuration

File: scripts/config_core.py Finally, you need to add your extension to config_core.py so that it can be parsed from the YAML file and generate the correct preprocessor definitions to include in component_selection_defines.vh. If you are using a standard extension, it should suffice to add your extension to the list in the RISCV_ISA variable. Custom extensions will need to add a new key to the YAML file, check for it in config_core.py, and generate the appropriate preprocessor defines.

Summary Checklist

Create package with opcodes and type definitions
Implement decoder that claims your instructions
Implement enabled module with functional unit logic
Create disabled stub with same interface
Create wrapper with conditional instantiation
Integrate decoder into control unit
Add claim signal to claimed logic
Integrate functional unit into execute stage (signals, instantiation, output mux)
Add busy signal to hazard detection for multi-cycle ops
Add control signals to control_unit_if.vh
Update config_core.py to generate the RV32X_SUPPORTED define

Reference: Existing Extensions

The codebase includes several extension implementations you can reference:

RV32M (Multiply/Divide): source_code/rv32m/ - Multi-cycle operations with variable latency
RV32B (Bit Manipulation): source_code/rv32b/ - Single-cycle combinational logic
RV32Zicond: source_code/rv32zc/ - Simple conditional operations

History

Earlier versions of RISCVBusiness (including the original fork) used a different system to manage extensions, RISC-MGMT. RISC-MGMT treated extensions as plugins that hooked into the pipeline in the decode, execute, and memory stages and could control some chosen signals. The philosophy for RISC-MGMT was that extensions acted as their own execution pipelines, and the main pipeline just passed a “token” to track where the instruction was (as the execution was still single-issue, in-order).

RISC-MGMT had some distinct advantages:

Extensions all saw the same interface to the processor, and could be written in a standardized format
Custom instructions were simple to implement, as they worked identically to the standard extensions

However, it also had the disadvantage of being inflexible. The model would have worked well for extensions like “M” or “B” where isolated functional units were added, but extensions like “F” that add new architectural state, “C” that changed the way instruction fetch worked and re-used “I” decoding, and “A” that is tightly coupled to the cache implementation require tighter integration with the rest of the core.

The current configuration implementation is essentially “inlining” the RISC-MGMT idea into the pipeline so that we can have more flexibility at the cost of more difficult integration.

RV32A

See Multicore for information on the RV32A implementation, as it is tightly coupled with the cache coherence implementation.

Currenly, we support only the ‘Zalrsc’ subset of the A extension.

RV32C

The RV32C extension allows some common instructions to be expressed as a 16b instruction. These 16b instructions may be freely mixed with 32b instructions.

Implementation Status

Currently, the implementation covers the “C” extension, including the “Zfa” and “Zfd” subsets (decode-only). “Zcb” will be implemented in the near future.

The implementation is split into 2 components: the fetch buffer and the decompressor.

Fetch Buffer

The fetch buffer sits in front of the I$. It has 2 functions:

Determine the next fetch address. The I$ supports only aligned 32b accesses, so the fetch buffer translates misaligned PC to the aligned address of the relevant data.
Buffer unused data. If an I$ fetch yields a compressed instruction, the remaining 16b are buffered as they belong to either another compressed instruction, or are part of a full-size instruction.

The fetch buffer forms an FSM (Mealy) from the PC alignment (pc[1]) and the fetch buffer valid bit (fb.valid). The state transition diagram is as follows: Fetch Buffer state diagram

In this diagram, “read compressed” indicates the next instruction is a compressed instruction read from the I$. “FB compressed” means the next instruction is a compressed instruction that was held in the fetch buffer.

Note that the 4th combination, a valid fetch buffer and aligned PC, is not possible.

Decompressor

As all implemented instructions are 1:1 with full-sized instructions, the decompressor simply sits in front of the normal decoder. The decompressor takes in 16b of an instruction and outputs the corresponding 32b instruction.

Decompressor

Keyboard shortcuts

RISCVBusiness Documentation