SoCET Atalla AIHW

Documentation is a work in progress!

Please help with regular documentation where you can. When you create diagrams or set something in stone for the team, please document it here. This GitHub Page will serve as our source of truth for the Tensor Core.

Guidelines / How-To

Setup Guidelines

This document should list any and all required steps to get yourselves setup to run Atalla. Specific setup scripts (like for the PyTorch infrastructure, or PPCI infrastructure) will be defined in sub-team homepages.

Basics

SSH into asicfab

The preferred IDE by the AI-HW team is VSCode. Please follow the instructions in DigitalOcean’s tutorial on connecting to a remote server.

You will need to download Cisco AnyConnect VPN if you plan to SSH from an out-of-campus location.

Setup asicfab

Run source /package/asicfab/AccountSetup/init.bash
Add the following into ~/.bashrc

[[ $- == *i* ]] || return
HOSTNAME=$(hostname)

if [ ${HOSTNAME} == "asicfab.ecn.purdue.edu" ]; then
        source /package/asicfab/AccountSetup/init.bash

        alias ls="ls --color"
        alias ll="ls -la"


        export COPYBUFFER=/package/asicfab/CopyBuffer
        export MODULEPATH=/package/asicfab/AccountSetup/modulefiles:$MODULEPATH
        export PATH=$HOME/.local/bin:$PATH # for python packages
        unset PYTHONPATH
        # For fusesoc + Questa usage
        export MODEL_TECH="$(dirname $(which vsim))"

        ###### CUSTOM CHANGES BELOW THIS LINE #######
        module load gcc/11.2.0 python3/3.11
        module load riscv-gcc verilator/5.036 gtkwave
        module load cadence/xcelium/23.03 siemens/questa/2021.4 intel/quartus-std
        module load lcov
elif [ ${HOSTNAME} == "asicfabu.ecn.purdue.edu" ]; then
        module load verilator gtkwave surfer lcov
else
        echo "Unknown host ${HOSTNAME}; not loading modules"
fi

module load cadence/genus       # used for synthesis
module load cadence/innovus     # used for physical implementation 
module load cadence/virtuoso    # used for manually inspecting and manipulating the design
module load cadence/ssv         # good general module
module load cadence/ddi         # good general module

export LD_PRELOAD=/lib64/libz.so.1

Setup Github SSH

Follow the steps in the Github SSH-ing document. You should be able to clone the Atalla repository locally using git clone git@github.com:Purdue-SoCET/atalla.git now.

To test out a basic file, run

git checkout scratchpad_main
make run FILE=./scripts/common/xbar/clos/test.tcl

Synthesis

Overview

If you cloned the repository properly, and ran the setup scripts, you should be able to see “Flowkit” as a submodule. This is a repository compiled by the Design Flow team with a bunch of scripts/flows to help us use Genus and Innovus. It was initially a Cadence software, but it’s adapted.

Goals

You will use the following steps to compile each of your modules to get area and clock information. After this, look at the Reports step on how to format and store the reports that are generated.

Steps

The following steps outline what to do. Search for @AIHW tag in each of the respective files to know where to add/edit stuff.

Ensure you are in the atallax01 branch.

git checkout atallax01
git submodule update --init --recursive

CREATE - /designs/cache/filelist.tcl

set listofdirs {}
lappend listofdirs "/home/asicfab/a/araviki/tensor-core/src/include"
set_db init_hdl_search_path $listofdirs

read_hdl -sv -define {NOIP SYNTHESIS} /home/asicfab/a/araviki/tensor-core/src/modules/cache_bank.sv
read_hdl -sv -define {NOIP SYNTHESIS} /home/asicfab/a/araviki/tensor-core/src/modules/cache_mshr_buffer.sv
read_hdl -sv -define {NOIP SYNTHESIS} /home/asicfab/a/araviki/tensor-core/src/modules/lockup_free_cache.sv

EDIT - /scripts/config/design_config.tcl: search for “# Read Verilog and elaborate.”, and add this below

create_flow_step -name read_cache_hdl -owner design {
  source ./designs/cache/filelist.tcl
}

create_flow_step -name elaborate_cache -owner design {
  elaborate lockup_free_cache # replace with the name of your actual module
}

EDIT - scripts/flow.yaml search for “flow: synthesis:”, near line 129

  synthesis:
    args: -tool genus -owner cadence -skip_metric -tool_options -disable_user_startup
    features:
    steps:
      - syn_generic:
          args: -owner cadence
          features:
          steps:
            - block_start:
            - init_elaborate:
            - init_design:
                args: -owner cadence
                features:
                steps:
                  - read_mmmc:
                  - read_physical:
                  # ########################
                  - read_cache_hdl: # replace this
                  - elaborate_cache: # replace this
                  # ########################
                  - read_power_intent:
                  - run_init_design:
                  - read_def:
                      enabled: "synth_ispatial || synth_hybrid"

CREATE - /scripts/constraints/cache_bank.sdc. Always have interfaces for your modules. If you say it doesn’t need an interface, then that unit isn’t worth synthesizing alone.

set sdc_version 2.0

set_units -capacitance 1.0fF
set_units -time 1.0ps

# Set the current design
current_design cache_bank

# -period sets time in ps, 1000 -> 1GHz, 5000 -> 200Mhz
# -waveform = {-period}/2
create_clock -name "clock1" -period 1000.0 -waveform {0.0 500.0} [get_ports <interface_instance_name>_clk]

set_clock_transition -rise 1 [get_clocks "clock1"]
set_clock_transition -fall 1 [get_clocks "clock1"]
set_clock_uncertainty 0.1 [get_clocks "clock1"]

set_clock_gating_check -setup 0.0

set_driving_cell -lib_cell inv_8x -pin X [ all_inputs ] -min -max
set_input_delay -add_delay 1.0 -clock [get_clocks clock1] [all_inputs -no_clocks] # simulate 1ps delay 
set_output_delay -add_delay 1.0 -clock [get_clocks clock1] [all_outputs]

EDIT - scripts/setup.yaml search for “constraint_modes” near line 101

constraint_modes:
  func:
    sdc_files:
      - scripts/constraints/cache_bank.sdc

Run the following commands in the Flowkit/ folder.

# Takes a long time
flowtool -reset -to synthesis 

# Wakes up flowtool. It'll take a bit for all the required files to compile. You can run specific flow steps after this. 
flowtool -flow run_syn_opt -interactive_run -isolate step
report_timing -max_paths 10 -path_type full > critical_paths.txt

Reports

Check within Flowkit/reports. reports/syn_opt/ has the results from the optimal synthesized flow. reports/syn_opt/qor.prt has the important content. Create a folder for each submodule within tensor-core/reports, and store the relevant information in there. We will not gitignore it.

To get the clock speed, take the {-period} value you set in /scripts/constraint/*.sdc file, and add the slack value. If clock period is (1000) and total slack is (-555) then the clock speed = (1/1555ps)MHz. If clock period is (3000) and total slack is (0) and Critical Path Slack is (1580.8), then your frequency is (1/(3000ps – 1580.8ps)) = 704.8MHz.
To get the area, look under the Area section. Values are in (um)^2. 1(mm^2) = 1e-6(um)^2.

PCACTI

Ramulator2

Verification

Overview

Verification happens in three steps:

Smoke Tests for all modules in SystemVerilog w/ QuestaSim.
Unit Tests for individual modules in SystemVerilog w/ QuestaSim. Ensure you apply assertions and performance counters! The goal
Top Level C++ tests w/ Verilator. This must be a more complete testbench, simulating real-workload situations.

All your code must go in ./tb/.

./tb/formal must contain all the mathematical assertions w/ covergroups.
./tb/unit must contain all the unit tests in the same heirarchy. Use make sv_test folder= tb_file= GUI= to run QuestaSim. Check the Makefile for options.
./tb/uvm is a maybe for now.

SoCET AI Hardware SystemVerilog Coding Guide

Written by: Malcolm McClymont

Last updated: 1/15/2026

Introduction

This guide is designed for engineers developing and testing hardware within the AI Hardware team. It provides a set of guidelines meant to facilitate large scale collaboration by making RTL code easy to read, modify, and test. These rules are split into three categories based on severity:

SHALL/SHALL NOT rules; these must be followed at (almost) all times. Any SHALL RULE VIOLATIONS must include comments that thoroughly explain how the rule is being violated and why. Anyone who deviates from these rules should not be surprised if asked to rewrite their code.
SHOUD/SHOULD NOT rules; these practices are strongly encouraged, but not strictly mandatory. Violating these rules does not require an explicit comment but may warrant rewriting code.
MAY/MAY NOT rules; these are practices that could improve code legibility or testability but are situational and should be applied at the engineer’s discretion.

Perceptive readers will notice that there are no absolute rules in this guide; engineers should always evaluate if a rule should be followed and provide documentation if they decide not to follow it.

Formatting

Engineers SHALL use spaces as tabs, with 4 spaces per tab.

VSCode can be configured to do this, but here are some commands for a .vimrc file that enables this:
```
autocmd BufNewFile,BufRead *.sv,*.v set tabstop=4
autocmd BufNewFile,BufRead *.sv,*.v set shiftwidth=4
autocmd BufNewFile,BufRead *.sv,*.v set expandtab
```
Engineers SHALL minimize dead (commented out) code. If large sections of code are commented out (3 or more lines) then a pointer comment SHALL be used or the comment should be deleted.

Engineers SHALL consolidate code. Dead code and active code should be in separate groups, not intermixed.

An example of these two rules:

Bad! Notice how the dead and alive lines are mixed.

...

always_comb begin : input_buses 
    // cu.input_type = 1'b0; 
    cu.input_row = '0;
    cu.input_load = 1'b0;
    // cu.weight_row = '0;
    // cu.weight_load = 1'b0;
    cu.partials_row = '0;
    cu.partials_load = 1'b0;
    if (cu.input_en) begin
        cu.input_row = cu.row_in_en;
        cu.input_load = 1'b1;
    end else if (cu.weight_en) begin
        // cu.input_type = 1'b1;
        // cu.weight_row = cu.row_in_en;
        // cu.weight_load = 1'b1;
    end
    if (cu.partial_en) begin
        cu.partials_row = cu.row_ps_en;
        cu.partials_load = 1'b1;
    end
end

...

Better, dead lines are grouped together

...
always_comb begin : input_buses 
    cu.input_row = '0;
    cu.input_load = 1'b0;
    cu.partials_row = '0;
    cu.partials_load = 1'b0;

    // cu.weight_row = '0;
    // cu.weight_load = 1'b0;
    // cu.input_type = 1'b0;

    if (cu.input_en) begin
        cu.input_row = cu.row_in_en;
        cu.input_load = 1'b1;
    end else if (cu.weight_en) begin
        // cu.input_type = 1'b1;
        // cu.weight_row = cu.row_in_en;
        // cu.weight_load = 1'b1;
    end
    if (cu.partial_en) begin
        cu.partials_row = cu.row_ps_en;
        cu.partials_load = 1'b1;
    end
end
...

Best, pointer comments are used to move commented blocks away from the active code. Ideally, dead lines should be at end of file or deleted entirely.

...
always_comb begin : input_buses 
    cu.input_row = '0;
    cu.input_load = 1'b0;
    cu.partials_row = '0;
    cu.partials_load = 1'b0;

    //[1]

    if (cu.input_en) begin
        cu.input_row = cu.row_in_en;
        cu.input_load = 1'b1;
    end else if (cu.weight_en) begin
        //[2]
    end
    if (cu.partial_en) begin
        cu.partials_row = cu.row_ps_en;
        cu.partials_load = 1'b1;
    end
end
...

endmodule

//[1]
// cu.weight_row = '0;
// cu.weight_load = 1'b0;
// cu.input_type = 1'b0;

//[2]
// cu.input_type = 1'b1;
// cu.weight_row = cu.row_in_en;
// cu.weight_load = 1'b1;

Signal names SHALL NOT exceed 30 characters. Additionally, their names SHOULD be intuitive.
Module names SHALL be intuitive.
All always_comb and always_ff blocks SHALL have names.
Blocks that share many signals or interact closely SHOULD be adjacent in the code. In other words, achieve spatial locality.
- However, always_comb blocks SHOULD be grouped with other always_comb blocks. The same applies to always_ff blocks.

Unless a block’s name makes its function obvious, every block SHOULD come with a comment describing what it does

For example:

//Detailed comment describing function of block
always_comb begin : <block name> 
...
end

...

//Detailed comment describing function of block
always_ff begin : <block name>
...
end

Any outstanding fixes/modifications to code SHOULD be documented using a TODO comment
When applicable, RTL and code SHALL use “manager” and “subordinate” as opposed to “master” and “slave”.

Verilator Linter

Engineers SHALL NOT have any Verilator linter warnings in code within main branch

Engineers SHALL NOT use lint_off to disable warnings

All of the Verilator warnings and their meanings can be found here: https://verilator.org/guide/latest/warnings.html

For example:

always_comb begin : <block name>
    ...

    //Bad! Why are they getting a truncate warning?
    //Engineer should explain this with a comment or fix it.

    /* verilator lint_off WIDTHTRUNC */
    curr_input_row = iteration[l];
    /* verilator lint_off WIDTHTRUNC */

    ...
end

Testbenches

Engineers SHALL only print messages for failing test cases, but include a “test complete” message too. These failing tests SHALL include a timestamp.

For example:

//This will set %t to print in nanoseconds. Replace -9 with -12 for ps.
$timeformat(-9, 2, " ns"); 

...

if(tb_out != golden_out) begin
    $display("Output mismatch for test %0d at %0t", i, $time);
    failed_cases += 1;
end

...

Testbenches SHALL use `timescale 1ps/1ps.
TODO: Expand this section

Interfaces

Engineers SHALL include an ifndef to avoid repeatedly including an interface
Modports SHALL follow the format x_y, where x and y represent the two modules the interface connects

For example:

//Good

module backend #(parameter logic [SCPAD_ID_WIDTH-1:0] IDX = '0) (
    scpad_if.backend_sched bshif, //Connects backend to scheduler
    scpad_if.backend_body bbif,  //Connects backend to scratchpad body
    scpad_if.backend_dram bdrif //Conects backend to DRAM
);

...

endmodule

TODO: Expand this section

Synthesizable Logic

Plus and minus SHALL be the only arithmetic operators directly used in synthesizable logic.

For example:

//Good
assign c = a + b;
//Bad!
assign c = a * b;

//Good. Create an instance of an operational unit from a written Verilog module.
mult_module M0(.a(a), .b(b), .c(c))

When using a for loop in synthesizable code, its intended function SHALL be thoroughly commented. This also applies to generate for loops.
Any use of always_latch SHALL come with extensive documentation on its intended function.
Functions MAY be used in synthesized logic, but SHALL only contain combinational logic.
- SystemVerilog doesn’t actually support sequential logic in functions.

For example:

//Good, only contains combinational logic.
function [7:0] addition (input [7:0] in_a, input [7:0] in_b);
    addition = in_a + in_b;
endfunction

...

always_comb begin
    c = addition(a, b);
end

Combinational Logic

Engineers SHALL NOT use nested ternary logic. Use an if/else or case block instead for this.

For example:

    //Bad! Nested ternary logic, hard to read.
    result = a ? b : c ? d : e;

    //Good. Use a different control structure.
    if(a) begin
        result = b;
    else if (c) begin
        result = d;
    else begin
        result = e;
    end

    //Also acceptable
    if(a) begin
        result = b;
    else begin
        result = c ? d : e;
    end

Any modules that are fully combinational MAY have the clk and n_rst input signals removed.

Sequential Logic

Sequential elements SHALL only use posedge clk and negedge n_rst in their sensitivity lists.
Engineers SHALL only use packed arrays in synthesziable modules. Testbenches may use either packed or unpacked arrays.

For example:

logic [7:0] data; //Packed array
logic data [7:0]; //Unpacked array

FSMs SHALL always use enums as state names.
Engineers SHOULD always use explicit next_state logic.
Any sequential elements that do not strictly require an n_rst signal SHOULD NOT have one. However, you SHALL comment which elements do not have an n_rst signal
- Determining which elements do not need an n_rst signal SHOULD be done after the design is fully verified.
- Signals used for write enable by any modules SHALL always have an n_rst signal

Forbidden Structures

The following SystemVerilog structures SHALL NOT be used:

Programs
Fork/join
Always@
Z-state logic/tri-state buffers
Classes and polymorphism SHALL be avoided. Typedef statements within packages SHALL be used instead.
Any datatypes indicating the strength of a signal (supply1, strong1, weak1, etc.)
Force statements
Triple equal signs equality operators (===)

Direct Programming Interface‑C (DPI‑C)

Overview

The SystemVerilog Direct Programming Interface (DPI) provides a standard way to call C functions from SystemVerilog and to call SystemVerilog functions from C. The DPI forms a bridge between the two languages and allows re‑use of existing C/C++ models or libraries.

This guide walks through two complementary flows that combine DPI‑C and a parameterizable asynchronous FIFO. The first flow uses Verilator to convert a SystemVerilog FIFO into C++ and writes a C++ testbench. The second flow implements the FIFO in C++ and imports it into a SystemVerilog testbench via DPI‑C, targeting Siemens Questa SIM.

Sources

Flow 1 – Verilating the SystemVerilog FIFO and Creating a C++ Testbench

a. Convert SystemVerilog to C++ with Verilator

Copy over async_fifo.sv into your local directory.

Invoke Verilator:

verilator --cc async_fifo.sv --exe fifo_tb.cpp -CFLAGS "-std=c++17" --build

–cc tells Verilator to generate C++ output.
–exe fifo_tb.cpp requests Verilator to compile the provided C++ testbench and link it with the generated model.
–build invokes make to build the executable.

Run the compiled simulation:

./obj_dir/Vasync_fifo

Verilator will generate a directory (default obj_dir) containing C++ files for the design. The included fifo_tb.cpp drives the async_fifo model by toggling independent write and read clocks, writing a sequence of values, and reading them back. The testbench checks for ordering errors and prints “FIFO test passed” when successful. See the file fifo_tb.cpp for details.

b. Writing a C++ Testbench

The C++ testbench uses the Verilated model interface. After including Vasync_fifo.h and initializing reset signals, it toggles wclk and rclk at different rates to emulate independent clock domains. It drives w_en and r_en, monitors the w_full/r_empty flags and verifies correct FIFO operation.

Refer to fifo_tb.cpp.

Verilator’s generated wrapper class provides the public signals as C++ members. Remember to call top->eval() after changing inputs; this schedules and evaluates the model for the current timestep.

Flow 2 – Importing a C++ FIFO into a SystemVerilog Testbench via DPI‑C

a. Implement the FIFO in C++ and Expose it via DPI

Implementing a FIFO in C++ gives flexibility and allows using existing algorithmic models as golden references. The DPI‑C interface maps SystemVerilog types to C types; for example, byte maps to char and int maps to int. More complex types (arrays, 4‑state logic) require DPI‑defined types in svdpi.h. Arguments can be passed by value or reference depending on the argument direction.

The provided file dpi_fifo.cpp implements a simple power‑of‑two FIFO in C++. It defines a DpiFifo structure with depth, write and read pointers and a dynamically allocated array. The exported functions have extern “C” linkage so the names are not mangled. Key functions are defined in the dpi_hdr.h.

// Allocate a FIFO with at least the requested depth (rounded up to a power of two)
DpiFifo* fifo_init(int depth);
// Free the FIFO
void fifo_free(DpiFifo* fifo);
// Return non‑zero if FIFO is full
int fifo_full(const DpiFifo* fifo);
// Return non‑zero if FIFO is empty
int fifo_empty(const DpiFifo* fifo);
// Push/pop an integer; return 1 on success, 0 on full/empty
int fifo_push(DpiFifo* fifo, int data);
int fifo_pop(DpiFifo* fifo, int* data);

b. Import the C++ Functions into SystemVerilog

In the SystemVerilog testbench (sv_tb_dpi.sv) we declare the C functions using import “DPI-C”. The chandle type represents an opaque C pointer. The testbench allocates the FIFO, pushes a sequence of integers, pops them back and checks the order:

module tb;
    import "DPI-C" function chandle fifo_init(input int depth);
    import "DPI-C" function void    fifo_free(input chandle handle);
    import "DPI-C" function int     fifo_push(input chandle handle, input int data);
    import "DPI-C" function int     fifo_pop(input chandle handle, output int data);
    import "DPI-C" function int     fifo_full(input chandle handle);
    import "DPI-C" function int     fifo_empty(input chandle handle);
    // … allocate and use FIFO …
endmodule

c. Compile and Run with Questa SIM

Siemens’ Questa SIM automatically compiles and links DPI‑C code when using the vlog -dpiheader option. The Elektroda article summarises the steps:

Write your C function and compile it into a position‑independent shared object (.so on Linux).

Declare the function in SystemVerilog using import “DPI-C” ….

Compile both C and SystemVerilog files together with vlog, then run the simulation and load the shared object with vsim -sv_lib.

vlib work
vlog -dpiheader dpi_hdr.h dpi_fifo.cpp sv_tb_dpi.sv
vsim -c work.tb -do "run -all; quit"

The -dpiheader switch instructs vlog to generate a header (dpi_hdr.h) that contains the prototypes of the imported functions. Questa automatically calls GCC to compile the C file and link the resulting shared library. Alternatively you can pre‑compile the C library manually:

gcc -fPIC -shared -I$MTI_HOME/include -o libdpi.so dpi_fifo.cpp
vlog sv_tb_dpi.sv
vsim work.tb -sv_lib libdpi

Be sure that the C library is compiled for the same word size as your simulator (32‑bit vs. 64‑bit); otherwise you will see linker errors. Also remember to include svdpi.h in your C file; without it the DPI data types are undefined.

Appendix

FIFO RTL

Adapted from https://www.verilogpro.com/asynchronous-fifo-design/.

// async_fifo.sv
module async_fifo #(
    parameter int DATA_WIDTH = 8,
    parameter int ADDR_WIDTH = 4
) (
    input  logic               wclk,
    input  logic               wrst_n,
    input  logic               w_en,
    input  logic [DATA_WIDTH-1:0] wdata,
    output logic               w_full,
    input  logic               rclk,
    input  logic               rrst_n,
    input  logic               r_en,
    output logic [DATA_WIDTH-1:0] rdata,
    output logic               r_empty
);
    localparam int FIFO_DEPTH = 1 << ADDR_WIDTH;
    logic [DATA_WIDTH-1:0] mem [0:FIFO_DEPTH-1];
    logic [ADDR_WIDTH:0] wptr_bin, rptr_bin;
    logic [ADDR_WIDTH:0] wptr_gray, rptr_gray;
    logic [ADDR_WIDTH:0] wptr_gray_sync1, wptr_gray_sync2;
    logic [ADDR_WIDTH:0] rptr_gray_sync1, rptr_gray_sync2;

    // Write domain
    always_ff @(posedge wclk or negedge wrst_n) begin
        if (!wrst_n) begin
            wptr_bin  <= '0;
            wptr_gray <= '0;
        end else if (w_en && !w_full) begin
            mem[wptr_bin[ADDR_WIDTH-1:0]] <= wdata;
            wptr_bin  <= wptr_bin + 1;
            wptr_gray <= (wptr_bin + 1) >> 1 ^ (wptr_bin + 1); // binary→Gray
        end
    end

    // Read domain
    always_ff @(posedge rclk or negedge rrst_n) begin
        if (!rrst_n) begin
            rptr_bin  <= '0;
            rptr_gray <= '0;
        end else if (r_en && !r_empty) begin
            rptr_bin  <= rptr_bin + 1;
            rptr_gray <= (rptr_bin + 1) >> 1 ^ (rptr_bin + 1);
        end
    end

    // Data output; asynchronous read with registered output
    always_ff @(posedge rclk) begin
        rdata <= mem[rptr_bin[ADDR_WIDTH-1:0]];
    end

    // Synchronize Gray pointers across domains
    always_ff @(posedge wclk or negedge wrst_n) begin
        if (!wrst_n) begin
            rptr_gray_sync1 <= '0;
            rptr_gray_sync2 <= '0;
        end else begin
            rptr_gray_sync1 <= rptr_gray;
            rptr_gray_sync2 <= rptr_gray_sync1;
        end
    end
    always_ff @(posedge rclk or negedge rrst_n) begin
        if (!rrst_n) begin
            wptr_gray_sync1 <= '0;
            wptr_gray_sync2 <= '0;
        end else begin
            wptr_gray_sync1 <= wptr_gray;
            wptr_gray_sync2 <= wptr_gray_sync1;
        end
    end

    // Full and empty detection:contentReference[oaicite:8]{index=8}
    always_comb begin
        // Empty when synchronized write pointer equals local read pointer
        r_empty = (wptr_gray_sync2 == rptr_gray);
        // Full when MSBs differ and lower bits match
        w_full  = (wptr_gray[ADDR_WIDTH] != rptr_gray_sync2[ADDR_WIDTH]) &&
                  (wptr_gray[ADDR_WIDTH-1:0] == rptr_gray_sync2[ADDR_WIDTH-1:0]);
    end
endmodule

FIFO TB in C++

// fifo_tb.cpp
// -----------------------------------------------------------------------------
// C++ testbench for async_fifo.sv when Verilated into C++.
//
// Build idea (typical):
//   verilator -Wall --cc async_fifo.sv --exe fifo_tb.cpp --top-module async_fifo
//   make -C obj_dir -f Vasync_fifo.mk
//   ./obj_dir/Vasync_fifo
//
// Verilator's guide describes translating SV to C++ with --cc and building an
// executable with --binary/--exe.citeturn583946675001164
// -----------------------------------------------------------------------------

#include <cstdint>
#include <deque>
#include <iostream>
#include <random>

#include "Vasync_fifo.h"
#include "verilated.h"

static vluint64_t main_time = 0;

static void tick(Vasync_fifo* top) {
    top->eval();
    main_time++;
}

int main(int argc, char** argv) {
    Verilated::commandArgs(argc, argv);

    auto* top = new Vasync_fifo;

    // Simple async clocks: write clock toggles every cycle, read clock every 2
    top->wclk = 0;
    top->rclk = 0;
    top->wrst_n = 0;
    top->rrst_n = 0;
    top->w_en = 0;
    top->r_en = 0;
    top->w_data = 0;

    // apply reset
    for (int i = 0; i < 10; i++) {
        top->wclk = !top->wclk;
        if ((i % 2) == 0) top->rclk = !top->rclk;
        tick(top);
    }
    top->wrst_n = 1;
    top->rrst_n = 1;

    std::deque<uint32_t> scoreboard;
    std::mt19937 rng(1);
    std::uniform_int_distribution<int> coin(0, 1);

    const int N = 2000;
    for (int t = 0; t < N; t++) {
        // drive enables in their own domains
        bool do_write = (coin(rng) == 1);
        bool do_read  = (coin(rng) == 1);

        // Write domain on rising edge of wclk
        if (!top->wclk) {
            top->w_en = do_write;
            if (do_write) {
                uint32_t v = (uint32_t)t;
                top->w_data = v;
            }
        }

        // Read domain on rising edge of rclk
        if (!top->rclk) {
            top->r_en = do_read;
        }

        // toggle clocks
        top->wclk = !top->wclk;
        if ((t % 2) == 0) top->rclk = !top->rclk;

        tick(top);

        // Scoreboard updates after edges
        if (top->w_en && !top->w_full) {
            scoreboard.push_back(top->w_data);
        }
        if (top->r_en && !top->r_empty) {
            if (scoreboard.empty()) {
                std::cerr << "ERROR: DUT popped but scoreboard empty\n";
                return 2;
            }
            uint32_t exp = scoreboard.front();
            uint32_t got = top->r_data;
            scoreboard.pop_front();
            if (got != exp) {
                std::cerr << "ERROR: mismatch exp=" << exp << " got=" << got << "\n";
                return 3;
            }
        }
    }

    std::cout << "PASS\n";
    delete top;
    return 0;
}

FIFO C++

// dpi_fifo.cpp
// -----------------------------------------------------------------------------
// C++ async FIFO model exposed through a C ABI for SystemVerilog DPI-C.
//
// This is the "C++ golden model -> SV testbench" flow.
// - SV holds a handle (chandle) to an allocated C++ object.
// - SV calls fifo_push/fifo_pop each cycle or transaction.
//
// DPI-C uses svdpi.h types and API; keeping types consistent matters because
// SV<->C data must be interpreted identically.citeturn818598791849910
// -----------------------------------------------------------------------------

#include <cstdint>
#include <cstdlib>
#include <new>
#include <vector>

extern "C" {
#include "svdpi.h"
}

struct FifoU32 {
    explicit FifoU32(int depth_pow2)
        : depth(depth_pow2), mem((size_t)depth_pow2, 0) {
        reset();
    }

    void reset() {
        wptr = 0;
        rptr = 0;
    }

    bool empty() const { return wptr == rptr; }

    bool full() const {
        int mask = depth - 1;
        return ((wptr & mask) == (rptr & mask)) && (((wptr ^ rptr) & depth) != 0);
    }

    bool push(uint32_t v) {
        if (full()) return false;
        mem[(size_t)(wptr & (depth - 1))] = v;
        wptr = (wptr + 1) & ((2 * depth) - 1);
        return true;
    }

    bool pop(uint32_t* out) {
        if (empty()) return false;
        *out = mem[(size_t)(rptr & (depth - 1))];
        rptr = (rptr + 1) & ((2 * depth) - 1);
        return true;
    }

    int depth;
    std::vector<uint32_t> mem;
    int wptr{0};
    int rptr{0};
};

extern "C" {

// Create a FIFO. depth must be power-of-2.
void* fifo_create(int depth) {
    try {
        return new FifoU32(depth);
    } catch (...) {
        return nullptr;
    }
}

void fifo_destroy(void* h) {
    delete static_cast<FifoU32*>(h);
}

void fifo_reset(void* h) {
    if (!h) return;
    static_cast<FifoU32*>(h)->reset();
}

// Returns 1 on success, 0 on full
int fifo_push_u32(void* h, unsigned int v) {
    if (!h) return 0;
    return static_cast<FifoU32*>(h)->push((uint32_t)v) ? 1 : 0;
}

// Returns 1 on success, 0 on empty
int fifo_pop_u32(void* h, unsigned int* out_v) {
    if (!h || !out_v) return 0;
    uint32_t tmp = 0;
    if (!static_cast<FifoU32*>(h)->pop(&tmp)) return 0;
    *out_v = (unsigned int)tmp;
    return 1;
}

int fifo_empty(void* h) {
    if (!h) return 1;
    return static_cast<FifoU32*>(h)->empty() ? 1 : 0;
}

int fifo_full(void* h) {
    if (!h) return 0;
    return static_cast<FifoU32*>(h)->full() ? 1 : 0;
}

} // extern "C"

FIFO C++ Header

// cpp_async_fifo.h
// -----------------------------------------------------------------------------
// Simple parameterizable FIFO model in C++ (power-of-2 depth)
// Used as a pure C++ golden model *or* inside DPI wrappers.
// -----------------------------------------------------------------------------
#pragma once

#include <array>
#include <cstddef>
#include <cstdint>

// Depth must be power-of-2.
template <typename T, std::size_t DEPTH>
class AsyncFifo {
    static_assert((DEPTH & (DEPTH - 1)) == 0, "DEPTH must be power-of-2");

public:
    AsyncFifo() { reset(); }

    void reset() {
        wptr_ = 0;
        rptr_ = 0;
    }

    bool empty() const { return wptr_ == rptr_; }

    bool full() const {
        // full when lower bits match but MSB differs
        const std::size_t mask = DEPTH - 1;
        return ((wptr_ & mask) == (rptr_ & mask)) && ((wptr_ ^ rptr_) & DEPTH);
    }

    bool push(const T& v) {
        if (full()) return false;
        mem_[wptr_ & (DEPTH - 1)] = v;
        wptr_ = (wptr_ + 1) & ((DEPTH * 2) - 1);
        return true;
    }

    bool pop(T& out) {
        if (empty()) return false;
        out = mem_[rptr_ & (DEPTH - 1)];
        rptr_ = (rptr_ + 1) & ((DEPTH * 2) - 1);
        return true;
    }

private:
    std::array<T, DEPTH> mem_{};
    // pointers carry one extra wrap bit -> range [0, 2*DEPTH)
    std::size_t wptr_{0};
    std::size_t rptr_{0};
};

FIFO TB in SV

// sv_tb_dpi.sv
// -----------------------------------------------------------------------------
// SystemVerilog testbench that imports a C++ FIFO model via DPI-C.
//
// Questa compile notes (one common flow):
// - Compile SV + C together with vlog and autolink
// - Or compile C into a shared library and load with vsim -sv_lib
// Example commands appear in many guides; one common pattern is:
//   vlog -dpiheader dpi_hdr.h dpi_func.c tb_top.sv
//   vsim -c work.tb_top -do "run -all; quit"citeturn115040724918351
// -----------------------------------------------------------------------------

module tb_dpi_fifo;

  import "DPI-C" function chandle fifo_create(input int depth);
  import "DPI-C" function void    fifo_destroy(input chandle h);
  import "DPI-C" function void    fifo_reset(input chandle h);
  import "DPI-C" function int     fifo_push_u32(input chandle h, input int unsigned v);
  import "DPI-C" function int     fifo_pop_u32 (input chandle h, output int unsigned v);
  import "DPI-C" function int     fifo_empty(input chandle h);
  import "DPI-C" function int     fifo_full (input chandle h);

  chandle h;
  int unsigned got;

  localparam int DEPTH = 16;

  initial begin
    h = fifo_create(DEPTH);
    if (h == null) $fatal(1, "fifo_create failed");

    fifo_reset(h);

    // push 0..DEPTH-1
    for (int i = 0; i < DEPTH; i++) begin
      if (!fifo_push_u32(h, i)) $fatal(1, "unexpected full at i=%0d", i);
    end
    if (!fifo_full(h)) $display("NOTE: fifo_full() did not assert; check full policy");

    // pop and check
    for (int i = 0; i < DEPTH; i++) begin
      if (!fifo_pop_u32(h, got)) $fatal(1, "unexpected empty at i=%0d", i);
      if (got !== i[31:0]) $fatal(1, "mismatch exp=%0d got=%0d", i, got);
    end
    if (!fifo_empty(h)) $display("NOTE: fifo_empty() did not assert; check empty policy");

    fifo_destroy(h);
    $display("PASS");
    $finish;
  end

endmodule

HW Architecture

Caches

DRAM Subsystem Homepage

Introduction

The DRAM subsystem exists to provide the Atalla accelerator with access to off-chip memory required by AI workloads that exceed on-chip SRAM limits. Because DRAM access involves strict command sequencing and long latencies, a controller and memory bus are required to manage these accesses. This subsystem’s current goals are towards a non-blocking architecture that improves bandwidth by overlapping memory requests.

This page serves as the central home for the DRAM Subsystem. It consolidates RTL diagrams, active projects, reports, presentations, and ramp-up material. Use the links below to navigate based on what you are looking for.

New to the DRAM subsystem?

Follow the ramp-up guide for background, architecture context, and recommended resources -> Ramp-Up Guide
Working on the DRAM subsystem?

View active projects, current contributors, development branches, and documentation -> Active Projects
View Past Reports/Presentations?

View past reports, presentations, abstracts, and posters made by the DRAM subsystem -> Past Reports and Presentations
View completed projects?

View completed projects by the DRAM subsystem -> Completed Projects

Ramp-Up Guide

This section is for new students joining the DRAM subsystem and serves as a starting point for getting up to speed. It includes background material and resources to help you understand the design and begin contributing.

Introductory DRAM Overview (Recommended Starting Point)

A high-level video explaining the basic structure and operation of DRAM. This is an excellent first exposure and helps build intuition before diving into more technical material.

https://www.youtube.com/watch?v=7J7X7aZvMXQ&t=47s
Memory Systems: Cache, DRAM, Disk – Jacob, Ng, and Wang

Chapters 10-13 are required reading as they provide depth on DRAM organization, timing, and memory system.

https://purdue.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma99169138574101081
Understanding DDR4 Timing Parameters

A short reference page summarizing DDR4 timing parameters and constraints.

https://www.systemverilog.io/design/understanding-ddr4-timing-parameters/
JEDEC DDR4 Standard (JESD79-4C)

The official DDR4 specification defining all command sequences, timing requirements, and constraints

https://raw.githubusercontent.com/RAMGuide/TheRamGuide-WIP-/main/DDR4%20Spec%20JESD79-4C.pdf
ETH Zurich Lecture: DRAM Controllers (Prof. Onur Mutlu)

An in-depth lecture covering DRAM controller design, performance challenges, and architectural tradeoffs.

https://www.youtube.com/watch?v=TeG773OgiMQ

Active Projects

This section documents the currently active DRAM subsystem projects, including their purpose, implementation status, code locations, and points of contact.

Non-Blocking DRAM Controller

Description: The goal of this project is to design a non-blocking DRAM controller that allows multiple memory requests to be in flight simultaneously to improve bandwidth utilization. The design uses a row-open policy and bank-specific request queues to hide memory latency and enable memory-level parallelism.

Contributors

Jason Lyst (jlyst@purdue.edu)
Adrian Buczkowski (abuczko@purdue.edu)
Eddie Hu (hu927@purdue.edu)
Shams Hoque (hoques@purdue.edu)

RTL Diagrams

This sections links the location of all Block-/RTL-diagrams that were made for this design: https://app.diagrams.net/#G18bqekF9I8oZJpSTm-BcsDvPkOPy_cdul#%7B%22pageId%22%3A%22fpKTT8HEuwSpTkvlEaWT%22%7D

Active Branches

This section links the location of active branches that are being used for the design:

Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram

Verification

This section links the location of verification related documents like verification plans:

Design Documentation/Resources

This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.

https://ieeexplore.ieee.org/document/7108455
https://cdn.discordapp.com/attachments/1412834335983272129/1421965427600392203/dram_controller_non_block_idea.pdf?ex=69792800&is=6977d680&hm=67a75be2ec3b113caa3017cc4007acdce61c7919d64f179cc3b22d7cfcef2005&
DDR4 MICRON Model: https://drive.google.com/file/d/1CKYhZJe7rzhp_2ATkkAfWrufMl-Lt6jW/view?usp=sharing

Split-Transaction Interconnect

Description: The goal of this project is to design a split-transaction memory bus that can manage simultaneous in-flight requests from caches/scratchpad and simultaneous in-flight responses from the DRAM controller.

Contributors

Aryan Kadakia (kadakia0@purdue.edu)
Xinyu Liu (liu3680@purdue.edu)

RTL Diagrams

Active Branches

This section links the location of active branches that are being used for the design:

Aryan Kadakia’s Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_aryan#
Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram

Verification

This section links the location of verification related documents like verification plans:

Design Documentation/Resources

This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.

https://developer.arm.com/documentation/102202/0300/AXI-protocol-overview
https://www.cis.upenn.edu/~cis5710/spring2024/slides/13_axi.pdf

Ramulator Simulator

Description: The goal of this project is to understand the ramulator simualtor and design an interface that can connect from the split-transaction bus into the simulator.

Contributors

Heng-I (Ivor) Chu (chu244@purdue.edu)
Yichen Tian (tian182@purdue.edu)

RTL Diagrams

This sections links the location of all Block-/RTL-diagrams that were made for this design: Active Branches

This section links the location of active branches that are being used for the design:

Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram

Verification

This section links the location of verification related documents like verification plans:

Design Documentation/Resources

This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.

https://github.com/CMU-SAFARI/ramulator2

Past Reports and Presentations

This section lists all past final reports, presentations, abstracts, and any other resource that was made by the DRAM subsystem.

Final Reports

Fall 2025 Final Report: https://docs.google.com/document/d/1fIBgyiB3g3OImUYkugq2czNFDmUUhIS_DO6sGcxqxDY/edit?usp=sharing
Spring 2025 Final Report: https://docs.google.com/document/d/1J7sHHt2H2yTATN91Cda57GuU_zQYz0v8/edit?usp=sharing&ouid=112766930685277737014&rtpof=true&sd=true

Design Review Presentations

Fall 2025 Design Review 1: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQDGPSESbApgR6gN1-VNOP-jAaiWMO0WYFC02s4PZF21BJo?e=WBqsit
Fall 2025 Design Review 2: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQBl4rWfHvqdT4ZHBKWouYAqAawxNiz32_OpBjLJoTGnRlo?e=8q3a8d
Spring 2025 Design Review: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQCThlgQtvIaQbjwHBnV-sbAAddkki-xSmsUfsxGS6OnpZE?e=lCDwWu

Abstracts

Fall 2025 Abstract: https://purdue0-my.sharepoint.com/:w:/g/personal/khatri12_purdue_edu/IQANqWleEbkvT5E5I8LGPdeWAcWw7mlhg-Q2tpLF6bX1JFc?e=gO85O0

Poster Presentation

Fall 2025 Poster Presentation: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQC-xhWrYXVmSrW5sevR7zQ1AegyEuAUWBxusv5jGvgEdPo?e=OTz2Ys

Completed Projects

Blocking DRAM Controller

Description: The goal of this project is to design a fully functional DRAM Controller that interfaces with a ddr4 model.

Contributors

Tri Than (than0@purdue.edu)
Dhruv Khatri (khatri12@purdue.edu)

RTL Diagrams

Active Branches

This section links the location of active branches that are being used for the design:

Tri’s Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_tri
Dhruv’s Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dhruv
Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram

Verification

This section links the location of verification related documents like verification plans:

Design Documentation/Resources

This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.

DDR4 MICRON Model: https://drive.google.com/file/d/1CKYhZJe7rzhp_2ATkkAfWrufMl-Lt6jW/view?usp=sharing

Scheduler

Scratchpad

Systolic Array

Vector Core

SW Systems

Compiler

Kernels

Introduction to the Atallax01 Programming Model

The Atallax01 Programming Model allows users to map C/C++ algorithms to throughput-focused deep-learning accelerator, architected around VLIW vector-datapath, SW-managed Scratchpad and a 32x32 BF16 Systolic Array.

Unlike GPUs, Atallax01 does not expose a wide SIMT/SIMD programming interface. Instead, it provides a tile-centric compute model where kernels explicitly optimize and orchestrate data movement, vector lane utilization and Systolic Array computations through a single unified instruction stream. The workloads that will be run on Atalla are highly regular and need both wide-and-deep pipelines specifically for two-dimensional matrices.

Atallax01 is not a general-purpose processor. It is a core that will be placed alongside a high-performance CPU for a heterogenous compute plateform. Programmability is offered through C/C++ and a custom compiler toolchain. We do not plan to support imperative languages like Python.

This document defines:

the hardware execution model
the memory heirarchy
the programming constructs available to the user
tile-based psuedocode conventions

Hardware Differences

CPUs are computing machines that were fine-tuned over decades to minimize the latency of single-threaded instruction streams. They rely heavily on advanced techniques like prediction, speculaton, dynamic scheduling, etc. SMT, a computing perspective that partitioned/allocated register-sets to time-mulitplex independent software threads, was added as an afterthought into a single CPU core to exploit even more instruction-level parallelism.

However, the industry saw the need for different machines to exploit the data-level parallelism seen in scientific and graphics workloads. GPUs are massively-parallel computing machines, primarily programmed using CUDA or HIP paradigms which expose implicit-SIMD perspectives to the user. This enables the user to write scalar-threaded code in C++ which are compiled into SIMT binaries to utilize the SIMD execution units. CUDA inintially innovated by combining hardware-efficiency of SIMD but the programmability of SMT.

In recent years, GPUs been adapted to cater to the demand of the deep learning ecosystem with the addition of Tensor Cores for matrix-multiplications. TPUs grew parallely, but were targeted purely for deep-learning workloads that were domainted by GEMMs/CONVs. Atallax01 targets these primitives directly and disregards the SIMT/SIMD abstractions. Users will write single-thread code in C/C++ that directly defines tile-based descriptors for memory movement and vector-based kernels for compute datapaths. Thus, we say Atallax01 behaves more like a TPU than a GPU.

Heterogenous Programming

Atallax01 is programmed using a heterogenous host-device model, similar in spirit to CUDA/HIP but fundamentally simpler.

Host responsibilities include: - Allocate DRAM Tenstors. - Launch device kernels. - Pass tile descriptors and kernel metadata. Device responsibilities include: - Move data between DRAM and on-chip SRAM. - Swizzle data within the Scratchpad to enable row/coloumn-major addressing. - Load slices of the tiles into vector registers. - Execute blocking vector load/store/compute operations to prime the Systolic Array, or utilize the execution lanes.

The compiler issues VLIW bundles into a mapped-space within the DRAM partition exposed to Atallax01. The on-chip scheduling unit enforces a tainted-VLIW scheme by checking dependencies through scoreboarding.

Memory Model

The Atallax01 memory system is software-managed, and does not enforce any hardware-managed ordering mechanisms. The datapath is in-order, with the SDMA instructions making SCPAD locations valid before later accesses take place.

Global Memory (DRAM): - Large, high latency - Only accessible via SDMA instructions. - Ideal for storing large tensors. Assume 8GB+ space. Scratchpad Memory (SCPAD): - 1MB SRAM on-chip memory, low latency. - Only accessible via SDMA instructions. - Two seperate partitions indexed as SCPAD0 and SCPAD1 Vector Register File (VEGGIE): - [X-Size] SRAM vector-register-file - Only accessible via VM instructions. - Intermediate tile-slice storage to send to Lanes/Systolic-Array Scalar Register File: - [X-Size] SRAM lockup-free D-Cache - Implemented as a hardware-managed L1 Cache. Systolic Array Accumulation Buffers: - Not programmable. Hardware-controlled. - Strided/Staggered collection and tranfer of vectors into VEGGIE.

Execution Model

VLIW-based execution. Each cycle, the scheduler may one of [X] Packet types. The compiler ensures intra-bundle independence, with inter-bundle dependencies handled by the Scoreboardds in the Scheduler Unit.

In the following sections, we will focus on explaining the different “concepts” to keep in mind before developing code for Atallax01. Following this, we will discuss abstracted kernels which utilize these concepts.

Abstract Entities:

TileDesc       - 2D block of memory Global/Scpad (described by shape + strides)


GlobalRegion   - Where in Global Memory
GlobalTile     – N-D tensor in off-chip DRAM, has-a Global Region

ScpadRegion    - Which Scratchpad and where inside the Scratchpad
ScpadTile      – 2D tensor in on-chip SRAM, has-a TileDesc and ScpadRegion

VectorReg[v]   – vector register(s) in the vector core

Abstract Instrinsics:

SDMA_LD_* ScpadTile, GlobalTile
SDMA_ST_* ScpadTile, GlobalTile
VM_LD VectorReg[v], ScpadTile
VM_ST VectorReg[v], ScpadTile
VV_* VectorReg[v], VectorReg[v]
VV_* VectorReg[v], Imm
VS_* VectorReg[v], ScalarReg[v]

GEMMV ScpadTile C, ScpadTile A, ScpadTile B
CONV ScpadTile C, ScpadTile A, ScpadTile B

Kernels

General Matrix-Multiply (GEMM)

Atallax01 does not expose 32x32 Systolic Array directly. Instead, we provide a fixed-shape sub-kernels that operate on tiles that satisfy [<= 32x32].

Let’s define

TM  – rows of the output tile   (TM ≤ 32)
TN  – cols of the output tile   (TN ≤ 32)
TK  – reduction dimension slice (TK ≤ 32)

A single GEMMV instrinsic consumes:

A_tile  : TM × TK  (activations)
B_tile  : TK × TN  (weights)
C_tile  : TM × TN  (partial sums / output)

and computes:

C_tile = A_tile · B_tile + C_tile

entirely inside the vector-core + systolic array microcode, blocking until SPCAD_C is updated.

Tiling/Grouping

Given a standard GEMM of general dimensions

C[M × N] = A[M × K] · B[K × N]

we can decompose it into the following number of tiles:

MT = ceil(M / TM)   
NT = ceil(N / TN)  
KT = ceil(K / TK)

Each output tile C[i,j] (for 0 ≤ i < MT, 0 ≤ j < NT) is defined as:

C_tile(i,j) = C[ i*TM : (i+1)*TM,   j*TN : (j+1)*TN ]
A_tile(i,k) = A[ i*TM : (i+1)*TM,   k*TK : (k+1)*TK ]
B_tile(k,j) = B[ k*TK : (k+1)*TK,   j*TN : (j+1)*TN ]

All three of these tiles are loaded into on-chip SRAM as ScpadTiles before a GEMMV call.

Below, we define the tiling logic:

struct TileGroupDesc {
    GlobalTile A_g;   // TM x TK slice of A in DRAM
    GlobalTile B_g;   // TK x TN slice of B in DRAM
    GlobalTile C_g;   // TM x TN slice of C in DRAM 
    int i, j, k;      // tile indices (row, col, k reduction)
}

vector<TileGroupDesc> plan_gemmv(
    GlobalTile A, GlobalTile B, GlobalTile C,
    int M, int N, int K,
    int TM, int TN, int TK
) {
    vector<TileGroupDesc> groups;

    for (int i = 0; i < M; i += TM) {
      for (int j = 0; j < N; j += TN) {

        GlobalTile C_g = make_tile(C, i, j, TM, TN);

        for (int k = 0; k < K; k += TK) {

          GlobalTile A_g = make_tile(A, i, k, TM, TK);
          GlobalTile B_g = make_tile(B, k, j, TK, TN);

          groups.push_back(TileGroupDesc{.A_g = A_g, .B_g = B_g, .C_g = C_g, .i = i, j = j, .k = k});
        }
      }
    }

    return groups;
}

Execution Loop

Note: _alloc_scpad0, _alloc_scpad1 and _gemmv are functions defined in the stub library we provide with the E2E stack. It works at the vector-register level.

bool execute_gemmv(
    GlobalTile A, GlobalTile B, GlobalTile C,
    int M, int N, int K
) {
    const int TM = ...; // ≤ 32
    const int TN = ...; // ≤ 32
    const int TK = ...; // ≤ 32

    vector<TileGroupDesc> groups = plan_gemmv(A, B, C, M, N, K, TM, TN, TK);

    for each distinct (i, j) over output tiles {
        GlobalTile C_g = pop_group(i,j, groups).C_g;

        ScpadTile sc_C  = _alloc_scpad1(TM, TN);  
        SDMA_LD_1(sc_C, C_g);
        
        for each g in groups where (g.i == i and g.j == j) { 
            in order of g.k {

                // Using different register spaces within the Scratchpad allows the compiler 
                // to packetize to allow overlapping loads while compute happens.
                ScpadTile sc_A  = _alloc_scpad0(TM, TK);  
                ScpadTile sc_B  = _alloc_scpad1(TK, TN);  
                SDMA_LD_0(sc_A, g.A_g);
                SDMA_LD_1(sc_B, g.B_g);

                _gemmv(sc_C, sc_A, sc_B);
            }
        }

        SDMA_ST_1(sc_C, C_g);
    }
}

Keyboard shortcuts

Atalla AIHW Documentation