SoCET Atalla AIHW
Documentation is a work in progress!
Please help with regular documentation where you can. When you create diagrams or set something in stone for the team, please document it here. This GitHub Page will serve as our source of truth for the Tensor Core.
Guidelines / How-To
Setup Guidelines
This document should list any and all required steps to get yourselves setup to run Atalla. Specific setup scripts (like for the PyTorch infrastructure, or PPCI infrastructure) will be defined in sub-team homepages.
Basics
SSH into asicfab
The preferred IDE by the AI-HW team is VSCode. Please follow the instructions in DigitalOcean’s tutorial on connecting to a remote server.
You will need to download Cisco AnyConnect VPN if you plan to SSH from an out-of-campus location.
Setup asicfab
- Run
source /package/asicfab/AccountSetup/init.bash - Add the following into
~/.bashrc
[[ $- == *i* ]] || return
HOSTNAME=$(hostname)
if [ ${HOSTNAME} == "asicfab.ecn.purdue.edu" ]; then
source /package/asicfab/AccountSetup/init.bash
alias ls="ls --color"
alias ll="ls -la"
export COPYBUFFER=/package/asicfab/CopyBuffer
export MODULEPATH=/package/asicfab/AccountSetup/modulefiles:$MODULEPATH
export PATH=$HOME/.local/bin:$PATH # for python packages
unset PYTHONPATH
# For fusesoc + Questa usage
export MODEL_TECH="$(dirname $(which vsim))"
###### CUSTOM CHANGES BELOW THIS LINE #######
module load gcc/11.2.0 python3/3.11
module load riscv-gcc verilator/5.036 gtkwave
module load cadence/xcelium/23.03 siemens/questa/2021.4 intel/quartus-std
module load lcov
elif [ ${HOSTNAME} == "asicfabu.ecn.purdue.edu" ]; then
module load verilator gtkwave surfer lcov
else
echo "Unknown host ${HOSTNAME}; not loading modules"
fi
module load cadence/genus # used for synthesis
module load cadence/innovus # used for physical implementation
module load cadence/virtuoso # used for manually inspecting and manipulating the design
module load cadence/ssv # good general module
module load cadence/ddi # good general module
export LD_PRELOAD=/lib64/libz.so.1
Setup Github SSH
Follow the steps in the Github SSH-ing document. You should be able to clone the Atalla repository locally using git clone git@github.com:Purdue-SoCET/atalla.git now.
To test out a basic file, run
git checkout scratchpad_main
make run FILE=./scripts/common/xbar/clos/test.tcl
Synthesis
Overview
If you cloned the repository properly, and ran the setup scripts, you should be able to see “Flowkit” as a submodule. This is a repository compiled by the Design Flow team with a bunch of scripts/flows to help us use Genus and Innovus. It was initially a Cadence software, but it’s adapted.
Goals
You will use the following steps to compile each of your modules to get area and clock information. After this, look at the Reports step on how to format and store the reports that are generated.
Steps
The following steps outline what to do. Search for
@AIHWtag in each of the respective files to know where to add/edit stuff.
- Ensure you are in the
atallax01branch.
git checkout atallax01
git submodule update --init --recursive
- CREATE - /designs/cache/filelist.tcl
set listofdirs {}
lappend listofdirs "/home/asicfab/a/araviki/tensor-core/src/include"
set_db init_hdl_search_path $listofdirs
read_hdl -sv -define {NOIP SYNTHESIS} /home/asicfab/a/araviki/tensor-core/src/modules/cache_bank.sv
read_hdl -sv -define {NOIP SYNTHESIS} /home/asicfab/a/araviki/tensor-core/src/modules/cache_mshr_buffer.sv
read_hdl -sv -define {NOIP SYNTHESIS} /home/asicfab/a/araviki/tensor-core/src/modules/lockup_free_cache.sv
- EDIT - /scripts/config/design_config.tcl: search for “# Read Verilog and elaborate.”, and add this below
create_flow_step -name read_cache_hdl -owner design {
source ./designs/cache/filelist.tcl
}
create_flow_step -name elaborate_cache -owner design {
elaborate lockup_free_cache # replace with the name of your actual module
}
- EDIT - scripts/flow.yaml search for “flow: synthesis:”, near line 129
synthesis:
args: -tool genus -owner cadence -skip_metric -tool_options -disable_user_startup
features:
steps:
- syn_generic:
args: -owner cadence
features:
steps:
- block_start:
- init_elaborate:
- init_design:
args: -owner cadence
features:
steps:
- read_mmmc:
- read_physical:
# ########################
- read_cache_hdl: # replace this
- elaborate_cache: # replace this
# ########################
- read_power_intent:
- run_init_design:
- read_def:
enabled: "synth_ispatial || synth_hybrid"
- CREATE - /scripts/constraints/cache_bank.sdc. Always have interfaces for your modules. If you say it doesn’t need an interface, then that unit isn’t worth synthesizing alone.
set sdc_version 2.0
set_units -capacitance 1.0fF
set_units -time 1.0ps
# Set the current design
current_design cache_bank
# -period sets time in ps, 1000 -> 1GHz, 5000 -> 200Mhz
# -waveform = {-period}/2
create_clock -name "clock1" -period 1000.0 -waveform {0.0 500.0} [get_ports <interface_instance_name>_clk]
set_clock_transition -rise 1 [get_clocks "clock1"]
set_clock_transition -fall 1 [get_clocks "clock1"]
set_clock_uncertainty 0.1 [get_clocks "clock1"]
set_clock_gating_check -setup 0.0
set_driving_cell -lib_cell inv_8x -pin X [ all_inputs ] -min -max
set_input_delay -add_delay 1.0 -clock [get_clocks clock1] [all_inputs -no_clocks] # simulate 1ps delay
set_output_delay -add_delay 1.0 -clock [get_clocks clock1] [all_outputs]
- EDIT - scripts/setup.yaml search for “constraint_modes” near line 101
constraint_modes:
func:
sdc_files:
- scripts/constraints/cache_bank.sdc
- RUN
Run the following commands in the Flowkit/ folder.
# Takes a long time
flowtool -reset -to synthesis
# Wakes up flowtool. It'll take a bit for all the required files to compile. You can run specific flow steps after this.
flowtool -flow run_syn_opt -interactive_run -isolate step
report_timing -max_paths 10 -path_type full > critical_paths.txt
Reports
Check within Flowkit/reports. reports/syn_opt/ has the results from the optimal synthesized flow. reports/syn_opt/qor.prt has the important content.
Create a folder for each submodule within tensor-core/reports, and store the relevant information in there. We will not gitignore it.
- To get the clock speed, take the {-period} value you set in /scripts/constraint/*.sdc file, and add the slack value. If clock period is (1000) and total slack is (-555) then the clock speed = (1/1555ps)MHz. If clock period is (3000) and total slack is (0) and Critical Path Slack is (1580.8), then your frequency is (1/(3000ps – 1580.8ps)) = 704.8MHz.
- To get the area, look under the Area section. Values are in (um)^2. 1(mm^2) = 1e-6(um)^2.
PCACTI
Ramulator2
Verification
Overview
Verification happens in three steps:
- Smoke Tests for all modules in SystemVerilog w/ QuestaSim.
- Unit Tests for individual modules in SystemVerilog w/ QuestaSim. Ensure you apply assertions and performance counters! The goal
- Top Level C++ tests w/ Verilator. This must be a more complete testbench, simulating real-workload situations.
All your code must go in ./tb/.
- ./tb/formal must contain all the mathematical assertions w/ covergroups.
- ./tb/unit must contain all the unit tests in the same heirarchy. Use
make sv_test folder= tb_file= GUI=to run QuestaSim. Check the Makefile for options. - ./tb/uvm is a maybe for now.
SoCET AI Hardware SystemVerilog Coding Guide
Written by: Malcolm McClymont
Last updated: 1/15/2026
Introduction
This guide is designed for engineers developing and testing hardware within the AI Hardware team. It provides a set of guidelines meant to facilitate large scale collaboration by making RTL code easy to read, modify, and test. These rules are split into three categories based on severity:
-
SHALL/SHALL NOT rules; these must be followed at (almost) all times. Any SHALL RULE VIOLATIONS must include comments that thoroughly explain how the rule is being violated and why. Anyone who deviates from these rules should not be surprised if asked to rewrite their code.
-
SHOUD/SHOULD NOT rules; these practices are strongly encouraged, but not strictly mandatory. Violating these rules does not require an explicit comment but may warrant rewriting code.
-
MAY/MAY NOT rules; these are practices that could improve code legibility or testability but are situational and should be applied at the engineer’s discretion.
Perceptive readers will notice that there are no absolute rules in this guide; engineers should always evaluate if a rule should be followed and provide documentation if they decide not to follow it.
Formatting
-
Engineers SHALL use spaces as tabs, with 4 spaces per tab.
VSCode can be configured to do this, but here are some commands for a .vimrc file that enables this:
autocmd BufNewFile,BufRead *.sv,*.v set tabstop=4 autocmd BufNewFile,BufRead *.sv,*.v set shiftwidth=4 autocmd BufNewFile,BufRead *.sv,*.v set expandtab -
Engineers SHALL minimize dead (commented out) code. If large sections of code are commented out (3 or more lines) then a pointer comment SHALL be used or the comment should be deleted.
-
Engineers SHALL consolidate code. Dead code and active code should be in separate groups, not intermixed.
An example of these two rules:
Bad! Notice how the dead and alive lines are mixed. ... always_comb begin : input_buses // cu.input_type = 1'b0; cu.input_row = '0; cu.input_load = 1'b0; // cu.weight_row = '0; // cu.weight_load = 1'b0; cu.partials_row = '0; cu.partials_load = 1'b0; if (cu.input_en) begin cu.input_row = cu.row_in_en; cu.input_load = 1'b1; end else if (cu.weight_en) begin // cu.input_type = 1'b1; // cu.weight_row = cu.row_in_en; // cu.weight_load = 1'b1; end if (cu.partial_en) begin cu.partials_row = cu.row_ps_en; cu.partials_load = 1'b1; end end ...Better, dead lines are grouped together ... always_comb begin : input_buses cu.input_row = '0; cu.input_load = 1'b0; cu.partials_row = '0; cu.partials_load = 1'b0; // cu.weight_row = '0; // cu.weight_load = 1'b0; // cu.input_type = 1'b0; if (cu.input_en) begin cu.input_row = cu.row_in_en; cu.input_load = 1'b1; end else if (cu.weight_en) begin // cu.input_type = 1'b1; // cu.weight_row = cu.row_in_en; // cu.weight_load = 1'b1; end if (cu.partial_en) begin cu.partials_row = cu.row_ps_en; cu.partials_load = 1'b1; end end ...Best, pointer comments are used to move commented blocks away from the active code. Ideally, dead lines should be at end of file or deleted entirely. ... always_comb begin : input_buses cu.input_row = '0; cu.input_load = 1'b0; cu.partials_row = '0; cu.partials_load = 1'b0; //[1] if (cu.input_en) begin cu.input_row = cu.row_in_en; cu.input_load = 1'b1; end else if (cu.weight_en) begin //[2] end if (cu.partial_en) begin cu.partials_row = cu.row_ps_en; cu.partials_load = 1'b1; end end ... endmodule //[1] // cu.weight_row = '0; // cu.weight_load = 1'b0; // cu.input_type = 1'b0; //[2] // cu.input_type = 1'b1; // cu.weight_row = cu.row_in_en; // cu.weight_load = 1'b1; -
Signal names SHALL NOT exceed 30 characters. Additionally, their names SHOULD be intuitive.
-
Module names SHALL be intuitive.
-
All always_comb and always_ff blocks SHALL have names.
-
Blocks that share many signals or interact closely SHOULD be adjacent in the code. In other words, achieve spatial locality.
- However, always_comb blocks SHOULD be grouped with other always_comb blocks. The same applies to always_ff blocks.
-
Unless a block’s name makes its function obvious, every block SHOULD come with a comment describing what it does
For example:
//Detailed comment describing function of block always_comb begin : <block name> ... end ... //Detailed comment describing function of block always_ff begin : <block name> ... end -
Any outstanding fixes/modifications to code SHOULD be documented using a TODO comment
-
When applicable, RTL and code SHALL use “manager” and “subordinate” as opposed to “master” and “slave”.
Verilator Linter
-
Engineers SHALL NOT have any Verilator linter warnings in code within main branch
-
Engineers SHALL NOT use lint_off to disable warnings
- All of the Verilator warnings and their meanings can be found here: https://verilator.org/guide/latest/warnings.html
For example:
always_comb begin : <block name> ... //Bad! Why are they getting a truncate warning? //Engineer should explain this with a comment or fix it. /* verilator lint_off WIDTHTRUNC */ curr_input_row = iteration[l]; /* verilator lint_off WIDTHTRUNC */ ... end
Testbenches
- Engineers SHALL only print messages for failing test cases, but include a “test complete” message too. These failing tests SHALL include a timestamp.
For example:
//This will set %t to print in nanoseconds. Replace -9 with -12 for ps.
$timeformat(-9, 2, " ns");
...
if(tb_out != golden_out) begin
$display("Output mismatch for test %0d at %0t", i, $time);
failed_cases += 1;
end
...
-
Testbenches SHALL use `timescale 1ps/1ps.
-
TODO: Expand this section
Interfaces
- Engineers SHALL include an ifndef to avoid repeatedly including an interface
- Modports SHALL follow the format x_y, where x and y represent the two modules the interface connects
For example:
//Good
module backend #(parameter logic [SCPAD_ID_WIDTH-1:0] IDX = '0) (
scpad_if.backend_sched bshif, //Connects backend to scheduler
scpad_if.backend_body bbif, //Connects backend to scratchpad body
scpad_if.backend_dram bdrif //Conects backend to DRAM
);
...
endmodule
- TODO: Expand this section
Synthesizable Logic
- Plus and minus SHALL be the only arithmetic operators directly used in synthesizable logic.
For example:
//Good
assign c = a + b;
//Bad!
assign c = a * b;
//Good. Create an instance of an operational unit from a written Verilog module.
mult_module M0(.a(a), .b(b), .c(c))
-
When using a for loop in synthesizable code, its intended function SHALL be thoroughly commented. This also applies to generate for loops.
-
Any use of always_latch SHALL come with extensive documentation on its intended function.
-
Functions MAY be used in synthesized logic, but SHALL only contain combinational logic.
- SystemVerilog doesn’t actually support sequential logic in functions.
For example:
//Good, only contains combinational logic.
function [7:0] addition (input [7:0] in_a, input [7:0] in_b);
addition = in_a + in_b;
endfunction
...
always_comb begin
c = addition(a, b);
end
Combinational Logic
- Engineers SHALL NOT use nested ternary logic. Use an if/else or case block instead for this.
For example:
//Bad! Nested ternary logic, hard to read.
result = a ? b : c ? d : e;
//Good. Use a different control structure.
if(a) begin
result = b;
else if (c) begin
result = d;
else begin
result = e;
end
//Also acceptable
if(a) begin
result = b;
else begin
result = c ? d : e;
end
- Any modules that are fully combinational MAY have the clk and n_rst input signals removed.
Sequential Logic
-
Sequential elements SHALL only use posedge clk and negedge n_rst in their sensitivity lists.
-
Engineers SHALL only use packed arrays in synthesziable modules. Testbenches may use either packed or unpacked arrays.
For example:
logic [7:0] data; //Packed array
logic data [7:0]; //Unpacked array
-
FSMs SHALL always use enums as state names.
-
Engineers SHOULD always use explicit next_state logic.
-
Any sequential elements that do not strictly require an n_rst signal SHOULD NOT have one. However, you SHALL comment which elements do not have an n_rst signal
- Determining which elements do not need an n_rst signal SHOULD be done after the design is fully verified.
- Signals used for write enable by any modules SHALL always have an n_rst signal
Forbidden Structures
The following SystemVerilog structures SHALL NOT be used:
- Programs
- Fork/join
- Always@
- Z-state logic/tri-state buffers
- Classes and polymorphism SHALL be avoided. Typedef statements within packages SHALL be used instead.
- Any datatypes indicating the strength of a signal (supply1, strong1, weak1, etc.)
- Force statements
- Triple equal signs equality operators (===)
Direct Programming Interface‑C (DPI‑C)
Overview
The SystemVerilog Direct Programming Interface (DPI) provides a standard way to call C functions from SystemVerilog and to call SystemVerilog functions from C. The DPI forms a bridge between the two languages and allows re‑use of existing C/C++ models or libraries.
This guide walks through two complementary flows that combine DPI‑C and a parameterizable asynchronous FIFO. The first flow uses Verilator to convert a SystemVerilog FIFO into C++ and writes a C++ testbench. The second flow implements the FIFO in C++ and imports it into a SystemVerilog testbench via DPI‑C, targeting Siemens Questa SIM.
Sources
- SV DPI Tutorial
- Async FIFO RTL
- Running Verilator
- How to call C-functions from SV using DPI-C
- How to call SV-modules from C++ using DPI-C
Flow 1 – Verilating the SystemVerilog FIFO and Creating a C++ Testbench
a. Convert SystemVerilog to C++ with Verilator
Copy over async_fifo.sv into your local directory.
- Invoke Verilator:
verilator --cc async_fifo.sv --exe fifo_tb.cpp -CFLAGS "-std=c++17" --build
- –cc tells Verilator to generate C++ output.
- –exe fifo_tb.cpp requests Verilator to compile the provided C++ testbench and link it with the generated model.
- –build invokes make to build the executable.
- Run the compiled simulation:
./obj_dir/Vasync_fifo
Verilator will generate a directory (default obj_dir) containing C++ files for the design. The included fifo_tb.cpp drives the async_fifo model by toggling independent write and read clocks, writing a sequence of values, and reading them back. The testbench checks for ordering errors and prints “FIFO test passed” when successful. See the file fifo_tb.cpp for details.
b. Writing a C++ Testbench
The C++ testbench uses the Verilated model interface. After including Vasync_fifo.h and initializing reset signals, it toggles wclk and rclk at different rates to emulate independent clock domains. It drives w_en and r_en, monitors the w_full/r_empty flags and verifies correct FIFO operation.
Refer to fifo_tb.cpp.
Verilator’s generated wrapper class provides the public signals as C++ members. Remember to call top->eval() after changing inputs; this schedules and evaluates the model for the current timestep.
Flow 2 – Importing a C++ FIFO into a SystemVerilog Testbench via DPI‑C
a. Implement the FIFO in C++ and Expose it via DPI
Implementing a FIFO in C++ gives flexibility and allows using existing algorithmic models as golden references. The DPI‑C interface maps SystemVerilog types to C types; for example, byte maps to char and int maps to int. More complex types (arrays, 4‑state logic) require DPI‑defined types in svdpi.h. Arguments can be passed by value or reference depending on the argument direction.
The provided file dpi_fifo.cpp implements a simple power‑of‑two FIFO in C++. It defines a DpiFifo structure with depth, write and read pointers and a dynamically allocated array. The exported functions have extern “C” linkage so the names are not mangled. Key functions are defined in the dpi_hdr.h.
// Allocate a FIFO with at least the requested depth (rounded up to a power of two)
DpiFifo* fifo_init(int depth);
// Free the FIFO
void fifo_free(DpiFifo* fifo);
// Return non‑zero if FIFO is full
int fifo_full(const DpiFifo* fifo);
// Return non‑zero if FIFO is empty
int fifo_empty(const DpiFifo* fifo);
// Push/pop an integer; return 1 on success, 0 on full/empty
int fifo_push(DpiFifo* fifo, int data);
int fifo_pop(DpiFifo* fifo, int* data);
b. Import the C++ Functions into SystemVerilog
In the SystemVerilog testbench (sv_tb_dpi.sv) we declare the C functions using import “DPI-C”. The chandle type represents an opaque C pointer. The testbench allocates the FIFO, pushes a sequence of integers, pops them back and checks the order:
module tb;
import "DPI-C" function chandle fifo_init(input int depth);
import "DPI-C" function void fifo_free(input chandle handle);
import "DPI-C" function int fifo_push(input chandle handle, input int data);
import "DPI-C" function int fifo_pop(input chandle handle, output int data);
import "DPI-C" function int fifo_full(input chandle handle);
import "DPI-C" function int fifo_empty(input chandle handle);
// … allocate and use FIFO …
endmodule
c. Compile and Run with Questa SIM
Siemens’ Questa SIM automatically compiles and links DPI‑C code when using the vlog -dpiheader option. The Elektroda article summarises the steps:
Write your C function and compile it into a position‑independent shared object (.so on Linux).
Declare the function in SystemVerilog using import “DPI-C” ….
Compile both C and SystemVerilog files together with vlog, then run the simulation and load the shared object with vsim -sv_lib.
vlib work
vlog -dpiheader dpi_hdr.h dpi_fifo.cpp sv_tb_dpi.sv
vsim -c work.tb -do "run -all; quit"
The -dpiheader switch instructs vlog to generate a header (dpi_hdr.h) that contains the prototypes of the imported functions. Questa automatically calls GCC to compile the C file and link the resulting shared library. Alternatively you can pre‑compile the C library manually:
gcc -fPIC -shared -I$MTI_HOME/include -o libdpi.so dpi_fifo.cpp
vlog sv_tb_dpi.sv
vsim work.tb -sv_lib libdpi
Be sure that the C library is compiled for the same word size as your simulator (32‑bit vs. 64‑bit); otherwise you will see linker errors. Also remember to include svdpi.h in your C file; without it the DPI data types are undefined.
Appendix
FIFO RTL
Adapted from https://www.verilogpro.com/asynchronous-fifo-design/.
// async_fifo.sv
module async_fifo #(
parameter int DATA_WIDTH = 8,
parameter int ADDR_WIDTH = 4
) (
input logic wclk,
input logic wrst_n,
input logic w_en,
input logic [DATA_WIDTH-1:0] wdata,
output logic w_full,
input logic rclk,
input logic rrst_n,
input logic r_en,
output logic [DATA_WIDTH-1:0] rdata,
output logic r_empty
);
localparam int FIFO_DEPTH = 1 << ADDR_WIDTH;
logic [DATA_WIDTH-1:0] mem [0:FIFO_DEPTH-1];
logic [ADDR_WIDTH:0] wptr_bin, rptr_bin;
logic [ADDR_WIDTH:0] wptr_gray, rptr_gray;
logic [ADDR_WIDTH:0] wptr_gray_sync1, wptr_gray_sync2;
logic [ADDR_WIDTH:0] rptr_gray_sync1, rptr_gray_sync2;
// Write domain
always_ff @(posedge wclk or negedge wrst_n) begin
if (!wrst_n) begin
wptr_bin <= '0;
wptr_gray <= '0;
end else if (w_en && !w_full) begin
mem[wptr_bin[ADDR_WIDTH-1:0]] <= wdata;
wptr_bin <= wptr_bin + 1;
wptr_gray <= (wptr_bin + 1) >> 1 ^ (wptr_bin + 1); // binary→Gray
end
end
// Read domain
always_ff @(posedge rclk or negedge rrst_n) begin
if (!rrst_n) begin
rptr_bin <= '0;
rptr_gray <= '0;
end else if (r_en && !r_empty) begin
rptr_bin <= rptr_bin + 1;
rptr_gray <= (rptr_bin + 1) >> 1 ^ (rptr_bin + 1);
end
end
// Data output; asynchronous read with registered output
always_ff @(posedge rclk) begin
rdata <= mem[rptr_bin[ADDR_WIDTH-1:0]];
end
// Synchronize Gray pointers across domains
always_ff @(posedge wclk or negedge wrst_n) begin
if (!wrst_n) begin
rptr_gray_sync1 <= '0;
rptr_gray_sync2 <= '0;
end else begin
rptr_gray_sync1 <= rptr_gray;
rptr_gray_sync2 <= rptr_gray_sync1;
end
end
always_ff @(posedge rclk or negedge rrst_n) begin
if (!rrst_n) begin
wptr_gray_sync1 <= '0;
wptr_gray_sync2 <= '0;
end else begin
wptr_gray_sync1 <= wptr_gray;
wptr_gray_sync2 <= wptr_gray_sync1;
end
end
// Full and empty detection:contentReference[oaicite:8]{index=8}
always_comb begin
// Empty when synchronized write pointer equals local read pointer
r_empty = (wptr_gray_sync2 == rptr_gray);
// Full when MSBs differ and lower bits match
w_full = (wptr_gray[ADDR_WIDTH] != rptr_gray_sync2[ADDR_WIDTH]) &&
(wptr_gray[ADDR_WIDTH-1:0] == rptr_gray_sync2[ADDR_WIDTH-1:0]);
end
endmodule
FIFO TB in C++
// fifo_tb.cpp
// -----------------------------------------------------------------------------
// C++ testbench for async_fifo.sv when Verilated into C++.
//
// Build idea (typical):
// verilator -Wall --cc async_fifo.sv --exe fifo_tb.cpp --top-module async_fifo
// make -C obj_dir -f Vasync_fifo.mk
// ./obj_dir/Vasync_fifo
//
// Verilator's guide describes translating SV to C++ with --cc and building an
// executable with --binary/--exe.citeturn583946675001164
// -----------------------------------------------------------------------------
#include <cstdint>
#include <deque>
#include <iostream>
#include <random>
#include "Vasync_fifo.h"
#include "verilated.h"
static vluint64_t main_time = 0;
static void tick(Vasync_fifo* top) {
top->eval();
main_time++;
}
int main(int argc, char** argv) {
Verilated::commandArgs(argc, argv);
auto* top = new Vasync_fifo;
// Simple async clocks: write clock toggles every cycle, read clock every 2
top->wclk = 0;
top->rclk = 0;
top->wrst_n = 0;
top->rrst_n = 0;
top->w_en = 0;
top->r_en = 0;
top->w_data = 0;
// apply reset
for (int i = 0; i < 10; i++) {
top->wclk = !top->wclk;
if ((i % 2) == 0) top->rclk = !top->rclk;
tick(top);
}
top->wrst_n = 1;
top->rrst_n = 1;
std::deque<uint32_t> scoreboard;
std::mt19937 rng(1);
std::uniform_int_distribution<int> coin(0, 1);
const int N = 2000;
for (int t = 0; t < N; t++) {
// drive enables in their own domains
bool do_write = (coin(rng) == 1);
bool do_read = (coin(rng) == 1);
// Write domain on rising edge of wclk
if (!top->wclk) {
top->w_en = do_write;
if (do_write) {
uint32_t v = (uint32_t)t;
top->w_data = v;
}
}
// Read domain on rising edge of rclk
if (!top->rclk) {
top->r_en = do_read;
}
// toggle clocks
top->wclk = !top->wclk;
if ((t % 2) == 0) top->rclk = !top->rclk;
tick(top);
// Scoreboard updates after edges
if (top->w_en && !top->w_full) {
scoreboard.push_back(top->w_data);
}
if (top->r_en && !top->r_empty) {
if (scoreboard.empty()) {
std::cerr << "ERROR: DUT popped but scoreboard empty\n";
return 2;
}
uint32_t exp = scoreboard.front();
uint32_t got = top->r_data;
scoreboard.pop_front();
if (got != exp) {
std::cerr << "ERROR: mismatch exp=" << exp << " got=" << got << "\n";
return 3;
}
}
}
std::cout << "PASS\n";
delete top;
return 0;
}
FIFO C++
// dpi_fifo.cpp
// -----------------------------------------------------------------------------
// C++ async FIFO model exposed through a C ABI for SystemVerilog DPI-C.
//
// This is the "C++ golden model -> SV testbench" flow.
// - SV holds a handle (chandle) to an allocated C++ object.
// - SV calls fifo_push/fifo_pop each cycle or transaction.
//
// DPI-C uses svdpi.h types and API; keeping types consistent matters because
// SV<->C data must be interpreted identically.citeturn818598791849910
// -----------------------------------------------------------------------------
#include <cstdint>
#include <cstdlib>
#include <new>
#include <vector>
extern "C" {
#include "svdpi.h"
}
struct FifoU32 {
explicit FifoU32(int depth_pow2)
: depth(depth_pow2), mem((size_t)depth_pow2, 0) {
reset();
}
void reset() {
wptr = 0;
rptr = 0;
}
bool empty() const { return wptr == rptr; }
bool full() const {
int mask = depth - 1;
return ((wptr & mask) == (rptr & mask)) && (((wptr ^ rptr) & depth) != 0);
}
bool push(uint32_t v) {
if (full()) return false;
mem[(size_t)(wptr & (depth - 1))] = v;
wptr = (wptr + 1) & ((2 * depth) - 1);
return true;
}
bool pop(uint32_t* out) {
if (empty()) return false;
*out = mem[(size_t)(rptr & (depth - 1))];
rptr = (rptr + 1) & ((2 * depth) - 1);
return true;
}
int depth;
std::vector<uint32_t> mem;
int wptr{0};
int rptr{0};
};
extern "C" {
// Create a FIFO. depth must be power-of-2.
void* fifo_create(int depth) {
try {
return new FifoU32(depth);
} catch (...) {
return nullptr;
}
}
void fifo_destroy(void* h) {
delete static_cast<FifoU32*>(h);
}
void fifo_reset(void* h) {
if (!h) return;
static_cast<FifoU32*>(h)->reset();
}
// Returns 1 on success, 0 on full
int fifo_push_u32(void* h, unsigned int v) {
if (!h) return 0;
return static_cast<FifoU32*>(h)->push((uint32_t)v) ? 1 : 0;
}
// Returns 1 on success, 0 on empty
int fifo_pop_u32(void* h, unsigned int* out_v) {
if (!h || !out_v) return 0;
uint32_t tmp = 0;
if (!static_cast<FifoU32*>(h)->pop(&tmp)) return 0;
*out_v = (unsigned int)tmp;
return 1;
}
int fifo_empty(void* h) {
if (!h) return 1;
return static_cast<FifoU32*>(h)->empty() ? 1 : 0;
}
int fifo_full(void* h) {
if (!h) return 0;
return static_cast<FifoU32*>(h)->full() ? 1 : 0;
}
} // extern "C"
FIFO C++ Header
// cpp_async_fifo.h
// -----------------------------------------------------------------------------
// Simple parameterizable FIFO model in C++ (power-of-2 depth)
// Used as a pure C++ golden model *or* inside DPI wrappers.
// -----------------------------------------------------------------------------
#pragma once
#include <array>
#include <cstddef>
#include <cstdint>
// Depth must be power-of-2.
template <typename T, std::size_t DEPTH>
class AsyncFifo {
static_assert((DEPTH & (DEPTH - 1)) == 0, "DEPTH must be power-of-2");
public:
AsyncFifo() { reset(); }
void reset() {
wptr_ = 0;
rptr_ = 0;
}
bool empty() const { return wptr_ == rptr_; }
bool full() const {
// full when lower bits match but MSB differs
const std::size_t mask = DEPTH - 1;
return ((wptr_ & mask) == (rptr_ & mask)) && ((wptr_ ^ rptr_) & DEPTH);
}
bool push(const T& v) {
if (full()) return false;
mem_[wptr_ & (DEPTH - 1)] = v;
wptr_ = (wptr_ + 1) & ((DEPTH * 2) - 1);
return true;
}
bool pop(T& out) {
if (empty()) return false;
out = mem_[rptr_ & (DEPTH - 1)];
rptr_ = (rptr_ + 1) & ((DEPTH * 2) - 1);
return true;
}
private:
std::array<T, DEPTH> mem_{};
// pointers carry one extra wrap bit -> range [0, 2*DEPTH)
std::size_t wptr_{0};
std::size_t rptr_{0};
};
FIFO TB in SV
// sv_tb_dpi.sv
// -----------------------------------------------------------------------------
// SystemVerilog testbench that imports a C++ FIFO model via DPI-C.
//
// Questa compile notes (one common flow):
// - Compile SV + C together with vlog and autolink
// - Or compile C into a shared library and load with vsim -sv_lib
// Example commands appear in many guides; one common pattern is:
// vlog -dpiheader dpi_hdr.h dpi_func.c tb_top.sv
// vsim -c work.tb_top -do "run -all; quit"citeturn115040724918351
// -----------------------------------------------------------------------------
module tb_dpi_fifo;
import "DPI-C" function chandle fifo_create(input int depth);
import "DPI-C" function void fifo_destroy(input chandle h);
import "DPI-C" function void fifo_reset(input chandle h);
import "DPI-C" function int fifo_push_u32(input chandle h, input int unsigned v);
import "DPI-C" function int fifo_pop_u32 (input chandle h, output int unsigned v);
import "DPI-C" function int fifo_empty(input chandle h);
import "DPI-C" function int fifo_full (input chandle h);
chandle h;
int unsigned got;
localparam int DEPTH = 16;
initial begin
h = fifo_create(DEPTH);
if (h == null) $fatal(1, "fifo_create failed");
fifo_reset(h);
// push 0..DEPTH-1
for (int i = 0; i < DEPTH; i++) begin
if (!fifo_push_u32(h, i)) $fatal(1, "unexpected full at i=%0d", i);
end
if (!fifo_full(h)) $display("NOTE: fifo_full() did not assert; check full policy");
// pop and check
for (int i = 0; i < DEPTH; i++) begin
if (!fifo_pop_u32(h, got)) $fatal(1, "unexpected empty at i=%0d", i);
if (got !== i[31:0]) $fatal(1, "mismatch exp=%0d got=%0d", i, got);
end
if (!fifo_empty(h)) $display("NOTE: fifo_empty() did not assert; check empty policy");
fifo_destroy(h);
$display("PASS");
$finish;
end
endmodule
HW Architecture
Caches
DRAM Subsystem Homepage
Introduction
The DRAM subsystem exists to provide the Atalla accelerator with access to off-chip memory required by AI workloads that exceed on-chip SRAM limits. Because DRAM access involves strict command sequencing and long latencies, a controller and memory bus are required to manage these accesses. This subsystem’s current goals are towards a non-blocking architecture that improves bandwidth by overlapping memory requests.
This page serves as the central home for the DRAM Subsystem. It consolidates RTL diagrams, active projects, reports, presentations, and ramp-up material. Use the links below to navigate based on what you are looking for.
-
New to the DRAM subsystem?
Follow the ramp-up guide for background, architecture context, and recommended resources -> Ramp-Up Guide
-
Working on the DRAM subsystem?
View active projects, current contributors, development branches, and documentation -> Active Projects
-
View Past Reports/Presentations?
View past reports, presentations, abstracts, and posters made by the DRAM subsystem -> Past Reports and Presentations
-
View completed projects?
View completed projects by the DRAM subsystem -> Completed Projects
Ramp-Up Guide
This section is for new students joining the DRAM subsystem and serves as a starting point for getting up to speed. It includes background material and resources to help you understand the design and begin contributing.
-
Introductory DRAM Overview (Recommended Starting Point)
A high-level video explaining the basic structure and operation of DRAM. This is an excellent first exposure and helps build intuition before diving into more technical material.
https://www.youtube.com/watch?v=7J7X7aZvMXQ&t=47s
-
Memory Systems: Cache, DRAM, Disk – Jacob, Ng, and Wang
Chapters 10-13 are required reading as they provide depth on DRAM organization, timing, and memory system.
https://purdue.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma99169138574101081
-
Understanding DDR4 Timing Parameters
A short reference page summarizing DDR4 timing parameters and constraints.
https://www.systemverilog.io/design/understanding-ddr4-timing-parameters/
-
JEDEC DDR4 Standard (JESD79-4C)
The official DDR4 specification defining all command sequences, timing requirements, and constraints
https://raw.githubusercontent.com/RAMGuide/TheRamGuide-WIP-/main/DDR4%20Spec%20JESD79-4C.pdf
-
ETH Zurich Lecture: DRAM Controllers (Prof. Onur Mutlu)
An in-depth lecture covering DRAM controller design, performance challenges, and architectural tradeoffs.
https://www.youtube.com/watch?v=TeG773OgiMQ
Active Projects
This section documents the currently active DRAM subsystem projects, including their purpose, implementation status, code locations, and points of contact.
Non-Blocking DRAM Controller
Description: The goal of this project is to design a non-blocking DRAM controller that allows multiple memory requests to be in flight simultaneously to improve bandwidth utilization. The design uses a row-open policy and bank-specific request queues to hide memory latency and enable memory-level parallelism.
Contributors
- Jason Lyst (jlyst@purdue.edu)
- Adrian Buczkowski (abuczko@purdue.edu)
- Eddie Hu (hu927@purdue.edu)
- Shams Hoque (hoques@purdue.edu)
RTL Diagrams
This sections links the location of all Block-/RTL-diagrams that were made for this design: https://app.diagrams.net/#G18bqekF9I8oZJpSTm-BcsDvPkOPy_cdul#%7B%22pageId%22%3A%22fpKTT8HEuwSpTkvlEaWT%22%7D
Active Branches
This section links the location of active branches that are being used for the design:
- Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram
Verification
This section links the location of verification related documents like verification plans:
Design Documentation/Resources
This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.
- https://ieeexplore.ieee.org/document/7108455
- https://cdn.discordapp.com/attachments/1412834335983272129/1421965427600392203/dram_controller_non_block_idea.pdf?ex=69792800&is=6977d680&hm=67a75be2ec3b113caa3017cc4007acdce61c7919d64f179cc3b22d7cfcef2005&
- DDR4 MICRON Model: https://drive.google.com/file/d/1CKYhZJe7rzhp_2ATkkAfWrufMl-Lt6jW/view?usp=sharing
Split-Transaction Interconnect
Description: The goal of this project is to design a split-transaction memory bus that can manage simultaneous in-flight requests from caches/scratchpad and simultaneous in-flight responses from the DRAM controller.
Contributors
- Aryan Kadakia (kadakia0@purdue.edu)
- Xinyu Liu (liu3680@purdue.edu)
RTL Diagrams
This sections links the location of all Block-/RTL-diagrams that were made for this design: https://app.diagrams.net/#G18bqekF9I8oZJpSTm-BcsDvPkOPy_cdul#%7B%22pageId%22%3A%22fpKTT8HEuwSpTkvlEaWT%22%7D
Active Branches
This section links the location of active branches that are being used for the design:
- Aryan Kadakia’s Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_aryan#
- Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram
Verification
This section links the location of verification related documents like verification plans:
Design Documentation/Resources
This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.
- https://developer.arm.com/documentation/102202/0300/AXI-protocol-overview
- https://www.cis.upenn.edu/~cis5710/spring2024/slides/13_axi.pdf
Ramulator Simulator
Description: The goal of this project is to understand the ramulator simualtor and design an interface that can connect from the split-transaction bus into the simulator.
Contributors
- Heng-I (Ivor) Chu (chu244@purdue.edu)
- Yichen Tian (tian182@purdue.edu)
RTL Diagrams
This sections links the location of all Block-/RTL-diagrams that were made for this design: Active Branches
This section links the location of active branches that are being used for the design:
- Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram
Verification
This section links the location of verification related documents like verification plans:
Design Documentation/Resources
This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.
- https://github.com/CMU-SAFARI/ramulator2
Past Reports and Presentations
This section lists all past final reports, presentations, abstracts, and any other resource that was made by the DRAM subsystem.
Final Reports
- Fall 2025 Final Report: https://docs.google.com/document/d/1fIBgyiB3g3OImUYkugq2czNFDmUUhIS_DO6sGcxqxDY/edit?usp=sharing
- Spring 2025 Final Report: https://docs.google.com/document/d/1J7sHHt2H2yTATN91Cda57GuU_zQYz0v8/edit?usp=sharing&ouid=112766930685277737014&rtpof=true&sd=true
Design Review Presentations
- Fall 2025 Design Review 1: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQDGPSESbApgR6gN1-VNOP-jAaiWMO0WYFC02s4PZF21BJo?e=WBqsit
- Fall 2025 Design Review 2: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQBl4rWfHvqdT4ZHBKWouYAqAawxNiz32_OpBjLJoTGnRlo?e=8q3a8d
- Spring 2025 Design Review: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQCThlgQtvIaQbjwHBnV-sbAAddkki-xSmsUfsxGS6OnpZE?e=lCDwWu
Abstracts
- Fall 2025 Abstract: https://purdue0-my.sharepoint.com/:w:/g/personal/khatri12_purdue_edu/IQANqWleEbkvT5E5I8LGPdeWAcWw7mlhg-Q2tpLF6bX1JFc?e=gO85O0
Poster Presentation
- Fall 2025 Poster Presentation: https://purdue0-my.sharepoint.com/:p:/g/personal/khatri12_purdue_edu/IQC-xhWrYXVmSrW5sevR7zQ1AegyEuAUWBxusv5jGvgEdPo?e=OTz2Ys
Completed Projects
Blocking DRAM Controller
Description: The goal of this project is to design a fully functional DRAM Controller that interfaces with a ddr4 model.
Contributors
- Tri Than (than0@purdue.edu)
- Dhruv Khatri (khatri12@purdue.edu)
RTL Diagrams
This sections links the location of all Block-/RTL-diagrams that were made for this design: https://app.diagrams.net/#G18bqekF9I8oZJpSTm-BcsDvPkOPy_cdul#%7B%22pageId%22%3A%22fpKTT8HEuwSpTkvlEaWT%22%7D
Active Branches
This section links the location of active branches that are being used for the design:
- Tri’s Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_tri
- Dhruv’s Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dhruv
- Main DRAM Branch: https://github.com/Purdue-SoCET/atalla/tree/memory_subsystem_dram
Verification
This section links the location of verification related documents like verification plans:
Design Documentation/Resources
This section links any documentation or resources that was used specific for this design. This includes meeting notes, design logs, research papers, etc.
- DDR4 MICRON Model: https://drive.google.com/file/d/1CKYhZJe7rzhp_2ATkkAfWrufMl-Lt6jW/view?usp=sharing
Scheduler
Scratchpad
Systolic Array
Vector Core
SW Systems
Compiler
Kernels
Introduction to the Atallax01 Programming Model
The Atallax01 Programming Model allows users to map C/C++ algorithms to throughput-focused deep-learning accelerator, architected around VLIW vector-datapath, SW-managed Scratchpad and a 32x32 BF16 Systolic Array.
Unlike GPUs, Atallax01 does not expose a wide SIMT/SIMD programming interface. Instead, it provides a tile-centric compute model where kernels explicitly optimize and orchestrate data movement, vector lane utilization and Systolic Array computations through a single unified instruction stream. The workloads that will be run on Atalla are highly regular and need both wide-and-deep pipelines specifically for two-dimensional matrices.
Atallax01 is not a general-purpose processor. It is a core that will be placed alongside a high-performance CPU for a heterogenous compute plateform. Programmability is offered through C/C++ and a custom compiler toolchain. We do not plan to support imperative languages like Python.
This document defines:
- the hardware execution model
- the memory heirarchy
- the programming constructs available to the user
- tile-based psuedocode conventions
Hardware Differences
CPUs are computing machines that were fine-tuned over decades to minimize the latency of single-threaded instruction streams. They rely heavily on advanced techniques like prediction, speculaton, dynamic scheduling, etc. SMT, a computing perspective that partitioned/allocated register-sets to time-mulitplex independent software threads, was added as an afterthought into a single CPU core to exploit even more instruction-level parallelism.
However, the industry saw the need for different machines to exploit the data-level parallelism seen in scientific and graphics workloads. GPUs are massively-parallel computing machines, primarily programmed using CUDA or HIP paradigms which expose implicit-SIMD perspectives to the user. This enables the user to write scalar-threaded code in C++ which are compiled into SIMT binaries to utilize the SIMD execution units. CUDA inintially innovated by combining hardware-efficiency of SIMD but the programmability of SMT.
In recent years, GPUs been adapted to cater to the demand of the deep learning ecosystem with the addition of Tensor Cores for matrix-multiplications. TPUs grew parallely, but were targeted purely for deep-learning workloads that were domainted by GEMMs/CONVs. Atallax01 targets these primitives directly and disregards the SIMT/SIMD abstractions. Users will write single-thread code in C/C++ that directly defines tile-based descriptors for memory movement and vector-based kernels for compute datapaths. Thus, we say Atallax01 behaves more like a TPU than a GPU.
Heterogenous Programming
Atallax01 is programmed using a heterogenous host-device model, similar in spirit to CUDA/HIP but fundamentally simpler.
Host responsibilities include: - Allocate DRAM Tenstors. - Launch device kernels. - Pass tile descriptors and kernel metadata. Device responsibilities include: - Move data between DRAM and on-chip SRAM. - Swizzle data within the Scratchpad to enable row/coloumn-major addressing. - Load slices of the tiles into vector registers. - Execute blocking vector load/store/compute operations to prime the Systolic Array, or utilize the execution lanes.
The compiler issues VLIW bundles into a mapped-space within the DRAM partition exposed to Atallax01. The on-chip scheduling unit enforces a tainted-VLIW scheme by checking dependencies through scoreboarding.
Memory Model
The Atallax01 memory system is software-managed, and does not enforce any hardware-managed ordering mechanisms. The datapath is in-order, with the SDMA instructions making SCPAD locations valid before later accesses take place.
Global Memory (DRAM): - Large, high latency - Only accessible via SDMA instructions. - Ideal for storing large tensors. Assume 8GB+ space. Scratchpad Memory (SCPAD): - 1MB SRAM on-chip memory, low latency. - Only accessible via SDMA instructions. - Two seperate partitions indexed as SCPAD0 and SCPAD1 Vector Register File (VEGGIE): - [X-Size] SRAM vector-register-file - Only accessible via VM instructions. - Intermediate tile-slice storage to send to Lanes/Systolic-Array Scalar Register File: - [X-Size] SRAM lockup-free D-Cache - Implemented as a hardware-managed L1 Cache. Systolic Array Accumulation Buffers: - Not programmable. Hardware-controlled. - Strided/Staggered collection and tranfer of vectors into VEGGIE.
Execution Model
VLIW-based execution. Each cycle, the scheduler may one of [X] Packet types. The compiler ensures intra-bundle independence, with inter-bundle dependencies handled by the Scoreboardds in the Scheduler Unit.
In the following sections, we will focus on explaining the different “concepts” to keep in mind before developing code for Atallax01. Following this, we will discuss abstracted kernels which utilize these concepts.
Abstract Entities:
TileDesc - 2D block of memory Global/Scpad (described by shape + strides)
GlobalRegion - Where in Global Memory
GlobalTile – N-D tensor in off-chip DRAM, has-a Global Region
ScpadRegion - Which Scratchpad and where inside the Scratchpad
ScpadTile – 2D tensor in on-chip SRAM, has-a TileDesc and ScpadRegion
VectorReg[v] – vector register(s) in the vector core
Abstract Instrinsics:
SDMA_LD_* ScpadTile, GlobalTile
SDMA_ST_* ScpadTile, GlobalTile
VM_LD VectorReg[v], ScpadTile
VM_ST VectorReg[v], ScpadTile
VV_* VectorReg[v], VectorReg[v]
VV_* VectorReg[v], Imm
VS_* VectorReg[v], ScalarReg[v]
GEMMV ScpadTile C, ScpadTile A, ScpadTile B
CONV ScpadTile C, ScpadTile A, ScpadTile B
Kernels
General Matrix-Multiply (GEMM)
Atallax01 does not expose 32x32 Systolic Array directly. Instead, we provide a fixed-shape sub-kernels that operate on tiles that satisfy [<= 32x32].
Let’s define
TM – rows of the output tile (TM ≤ 32)
TN – cols of the output tile (TN ≤ 32)
TK – reduction dimension slice (TK ≤ 32)
A single GEMMV instrinsic consumes:
A_tile : TM × TK (activations)
B_tile : TK × TN (weights)
C_tile : TM × TN (partial sums / output)
and computes:
C_tile = A_tile · B_tile + C_tile
entirely inside the vector-core + systolic array microcode, blocking until SPCAD_C is updated.
Tiling/Grouping
Given a standard GEMM of general dimensions
C[M × N] = A[M × K] · B[K × N]
we can decompose it into the following number of tiles:
MT = ceil(M / TM)
NT = ceil(N / TN)
KT = ceil(K / TK)
Each output tile C[i,j] (for 0 ≤ i < MT, 0 ≤ j < NT) is defined as:
C_tile(i,j) = C[ i*TM : (i+1)*TM, j*TN : (j+1)*TN ]
A_tile(i,k) = A[ i*TM : (i+1)*TM, k*TK : (k+1)*TK ]
B_tile(k,j) = B[ k*TK : (k+1)*TK, j*TN : (j+1)*TN ]
All three of these tiles are loaded into on-chip SRAM as ScpadTiles before a GEMMV call.
Below, we define the tiling logic:
struct TileGroupDesc {
GlobalTile A_g; // TM x TK slice of A in DRAM
GlobalTile B_g; // TK x TN slice of B in DRAM
GlobalTile C_g; // TM x TN slice of C in DRAM
int i, j, k; // tile indices (row, col, k reduction)
}
vector<TileGroupDesc> plan_gemmv(
GlobalTile A, GlobalTile B, GlobalTile C,
int M, int N, int K,
int TM, int TN, int TK
) {
vector<TileGroupDesc> groups;
for (int i = 0; i < M; i += TM) {
for (int j = 0; j < N; j += TN) {
GlobalTile C_g = make_tile(C, i, j, TM, TN);
for (int k = 0; k < K; k += TK) {
GlobalTile A_g = make_tile(A, i, k, TM, TK);
GlobalTile B_g = make_tile(B, k, j, TK, TN);
groups.push_back(TileGroupDesc{.A_g = A_g, .B_g = B_g, .C_g = C_g, .i = i, j = j, .k = k});
}
}
}
return groups;
}
Execution Loop
Note: _alloc_scpad0, _alloc_scpad1 and _gemmv are functions defined in the stub library we provide with the E2E stack. It works at the vector-register level.
bool execute_gemmv(
GlobalTile A, GlobalTile B, GlobalTile C,
int M, int N, int K
) {
const int TM = ...; // ≤ 32
const int TN = ...; // ≤ 32
const int TK = ...; // ≤ 32
vector<TileGroupDesc> groups = plan_gemmv(A, B, C, M, N, K, TM, TN, TK);
for each distinct (i, j) over output tiles {
GlobalTile C_g = pop_group(i,j, groups).C_g;
ScpadTile sc_C = _alloc_scpad1(TM, TN);
SDMA_LD_1(sc_C, C_g);
for each g in groups where (g.i == i and g.j == j) {
in order of g.k {
// Using different register spaces within the Scratchpad allows the compiler
// to packetize to allow overlapping loads while compute happens.
ScpadTile sc_A = _alloc_scpad0(TM, TK);
ScpadTile sc_B = _alloc_scpad1(TK, TN);
SDMA_LD_0(sc_A, g.A_g);
SDMA_LD_1(sc_B, g.B_g);
_gemmv(sc_C, sc_A, sc_B);
}
}
SDMA_ST_1(sc_C, C_g);
}
}