Start
This tutorial shows how to make a project for the Zynq chip that implements DMA (Direct Memory Access). We will use the RealDigital 4x2 RFSoC board running PYNQ version 3.0.1, and Vivado 2024.2. You should probably look at the tutorial on GPIO first to be familiar with Vivado and how to set things up.
Create a new project, call it something like "dma", selecting the 4x2 in the "Default Parts" menu (see above tutorial). Click on "Create Block Design" in Vivado under "IP Integrator", call the top level something like "top", and in the empty diagram instantiate a Zynq block and click "Run Block Automation" with the defaults. That should get you a Diagram window with a Zynq ready to go.
Next let's add the constraints file as in the last tutorial, so that the bit file is compressed and there are temperature protections for the Zynq chip.
AXI Bus
AXI stands for "Advanced eXtensible Interface". This is a high-performance bus protocol that tries to be efficient for connecting the different parts of a "system-on-chip" (SoC), so it's used to connect the ARM chips to the FPGA and the RF Converter circuits on the Zynq RFSoC. It has a high bandwidth for burst transfers, separate read and write channels, handshaking, and a few other niceties.
There are 3 ways to use AXI: AXI4 (high-speed memory-mapping), AXI4-Lite (register-level control), and AXI4-Stream (data flow, especially ADC/DAC streaming). AXI4 is the one with full functionality. In it there are 5 independent "channels":
AW Master $\to$ slave, address for write transactions
W Master $\to$ slave, write data
B Slave $\to$ master, write response
AR Master $\to$ slave, address for read transactions
R Slave $\to$ master, read data
Each channel has a "handshaking" using the valid (sender asserts when data is ready to be lathced) and ready (receiver asserts when it's ready for new data) lines. A transaction is completed when both are high in the same clock cycle. That means in the transmitter, there's a state machine that waits for ready to be high, then sends the data, asserts valid, and waits for ready to be deasserted before deasserting valid and starting the next transaction.
Add AXI to the project
Now in the Diagram, click on "Add IP" and add "AXI Direct Memory Access". Let's rename this block from the default, "axi_dma_0" to something like "AXI_DMA". Remember, this name will be in the python PYNQ dictionary so we should make it easy to access.
What we want to do is to be able to stream data into and out of the 4GBytes of DDR memory on the board into the AXI bus. In the DMA block, read means reading from the DDR memory, and write means writing to the memory. On the block, you will see a port labeled "M_AXI_MM2S", and that is an AXI-Stream port bus master port for reading. The "MM2S" is short for "memory mapped to stream". There is another bus master called "M_AXI_S2MM", and that is an AXI-Stream port for writing to memory, where "S2MM" means "stream to memory mapped". If you mouse over the $+$ sign next to the "M_AXI_MM2S" port on the AXI block, you will see the cursor change to a double chevron symbol:
If you click on that $+$ symbol it will open up all of the lines on that "MM2S" port, and you will see all of the channels there. For instance you will see a "m_axo_mm2s_arready" port, the ready line for the AR channel. And so on.
The DMA block also has an AXI-Lite control port used to write instructions to configure, readback status, start, and stop DMA.
Scatter-Gather
Scatter-gather is where data can be transferred from fragmented or disjointed memory locations onto the AXI bus. PYNQ doesn't directly support SG, however it does work and can be implemented. Without SG, DMA transfers usually require memory buffers to be contiguous, which is fine unless you want to transfer huge waveforms to the RF Converter, which is what we want to do eventually! If you type the following command in linux:
cat /proc/meminfoit will show you what's in the /proc/meminfo file, which is a snapshop of memory statistics. On the 4x2 board we have, which has a 32G SD card, if you "cat /proc/meminfo" it tells us that we have 4G of memory total with 2.8G free, 3.2G available, but if yuou look at the line that has "CMATotal" and "CMAFree", that will tell you the amount of contiguous memory that is free to use. On our board, CMATotal is 131MBytes and CMAFree is 120MBytes. Waveforms sent to the RFSoC RF Converter has to have 4 bytes per word: 2 bytes for I and 2 for Q), so if you have 120MBytes of free contiguous memory, that means you can send a waveform that has no more than 30M samples.
With SG, you have to preload a descriptor table SC can allow transfers of up to 8,388,608 bytes we use this on the "cyclic" project in