Start

This tutorial shows how to make a project for the Zynq chip that implements DMA (Direct Memory Access). We will use the RealDigital 4x2 RFSoC board running PYNQ version 3.0.1, and Vivado 2024.2. You should probably look at the tutorial on GPIO first to be familiar with Vivado and how to set things up.

Create a new project, call it something like "dma", selecting the 4x2 in the "Default Parts" menu (see above tutorial). Click on "Create Block Design" in Vivado under "IP Integrator", call the top level something like "top", and in the empty diagram instantiate a Zynq block and click "Run Block Automation" with the defaults. That should get you a Diagram window with a Zynq ready to go. Click on the ZYNQ decal and change the name to "ZYNQ" in the "Block Properties" window that opens up to the left. If all went well you should see this:

Next let's add the constraints file as in the last tutorial, so that the bit file is compressed and there are temperature protections for the Zynq chip. To do this, click the $+$ symbol in the "Sources" tab, select "Add or create constraints" in the "Add Sources" window that pops up, hit Next, and in the next window click "Create File". Call it something (I used "top" so it's the same as the project) and click Finish. Then open "Constraints" in the "Sources" window and navigate to the file, it should be called "top.xdc". Double click and add these lines:

# enable the over-temperature shutdown feathers
set_property BITSTREAM.CONFIG.OVERTEMPSHUTDOWN ENABLE [current_design]
# compress the bitstream to make it smaller
set_property BITSTREAM.GENERAL.COMPRESS TRUE [current_design]
Then save the file and you should be good to go with these constraints.

Next add the AXI DMA (explained below) by clicking on the $+$ icon in the Diagram menu bar (or right click in the diagram window and select "Add IP") and in the search window type "AXI DMA". That will bring up a list, so double click on "AXI Direct Memory Access". It will show an "AXI Direct Memory Access" decal. Click once on it and rename it to "AXI_DMA" in the "Block Properties" window. You should see something like this in the diagram window:

Before going further, we should get a bit into the AXI bus, what it does, etc.

AXI Bus

AXI stands for "Advanced eXtensible Interface". This is a high-performance bus protocol that tries to be efficient for connecting the different parts of a "system-on-chip" (SoC). The latest AXI documentation from Xilinx on how they implement AXI is https://docs.amd.com/r/en-US/pg021_axi_dma/Introduction.

One of AXI's primary uses is to connect the processor to the external DRAM (the 4x2 has 4 GBytes of 2400MHz DDR4 memory, and send it wherever you like (e.g. to the RF Converter circuits) on the Zynq RFSoC. It has a high bandwidth for burst transfers, separate read and write channels, handshaking, and a few other niceties.

There are 3 ways to use AXI: AXI4 (high-speed memory-mapping), AXI4-Lite (register-level control), and AXI4-Stream (data flow, especially ADC/DAC streaming). AXI4 is the one with full functionality. In it there are 5 independent "channels":

AW:Master $\to$ slave write transactions, address
W:Master $\to$ slave, write transactions, data
B:Master slave, write response
AR:Master $\to$ slave read transactions, address
R:Master $\to$ slave read transactions, data

Each channel has a "handshaking" using the VALID and READY channel lines, but the order depends on whether you are doing a write or a read (from the master point of view). So if the master wants to write to the slave, the master asserts VALID which tells the slave that the address and/or data values are available. The slave asserts READY when the data is latched, the master then releases VALID and the slave releases READY, thus the handshake.

As an example, the figure below shows a transaction where the master is writing to a control register somewhere. You can see that AWVALID and WVALID both go high at the same time indicating that the address (here 0x0) and data (here 0x1) are ready. Then the slave asserts AWREADY indicating that it is latching the address, and WREADY indicating it is latching the data, on that clock edge when both valid and ready are high. But there is more to do - the slave has to determine if the transfer is complete, and if the data it received is legitimate, especially the address. So to do this, the master must also at some point (and this is usually early in the write operation) assert the ready line BREADY in the B channel when it is ready to receive the acknowledgement from the slave that the data is processed, in which case the slave asserts BVALID indicating that the operation is complete. Thus there are several handshakings going on, in each channel AW, W, and B. All handshaking follows the rule that the handshake happens on the positive edge of the clock when both valid and ready are high. Note that for the B channel, when the slave asserts BVALID it also asserts BRESP, which is a 2-bit word that tells if the operation is successful: 00=okay, 01=eclusive access ok, 10=slave error (bad address, permission denied, etc), and 11=decode error (address doesn't exist). In the figure below, BRESP is indeed 00, which is the normal response code.

This Project

In this project, what we want to do is to be able to DMA data into and out of the 4GBytes of DDR memory on the board that is connected to the processor, and from there stream via the AXI Stream into someplace like a fifo, or maybe directly into the RF Converter. So it will be important to understand the difference between the AXI ports that connects the FPGA to the processor (that is connected to the external DDR), and the AXI ports that streams that data into and out of fifos, RF converters, etc.

The basic idea is that we will use all 3 AXI channels. AXI-LITE will be connected between the processor (PS) and all AXI slaves (the DMA engine, and any GPIO engines you might have, although in this project we don't have any). The AXI4 channel will be between the PS and the programmable logic (PL, the FPGA), to be able to read data from the PS memory and send it into the FPGA. The AXI4-Stream channel will be where the DMA engine streams data into the AXI FIFO. The DMA engine does alot of heavy lifting as it is the interface between the PS and the PL!

The figure below is a schematic of the DMA ports that connect the ARM/DDR and stream (here a FIFO). We use a FIFO here that understands AXI protocol, but when you build a project on the RFSoC, what you will usually be doing is streaming data into the I and Q ports of the RF Converter for the modulation, which we won't go into here.

Configure DMA

Since we've already configured our project to use block design and instantiated the PS (Zynq chip) and the DMA engine, now we want to configure the DMA and connect it to the PS. So double click the DMA IP and open up the "Re-customize IP" settings window. You should see this:

Uncheck "enable Scatter Gather Engine" for this tutorial, and set the "Width of Buffer Length Register (8-26)" to 26 if it's not there already. This sets the maximum package size for a single DMA transfer, 26 bits, which is 67 Mbytes (you can set it smaller value but it apparently saves only a small amount of memory in the FPGA). Also set the "Address Width (32-64)" to 32, but you can also use 64 since the 4x2 uses the Zynq Ultrascale+, which has a 64 bit processor, however most of the time we will only be doing DMA transfers of 32 bit words, so let's leave it at 32 here. Then set the DMA read and write channels to be enabled, and leave the write channel set to "Auto". The "Stream Data Width" should be 32 for this tutorial. And make sure "Allow Unaligned Transfers" is not enabled so that all transfers align on the 1st byte of each 32 bit word.

Click OK to accept changes, and then click "Run Connection Automation" (it's towards the top of the "diagram" window, highlighted in blue). That will open a dialog, and make sure everything is selected. Click OK. Sometimes you have to click "Run Connection Automation" twice! Then click the "Regenerate Layout" icon in the Diagram window (it's a clockwise circular arrow). You should see this:

Note that what's connected to the AXI_DMA here is only the S_AXI_LITE port, which is for controls. None of the other AXI ports on AXI_DMA (M_AXI_MM2S, M_AXI_SS2M, M_AXIS_MM2S, S_AXIS_S2MM) are connected.

Before we connect all the AXI_DMA ports, let's first take a look at the AXI_DMA block that you created. At the top right you will see "M_AXI_MM2S", "M_AXI_SS2M", and "M_AXIS_MM2S", and at the top left you will see "S_AXI_LITE" and "S_AXIS_S2MM". The first letter "M" means "master" (the DMA engine initiates transfers and produces data) and the first letter "S" means slave (the DMA engine accepts data). These ports are:

Back in the diagram, if you mouse over the $+$ sign next to the "M_AXI_MM2S" port on the AXI block, you will see the cursor change to a double chevron symbol:

If you click on that $+$ symbol that the chevron points to it will open up all of the lines on that "MM2S" port, and you will see all of the channels there. For instance you will see a "m_axo_mm2s_arready" port, which is the ready line for the AR channel. And so on. These MM2S and SS2M ports will have to be connected to the ZYNQ chip.

AXI DMA port connections

Next we want to connect the AXI_DMA DMA ports. The control port, "S_AXI_LITE", is already connected (it was connected when you hit the "Run Connection Automation" button after setting up the AXI_DMA block).

The master ports "M_AXI_M2SS" and "M_AXI_S2MM" ports go through the processor. The ports on the ZYNQ processor are not connected by default, so we have to connect them by hand. To do this, double click the Zynq block, which should bring up the settings menu:

Click on the "PS-PL Configuration" tab on the left (PS is the processor and PL is the FPGA logic). then expand "PS-PL Interfaces" and then "Slave Interface" and then "AXI HP". You will see 4 ports labeled "AXI HPn FPD" where n goes from 0 to 3. Internally, there are 2 connections to the processor external memory, with 2 ports per connection. HP0 and HP1 share a switch to one port, HP2 and HP3 share the other. We will connect HP0 and HP1, but if there is a lot of throughput required, it is more efficient to connect HP0 and HP2 or HP1 and HP3.

Enable "AXI HP0 FPD" and "AXI HP1 FPD", and under each set the "AXI HPn FPD Data Width" to 128 (which might be the default). The width is actually not too important for now. Then click OK. This should make the slave ports show up. So the block decal should change from this:

to

Click "Run Connection Automation" again, enable "All Automation" in the window that pops up, and click ok. If you get a warning message, click OK to accept it and then click "Run Connection Automation" again and that should fix it. Since the S_AXI_HP0_FPD and *1_FPD ports are for DMA, they will be automatically routed to the AXI_DMA ports.

Click "Regenerate Layout" again, and you should see this:

Notice now that the M_AXI_MM2S and M_AXI_S2MM ports are routed through a "AXI SmartConnect" to the processor slave ports S_AXI_HP0_FPD and S_AXI_HP1_FPD, which are how the processor talks to the external DRAM. The SmartConnects are there to do housekeeping things like matching different data widths, handle different clock domains, ensure correct protocol translation, and other essential acts as a central AXI bus router. The system also added a "Processor System Reset" block to generate properly syncrhonized reset signals for the various clocks and logic, and this is especially important with designs that have various clocks.

Next we want to connect the AXI stream ports, which means where we want to send the data that comes out of the external DRAM. In this tutorial, we will add a FIFO to the data path (just to see how it's done) and then loop it back. To do this, click the $+$ sign (or right click and select "Add IP") and in the window that pops up, search for "fifo". You should see 5 or 6 variations. The one you want is "AXI4-Stream Data FIFO" (although you could also use "FIFO Generator"), so double click on that and you should see a new block that has a title "axis_data_fifo_0" (or something like that) above it and "AXI4-Stream Data FIFO" below it. Click on the block and in the "Block Properties" window that shows up (just to the left of the diagram) change the "Name:" field to "AXI_FIFO", just to make things easier when using the python hooks.

To configure the FIFO, double click on the "AXI_FIFO" block, and a "Re-customize IP" window should pop up. Under the "General" tab, set the "FIFO depth" to 512, and make sure that "TDATA width (bytes)" is set to 4 (32 bits). Leave everythign else as default. Then click on the "Flags" tab. Under "Write flags" change "Enabel write data count" to Yes, and do the same under "Read flags". This way we can find out in python how much data is in the fifo. Click "Ok", and it will then show a FIFO block that has 2 more ports under "M_AXIS": "axis_wr_data_count[31:0]" and "axis_rd_data_count[31:0]". Note that the write data count will count the number of words written, and the read data count will count the number of words that are available to be read. So the number read will be the difference between these two counters (not sure exactly why they do it this way instead of counting the number actually read out).

Next, add 2 "AXI GPIO" blocks. Rename them to something like "FIFO_WR" and "FIFO_RD", and mouse over the + sign next to the GPIO port and open up the port. You will see a port called "gpio_io[31:0]" with an arrow indicating it's an input. Now you have to make some connections by hand. This is easy in Vivado: mouse over the port you want to connect, and the cursor will change to a slanted pen. Then click and drag to the destination port and release. The system will route the line so that it looks as uncomplicated as possible. So with the Vivado pen, connect the FIFO_WR gpio_io input port to the axis_wr_data_count port on the FIFO, and do the same for the FIFO_RD. Then click on "Run Connection Automation" to get everything hooked up.

Next make the following connections to connect the AXI_FIFO ports to the DMA, clock, and reset lines:

Then click on "Run Connection Automation", and then regenerate layout. That should take care of connecting the AXI GPIO blocks FIFO_WR and FIFO_RD AXI bus ports, clocks, and resets. If all goes well you should see something like this:

Press F6 to run design validation (checks for errors) and make sure there are no errors, which there shouldn't be. Then go to the "Sources" tab (upper window to the left of the diagram), open "Design Sources", and right click on "top (top.bd)" and select "Create HDL Wrapper" and choose "Let Vivado manager wrapper and auto-update" and hit OK. Assuming all goes well, generate the bitstream and hwh file by clicking "Generate Bitstream" in the Project Manager window on the left.

Using PYNQ

Next, make sure the board is on and connected to a PC, and open up a Jupyter lab connection in a browser (Chrome works fine). You will need to copy the .bit and .hwh files from the above project onto the board first. Let's do that and name them "test_dma.bit" and "test_dma.hwh". The .bit file is in the <name>.runs/impl_1 folder and .hwh is in <name>.gen/sources_1/bd/<top>/hw_handoff/ folder where <name> is the name of your project folder, and <top> is the name of the block design file when you made it.

Open up a new notebook and in the first cell type:

    from pynq import Overlay, allocate
    import numpy as np
and in the next cell type:
    base = Overlay("test_dma.bit")
    print("overlay completed")
The call to the Overlay() function actually loads the bit file into the FPGA, but it also parses the hwh file, which is a rather large XML file will all of the information that PYNQ needs to interface the processor with the logic (PS to PL).

You should look at the dictionary to be sure you have everything under control, so in the next cell type:

    base.ip_dict
You should see 2 entries: "ZYNQ" for the processor, and "AXI_DMA" for the AXI DMA block. It should look something like this:
    ↓ip_dict:
      →AXI_DMA:
      →ZYNQ:

Expand AXI_DMA and you can see what's in there. The physical address assigned to the IP block starts at "phys_addr", and this is where the block is mapped into the processor's address space. "addr_range" tells you how many bytes of address space the IP block occupies. All of these addresses from phys_addr:phys_addr+addr_range are accessible via the MMIO function in PYNQ. For more on these addresses, see the documentation, which you can get by searching the web for "AXI DMA IP product guide", or going directly to:

https://docs.amd.com/v/u/5.00a-English/pg021_axi_dma

on page 26 for "Simple DMA" (that is, not Scatter-Gather). This document is hard to read, so here's a synopsis.

AXI DMA Register Address Mapping

The AXI DMA has 2 memory mapping channels: MM2S from external DDR through the PS into a stream; and S2MM from stream to external DDR memory. Each has a control register at address offset 0x0 for MM2S and 0x30 for S2MM, and a status register at address offset 0x4 for MM2S and 0x34 for S2MM. The control register bits look like this:

Bits are described here:

The status register bits look like this:

Bits are described here:

The control register is written to (controlled by) the PS via the PYNQ functions you call, and the status register tells you what state things are in. For what we are doing, which is very simple DMA, the main bit for the control register is bit 0, the RS bit, where RS=0 means the DMA is stopped and RS=1 means it is running. The status register bits that we most care about are bits 0 and 1. Bit 0 is the HALT bit, so HALT=0 means the DMA channel is "RUNNING", and HALT=1 means it is "HALTED". Bit 1 is the IDLE bit, and IDLE=0 means not idle and IDLE=1 means the channel is IDLE. There are also some error bits in the status register. PYNQ takes care of these bits.

Back to PYNQ. In the notebook, you have to set up the class that allows you to do DMA, which is done by the following command:

    dma = base.AXI_DMA
where AXI_DMA is the name you gave the DMA engine in the block diagram (and which should be what the base.ip_dict command reports). When you issue this command, PYNQ will set up the DMA engine, writing to the MM2S and S2MM control registers. To see the state of those registers, the easiest thing to do is to write after the above command the following:
    dma.register_map
This will print out a list of registers that looks like this:
RegisterMap {
  MM2S_DMACR = Register(RS=1, Reset=0, Keyhole=0, Cyclic_BD_Enable=0, IOC_IrqEn=0, Dly_IrqEn=0, Err_IrqEn=0, IRQThreshold=1, IRQDelay=0),
  MM2S_DMASR = Register(Halted=0, Idle=0, SGIncld=0, DMAIntErr=0, DMASlvErr=0, DMADecErr=0, SGIntErr=0, SGSlvErr=0, SGDecErr=0, IOC_Irq=0, Dly_Irq=0, Err_Irq=0, IRQThresholdSts=0, IRQDelaySts=0),
  MM2S_CURDESC = Register(Current_Descriptor_Pointer=0),
  MM2S_CURDESC_MSB = Register(Current_Descriptor_Pointer=0),
  MM2S_TAILDESC = Register(Tail_Descriptor_Pointer=0),
  MM2S_TAILDESC_MSB = Register(Tail_Descriptor_Pointer=0),
  MM2S_SA = Register(Source_Address=0),
  MM2S_SA_MSB = Register(Source_Address=0),
  MM2S_LENGTH = Register(Length=0),
  SG_CTL = Register(SG_CACHE=0, SG_USER=0),
  S2MM_DMACR = Register(RS=1, Reset=0, Keyhole=0, Cyclic_BD_Enable=0, IOC_IrqEn=0, Dly_IrqEn=0, Err_IrqEn=0, IRQThreshold=1, IRQDelay=0),
  S2MM_DMASR = Register(Halted=0, Idle=0, SGIncld=0, DMAIntErr=0, DMASlvErr=0, DMADecErr=0, SGIntErr=0, SGSlvErr=0, SGDecErr=0, IOC_Irq=0, Dly_Irq=0, Err_Irq=0, IRQThresholdSts=0, IRQDelaySts=0),
  S2MM_CURDESC = Register(Current_Descriptor_Pointer=0),
  S2MM_CURDESC_MSB = Register(Current_Descriptor_Pointer=0),
  S2MM_TAILDESC = Register(Tail_Descriptor_Pointer=0),
  S2MM_TAILDESC_MSB = Register(Tail_Descriptor_Pointer=0),
  S2MM_DA = Register(Destination_Address=0),
  S2MM_DA_MSB = Register(Destination_Address=0),
  S2MM_LENGTH = Register(Length=0)
}
In the above, "DMACR" means DMA control register, "DMASR" means status register, and there are a bunch more registers that have to do with DMA pointers that we don't have to care about because PYNQ knows what it is doing. Focusing on the control and status registers, you will see this:
RegisterMap {
  MM2S_DMACR = Register(RS=1, Reset=0, Keyhole=0, Cyclic_BD_Enable=0, IOC_IrqEn=0, Dly_IrqEn=0, Err_IrqEn=0, IRQThreshold=1, IRQDelay=0),
  MM2S_DMASR = Register(Halted=0, Idle=0, SGIncld=0, DMAIntErr=0, DMASlvErr=0, DMADecErr=0, SGIntErr=0, SGSlvErr=0, SGDecErr=0, IOC_Irq=0, Dly_Irq=0, Err_Irq=0, IRQThresholdSts=0, IRQDelaySts=0),

  S2MM_DMACR = Register(RS=1, Reset=0, Keyhole=0, Cyclic_BD_Enable=0, IOC_IrqEn=0, Dly_IrqEn=0, Err_IrqEn=0, IRQThreshold=1, IRQDelay=0),
  S2MM_DMASR = Register(Halted=0, Idle=0, SGIncld=0, DMAIntErr=0, DMASlvErr=0, DMADecErr=0, SGIntErr=0, SGSlvErr=0, SGDecErr=0, IOC_Irq=0, Dly_Irq=0, Err_Irq=0, IRQThresholdSts=0, IRQDelaySts=0),
}
So you see that the MM2S and S2MM control registers both have RS=1 set, so it is in the "RUN" state. That's because it is waiting for you to initiate the DMA transfer. The status registers both have HALT=0 and IDLE=0 which means the channel is running and not IDLE, consistent with what the control register is telling it.

A little bit on how the RealDigital 4x2 memory works. There is 4GByte for the processor system (PS) and another 4GByte for the programmable logic (PL). It is all 2400MHz DDR. The 4G PS memory is used by the Linux kernal, PYNQ applications, general system use, and buffers allocated in notebooks (see below). Once you boot up the board, you can check how the system is using DDR memory and especially how much contiguous memory is free by looking in the Linux file /proc/meminfo. To do this in a notebook, try this:

    !cat /proc/meminfo | grep -E "MemTotal|MemFree|Cached|Active|Inactive|Cma"
On the board I'm using, here's what I see:
MemTotal:        4025568 kB
MemFree:         2800144 kB
Cached:           464072 kB
SwapCached:            0 kB
Active:           209556 kB
Inactive:         892064 kB
Active(anon):       1764 kB
Inactive(anon):   593332 kB
Active(file):     207792 kB
Inactive(file):   298732 kB
CmaTotal:         131072 kB
CmaFree:          119428 kB

MemTotal is around 4G, and this is the PS external DDR memory that you have access to via PYNQ. The 4G PL memory is not visible to Linux by default, but can be mannually configured and accessed in the Xilinx project (using the AXI MIG interface, which is not covered here).

Next, allocate a buffer in the external DDR. We will use the PYNQ allocate() function from the pynq.allocate module, which will allocate in the external DRAM DDR memory, not in the processor internal memory. We will use numpy to specify what type of memory (uint32 here). Note that allocate() requires memory that is physically contiguous (no gaps), is shared between the processor and FPGA, and will be reserved so that Linux won't touch it. Below we allocate 2 buffers: into_fifo is the buffer we will allocate and fill with data to write into the fifo, and from_fifo is the buffer that will received data from the fifo to compare. Note that we make these buffers the same size, mostly for convenience but also because that makes the DMA go smoothly without any need for us to do anything. Both buffers are 32 bit unsigned integers. After allocating the buffer, we will write a counter into the input buffer that will be sent to the fifo, but let's also set the upper 16 bits of the counter to something we can easily recognize.

    into_fifo_size = 100
    into_fifo = allocate(shape=(into_fifo_size,), dtype=np.uint32)
    for i in range(into_fifo_size):
        into_fifo[i] = 0xcafe0000 + i
        print(hex(into_fifo[i]), end=" ")
    from_fifo_size = 100
    from_fifo = allocate(shape=(from_fifo_size,), dtype=np.uint32)    
This should print out all 100 elements of the into_fifo buffer so you can see if it's ok.

Now we have to set up the pointers for the dma channels for both sending and receiving:

    dma_send = dma.sendchannel
    dma_recv = dma.recvchannel
Next we initiate the DMA transfer from the "into_fifo" by using the transfer method to the dma class:
    dma_send.transfer(into_fifo)
    dma_send.wait()
The .wait method just waits for the DMA to finish, however the transfer will be fast. The default clock here is 100MHz, and we are sending 100 words so it should take around $1\mu s$. Also, if you leave off the arguments to the transfer method, it will assume you want to start at the 1st word in the buffer and send everything. That would be equivalent to doing the following:
    dma_send.transfer(into_fifo, 0, 4*into_fifo_size)
    dma_send.wait()
Note the 4 multiplying the size, this is because the 3rd argument is the number of bytes not the number of words.

If you were to read the control and status registers now by putting a "dma.register_map" after the "dma_send.wait()" command, you should see the MM2S is still in "RUN" state (RS=1), and the status register will report it is "RUNNING" (HALT=0) and "IDLE" (IDLE=1) because the transfer is complete. For the SS2M channel, it should not have changed since all we did was an MM2S transfer.

At this point the data should be in the fifo ready to be read out, which is done by the simple command

    dma_recv.transfer(from_fifo, 0, 4*from_fifo_size)
(or you can leave off the 2nd and 3rd arguments altogether). This will transfer the data from the fifo into the receive buffer. Typing it out should show the right values.
Last updated 8/11/2025 drew@umd.edu