A couple of days ago, I presented a small VHDL project that allows to exchange data with an FPGA over a fast USB link. The goal of that post was to present some difficult-to-find information about the FT2232H chip, and a simple project that allows to benchmark the resulting USB connection (results show that 14 MB/s are possible using that chip).
In this post, I go a bit further and present an USB to AXI master bridge that I developed to test various AXI slaves on the Digilent Nexys Video board. This project can be compared to the Xilinx JTAG AXI master IP, but has the advantage of being simpler to use (using a small and portable C tool instead of Vivado commands).
Background
Last post introduced the Nexys Video board and what features it has, and presented the FT2232H USB-CDC chip that allows a microcontroller or FPGA to easily communicate with a computer over USB. In this section, I give more details about the AXI bus, used in the industry to connect various devices on a SoC, and the Xilinx Vivado IDE used in IP mode.
AXI
AXI is a bus protocol developed by ARM, that allows devices to communicate using addresses and data. At the most basic level, an AXI bus is a link between one master and one slave. The master initiates transactions (read/write data at a specific address), the slave responds to them and provides data. A typical AXI slave is a memory controller, that takes addresses and data to read/write to memory and performs these operations. The most common AXI master is a microprocessor, that initiates memory transactions according to the program is executes.
On paper, the AXI bus seems limited, as it only describes one master and one slave. However, the devices themselves can be much more interesting than a simple master or slave. A device may have several master ports (instruction and data ports, or two independent ports that allow to bridge two AXI networks), several slave ports (a memory controller taking instructions from different masters), or a mix of master and slave ports (a powerful AXI router that redirects master requests to the proper slaves depending on their addresses, with arbitration if different masters want to access the same slave at the same time). Other AXI devices include clock domain crossing devices (a fast master is connected to a slow slave, for instance), bus width adapters (a master reads/writes 32-bit chunks of data while the slave uses a 128-bit bus, which requires several master transactions to be merged before being sent to the slave), etc.
The following figure shows an example of a set of AXI devices that have been connected in order to provide an interesting system: a fast Microblaze processor is connected to a memory cache, connected to a memory controller; a slow Microblaze processor (running at half the speed for instance) is connected to clock adapters then to the memory cache.
Xilinx Vivado IP
Logic circuit development can be split in two parts: implementing the interesting stuff (and their testbenches), and wiring everything together. This last operation requires much copy/pasting, is tedious, and can introduce subtle errors in the design, for instance mis-matched clock periods, difficulties to know whether "data_to_send" goes from or to the device currently being wired, etc.
Most FPGA or ASIC IDE providers have solutions to this problem, and they nearly all consist of allowing the user to graphically wire boxes. Each piece of the design is a box with input and output ports, and lines can be drawn between the ports. This produces a very nice schematic and allows to quickly see the entire design. Once this wiring is done, a VHDL or Verilog file is generated for use with standard synthesis and simulation tools.
In Xilinx Vivado, schematic entry requires the use of the IP Integrator perspective. Xilinx heavily focuses on this IP Integrator and provides plenty of components ready to be used. In fact, most of their example designs now consist of integrating a couple of components with a Microblaze processor, so that people don't have to enter VHDL code anymore.
Once you have created your project (a standard RTL project), the IP Integrator can be opened by clicking on IP Integrator ยป Create Block Design in the left panel of Vivado:
This opens a blank sheet on which you can add components (called IP, for Intellectual Property I think) and wire everything together. The tool is very powerful and really tries to prevent you from making errors. For instance, each port has a type and additional information, so that Vivado can detect when you connect ports from different clock domains together. It also checks that your AXI parameters are correct (bus widths match, for instance).
If you right-click on the blank sheet, a menu allows you to add some IP provided by Xilinx (or by you, you can create your own IP even if I will not describe that in this blog post), or to add any of your VHDL or Verilog modules (Add Module..., your file will appear as a box with input and output ports). Some Xilinx IP propose nice automations. For instance, adding a Microblaze microprocessor, then clicking on "Run component automation" in the green bar that appears, will add plenty of other components to that you get a complete and functional Microblaze SoC, with memory, debug and some AXI infrastructure.
Because you can add your own RTL components in IP Integrator, you can wire up your design in this tool without having to create IP packages or use any Xilinx IP. However, I recommend that you look at some Xilinx IPs (more details about them can be found in IP Catalog, in the sidebar). There are complex stuff, like video, image encoders and network stacks, but also very handy basic things like memory caches, multipliers and dividers, AXI-to-nearly-anything bridges, etc.
Protocol
The goal of this project is to design an AXI master that issues transaction based on an USB connection in the simplest way possible. The AXI protocol is quite complex and supports many features, like burst transfers (sending the address once, then several data packets), partial writes, cache coherency signals, memory protection (the master says whether it is in privileged or unprivileged mode), etc. However, I don't implement any of these features, only the strict minimum for testing. Burst transfers are not really needed as USB is much slower than even the slowest memory device.
The USB protocol is as-is:
- The computer sends an address encoded as 4 bytes, little endian. The MSB (bit 31) is set to 1 when writing, to 0 when reading.
- When writing, the computer sends a data packet, N bytes, with N depending on the width of the AXI bus (4 for a 32-bit bus, 16 for a 128-bit bus, etc)
- When reading, the computer reads N bytes, that will arrive when memory has been read
No error management is used, the protocol is the absolute minimum required for testing AXI slaves and memories.
Because the FTDI chip has read and write buffers, the protocol can be slightly optimized for free. When writing, the computer can send a bunch of addresses and data packets at once (every data packet must follow its address). When reading, the computer can send a bunch of addresses, then read a bunch of data packets. Beware that the addresses sent to the FTDI chip must fit on its read buffer (32 work, 64 is unreliable). If too many addresses are sent, the buffer may become full, which may block the computer and prevent it from reading data. The computer waits for the address buffer to become empty, the FPGA waits for the computer to read data, and we have a deadlock.
VHDL implementation
I have implemented the USB-to-AXI master component as an IP, that can be imported and used in Vivado. You can download the IP here. The hdl/ft2232h_mem_v0__1.vhd
file contains the actual component and can be used under the GNU LGPLv3 license. Other files are generated by Vivado and I don't really know what their licenses are. I advise you to extract the VHDL file and re-build an IP if you need to use it for something else than experiments.
USB-to-AXI IP
The AXI bus contains many signals and four independent communication channels: read addresses (from master to slave), read data (from slave to master), write addresses (from master to slave) and write data (from master to slave too). An AXI master can send as many read/write addresses on the address channels as long as the slave is ready. When the master wants to read data, it sets its ready flag and reads from the read data channel. The master can also write data on the write channel as long as the slave is ready. This is a bit weird, but as long as transactions are kept in order, you can send a bunch of data, then a bunch of write addresses (at which point the slave will perform the write operations). The only thing you must keep in mind is to send read addresses before reading data (seems legit).
The VHDL component is a large state machine that covers two state chains:
- Get (write) address from USB, send write address to AXI slave, wait for AXI slave to accept the address, get data from USB, send data to the slave, wait for data to be accepted, return to idle
- Get (read) address from USB, send read address to slave, wait for slave, accept data, wait for data to arrive, send data to USB, return to idle
There are many states and nearly no parallelism, but this is to keep things simple (for instance, the component could have read data from USB while the write address is sent to the slave instead of waiting for the slave to accept the address before reading data from the USB bus).
1 2 3 4 5 | type usb_state_t is (
usb_get_address, usb_address, usb_send_waddr, usb_send_raddr, usb_wait_waddr, usb_wait_raddr,
usb_get_data, usb_data, usb_send_data, usb_wait_data,
usb_accept_data, usb_wait_for_data, usb_read_data, usb_read_data_wait, usb_after_read
);
|
Another thing to pay attention to is that the bus to the FTDI chip is much narrower (8-bit) than the AXI bus (up to 512-bit). This requires some additional state variables that allow to read/write data packets byte-by-byte before sending them on the AXI bus:
1 2 3 4 5 6 7 8 9 10 11 12 | -- State of the USB transmission
constant addr_bytes : integer := (C_M00_AXI_ADDR_WIDTH/8);
constant data_bytes : integer := (C_M00_AXI_DATA_WIDTH/8);
signal usb_state : usb_state_t; -- Address/data state
signal address_byte : integer range 0 to addr_bytes-1; -- Index of the address byte being read
signal data_read_byte : integer range 0 to data_bytes-1; -- Index of the data byte being read
signal data_write_byte : integer range 0 to data_bytes-1; -- Index of the data byte being read
signal address_buffer : std_logic_vector(m00_axi_awaddr'range);
signal data_read_buffer : std_logic_vector(m00_axi_wdata'range);
signal data_write_buffer : std_logic_vector(m00_axi_rdata'range);
signal enable_write : std_logic;
|
Before presenting the complete state machine, I also want to show that most of the AXI output ports can be set to a fixed value. For instance, the component always generates bursts of length 1, does not bother with cache coherency or security, ignores many things (like error signals), etc:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | m00_axi_awid <= (others => '0');
m00_axi_awlen <= (others => '0');
m00_axi_awsize <= burst_size(data_bytes);
m00_axi_awburst <= (others => '0');
m00_axi_awlock <= '0';
m00_axi_awcache <= (others => '0');
m00_axi_awprot <= (others => '0');
m00_axi_awqos <= (others => '0');
m00_axi_awuser <= (others => '0');
m00_axi_wstrb <= (others => '1');
m00_axi_wlast <= '1';
m00_axi_wuser <= (others => '0');
m00_axi_bready <= '1';
m00_axi_arid <= (others => '0');
m00_axi_arlen <= (others => '0');
m00_axi_arsize <= burst_size(data_bytes);
m00_axi_arburst <= (others => '0');
m00_axi_arlock <= '0';
m00_axi_arcache <= (others => '0');
m00_axi_arprot <= (others => '0');
m00_axi_arqos <= (others => '0');
m00_axi_aruser <= (others => '0');
|
With burst_size
an array that maps widths (in bytes) to burst size values:
1 2 3 4 5 6 7 8 9 10 11 12 13 | type burst_size_t is array(Natural range <>) of std_logic_vector(m00_axi_awsize'range);
constant burst_size : burst_size_t(1 to 128) := (
1 => "000",
2 => "001",
4 => "010",
8 => "011",
16 => "100",
32 => "101",
64 => "110",
128 => "111",
others => "000"
);
|
Finally, here is the complete state machine. I refer you to the VHDL file linked at the beginning of this section for a list of input/output ports and the complete implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | prog_rdn <= '1';
prog_wrn <= '1';
prog_oen <= '0';
prog_d <= (others => 'Z');
case usb_state is
when usb_get_address =>
-- Wait for address bytes
if prog_rxen = '0' then
address_buffer(address_byte*8 + 7 downto address_byte*8) <= prog_d;
enable_write <= prog_d(7);
prog_rdn <= '0';
usb_state <= usb_address;
end if;
when usb_address =>
-- Read an address byte, and wait for next address bytes
-- or data bytes
if address_byte = addr_bytes-1 then
-- Last byte has been read, send address
address_byte <= 0;
if enable_write = '1' then
-- Write operation, on the write address bus
usb_state <= usb_send_waddr;
else
-- Read operation, on the read bus
usb_state <= usb_send_raddr;
end if;
else
-- Wait for next byte
address_byte <= address_byte + 1;
usb_state <= usb_get_address;
end if;
when usb_send_waddr =>
-- Send a write request
m00_axi_awaddr <= address_buffer;
m00_axi_awvalid <= '1';
usb_state <= usb_wait_waddr;
-- Clear the top bit, that is always at one otherwise
m00_axi_awaddr(m00_axi_awaddr'high) <= '0';
when usb_wait_waddr =>
-- Wait for write request to be accepted
if m00_axi_awready = '1' then
m00_axi_awvalid <= '0';
usb_state <= usb_get_data;
end if;
when usb_send_raddr =>
-- Send a read request
m00_axi_araddr <= address_buffer;
m00_axi_arvalid <= '1';
usb_state <= usb_wait_raddr;
when usb_wait_raddr =>
-- Wait for the read request to be accepted
if m00_axi_arready = '1' then
m00_axi_arvalid <= '0';
usb_state <= usb_accept_data;
end if;
when usb_get_data =>
-- Wait for data bytes
if prog_rxen = '0' then
data_write_buffer(data_write_byte*8 + 7 downto data_write_byte*8) <= prog_d;
prog_rdn <= '0';
usb_state <= usb_data;
end if;
when usb_data =>
-- Read a data byte
if data_write_byte = data_bytes-1 then
-- Last byte has been obtained, send data
data_write_byte <= 0;
usb_state <= usb_send_data;
else
-- Wait for next byte
data_write_byte <= data_write_byte + 1;
usb_state <= usb_get_data;
end if;
when usb_send_data =>
-- Send data to be written
m00_axi_wdata <= data_write_buffer;
m00_axi_wvalid <= '1';
usb_state <= usb_wait_data;
when usb_wait_data =>
-- Wait for data to be accepted
if m00_axi_wready = '1' then
m00_axi_wvalid <= '0';
usb_state <= usb_get_address;
end if;
when usb_accept_data =>
-- Tell the AXI bus that we are ready to read
m00_axi_rready <= '1';
usb_state <= usb_wait_for_data;
when usb_wait_for_data =>
-- Wait for data on the AXI bus
if m00_axi_rvalid = '1' then
m00_axi_rready <= '0';
data_read_buffer <= m00_axi_rdata;
prog_oen <= '1'; -- Tell FTDI chip that we will write to it
usb_state <= usb_read_data;
end if;
when usb_read_data =>
-- Transmit data on the USB connection
prog_oen <= '1'; -- Tell FTDI chip that we will write to it
if prog_txen = '0' then
prog_d <= data_read_buffer(data_read_byte*8 + 7 downto data_read_byte*8);
prog_wrn <= '0';
usb_state <= usb_read_data_wait;
end if;
when usb_read_data_wait =>
-- Let time to prog_txen to be updated
prog_oen <= '1';
if data_read_byte = data_bytes-1 then
-- Everything has been read, start a new request
prog_oen <= '0';
data_read_byte <= 0;
usb_state <= usb_after_read;
else
-- Read next byte
data_read_byte <= data_read_byte + 1;
usb_state <= usb_read_data;
end if;
when usb_after_read =>
-- Do nothing just after a write, let time to OEN
-- to propagate to the FTDI chip
usb_state <= usb_get_address;
end case;
|
Test Design
The USB-to-AXI IP can be used in any design to test AXI slaves. For my experiments, I have used it with a Memory Interface Generator, that allows the FPGA on the Nexys Video to communicate with the onboard 512MB DDR3 RAM. Before showing the IP diagram that I used, here are some considerations:
- The FTDI chip produces at 60 Mhz clock, that is used to clock the USB-to-AXI IP. This means that the AXI bus for which the IP is master runs at 60 Mhz
- The Nexys Video board provides a very stable 100 Mhz reference clock to the FPGA.
- The Memory Interface Generator needs a 200 Mhz input clock, that must be produced. I decided to use the 100 Mhz reference clock as base clock, not the 60 Mhz one, because memory interfaces are quite sensitive to clock signals (jitter, duty cycle) and I don't think that the primary goal of the FTDI chip is to provide a rock-solid clock.
- The MIG is very easy to configure if you properly configure your project for your board. Digilent provides board files for all its boards, and these files describe the precise timings and parameters of the onboard DDR RAMs, if any. They also provide the complete pinout of the FPGA, which allows you to add a memory in a single click (running a board automation).
- The MIG produces a ui_clk, a 100 Mhz clock that it uses to drive its AXI slave port. This means that the USB-to-AXI IP (60 Mhz) and MIG (100 Mhz) run at different speeds. An AXI clock converter is therefore needed.
- In simple cases, reset is easy to handle, just use a switch (beware that most resets are active low, the switch must be set to 1 for the device to work). However, clock generators are a bit more difficult to handle as we have to wait for their clocks to become stable before releasing reset. In this design, there are two clock generators: the one that produces the 200 Mhz clock and one generator internal to the MIG. We therefore need two reset logics, one that releases its reset when the 200 Mhz clock is stable, one that releases it once the MIG can be used (when it has been calibrated).
All in all, quite a number of components are required. Here is a large image (I recommend that you right-click on it and see it outside this page) that shows all the components I used and their connections. For my experiments, I've set the AXI bus widths to 128 bits, so that more data can be send/received for each address.
In this image, I send a number of output signals to the LEDs available on the Nexys Video board for debugging purposes. If you do the same, you will be able to see that the MIG needs nearly half a second to complete its calibration, which may contribute significantly to the startup time of a microprocessor-based system.
Software tool
An upload/download tool can be downloaded here. Compile this file with:
1 | gcc -O2 -I/usr/include/libftdi1 -o ftdi-axi-master -lftdi1 main.c
|
The program can be used to write to memory locations (ftdi-axi-master w filename.dat address
) or to read from memory location (ftdi-axi-master r filename.dat address size
). Addresses and sizes can be expressed in decimal, octal or hexadecimal form (with the 0x
prefix).
Upload
When uploading or downloading, the program starts by initializing the FTDI chip in Synchronous FIFO mode, the mode expected by the FPGA and that provides the best performance. My previous blog post explains the process and tells you what to change if you own a board for which the FTDI chip has a different major-minor USB number.
The actual uploading process consists of sending a list of addresses and data to the FPGA. The data width is configured at the beginning of the C file and must be adapted to your AXI bus width. Because sending packets over USB is slow, large packets should be sent. The program packs 1024 addresses, 1024 data words, then sends to whole over the USB bus. This allows data to be sent at over 14 MB/s (when addresses are counted, we should be over 17 MB/s, or 136 Mbps).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | struct element {
unsigned int address;
char data[WIDTH/8];
};
// Open file
FILE *f = fopen(filename, "r");
// Send packets to the FPGA
struct element elems[BURST_LEN];
unsigned int i = 0;
while (!feof(f)) {
// Produce element
elems[i].address = address | 0x80000000; // Set "write" flag
fread(&elems[i].data, WIDTH/8, 1, f);
// If the list of elements is full, send them
if (i == BURST_LEN - 1) {
ftdi_write_data(ftdi, (unsigned char *)&elems, sizeof(elems));
i = 0;
} else {
++i;
}
// Prepare for next address
address += WIDTH / 8;
}
// Send last elements
if (i != 0) {
ftdi_write_data(ftdi, (unsigned char *)&elems, i * sizeof(struct element));
}
|
Download
The download process is comparable to the upload process, with the exception that the program only sends addresses (no data), and receives data. We need to be careful while doing that, as the FPGA answers to read addresses one at a time, and will not read following addresses until the answer has been transmitted. We must therefore ensure that the FTDI send buffer is always able to receive data, in order to prevent deadlocks from occuring (if the computer sends too many addresses and waits for them to be consumed, but the FPGA is waiting for its send buffer to become empty and does not read any address).
Fortunately, in my experiments, it appears that using asynchronous USB transfer works and is not too slow. The program fills an address buffer, sends all of them asynchronously, reads all the data asynchronously, and libftdi
ensures that information flows without deadlocks. Doing so allows libftdi
to make the most out of the buffers available on the FTDI chip, which reduces USB transfers and allows data to be read at a bit under 1 MB/s. This is much slower than writes but still much faster than RS-232. A better protocol (send one read address and a size, then read everything) would make things much faster, though.
Here is the code I use:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | struct ftdi_transfer_control *wr_transfer, *rd_transfer;
FILE *f = fopen(filename, "w");
// Send packets to the FPGA
unsigned int addresses[BURST_LEN];
char buf[BURST_LEN * WIDTH/8];
unsigned int i = 0;
for (int j=0; j<size; j += WIDTH/8) {
// Produce address
addresses[i] = address;
if (i == BURST_LEN - 1) {
// Asynchronously send addresses and read data
wr_transfer = ftdi_write_data_submit(
ftdi, (unsigned char *)&addresses, sizeof(addresses));
rd_transfer = ftdi_read_data_submit(
ftdi, (unsigned char *)&buf, sizeof(buf));
// Wait for transfers to complete and write data to file
if (wr_transfer && rd_transfer) {
ftdi_transfer_data_done(wr_transfer);
ftdi_transfer_data_done(rd_transfer);
fwrite(&buf, sizeof(buf), 1, f);
}
i = 0;
} else {
++i;
}
// Prepare for next address
address += WIDTH / 8;
}
// Process last elements
if (i != 0) {
wr_transfer = ftdi_write_data_submit(
ftdi, (unsigned char *)&addresses, i * sizeof(unsigned int));
rd_transfer = ftdi_read_data_submit(
ftdi, (unsigned char *)&buf, i * WIDTH/8);
if (wr_transfer && rd_transfer) {
ftdi_transfer_data_done(wr_transfer);
ftdi_transfer_data_done(rd_transfer);
fwrite(&buf, i * WIDTH/8, 1, f);
}
}
|
There is a bit of copy/paste in this code, which I don't like, but it works for its purpose.
Note: Even though I'm able to transfer large files using this program (a 70 MB video file, archives, etc), it sometimes happen that the file read from memory starts with 16 random bytes. If that happens, resetting the board, re-writing and re-reading solves the problem. I don't know where it comes from, probably stray data in the FTDI buffer when writing or reading files.