PCIe DMA Design and Throughput Test based on Xilinx PCIe Core

Make it to the Right and Larger Audience

Blog

PCIe DMA Design and Throughput Test based on Xilinx PCIe Core

Based on transaction initiator, there are two ways a PCIe card and a PC communicate with each other, PC initiates read/write transactions to PCIe card and PCIe card initiates read/write transactions to PC bus.

 

In the first way, PCIe card memory first needs to be mapped to PC memory space through BAR and PC accesses PCIe card with simple PCIe commands. Below is an example how Realtek PCIe card is mapped to PC space with BAR0 for its I/O and BAR2 and BAR4 for its memory. For our system, PCIe card has an Xilinx FPGA which implements PCIe EP core. In Xilinx PCIe EP core, BAR space starting address and size can be freely adjusted.

 

In our test, PC is configured with Celeron 2.93Ghz single core, i945GC mother board, and DDR2 667 memory. Test software suite is DriverStudio. Test shows it takes about 0.5us for PC CPU to write a location in PCIe card BAR space and 1us to read it. So the maximum access speed is less than 10MBps if PC CPU initiates accesses to PCIe card BAR space. What’s more, the load on PC CPU is heavy in this case.

xpciedma1

 

In the 2nd way, PCIe card memory is NOT mapped to PC memory space. Instead, a DMA engine is implemented in PCIe card Xilinx FPGA. DMA engine collects data from PCIe card memory space per CPU’s instruction.

 

Here is the process how data is moved from PCIe card to PC memory.

1. PC allocates a consecutive space in its local memory space and locks it so other programs won’t mess it up.

2. When FPGA on PCIe card detects enough data to be sent to PC, FPGA sends an interrupt to PC and notify PC to access these data.

3. PC receives interrupt, writes DMA control registers, which are in BAR0 space, with the starting address and size of the locally allocated memory space mentioned in step 1. Then PC writes BAR0 DMA command register to kick off DMA.

4. DMA receives kick-off command, reads data from its local memory, assembles PCIe memory write TLPs, and sends these TLPs to PC. When a transfer is done, FPGA sends an interrupt to PC. Note a transfer normally contains many TLPs.

5. PC receives interrupt, reads PCIe card BAR0 status register to check transfer status. PC can copy this data to another space so user program can access it or just unlock this space and allocate and lock another space for next PCIe transfer.

6. Loop back to step 2.

 
Here is the process how data is moved from PC memory to PCIe card.

1.    PC allocates a consecutive space in its local memory space and locks it so other programs can not access it.

2. FPGA detects enough empty memory space on PCIe card, sends an interrupt to PC and wait for PC to send data.

3. PC receives interrupt, writes DMA control registers at PCIe card BAR0 space with the starting address and size of the locally allocated memory space mentioned in step 1. Then PC writes BAR0 DMA command register to kick off DMA.

4. DMA receives kickoff command, assembles PCIe memory read TLPs and send them to PC. PC receives read TLPs, grab data from local memory, assembles and returns PCIe memory read completion TLPs back to PCIe card. FPGA receives completion TLPs, de-assembles them, and put data into its local memory in order. When a transfer is done, FPGA sends an interrupt to PC.

5. PC receives interrupt, reads PCIe card BAR0 status register to check transfer status. PC can copy this data to another space so user program can access it or just unlock this space and allocate and lock another space for next PCIe transfer.

6. Loop back to step 2.

 

Below diagram is PCIe EP DMA engine architecture.

xpciedma2

 

In our test, we generate an increment-by-1 data stream and hooks it up to above Egress FIFO input. On Ingress FIFO output port, we check if received data is incremented by 1 and an error counter counts number of errors. In our test, we didn’t see error.

 

Below is our test setup.

xpciedma3

 
Here are our test steps:

1. FPGA detects if there are enough data to be sent to PC. If so, FPGA sends interrupt to PC.
2. PC receives and processes interrupt and kicks-off DMA. There are some CPU software processing and four writes to FPGA BAR0.
3. DMA transfer. When DMA transfer data size is larger than 256KB, DMA starting and ending time can be ignored.
4. CPU processes DMA transfer done interrupt. There are some CPU software processing and four writes to FPGA BAR0.
5. (optional) CPU moves data from allocated buffer to user space.
6. CPU enables interrupt and goes back to step 1.

 

Some analysis:

Let’s say each DMA transfer is N (MB) and DMA transfer efficiency is f. Here f is the percentage of time PC system is working on real user data. Further, when PC receives interrupt, the interrupt response time is m (second).
Then we can have transfer speed is
 \frac{N}{N/(250*8*f) + m}
where 250 is PC system clock at 250MHz and 8 is 8 bytes which is PC system bus width.

For transfer efficiency f, assume an TLP contains 128 byte user data and TLP header is (always) 16 bytes. In addition, there is a gap between TLP packets which is at least one cycle for every 18 cycle transfer. So f is less than 128/(128+16)*18/(18+1) = 80%.

Interrupt service time is 20uS. During this time, CPU needs to do about 8 BAR0 write access. Each write is about 1uS and each read is 2uS. So it is about 10uS for CPU read/write processing. The left 10uS is CPU wait time.
Below are test results following steps mentioned above except that step 5, moving data to user space, is skipped. Each test lasts about 10 second. When CPU receives interrupt, CPU clears interrupt and re-enables DMA transfer.

1. When DMA transfer size is 1MB, speed is 983 MBps.

xpciedma_r1

2. When DMA transfer size is 2MB, speed is 992.8 MBps.

xpciedma_r2
3. When DMA transfer size is 4MB, speed is 996.8 MBps.

xpciedma_r3
4. When DMA transfer size is 8MB, speed is 999.2 MBps.

xpciedma_r4
5. When DMA transfer size is 16MB, speed is 1000 MBps.

xpciedma_r5
6. When DMA transfer size is 32MB, speed is 1001.6 MBps.

xpciedma_r6

Based on 1MB and 2MB speeds, we can calculate f=0.501398684 and m=0.00002008356 (second). Following speed equation, we can get:

DMA Transfer Size  By Equation Test
 4MB  997.773MBps  996.8MBps
 8MB  1000.279MBps  999.2MBps
 16MB  1001.53MBps  1000MBps
 32MB  1002.16MBps  1001.6MBps

 

Following the same equation, if DMA transfer size is small, we have:

 DMA Transfer Size In theory
 128 Bytes  6.33 MBps
 256 Bytes 12.58 MBps
 512 Bytes  24.86 MBps

So DMA transfer size is critical for overall transfer speed.
As mentioned, moving data to user space is not performed in above test. If this step is performed, the performance highly depends on software and driver design on PC side. For our PCIe driver, it moves data from allocated memory to its internal buffer first and then moves from internal buffer to user space. User space side also performs data CRC check. In our test, if we enable this step, speed is about 250 MBps with large DMA transfer size.

 

 
Senior Engineer
Author brief is empty
Groups:

Tags:

0 Comments

Contact Us

Thanks for helping us better serve the community. You can make a suggestion, report a bug, a misconduct, or any other issue. We'll get back to you using your private message ASAP.

Sending

©2021  ValPont.com

Forgot your details?