Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Porting Drivers to the STM32 F7

Problem Statement

I recently completed a port to the STMicro STM32F746G Discovery board. That MCU is clearly a derivative of the STM32 F3/F4 and many peripherals are, in fact, essentially identical to the STM32F429. The biggest difference is that the STM32F746 sports a Cortex-M7 which includes several improvements over the Cortex-M4 and including, most relevant to this discussion, a fully integrated data cache (D-Cache).

Because of this one difference, I chose to provide the STM32 F7 code its own directories separate from the STM32 F1, F2, F3, and F4.

Porting Simple Drivers

Some of the STM32 F4 drivers can be used with the STM32 F7 can be ported very simply; many ports would just be a matter of copying files and some search-and-replacement. Like:

  • Compare the two register definitions files; make sure that the STM32 F4 peripheral is identical (or nearly identical) to the F7 peripheral. If so then,
  • Copy the register definition file from the stm32/hardware to the stm32f7/hardware directory, making name changes as appropriate and updating any minor register differences.
  • Copy the corresponding C file (and possibly a .h file) from the stm32/ directory to the stm32f7/ directory, again making any naming changes and modifications for any register differences.
  • Update the Make.defs file to include the new C file in the build.

Porting Complex Drivers

The Cortex-M7 D-Cache, however, does raise issues with the compatibility of most complex STM32 F4 and F7 drivers. Even though the peripheral registers may be essentially the same between the STM32F429 and the the STM32F746, many drivers for the STM32F429 will not be directly compatible with the STM32F746, particularly drivers that use DMA. And that includes most complex STM32 drivers!

Cache Coherency

With DMA, physical RAM memory contents is accessed directly by peripheral hardware without intervention from the CPU. The CPU itself deals only the indirectly with RAM through the D-Cache: When you read data from RAM, it is first loaded in the D-Cache then accessed by the CPU. If the RAM contents is already in the D-Cache, then physical RAM is not accessed at all! Similarly, when you write data into RAM (with write buffering enabled), it may actually not be written to physical RAM but may just remain in the D-Cache in a dirty cache line until that cache line is flushed to memory. Thus, there may be inconsistencies in the contents of the D-Cache and in the contents of contents of physical RAM due related to DMA. Such issues are referred to as Cache Coherency problems.

DMA

DMA Read Accesses

A DMA read access occurs when we program DMA hardware to read data from a peripheral and store that data into RAM. This happens, for example, when we read a packet from the network, when we read a serial byte of data from a UART, when we read a block from an MMC/SD card, and so on.

...

What if the D-Cache line is also dirty? What if we have writes to the DMA buffer that were never flushed to physical RAM? Those writes will then never make it to physical memory if the D-Cache is invalideated. Rule 2: Never write to read DMA buffer memory! Rule 3: Make sure that all DMA read buffers are aligned to the D-Cache line size so that there are no spill-over cache effects at the boarders of the invalidated cache line.

DMA Write Accesses

A DMA write access occurs when we program DMA hardware to write data from RAM into a peripheral. This happen for example, when we send a packet on a network or when we write a block of data to an MMC/SD card. In this, the hardware expects the correct data to be in physical RAM when write DMA is performed. If not then, the wrong data will be sent.

...

What if you had two adjacent DMA buffers side-by-side? Couldn't the cleaning of the write buffer force writing into the adjacent read buffer? Yes! Rule 5: Make sure that all DMA write buffers are aligned to the D-Cache line size so that there are no spill-over cache effects at the borders of the cleaned cache line.

Write-back vs. Write-through D-Cache

The Cortex-M7 supports both write-back and write-through data cache configurations. The write-back D-Cache works just as described above: dirty cache lines are not written to physical memory until the cache line is flushed. But write-through D-Cache works just as without the D-Cache. Writes always go directly to physical RAM.

...

And what if the driver receives arbitrarily aligned buffers from the application? Then what? Should write buffering be disabled in that case too? And what is the performance cost for disabling the write buffer?

DMA Module

Some STM32 F7 peripherals have built in DMA. The STM32 F7 Ethernet driver discussed below is a good example of such a peripheral with built in DMA capability. Most STM32 F7 peripherals, however, have no built-in DMA capability and, instead, must use a common STM32 F7 DMA module to perform DMA data transfers. The interfaces to that common DMA module are described in arch/arm/src/stm32f7/stm32_dma.h.

...

  • TX DMA Transfers. Before calling stm32_dmastart() to start an TX transfer, the DMA client must clean the DMA buffer so that the content to be DMA'ed is present in physical memory.
  • RX DMA transfers. At the completion of all DMAs, the DMA client will receive a callback providing the final status of the DMA transfer. For the case of RX DMA completion callbacks, logic in the callback handler should invalidate the RX buffer before any attempt is made to access new RX buffer content.

Converting an STM32F429 Driver for the STM32F746

Since the STM32 F7 is so similar to the STM32 F4, we have a wealth of working drivers to port from. Only a little effort is required. Below is a summary of the kinds of things that you would have to do to convert an STM32F429 driver to the STM32F746.

An Example

There is a good example in the STM32 Ethernet driver. The STM32 F7 Ethernet driver (arch/arm/src/stm32f7/stm32_ethernet.c) derives directly from the STM32 F4 Ethernet driver (arch/arm/src/stm32/stm32_eth.c). These two Ethernet MAC peripepherals are nearly identical. Only changes that are a direct consequence of the STM32 F7 D-Cache were required to make the driver work on the STM32 F7. Those changes are summarized below.

Reorganize DMA Data Structure

The STM32 Ethernet driver has four different kinds DMA buffers:

...

Code Block
  struct stm32_ethmac_s
  {
    ...
    /* Descriptor allocations */
Code Block
    struct eth_rxdesc_s rxtable[CONFIG_STM32_ETH_NRXDESC];
    struct eth_txdesc_s txtable[CONFIG_STM32_ETH_NTXDESC];
Code Block
    /* Buffer allocations */
Code Block
    	uint8_t rxbuffer[CONFIG_STM32_ETH_NRXDESC*CONFIG_STM32_ETH_BUFSIZE];
    uint8_t alloc[STM32_ETH_NFREEBUFFERS*CONFIG_STM32_ETH_BUFSIZE];
  };

...

The following definitions were added to support aligning the sizes of the buffers to the Cortex-M7 D-Cache line size:

Code Block

  /* Buffers use fro DMA access must begin on an address aligned with the
 * D-Cache line and must be an even multiple of the D-Cache line size.

...


 * These size/alignment requirements are necessary so that D-Cache flush

...


 * and invalidate operations will not have any additional effects.

...

Code Block

  
 *
 * The TX and RX descriptors are normally 16 bytes in size but could be
 * 32 bytes in size if the enhanced descriptor format is used (it is not).
 */

#define DMA_BUFFER_MASK    (ARMV7M_DCACHE_LINESIZE - 1)
  #define DMA_ALIGN_UP(n)    (((n) + DMA_BUFFER_MASK) & ~DMA_BUFFER_MASK)
  #define DMA_ALIGN_DOWN(n)  ((n) & ~DMA_BUFFER_MASK)
Code Block
  #ifndef CONFIG_STM32F7_ETH_ENHANCEDDESC
  #  define RXDESC_SIZE       16
  #  define TXDESC_SIZE       16
  #else
  #  define RXDESC_SIZE       32
  #  define TXDESC_SIZE       32
  #endif
Code Block
  #define RXDESC_PADSIZE      DMA_ALIGN_UP(RXDESC_SIZE)
  #define TXDESC_PADSIZE      DMA_ALIGN_UP(TXDESC_SIZE)
  #define ALIGNED_BUFSIZE     DMA_ALIGN_UP(ETH_BUFSIZE)
Code Block
  #define RXTABLE_SIZE        (STM32F7_NETHERNET * CONFIG_STM32F7_ETH_NRXDESC)
  #define TXTABLE_SIZE        (STM32F7_NETHERNET * CONFIG_STM32F7_ETH_NTXDESC)
Code Block
  #define RXBUFFER_SIZE       (CONFIG_STM32F7_ETH_NRXDESC * ALIGNED_BUFSIZE)
  #define RXBUFFER_ALLOC      (STM32F7_NETHERNET * RXBUFFER_SIZE)
Code Block
  #define TXBUFFER_SIZE       (STM32_ETH_NFREEBUFFERS * ALIGNED_BUFSIZE)
  #define TXBUFFER_ALLOC      (STM32F7_NETHERNET * TXBUFFER_SIZE)

...

The RX and TX descriptor types are replace with a union type that assures that the allocations will be aligned in size:

Code Block

  /* This union type forces the allocated size of RX descriptors to be the
 * padded to a exact multiple of the Cortex-M7 D-Cache line size.

...

Code Block

  
 */

union stm32_txdesc_u
  {
    uint8_t             pad[TXDESC_PADSIZE];
    struct eth_txdesc_s txdesc;
  };
Code Block
  union stm32_rxdesc_u
  {
    uint8_t             pad[RXDESC_PADSIZE];
    struct eth_rxdesc_s rxdesc;
  };


Then, finally, the new buffers are defined by the following globals:

Code Block

  /* DMA buffers.  DMA buffers must:

 *
 * 1. Be a multiple of the D-Cache line size.  This requirement is

...

Code Block

   assured
 *    by the definition of RXDMA buffer size above.
 * 2. Be aligned a D-Cache line boundaries, and
 * 3. Be positioned in DMA-able memory (*NOT* DTCM memory).  This must
 *    be managed by logic in the linker script file.
 *
 * These DMA buffers are defined sequentially here to best assure optimal
 * packing of the buffers.
 */

/* Descriptor allocations */
Code Block
  static union stm32_rxdesc_u g_rxtable[RXTABLE_SIZE]
    __attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));
  static union stm32_txdesc_u g_txtable[TXTABLE_SIZE]
    __attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));
Code Block
  /* Buffer allocations */
Code Block
  static uint8_t g_rxbuffer[RXBUFFER_ALLOC]
    __attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));
  static uint8_t g_txbuffer[TXBUFFER_ALLOC]
    __attribute__((aligned(ARMV7M_DCACHE_LINESIZE)));

...

This does, of course, force additional changes to the functions that initialize the buffer chains, but I will leave that to the interested reader to discover.

Add Cache Operations

The Cortex-M7 cache operations are available the following file is included:

...

Here is an example where the RX descriptors are invalidated:

Code Block
  static int stm32_recvframe(struct stm32_ethmac_s *priv)
  {
  ...
    /* Scan descriptors owned by the CPU.  */
Code Block

  */

  rxdesc = priv->rxhead;
Code Block
    /* Forces the first RX descriptor to be re-read from physical memory */
Code Block
    arch_invalidate_dcache((uintptr_t)rxdesc,
                           (uintptr_t)rxdesc + sizeof(struct eth_rxdesc_s));
Code Block
    for (i = 0;
         (rxdesc->rdes0 & ETH_RDES0_OWN) == 0 &&
          i < CONFIG_STM32F7_ETH_NRXDESC &&
          priv->inflight < CONFIG_STM32F7_ETH_NTXDESC;
         i++)
      {
      ...
        /* Try the next descriptor */
Code Block
        rxdesc = (struct eth_rxdesc_s *)rxdesc->rdes3;
Code Block
        /* Force the next RX descriptor to be re-read from physical memory */
Code Block
        arch_invalidate_dcache((uintptr_t)rxdesc,
                               (uintptr_t)rxdesc + sizeof(struct eth_rxdesc_s));
      }
  ...
  }


Here is an example where a TX descriptor is cleaned:

Code Block
  static int stm32_transmit(struct stm32_ethmac_s *priv)
  {
  ...
            /* Give the descriptor to DMA */
Code Block
            txdesc->tdes0 |= ETH_TDES0_OWN;
Code Block
            /* Flush the contents of the modified TX descriptor into physical
           * memory.

...

Code Block


           */

          arch_clean_dcache((uintptr_t)txdesc,
                              (uintptr_t)txdesc + sizeof(struct eth_txdesc_s));
  ...
  }


Here is where the read buffer is invalidated just after completed a read DMA:

Code Block

  static int stm32_recvframe(struct stm32_ethmac_s *priv)
  {
  ...
                ...
	/* Force the completed RX DMA buffer to be re-read from
	 * physical memory.

...

Code Block

                
	 */

	arch_invalidate_dcache((uintptr_t)dev->d_buf,
                                       (uintptr_t)dev->d_buf + dev->d_len);
Code Block

                

	nllvdbg("rxhead: %p d_buf: %p d_len: %d\n",
                        priv->rxhead, dev->d_buf, dev->d_len);
Code Block
                /* Return success*/
Code Block
                return OK;
  ...
  }


Here is where the write buffer in clean prior to starting a write DMA:

Code Block
  static int stm32_transmit(struct stm32_ethmac_s *priv)
  {
  ...
    /* Flush the contents of the TX buffer into physical memory */
Code Block
    arch_clean_dcache((uintptr_t)priv->dev.d_buf,
                      (uintptr_t)priv->dev.d_buf + priv->dev.d_len);
  ...
  }