I recently completed a port to the STMicro STM32F746G Discovery board. That MCU is clearly a derivative of the STM32 F3/F4 and many peripherals are, in fact, essentially identical to the STM32F429. The biggest difference is that the STM32F746 sports a Cortex-M7 which includes several improvements over the Cortex-M4 and including, most relevant to this discussion, a fully integrated data cache (D-Cache).
Because of this one difference, I chose to provide the STM32 F7 code its own directories separate from the STM32 F1, F2, F3, and F4.
Porting Simple Drivers
Some of the STM32 F4 drivers can be used with the STM32 F7 can be ported very simply; many ports would just be a matter of copying files and some search-and-replacement. Like:
- Compare the two register definitions files; make sure that the STM32 F4 peripheral is identical (or nearly identical) to the F7 peripheral. If so then,
- Copy the register definition file from the
stm32f7/hardwaredirectory, making name changes as appropriate and updating any minor register differences.
- Copy the corresponding C file (and possibly a .
hfile) from the
stm32/directory to the
stm32f7/directory, again making any naming changes and modifications for any register differences.
- Update the
Make.defsfile to include the new C file in the build.
Porting Complex Drivers
The Cortex-M7 D-Cache, however, does raise issues with the compatibility of most complex STM32 F4 and F7 drivers. Even though the peripheral registers may be essentially the same between the STM32F429 and the the STM32F746, many drivers for the STM32F429 will not be directly compatible with the STM32F746, particularly drivers that use DMA. And that includes most complex STM32 drivers!
With DMA, physical RAM memory contents is accessed directly by peripheral hardware without intervention from the CPU. The CPU itself deals only the indirectly with RAM through the D-Cache: When you read data from RAM, it is first loaded in the D-Cache then accessed by the CPU. If the RAM contents is already in the D-Cache, then physical RAM is not accessed at all! Similarly, when you write data into RAM (with write buffering enabled), it may actually not be written to physical RAM but may just remain in the D-Cache in a dirty cache line until that cache line is flushed to memory. Thus, there may be inconsistencies in the contents of the D-Cache and in the contents of contents of physical RAM due related to DMA. Such issues are referred to as Cache Coherency problems.
DMA Read Accesses
A DMA read access occurs when we program DMA hardware to read data from a peripheral and store that data into RAM. This happens, for example, when we read a packet from the network, when we read a serial byte of data from a UART, when we read a block from an MMC/SD card, and so on.
In this case, the DMA hardware will change the contents of physical RAM without knowledge of the CPU. So if that same memory that was modified by the DMA read operation is also in the D-Cache, then the contents of the D-Cache will no longer be valid; it will no longer match the physical contents of the memory. In order to fix this, the Cortex-M7 supports a special cache operation that can be used to invalidate the D-Cache contents associate with the read DMA buffer address range. Invalidation simply means discarding the currently cached D-Cache lines so that they will be refetched from physical RAM. Rule 1a: Always invalidate RX DMA buffers sometime before or after starting the read DMA but certainly before acessing the read buffer data. Rule 1b: Never read from the read DMA buffer before the read DMA buffer completes, or otherwise you will re-cache the DMA buffer content.
What if the D-Cache line is also dirty? What if we have writes to the DMA buffer that were never flushed to physical RAM? Those writes will then never make it to physical memory if the D-Cache is invalideated. Rule 2: Never write to read DMA buffer memory! Rule 3: Make sure that all DMA read buffers are aligned to the D-Cache line size so that there are no spill-over cache effects at the boarders of the invalidated cache line.
DMA Write Accesses
A DMA write access occurs when we program DMA hardware to write data from RAM into a peripheral. This happen for example, when we send a packet on a network or when we write a block of data to an MMC/SD card. In this, the hardware expects the correct data to be in physical RAM when write DMA is performed. If not then, the wrong data will be sent.
We assure that we do not have pending writes in a dirty cache line by cleaning (or flushing) the dirty cache lines; i.e., for forcing any pending writes in the D-Cache lines to be written to physical RAM. Rule 4: Always clean (or flush) the D-Cache to force all data to be written from the D-Cache into physical RAM.
What if you had two adjacent DMA buffers side-by-side? Couldn't the cleaning of the write buffer force writing into the adjacent read buffer? Yes! Rule 5: Make sure that all DMA write buffers are aligned to the D-Cache line size so that there are no spill-over cache effects at the borders of the cleaned cache line.
Write-back vs. Write-through D-Cache
The Cortex-M7 supports both write-back and write-through data cache configurations. The write-back D-Cache works just as described above: dirty cache lines are not written to physical memory until the cache line is flushed. But write-through D-Cache works just as without the D-Cache. Writes always go directly to physical RAM.
If I am using a write-through D-Cache, can't I just forget about cleaning the D-Cache? No, because you don't know how a user is going to configuration the D-Cache. Rule 6: Always assume that write-back caching is being performed; otherwise, your driver will not be portable.
You may notice in
NOTE: I have experienced other cases (on the SAMV7) where write buffering must be disabled: In one case, a certain peripheral used 16-byte DMA descriptors in an array. Clearly it is impossible to manage the caching of the 16-byte DMA descriptors with a 32-byte cache line in this case: I think that the only option is to disabled the write buffer.
And what if the driver receives arbitrarily aligned buffers from the application? Then what? Should write buffering be disabled in that case too? And what is the performance cost for disabling the write buffer?
Some STM32 F7 peripherals have built in DMA. The STM32 F7 Ethernet driver discussed below is a good example of such a peripheral with built in DMA capability. Most STM32 F7 peripherals, however, have no built-in DMA capability and, instead, must use a common STM32 F7 DMA module to perform DMA data transfers. The interfaces to that common DMA module are described in
The DMA modules does not do any cache operations. Rather, the client of the DMA module must perform the cache operations. Here are the basic rules:
- TX DMA Transfers. Before calling
stm32_dmastart()to start an TX transfer, the DMA client must clean the DMA buffer so that the content to be DMA'ed is present in physical memory.
- RX DMA transfers. At the completion of all DMAs, the DMA client will receive a callback providing the final status of the DMA transfer. For the case of RX DMA completion callbacks, logic in the callback handler should invalidate the RX buffer before any attempt is made to access new RX buffer content.
Converting an STM32F429 Driver for the STM32F746
Since the STM32 F7 is so similar to the STM32 F4, we have a wealth of working drivers to port from. Only a little effort is required. Below is a summary of the kinds of things that you would have to do to convert an STM32F429 driver to the STM32F746.
There is a good example in the STM32 Ethernet driver. The STM32 F7 Ethernet driver (
arch/arm/src/stm32f7/stm32_ethernet.c) derives directly from the STM32 F4 Ethernet driver (
arch/arm/src/stm32/stm32_eth.c). These two Ethernet MAC peripepherals are nearly identical. Only changes that are a direct consequence of the STM32 F7 D-Cache were required to make the driver work on the STM32 F7. Those changes are summarized below.
Reorganize DMA Data Structure
The STM32 Ethernet driver has four different kinds DMA buffers:
- RX DMA descriptor,
- TX DMA descriptors,
- RX packet buffers, and
- TX packet buffers,
In the STM32F429 driver, these are simply implemented as part of the driver data structure:
There are potentially three problems with this: (1) We don't know what kind of memory the data structure will be defined in. What if it is DTCM memory? Then the DMAs will fail. (2) We don't know the alignment of the DMA buffers. They must be aligned on D-Cache line boundaries. (3a) The size of RX or TX descriptor is either 16- or 32-bytes. In order to individually clean or invalidate the cache line, they must be sized in multiples of the cache line size and (3b) the same applies to the DMA buffers.
To fix this, several things were done:
- The buffer allocations were moved from the device structure into separate declarations that can have attributes.
- One attribute that could be added would be a section name to assure that the structures are linked into DMA-able memory (via definitions in the linker script).
- Another attribute is that we can force the alignment of the structure to the D-Cache line size.
The following definitions were added to support aligning the sizes of the buffers to the Cortex-M7 D-Cache line size:
The RX and TX descriptor types are replace with a union type that assures that the allocations will be aligned in size:
Then, finally, the new buffers are defined by the following globals:
This does, of course, force additional changes to the functions that initialize the buffer chains, but I will leave that to the interested reader to discover.
Add Cache Operations
The Cortex-M7 cache operations are available the following file is included:
Here is an example where the RX descriptors are invalidated:
Here is an example where a TX descriptor is cleaned:
Here is where the read buffer is invalidated just after completed a read DMA:
Here is where the write buffer in clean prior to starting a write DMA: