Microcontrollers with c cortex m and beyond (klaus elk)

This Multiply-Accumulate is a very common DSP-pattern that we will recognize again and again.We did not have the advanced pipeline concepts that modern CPUs have, but I clearly remember

Introduction

Before digging into the embedded software, we will discuss the hardware in a generic embedded system I will introduce a number of keywords and concepts We will dig deeper into most of these along the way, but in order to understand some concepts, we need an idea about the others Thus, it all becomes a bit iterative, and we use things that we later get more insight into.

Back in the eighties I was part of a project developing medical equip ment for measuring Evoked Potentials This is small electric signals in muscles and nerves - provoked by injecting currents through needles - and yes - it was a bit painful The Evomatic 8000 used Intel 8086 - immortalized by the first IBM PC - for general control However, we needed something much faster for Digital Signal Processing For this we used Bit-Slice - based on the 29xx-series from AMD The core of this was a 4-bit wide ALU -

Arithmetic Logical Unit - that could input two 4-bit operands, and add/subtract, and/or/xor, and rotate/shift these to produce a 4-bit result Several ALUs could be chained together to support the arithmetic operations as well as shifting or rotating the full chain Using Carry-Look-Ahead and AND’ing of Zero flags etc., it was possible to put the 4-bit ALUs together like LEGO bricks Hence, the “Bit-Slice” name We accumulated data in 20-bits and thus needed five 4-bit ALUs There was also a Sequencer which would decide the next instruction address from the current address, the chosen operation - e.g., Branch to address XX if Less Than - and the flags from the ALUs We created our own instruction set where we dedicated e.g., two times four bits to select two register operands among 16 registers Another field in the instruction-set spelled out the operation for the ALU.

More fields were dedicated to addresses in RAM-memory and so on Basi cally we allocated fixed parts of the instruction-word to specific purposes - like e.g., selecting registers We had a very long instruction word from all these fields, and we created our own assembly language for this.

Very importantly, we had a hardware multiplier that could multiply two 8-bit operands and produce a 16-bit result which was accumulated in the 20-bit wide accumulator - giving headroom for adding a lot of data before dividing with e.g., number of samples to obtain an average This

Multiply-Accumulate is a very common DSP-pattern that we will recognize again and again.

We did not have the advanced pipeline concepts that modern CPUs have, but I clearly remember how important it was to keep the multiplier busy all the time when I wrote a Radix-4 FFT, while in parallel moving data in and out of the registers Using the fields of the wide instruction word felt very much like opening doors to let the bits run back and forth like mice in an organized maze The DSP-parts of the system filled two

Double Eurocards (6U rack-size - app 233 mm * 160 mm).

I learned a lot about how a CPU is created, as we basically built our own huge CPU - or rather DSP Modern CPUs and MCUs are built around simi lar concepts, but are much faster, use much less energy, and particularly much less space On top of this, the MCUs come stacked with peripherals Nevertheless, many of the underlying principles are the same.

A Generic Embedded System

Figure 3.1 shows a generic programmers block-diagram of an MCU - Micro controller You will recognize several blocks from the Introduction Please note that only selected data- and control-paths are shown In the figure we have the following components:

The clock-system is extremely important - and often very complex With everything built from basic flip-flops, we need clocks to drive signals through the gates In this sense, clocks are connected every where The MCU will typically have several clocks - often derived from the same “Clock-Base” This base can be accurately set with an external crystal If the crystal is faulty or not present, the clock-base is normally derived from an RC-circuit You may programmatically decide which clock goes to which peripheral - and even if at times you want a given clock to pause In modern electronics the power consumption is almost a linear function of the clock This means

Figure 3.1: Generic Microcontroller that if you can programmatically conclude that you do not need the service of a given peripheral or memory, and you can pause its clock, then there is power to save This again will lead to a longer

“run-time” when running on batteries.

Modern embedded systems support several Clock Domains Within a given clock domain there can be several clocks, but they are all based on a common clock In other words, you can get from one to the other by a simple multiplication and/or division Different clock domains typically originate from the need to communicate with other devices over interfaces like USB, SPI, CAN, S/PDIF etc Here data may arrive at a clock-rate driven by the other side Clocks in different domains do not relate simply to each other, and they will eventually drift away from each other over time - simply because they have no common origin You will need advanced algorithms when data with a constant sample-rate is transferred between clock-domains Figure 3.2 shows a connection from the clock to the interrupt-controller This is relevant as you typically can set up different timers - and some may generate interrupts Such a clock-interrupt may be used by an RTOS - Real-Time Operating System - kernel.

The Arithmetic Logic Unit is the heart of the MCU This is where normal integer calculations like adding and subtraction is done, but also logic operations - OR, AND, XOR etc Integer multiplications and divisions are typically also performed here The ALU creates the flags used for branching This is discussed in Chapter 7 E.g., the

Zero-flag is set when the result of an operation is 0.

This contains the “sequencer” from the introduction, that handles branching in loops and function-calls It is also a lot more Modern cores have a lot of Microcode This is proprietary secret low-level code It handles internal protocols and executes the many built-in state-machines that decide what can and what cannot be done - and in which sequence.

CPUs and MCUs always have several registers In the old 8086 used in the original PC, each register had specific relation to specific assembly instructions As an example, the CX register had to be used as loop-counter, to facilitate a Decrement-And Branch-Until Zero instruction Today, instructions are typically more orthogonal This means that many instructions will work with any register Still some registers also have special meaning - like PC - Program Counter - and

SP - Stack Pointer As we will see when we look at assembly code, the Arm architectures - like many others - are of the type Load-And-Store This means that all ALU operations - like add - work on registers To work on data from memory or peripherals, such data must first be loaded into a register The result of an ALU operation also ends in a register A new instruction is then needed to store the register into memory This means that a simple increment of a value in memory requires a read instruction, then an add instruction, and finally a write instruction On the general level we call this a Read-Modify Write operation This is important in multithreaded programs, as we will discuss later When doing function calls with a limited number of operands, most CPUs and MCUs will use registers for these, as this is faster than using the stack All in all, this means that registers are incredibly central to Arm and many other CPUs and MCUs The Cortex-M Registers are discussed in Chapter 7.

I just stated that in a load-and-store architecture all data needs to go

3.2 A GENERICEMBEDDED SYSTEM 19 through registers There is however one exception - DMA - Direct

Memory Access In modern systems many data-flows may routinely run all the time One example starts with an A/D-converter - aka ADC This device will deliver samples at a constant rate - e.g., 48 kHz

In very primitive systems each sample generates an interrupt, that triggers an exception-handler which reads the sample into a buffer in RAM We don’t want our advanced microcontroller interrupted for such mundane tasks all the time Instead, we set up a DMA-transfer to silently move every sample into a RAM-buffer - until the RAM buffer is e.g half full (if we use an ADC) This will finally generate an interrupt, and we can use this to move the buffered data further along in our processing chain - while the other half of the buffer is filled by a new DMA-operation There may also be interrupts from D/A-converters, as well as from UARTs, CAN-buses etc In these scenarios, the peripheral end of the DMA will typically work on a single address, while the memory address is incremented by a fixed constant between each sample DMA can also be from one memory location to another memory location (not shown in the figure) In this case the two ends of the DMA will use address-increments that are not necessarily the same Note that even though DMA might save CPU-time, it may still occasionally block relevant buses This will temporarily pause the CPU.

An MCU can run without external memory and therefore needs internal memory that stays programmed when powered off This is called Non-Volatile Memory - often abbreviated to NVM This memory is usually Flash It could, however, also be ROM - Read Only Memory ROM is memory that is programmed - Masked - at the silicon factory and cannot be deleted As Figure 3.2 shows, the flash may also contain constant data like e.g., tables.

Again - an MCU is defined by having internal memory - also for data RAM is volatile - it needs power to retain its content Some MCUs allow us to copy performance-critical code to RAM, to speed it up It is, however, important that this does not introduce a RAM-bottleneck when both program and data is fetched from the same place.

Even the simplest MCU will have circuitry allowing external events to interrupt the current program-flow and take care of urgent business Some devices will allow Nested Interrupts while others may not We have nested interrupts when the normal flow of execution has been interrupted by an external event, and then another event occurs, which have higher priority, so we interrupt the interrupt This may continue for several levels As a software-developer you need to take this into account Without nesting, you may experience a high

Interrupt-Latency for even the most important interrupt Typically, we define interrupt-latency as the time it takes from the external event happens, until the MCU is chewing on the first instruction in the ISR - Interrupt Service Routine You also need to be aware that each nested interrupt will cause the stack to grow further Re-entrant interrupts would make the above even worse and is normally not enabled.

This can be anything Typical peripherals are Ethernet and USB,

UART - Universal Asynchronous Receive/Transmit for serial ports and GPIO for individual I/O bits (possibly grouped in bytes) Many

MCUs also contain one or more ADCs - Analog/Digital Converters - and DACs - Digital/Analog Converters Another popular component is

PWM - Pulse Width Modulation - which is a simple way to use a digital output to provide a semi-analog voltage By changing the duty-cycle of a square-wave, you can generate an average voltage between the rails of the digital output PWM is popular for motor-control on small motors Typically, we think about an MCU with added peripherals, but there are also cases where the peripheral is the key component

An example is Nordic Semiconductor who markets BLE - Bluetooth

Low Energy - ICs These contain an MCU - typically Cortex-M0 or

Cortex-M4 You may use such a device as a BLE Gateway in a system where the built-in MCU may - or may not - be the only MCU/CPU in the system.

CPU and MCU

CPU stands for Central Processing Unit A Micro-Controller Unit - MCU or pC- always contains a CPU-core An MCU however, additionally contains enough on-board memory to run meaningful programs without external memory You also expect an MCU to contain some relevant peripherals. All the above means that MCUs are typically used for very specific pur poses in a confined space This is basically the definition of an embedded system.

The reason we still have CPUs is that they typically are more flexi ble They often support modern operating systems by having an MMU

- Memory-Management-Unit - that allows the CPU to have a full-blown operating system with virtual memory and paging etc Like Linux or Win dows The cost is that we need to supply the CPU with external memory

A CPU might, however, have some on-board peripherals and maybe a bit of memory as well, so the border between the definitions becomes blurred For me, the MMU is the main difference between a modern CPU and a modern MCU Onwards I will use the term “Core” for the inner part of a CPU or MCU - basically the stuff to the right of the memories in Figure 3.2.

Table 3.1 shows the typical differences between a CPU and an MCU

As stated, these things overlap A good example is that most CPUs will

Harvard and Von Neumann

Operating System Full OS RTOS or none

On-Board Peripherals Some Tons

Code Executes from DRAM Flash

Space usage Bigger Smaller boot up and load the code from an SSD1 into DRAM where it will execute it, while MCUs typically execute directly from Flash However, you will find examples of the opposite For now, we will assume that our embedded system executes directly from Flash - like a “true” MCU We will discuss memory in more detail in Section 3.5.

1SSD is Solid-State Disk It is Flash-based and have no moving parts.

In the previous section we touched upon the subject of buses - the address and data-lines that connect the memory to the CPU This is relevant for classic CPUs as described above, as well as for the Core inside the MCU Essentially, the Core will apply an address to the address bus As soon as the zeros and ones on the address-wires are stable enough, the Core will either write or read data on the data bus There is help from signals that control Read/Write as well as Chip Select Chip Select use the higher bits of the address to enable only the right memory chip(s), while the rest of the address lines goes to the selected chip(s) This concept allows us to use memory chips that only covers a fraction of the total address-space Many buses have more signals, but they are not relevant here.

A classic CPU like 8086 only had one set of data-lines (16-bit) and address-lines (20-bit) In fact, the cheaper - but software compatible -

8088 would multiplex the lower-byte address and data-lines so that the same wires first would be used for addressing, and then for data This was slower, but cheaper.

However, both 8086 and 8088 would need a lot of clock cycles to first fetch an instruction, and then later one or even two operands from the

Figure 3.2: Von Neumann and Harvard memory - all in sequence The concept of sharing memory and buses for instructions as well as for data is called Von Neumann.

The Von Neumann concept is especially annoying when doing Digital Signal Processing, where you often need to keep the central Multiply-

Accumulate as busy as possible - typically with each multiplication taking in a pre-defined constant and a data-value This led to the Harvard archi tecture where you have one set of address and data buses for instructions, and another set for data See Figure 3.2 This allows for parallel access to data and code.

DSPs favor the Harvard concept If the DSP is running the code of a filter, it will typically have the filter-constants stored together with the instructions - sometimes even as part of the instructions The first DSP I experienced - Texas Instruments TMS320C10 - was Harvard based and had a 16-bit instruction word A Multiply-Accumulate instruction dedicated

13 bits of the 16 to a filter-constant Thus, a FIR-filter of order 21 would contain 21 consecutive Multiply-Accumulate instructions in the assembly source, with each 16-bit instruction containing a 13-bit constant Data was fetched in parallel with the help of an auto-incrementing data-pointer I used this DSP in my university graduation project in 1983 Later I worked with the famous intel 8051 microcontroller, which also uses the Harvard architecture.

Apart from the added board-space and cost, Harvard has a serious

Physical Memory Types

drawback: If data and instructions are completely separate, then how do you update the firmware? When a new program is loaded it is seen as data by the old program - so how do we get it into the instruction storage? A solution often requires additional external hardware.

In general, most designers prefer the flexibility of Von Neumann and the speed of Harvard The most common solution in microcontrollers today is therefore to have one unified address space - with at least two (sets of) buses This is often called “Modified Harvard”, but it might be more correct to call it Von Neumann on the architectural level and Harvard at the implementation level.

The early intel CPUs had special “In” and “Out” instructions for manip ulating external hardware - peripherals This is basically abandoned, and peripherals are now mapped into the addressing space - still often called the memory space This means that when you map the 4 GB of address space for a modern 32-bit MCU, you will find blocks for respectively RAM, Flash and peripherals We will look at this in Section 4.1, where we will see that the overall unified memory map in the various Cortex-Ms is more or less a standard.

Table 3.2 shows some of many types of memory The UV-erasable EPROM is kind of outdated I specifically remember a product where we had a

PROM-board It had 24 EPROMs When we wanted to test new firmware, we had to take all 24 EPROMs out of their sockets Then we placed them in a UV-eraser where they cooked for like 20 minutes Now we burned them in the PROM-burner - which had space for 8 EPROMs as far as I remember Finally, we could put them back in the circuit.

At last came the event where we were to show the product - Counter point - at a medical convention in Italy I flew two days later than the other guys, bringing a brand-new set of 24 EPROMs with me We mounted them all, switched on the power - and nothing happened Huge sigh from the assembled colleagues “Time to try the spare” I said and whipped out 24 more EPROMS from my bag They worked I was a hero that day - although

I was probably the one who had messed-up the first set Incidentally, the paperwork for the hardware was not in place, and on the way home I got picked in customs I began to accept the fact that I would miss my flight But then - just next to me - Diego Maradona got into a fistfight with a press-photographer I silently gathered my stuff and walked on.

Back to Table 3.2 Of all the memory-types in the table, only the RAM variants are volatile Static RAM is simple to design into a system, as it does not need frequent refreshing cycles like Dynamic RAM - DRAM. The upside of DRAM is the much higher density - and therefore MB per Dollar as well as per square-inch SDRAM is essentially DRAM running at a speed that is synchronized to the CPU frequency You can get Double

Data Rate RAM that can be accessed on both flanks of the clock - DDRAM

These come in faster and faster generations - like DDR5.

When you buy a PC, you can basically buy the DRAM-type that you can afford, but you never see PCs advertised with SRAM PCs need the high density for the many Gigabytes of RAM, and the DRAM-controller - normally built into the CPU - handles the refresh.

So why do we then see SRAM in MCUs? The thing is, that if you leave SRAM with the voltage applied, but don’t use it, it uses almost no current Since P = U * I, the power consumption of an idle SRAM is close to zero

If the MCU sleeps the RAM power also goes down - without any fancy control.

This is relatively slow when reading, and read errors do occur reg ularly The good thing is that it has a high density and thus can deliver many MB of program-space If you use it as an SSD disk, the Many CPUs and MCUs come with some ROM from the factory This is typically some kind of bootloader - e.g., BOOTP - allowing you to boot from an Ethernet connection It can also be code related to security, as well as guaranteed unique MAC-addresses and serial numbers OTP - One-Time

Programmable - is also used for serial numbers etc - but this is not the chip-vendors serial number but that of the board.

Even though Flash can act much as EEPROM, you still see EEPROM being used It can (still) survive more rewrites than Flash EEPROM is also easy to write and read to.

Flash is a complex subject We have two kinds:

This is relatively fast to read, but slow to erase and write, and you rarely see read-errors This makes it ideal for reading with direct- addressing - and thus for in-place execution of code - directly from boot The downside is the low density - it costs more per MB The upside is a fast boot, a less-complex system and less need for RAM

If you think the above sounds like a typical microcontroller, you are right.

SRAM Static RAM RAM that only needs power

DRAM Dynamic RAM RAM that needs refreshing

SDRAM Synchronous DRAM DRAM in synch with CPU clock DDRn Double Data Rate

ROM Read Only Memory “Masked” at IC-factory.

PROM Programmable ROM Programmed before mounting

EPROM Erasable Program Erase with UV-light in R&D EEPROM Electrically Erasable

1 Million+ erasures Flash Modern large erasable memory

10-100 k Erasures read-errors can be handled by marking bad sectors If the program is copied to RAM during boot - aka Code-Shadowing - you have a flexible system The complex boot - reading a lot of Flash - slows down these types of systems Another problem is that you need a lot of DRAM - which is not cheap and takes up board-space The above description fits with our expectations for a large CPU-system running e.g., Linux.

As in many other cases, we see new products that tend to lessen the gap between the technologies For this reason, you should probably focus on the basic features like read/write/erase time and density for the given Flash and less on the technology used Still, the above can explain why a given system executes in-place or is using code-shadowing.

The Cortex Families

After having discussed several generic concepts we will look at the Arm Cortex profiles.

There are three main Cortex profiles - together spelling A-R-M:

The A stands for Application These are more classic CPUs than the other Cortex’es Cortex-A sports an MMU - Memory Management Unit This opens up for fully separated processes with paged virtual memory In a project we used an NXP i.MX7 with two Cortex-A7 cores We took advantage of Linux’ SMP - Symmetric Multiprocessing

We did plan to decide which processes would go onto which core - but we ended up leaving it to Linux It was quite a feat to convince the organization that we indeed could use Linux - a non-real-time

OS for a job which required very accurate time-stamping of samples

We ended up building some faith on the fact that the device also contained a Cortex-M4, which could run a real-time OS if need be - a bit like having money in the bank In the end we never used the Cortex-M4 The combination of HW-buffering in the device, together with DMA and a fast Linux, meant that the non-real-time Linux had no problems.

The R stands for real-time These devices offer high performance and the lowest latency However, it’s not only real-time but also the safety features that define Cortex-R A good example is the Hercules family from Texas Instruments The high-end Hercules devices come with Cortex-R dual cores that can run in Lockstep 2- meaning that they run the same program and an exception is raised if the Cores disagree An important application area for Hercules and the likes is automotive and other transport where people’s life may depend on correct execution Arm-R also requires that the built-in MPU -

Memory Protection Unit - is used Where FreeRTOS would be a good

RTOS choice on Cortex-M, you would typically choose its sibling

The M stands for Microcontroller, and this family is the one which we will talk most about in this book One thing that separates Cortex-M

2 Some high-end Cortex-Ms are now also available with dual-core lockstep

Figure 3.3: Cortex-M4 in STM32F334R8 Courtesy of ST. from Cortex-A is that Cortex-M never has an MMU Of the three pro files, Cortex-M is the best when it comes to low-power and keeping costs down The NVIC - Nested Vector Interrupt Controller is common to all Cortex-M MCUs Figure 3.3 shows Cortex-M as built into our sample - STM32F334R83 Cortex-M devices cannot run Linux (at least not without cutting some serious corners) They can however, run an RTOS like Arm Mbed and FreeRTOS There are also many designs running “bare metal” on Cortex-M devices.

3 Except that STM32F334R8 has no MPU

The Cortex profiles are created to be extended by silicon vendors - offering many ways for these vendors to stand out - but still assuring that the programming experience does not change too much when you move from one device to another The first many Cortex-devices were all 32-bit cores and a major “common ground” for these, is the 4 GB Memory Map which we will see in Section 4.1 We will dig further into the Cortex-M architectures in Chapter 7.

STM32F334R8 Architecture

Buses

Figure 3.4 shows several buses We will focus on the buses out of the Cortex-M4 core, as they relate to the instructions Instead of two Harvard buses to two separate spaces, we see three buses to the shared space:

The Ibus is used to fetch instructions (code) This bus can read from the range 0x0000 0000 - 0x2000 0000 Figure 4.1 shows the memory map of the STM32F334R8 We see that the Ibus address range fits with the area noted as “CODE” Using a pin on the chip - “Boot0” - and a configuration bit - “nBoot1” - the designer can choose to “alias” either onboard Flash, SRAM or “System Memory” into this space

“Alias” means that it will be accessible at both the original address range and starting from 0.

- When BootO is 0 we will see the 64k of Flash from address

0x0800 0000 aliased to address 0 and onwards This is the normal scenario.

- When Boot0 is 1 and nBoot1 is 0, we will see SRAM from address

0x2000 0000 aliased to address 0 and onwards.

- When both bits are 1, we will see “System Memory” mapped from address 0x1FFF D800 aliased to address 0 and onwards.

Figure 3.4: STM32F334 Architecture Courtesy of ST.

System Memory is a Flash Area that is programmed with a bootloader from the factory This bootloader does not take up much space, but is capable of loading firmware through many interfaces, like SPI, USART, CAN and so on Please note that although this on-chip code can assist in downloading new firmware - like a debugger can - it is not our debugger As a designer you may choose to embrace the bootloader and allow it to be used in-field by end-users or service-technicians - typically after a special button-combination You will however normally try to disable the use of the debugger in field, as it can easily be used to extract your IP - the code The clever reader will realize that a bad-guy might upload some code through the bootloader that basically “prints out” the existing binary code - or alters the functionality For this reason you might want to disable the use of the standard bootloader in field - or reconfigure so that it only works on signed code.

If you want to play around with the Boot-bits you can check UM1724 which is the User Manual for the Nucleo-64 boards Here you will find that Boot0 is normally pin 7 on the Morpho connector (leftmost in top view) There is a dot for pin 1, and you have uneven numbers along the edge of the board On pin 5 - just above pin 7 - you have Vdd (supply voltage) Connect pin5 and pin7 with a jumper and Boot0 changes from 0 to 1 There is an unused jumper at CN12 (top left corner) Since nBoot1 is a programmable option byte, the easiest way to change that is to download the STM32CubeProgrammer - if you havent done that already This is a very neat and simple tool that is great for inspecting SFRs - Special Function Registers - and updating firmware, but it also has access to the option bytes where nBoot1 resides By default, nBoot1 is checked.

The Dbus is basically a way to handle the limitations of the very compressed instruction-set in Cortex-M4 Many assembly instruc tions need to load a 32-bit constant - aka a literal - into a register

To avoid having very wide fields in the instruction set5, the Arm designers have decided that instead of allowing for 32-bit constants in the instruction-set, such an instruction will instead use an 8-bit Program-Counter relative-address to refer to the constant - still in code-space These constants are bundled in so-called literal pools.

5Like the Bit-Slice fields described in Section 3.1

We will see this concept in action in Chapter 5 You can say that the Dbus is an Ibus assistant In some documentation it is called the DCode-bus, which is probably a better name The Dbus is also used for debug access.

The System-Bus can access peripherals and the SRAM area Bigger Cortex-Ms like M7 also use the System-bus for external Memory - code as well as data.

To sum it up, we can say that in the Harvard-concept, the I-bus together with the D-bus is our Code-bus, whereas the System-bus is our Data-bus This is however only completely correct as long as we retrieve code from the Code region (we will get back to regions shortly).

The three buses go directly into a Bus Matrix This is also called an In terconnect Matrix Through this matrix, our Cortex-M4 buses are connected to numerous buses with various performance and widths The names of these buses are shown on the right side of Figure 4.1 from 0x4000 0000 and onwards.

Note that the three main buses are in no form connected to the outside world This means that this particular MCU cannot connect to any external memory - RAM or Flash We have a “true” MCU.

Also note that the software developer does not need to think about using one bus for something, and another bus for something else As long as the HW-designers respect the Arm guidelines for memory layout - and the system is correctly initialized, then the software developer can write standard C-code Typically, the code will be fetched faster from the Code region, but it can also be fetched from RAM.

Analog Interfaces

Our STM32F334R8 has surprisingly many analog interfaces for a digital component - making it a Mixed-Signal MCU:

Very typical for MCUs, the STM32F334R8 has two 12-bit Analog-To- Digital converters - ADCs Each of these can multiplex between 14+ channels - of which some are internal - like junction temperature It is possible to run dual synchronized conversions - stereo.

There is a built-in Operational Amplifier with multiplexed inputs It can be used as programmable gain-amplifier for some ADC inputs.

Note the gray “Temperature Sensor” area behind the ADCs as well as the opamp This is the area where the built-in temperature sensor is mounted During production a calibration-curve is measured and stored in the chip Application programs may read the temperature and lookup in these tables to get more exact conversions.

We have two 12-bit Digit-To-Analog Converters - DACs These can also be run in stereo.

These can sense when an analog voltage passes a preprogrammed threshold and generate an interrupt With the help of a Bandgap reference voltage - shared with the ADCs - the comparators are very accurate.

Both ADCs and DACs can be interfaced via DMA - Direct Memory

Access This means that the CPU will not need to read/write each sample and saves a lot of CPU-time It also means that we have looser real-time requirements, as we don’t need to service an interrupt for every sample Instead, we have hardware buffers for the A/D and D/A that are serviced by DMA - giving us only a fraction of the interrupts we otherwise would have had.

Digital Interfaces

We have a nice arsenal of digital interfaces:

The Serial Peripheral Interface is a classic digital interface for relatively fast on-board communication with external peripherals Typically, the Core is master and there can be several slaves, although it is very common to use one SPI-bus per peer-to-peer communication Almost any modern MCU will have SPI-support In this case it supports many modes, speeds and data-widths as well as DMA An SPI-bus could e.g., be used for a Wi-Fi module.

The Inter-Integrated Circuit bus is like SPI used for on-board commu nication It is not as fast as SPI, but it is more common to have multi ple slaves Typical peripherals are: Serial EEPROM (for configuration data), keyboard, small text LCD-display etc.

Controller Area Network is known from cars and factory floors CAN is a well-known standard for cheap communication with sensors and actuators These are mounted as multi-drop on a simple twisted pair of wires The CAN-protocol has a built-in priority scheme and is relatively robust This implementation offers small receive FIFOs (hardware queues) with interrupts and the ability to set filters for message types.

The General-Purpose Input/Output is a swarm (STM32F334R8 has 51) of 1-bit ports that can be configured for input or output, with or without pull-up/down About half of these can handle 5V, which is more than the Core-Voltage of the CPU-core As is also quite common, most of these ports can alternatively be configured as something else This means that you cannot have all 51 GPIOs while also having ADC or DAC or Timers This is the pin-multiplexing mentioned earlier.

Up to 12 timers of varying resolution Some can be used to count external pulses, generate PWM - Pulse Width Modulation - and others can be used as Watchdogs The SysTick timer is meant to be used with an Operating System and is integrated with the NVIC.

Three Universal Synchronous/Asynchronous Receiver/Transmitters - of which one may serve as a classic RS-232 serial port with CTS and RTS hardware-handshake signals With speeds up to 9 Mb/s these ports can be used as a general interface to other CPUs.

First a bit of terminology If you come from the Microsoft world you may have experienced that 16-bits are termed “WORDS” and 32-bit are called

“DWORDS” This is NOT how it is in the Cortex world The Cortex-terms are summarized in Table 4.1

8-bits Byte16-bits Half-Word32-bits Word64-bits Double Word

Memory Map

The Memory Map (more correctly: Address Map) is one of the most impor tant characteristics of an embedded system If you understand what the memory map tells you, and why it is structured as it is, you already know a lot about the system Thus, the memory map is often a good place to start when you want to understand what an MCU offers.

Figure 4.1 shows a view of the memory map for the STM32F334 - with a zoom-in on the “Code” region There is another view in Figure 4.2 showing region-sizes in GB which can be helpful This is how ST structures memory

As we shall see, it closely follows the way Arm structures memory The ability to switch-in Flash, System Memory or SRAM is building on Arm Cortex-features - although the content of the System Memory - the ROM bootloader - is an ST invention The buses from Chapter 3 - IBus, DBus

Cortex-M4 with FPU Internal Peripherals

System memory Reserved CCM RAM Reserved Flash memory

Flash, system memory or SRAM, depending on BOOT configuration

Figure 4.1: STM32F334 Memory Map Courtesy of ST.

4.1 MEMORY MAP 39 and System-bus - comes directly with the Cortex-M part from Arm, while the peripheral-buses named APB1, APB2, AHB1, AHB2 and AHB3 are chosen by ST designers from the toolbox supplied by Arm You can read much of this out of Figure 3.4 On larger MCUs like the Cortex-M7 there is yet a bus, called the AXI-bus.

In the daily work a software developer does not need to worry about whether it was Arm or the silicon vendor (here ST) who designed this or that It does however make a difference when you are looking for the relevant documentation, or if you want to port to your code to a device from another silicon vendor The good thing is that the Cortex-

M derivates share the same generic memory map, and generally map the same peripheral register to the same address (although it may offer different choices) In Table 4.2 you can see what was designed by Arm in Cortex M0/M0+/M3/M4/M7 devices, and what was designed by ST in this particular STM32 device - and basically also in other STM32 devices (with varying sizes of available memory) Table 4.2 has the following columns:

To save space in the table there is no end-address per range, but as there are no gaps you can count on each range ending one byte before the next start (except the “end-stop” at the end of the table).

This column is taken from Figure 4.1 and relates directly to the overall architecture of STM32F334R8, shown in Figure 3.4 There are many details that are not mentioned here - option-bytes, peripheral registers, and bit-banding - see Section 4.2 However, the overall scheme is correct.

This is Arm’s scheme for the Cortex M0/M0+/M3/M4/M7 devices

We see that while our specific device only has 12 kB of SRAM from 0x2000 0000 to 0x2000 3000, Arm has reserved the range from 0x2000 0000 to 0x4000 0000 - corresponding to 0.5 GB - for SRAM Similarly, we can see that the entire range from 0x6000 0000 to 0xa000 0000 - corresponding to 1 GB - is reserved for external RAM

In our STM32F334R8 this range is “Reserved” - not to be used We noted earlier that no buses are available on the pins of the chip - meaning that this device is not meant to have external memory Note that Arm calls the range “RAM” - not “SRAM” It could be SRAM, but it could also be SDRAM The Cortex-M7 has a built-in DRAM controller as part of the FMC - Flexible Memory Controller Note also that the “RAM” range does not have the “XN” marker (see next bullet), and therefore can contain code The FMC in Cortex-M7 also has a built-in NAND-Flash controller This is a good example of how you could design a system based on a Cortex-M7 - executing code from external DRAM - like a classic CPU In such a case you might use the built-in NOR-Flash to host a bootloader, that loads the main program from external NAND-flash - with high density - into external SDRAM, and then executes from this Much like a classic CPU.

When a row has the XN assigned, it means Execute Never This marking is used for the ranges reserved for internal as well as external peripherals Should a program attempt to execute code from here, an exception is normally generated (depending on MPU setup).

The hardware is allowed to make optimizations when it comes to executing your C-code - maybe doing things in a slightly different order This might sound scary, but it is one of the many trade-offs we accept to get better performance 1 You do, however, not want peripheral register accesses affected by hardware optimizations A component might require the low-byte of a 16-bit value written to an address, followed by the high byte - to the same address If the order of these writes is swapped, strange things will happen Even if the bytes are not written to the same address, it can still cause problems, if e.g., writing the high-byte is assumed to come last and triggers an action The chip-designers have helped us here with the Access-Type field.

1Please note that this is not the same as compiler optimization

- When a block is marked “N” for “Normal” it means: The proces sor can re-order transactions for e ffi ciency or perform speculative reads.

- When a block is marked “D” for “Device” it means: The processor preserves transaction order relative to other transactions to Device or Strongly-ordered memory This covers the example above.

- When a block is marked “S” for “Strongly-Ordered” it means:

The processor preserves transaction order relative to all other trans actions Strongly-Ordered is adamant when you e.g., change

XN = Execute Never Type = Normal, Device or Strong

Table 4.2: 4 GB Memory Space Usage

Start Address STM32F334R8 Cortex XN T

0xa000 0000 Reserved Ext Device XN D 0xe000 0000 M4 + Int Peripherals Private Per XN S 0xe010 0000 M4 + Int Peripherals Vendor Spec XN D 0xffff ffff End of 32-bit range NA - - the setup of the MPU - the Memory Protection Unit - or select another stack-pointer (more on this later).

I have quoted Arm directly in the above three definitions in italic

A speculative read is e.g., reading instructions and/or data ahead to maintain a pipeline Even if e.g., Cortex-M4 does not have an instruction-cache in the usual sense, the Flash-interface nevertheless has a prefetch-bu ff er of two times 8 bytes The hardware may engage in branch-prediction - reading data and instructions that end up unused

This is not something Arm is very open about as it is a parameter in the competition To avoid any such speculative reads that could trigger strange things, the safe thing is to mark the region “XN” and e.g “Device-Ordered” in the MPU configuration.

Bit-Banding

In many cases we have things going on in parallel that access the same memory location or peripheral register It can be DMA while executing normally, it can be multithreading, multiple cores, or another CPU on the board Even in a simple single-core, bare-metal system we can have an interrupt routine that works on a resource that is shared with the main program Especially when we set or clear bits that control hardware or signal software stages, we need to take care.

Normally, you cannot simply set a single bit in a register or in memory You need to perform a Read-Modify-Write If the interrupt happens after the main program’s read, but before the write, the interrupt routine might do its own read-modify-write, which the main program then wrongly overwrites when it gets access to the CPU again We will look closer at this problem in Chapters 11 and12.

Since we are now discussing the Cortex-M memory map, I will focus on one specific solution to the problem, introduced by Arm: Bit-Banding The bit-banding concept involves two special regions of memory - one is mapped to an important part of the peripherals, whereas the other is mapped to a memory area See Figure 4.2.

The mapping is not a “normal” alias - like the one we have for the code starting in address 0, where memory areas are mapped one-to-one In the bit-banding concept Arm has mapped a full 32-bit word in the alias memory region to a single bit in a lower-addressed region It is only bit 0 in the alias that is used - the remaining bits are ignored2 We can write 0xffff ffff or 0x0000 0001 - it doesn’t matter which - to address 0x2200 0004, and it will set bit 4 in 0x2000 0000 The design assures that this is done with a hardware-based - atomic - read-modify-write The concept is also shown in Table 4.3.

2This of course means that you cannot use the “alias” regions as normal RAM

Listing 4.1 is a C-macro from Texas Instruments - implementing the bit band address formula given by Arm The first line assumes that the offset between the alias and its target is the same for memory and peripheral bit-banding Line 2 furthermore assumes that the target base-address ends with 7 hex zeros in both cases - the same as saying that the target base

Figure 4.2: Memory Layout with Bit-banding Courtesy of ST.

0x2000 0000 - 1 MB SRAM Normal direct access - 0x200f ffff Bit-Band and via alias

0x2200 0000 - 32 MB SRAM 1’st bit mapped to bit 0x23ff ffff Bit-Band Alias in the above

0x4000 0000 - 1 MB Peripheral Normal direct access - 0x400f ffff Bit-Band and via alias

0x4200 0000 - 32 MB Peripheral 1’st bit mapped to bit0x43ff ffff Bit-Band Alias in the above

2 #define BITBAND_ADDR(addr, bit) ( (addr & 0xF0000000) + BB_OFFSET + (( addr & 0xFFFFF) FLASH

* Constant data into "FLASH" Rom type memory * /

*(.rodata) / * rodata sections (constants, strings, etc.) * /

*(.rodata * ) / * rodata * sections (constants, strings, etc.) * / = ALIGN(4);

} >FLASH section in Listing 4.4 receives all the rodata This is const defined variables - often tables It makes sense to store these variables in Flash

- saving valuable RAM and also improving security as these data cannot simply be overwritten.

Listing 4.5 shows how all data (and RamFunc) is put into the next output section, named data We start by assigning the start-value of the data segment to the symbol _sidata with the LOADADDR statement The interesting thing is that this is not yet known The linker however, supports lazy evaluation and as _sidata is not needed now, it is OK that it gets its value later Next we see how symbols inside the data output segment are defined for start and end of the data section.

The >ram at> flash at the end is a nice trick As you may remember from earlier, data contains all the data from the C-programs that have initial values With this single line we assure that all constant values from the data segments are stored in Flash - but also allocated in RAM with correct symbols for the addresses of all variables The linker can not assure that our data actually ends up in RAM, but it certainly has paved the way for the “startup” program that has this responsibility The use of “RAM” here also means that the location pointer “knows” that the segment should start with the start address for RAM However, the actual load-address is in Flash and decides the value of _sidata Listing 4.5 finally shows how the Linker Script handles the ccmram in the same way it handled data.

In Listing 4.6 we see how the Linker Script handles the bss segment Again we see how symbols for start and end are created to be used in the startup We also see that there is no use of Flash - only RAM This matches the fact that bss contains uninitialized data There is no need to store anything in Flash We also see how heap and stack are placed at the high end of the RAM - just as shown earlier in Figure 4.4 They are placed in an output segment named for the occasion: _user_heap_stack.

When the linker has read all input files, it has a large table of all symbols - and how they refer to each other Typically, we use the attribute

-gc-sections to tell the linker to garbage collect all input sections that are not referenced This can save a lot of memory space keep is used to override this garbage collect, as we saw with the isr_vector.

Listing 4.6 shows a special /discard/ section where developers can write the names of sections that must be subjected to garbage collection - no matter whether garbage collection was requested elsewhere or not The whole idea of a library is to only include code for what is used.

Listing 4.5: Linker Script - RAM and Flash

2 / * Used by the startup to initialize data * /

5 / * Initialized data sections into "RAM" Ram type memory * /

9 _sdata = ; / * create a global symbol at data start * /

16 _edata = ; / * define a global symbol at data end * /

23 * If initialized variables will be placed in this section,

24 * the startup code needs to be modified to copy the init-values.

31 _sccmram = ; / * create a global symbol at ccmram start * /

36 _eccmram = ; / * create a global symbol at ccmram end * /

bss : section into "RAM1 Ram type memory * /

* This is used by section * /

* define a in order to initialize the bss global symbol at bss start * /

; global symbol at bss end * /

_user_heap_stack : j ction, used left * / to check that there is enough "RAM"

/DISCARD/ : from the compiler libraries * /

Linker Map

When the linker is run with the linker script and all the source-files, it generates our binary output file - typically an ELF-file, which we will see in a moment The linker can also generate a linker-map A few selected pieces of our linker-map is shown in Listing 4.7 When all goes well you can recognize all the variables from the linker-script, as well as all variables and functions from the source, in the linker-map However, in the map the addresses of variables and functions are resolved - meaning that they now have absolute addresses assigned by the linker We will see several of these in action in Chapter 5 A linker-map can be very practical when you are debugging - especially if your debugger does not translate everything to symbols It is also a good place to check that your changes to the linker script are working as expected In the STM32CubeIDE you can find the linker-map in the “Project Explorer” tree under “Debug”.

Listing 4.7: STM32F334R8 Map-file (shortened)

0x0000000020003000 _estack = (ORIGIN (RAM) + LENGTH (RAM))

0x0000000000000200 _Min_Heap_Size = 0x200 0x0000000000000400 _Min_Stack_Size = 0x400

0x0000000008000000 0x188 /Core/Startup/ startup_stm32f334r8tx.o 0x0000000008000000 g_pfnVectors 0x0000000008000188 = ALIGN (0x4)

text.main 0x00000000080001c8 0x28 /Core/Src/main.o

0x00000000080004f0 0x50 /Core/Startup/ startup_stm32f334r8tx.o 0x00000000080004f0 Reset_Handler Skipping

0x00000000200000b8 PROVIDE (_end = ) 0x00000000200002b8 = ( + _Min_Heap_Size)

ELF

Listing 4.8 is the result of the command readelf Firstrun.elf - or rather arm-none-eabi-readelf.exe -a FirstRun.elf - where

“Firstrun” is my name for the Blinky project.

A few comments to the parsed ELF:

• Line 11: Program Entry is at 0x800 04f1 From the linker-script we remember that entry is our reset_handler We know that our Flash starts at 0x0800 0000 (I added a 0 in front, to avoid confusion) Thus, our executable program starts 0x4f1 into the Flash Definitely within our 64 kB If you are thinking that a function address cannot be uneven you are right We will get back to this in Chapter 5.

• Line 14: Somehow the use of the hardware floating-point accelerator has been enabled Great - but where? The linker command-line also includes some options In the IDE these are found via Prop erties on the project, then “settings” Both the compiler and the linker options contain the following two options: -mfpu=fpv4-sp-d16

-mfloat-abi=hard The first option tells gcc that we have an FPU hw unit following v4 of the spec It supports single-precision (sp), and it has 16 double-registers - corresponding to 32 single-registers The second option tells gcc to use the hard FPU calling convention This just goes to show that directives and options from many places end up in the linker, and it can sometimes be difficult to find out what originates where We will see more compiler and linker options in Chapter 6.

• Line 17: Section Headers We recognize the various output sections - like text and data etc The Addr column is the physical address in memory, that the section goes to, while the O ff column tells the offset in the ELF-file where this is found The Size is hexadecimal in Bytes, and the Flags are W for Write, X for eXecute and A for Allocate

“Write” and “Execute” makes sense - but “Allocate?” The “allocate” flag is set when the given memory must be allocated in the target system This sounds obvious, but don’t forget that the ELF-file may also contain debug information We don’t want that to take up our memory space.

• Lines 20-27: The sections that go into Flash are progbits , while e.g., bss and _user_heap_stack are NOBITS Here we see the difference between rodata and text They are both readable, but only text is executable .data and bss are both readable and writable as expected It is a bit misleading that ccmram is only writable However, this describes the ELF-content - it is not an MPU-configuration.

• Line 30: We see our well-known sections mapped into Segments of the ELF-file Here “segment” is the right term It designates physical ranges in memory It is clear that the parts that go to Flash all go into Segment 0 This fits with the statement “Erasing memory corresponding to segment 0:” that we saw way back in Chapter 2 in Listing 2.3, line 44 when the Blinky program was flashed.

Listing 4.8: STM32F334R8 Parsed ELF (shortened)

1085864 (bytes into file) 0x5000400, V5 EABI, hard-float

24: f1e0f85f stack NOBITS mapping: or text rodata er_heap_stack

40 206: 08000825 740 FUNC GLOBAL DEFAULT 2 HAL_GPIO_Init

41 207: 08000541 2 FUNC WEAK DEFAULT 2 EXTI0_IRQHandler 42

DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT

The ELF ends with a nice summary of the overall configuration at tributes - see Listing 4.9 You can also output this part alone, by using the readelf with the “-A” option.

Tag_CPU_arch_profile: Microcontroller

Tag_THUMB_ISA_use: Thumb-2

Tag_ABI_FP_denormal: Needed

Tag_ABI_FP_exceptions: Needed

Tag_ABI_FP_number_model: IEEE 754

Tag_ABI_align_needed: 8-byte

Tag_ABI_enum_size: small

Tag_ABI_HardFP_use: SP only

Tag_ABI_VFP_args: VFP registers

Stack & Heap with monitors

While on the subject of heap and stack, I want to show how to set up a software-based stack “guard” We know that the stack grows downwards

It is not uncommon that we underestimate how much data is stacked If the stack is defined too small in the linker-script, it will grow downwards into the heap area and create havoc.

A stack guard consists of the following:

• At the end of the linker-script a variable is declared at the bor der between heap and stack See Listing 4.10 where the symbol stack_chk_guard is set to point to the end of the heap, and allocate the next four bytes.

• Somewhere in the C-Source, this variable is declared as extern as seen in Listing 4.11 The variable is initialized with a recognizable pattern - e.g., 0xdeadbeef or 0xdeadface.

• In a function that is executed often - e.g., in an idle-loop, a scheduler core or similar - the variable is tested for changes This is also shown in Listing 4.11.

In the above case it was possible to define a symbol in the linker script and allocate space for it The C-source only needs to refer to it as extern and can easily use it Sometimes it works the other way around - you have something in C that you need to manipulate in the linker script The compiler allows you to assign an attribute to the data-structure that

Listing 4.10: Stack Guard in Linker Script

= + _Min_Heap_Size; / * Allocate heap * /

stack_ _chk_guard = ; / * Define symbol at end of heap * /

Listing 4.11: Stack Guard in source-code

// Declare the guard extern - from linker script extern uint32_t stack_chk_guard;

// Set before the program seriously runs stack_chk_guard = Oxdeadface;

// Somewhere in idle loop or scheduler - we monitor the guard assert( stack_chk_guard == Oxdeadface); declares a specific input segment to hold a given variable In gcc we could e.g write attribute ((section ("._user_heap_stack"))) before the actual declaration The section may be an existing input section or a new one We use this in Chapter 9.

The use of the MPU - Memory Protection Unit to generate an exception when the guard is violated is more effective than the above software-based solution This is discussed in Chapter 10.

Reset Vector

Now is the time to see many of the things we have discussed so far in action

We will do this by studying a reset One might expect that we should set a breakpoint at the reset vector located at address 0 In older CPU systems the reset vector was executable code, with a “Jump” instruction to the handler address In that scenario we could set a breakpoint here The same goes for the other vectors that follow the reset vector However, in the Arm Cortex-M MCUs this is different Each vector is simply the address of the relevant handler - except the reset-vector, which first contains the stack-pointer - and then the handler-address The vectors are in reality data - not code - read by the Cortex-M microcode This utilizes the memory better, as we save the space for all the jump-instructions The concept also works around the fact that the instruction-set cannot directly hold 32-bit addresses The first instruction to be executed “normally” will therefore be the first instruction in the reset-handler - the entry-point as defined in the linker script.

Debuggers are typically by default set to “break” (reaching a preset breakpoint) at the first line in main You can reconfigure this, but we don’t need to The chip-designers are aware that debugging straight from reset is often interesting - and in other scenarios really important The debugger therefore normally does not reset just because the rest of the CPU does When the system is running, you can set a breakpoint in the Reset_Handler, and then press the Reset-button on the board This will reset everything - except the debugger - and you will be at the breakpoint in the Reset_Handler Nice.

Be aware that if you open a memory window in Eclipse, you might not

File Edit Selection View Go Run Terminal Help © R- | t> | JLink Debi^ ]® - C main.c | startup_stm32f334r8tx.s X launch.json v VARIABLES FirstRun > Core > Startup > * ™ startup_stm32f334r8tx.s r/: wxt>

58 section text.Reset_Handler r9: 0x0 rl0: 0x0

weak Reset_Handler type Reset_Handler J ^-Function nil: 0x0 61 3eset_Handler : rl2: 0x9 E> 62 Idr sp, =_estack / * Atollic update: set stacI

PROBLEMS OUTPUT DEBUG CONSOLE TERMINAL MEMORY XRTOS a L° _ - 4" ID [^l B 0 Status: Debugger attached, sto

\z CALL STACK Paused on breakpoint 0000000000000000 03-00 07-04 0b -08 0f-0c ôr> Re s e t_l land 1 er@Ox080C04f 8 0000000000000000 20003000 080004fl 08000479 0800047f

Figure 5.1: Breakpoint at Reset see what you expect When you choose to see memory as 32-bit words you will see them as big-endian - and our CPU is little-endian The trick is to choose the so-called “Traditional Rendering”.

This time, I am however, using VS Code Following the prescription above, I have a set VS Code to break at the first real instruction - the Reset_Handler in “startup_stm32f334r8tx.s” Note that some newer boards has a C-based startup file However, using this assembly-language version of startup gives us a chance to see some interesting ARM concepts The memory window allows us to see memory in 1, 4 or 8 byte groups - and we can also select between big- and little-endian Figure 5.1 shows a very small part of the screen - with VS Code in a more paper-friendly mode than the usual black.

We know that the stack-pointer must be set to point to the end of the RAM section The stack-type is known as Full Descending We already know that it grows downwards - the same as descending “Full” means that when not in the midst of an operation, the stack-pointer - SP - points to a full memory cell - something stacked earlier Thus, at the next Push operation, the stack-pointer will count down first and then store data This means that we can safely let the stack-pointer point at the first address after the RAM Our RAM starts at 0x2000 0000 This MCU only has 12 kB of RAM - corresponding to 0x3000 From all the above, we will expect the SP to be loaded with 0x2000 3000 This also fits with the linker-map back in Listing 4.7 In Figure 5.1 in the memory area at the bottom, we recognize the expected SP - 0x2000 3000 at address 0.

The next word in the memory view - at address 4 - is 0x0800 04F1 This should be the address of our Reset_Handler, and it fits with the ELF-file - Listing 4.8 However, the Map-file puts the Reset_Handler at 0x0800 04F0 - see Listing 4.7 It turns out that they are all correct.

Since all instructions 16 or 32 bits wide, even start-addresses ensures that all instruction-fetches will be from even addresses, which means fewer accesses to Flash Thus, even start-addresses are required by the architecture However, since addresses are guaranteed to be even, Arm has used the LSB - Least Significant Bit - in the address, to flag that we should use the Thumb instruction-set This is not really necessary on a Cortex-

M, as Cortex-M MCUs do not support the alternative - the older ARM instructions However, the ARM instructions are supported by Cortex R and A, so setting this flag allows for more reuse across the Cortex’es and tools This is why the linker gave an uneven address to the ELF-file - and this is loaded into the reset-vector The address of the first instruction is indeed 0x080 04F0 as stated in the Map-file This is also what we see in the PC - Program Counter - in the register list on the left side, now that we are standing at the breakpoint - the start of Reset_Handler We see this address again in Listing 5.1 - showing the disassembly view, which we will get to in a moment.

Another register - the Stack-Pointer - already has the value 0x2000

3000 Our program has not executed a single instruction, which means that the MCUs microcode has read this value from the reset vector and written it into the stack-pointer.

Below the registers we see the text “CALL STACK” with the value Re- set_Handler@0x080 04F0 We know that the stack-pointer is still pointing at the first byte after the RAM, so it cannot be the real stack This is the debugger helping us.

Figure 5.2 shows the top of the real - still unused - stack The memory that the stack-pointer points to - 0x2000 3000 - does not exist We just see wavy lines But just below we do have RAM - ready for the first push-instruction.

Literal Pool

Listing 5.1 shows source code with interleaved assembly code - copied from a debugger window This is a bit odd - as the source is also assembly code We will walk through the “normal” assembly-source in a moment, but there is more to learn - still standing at the very first instruction The first line in Listing 5.1 is line 62 in the source file It instructs the core to load the 32-bit stack-pointer register with _estack - end-of-stack Just a moment ago we realized that the reset microcode already has initialized the stack-pointer to the value at address 0 So why load it again? The answer is that we could have had some fancy swap of memory-blocks - placing our stack somewhere else Until such a swap is in place, the microcode has assured that we have a normal stack - in case of an early interrupt Such a swap is however not needed, and we happily load the stack-pointer with the same value again.

The instruction-set in Cortex-M is very condensed As hinted ear lier, there is no room for a 32-bit constant like an address We have to make a small excursion to get a 32-bit constant Thus, the real assembly instruction is what we have in the next line that starts with the address of the instruction - still 0x0800 04f0 This instruction tells the core to load the stack-pointer register with what the content of the “[]” points to The square-brackets gives us a level of indirection We are to take the value of the program-counter - after this instruction - and add 52 decimal The

PC after this instruction is 0x0800 04f4 52 decimal is 0x34 The sum is 0x0800 0528 We don’t need to calculate this in the future - a comment is generated with this info At the memory window - see Figure 5.3 - the first line after the header starts at address 0x0800 0520 (first column is address of the following data) As 32-bit words are made of 4 bytes, we add 4 per word - also seen in the header So, the word at 0x0800 0528 is 0x2000 3000 This is indeed the first word after the stack - _estack.The next assembly line is a call to SystemInit, and after this we load _sdata, _edata and _sidata in the same manner as the stack-pointer was loaded From Chapter 4 we remember that these signify respectively the start of the data output section, the end of it, and the start of the initialization values for the data section in Flash - _sidata When we discussed the linker script, we saw in Listing 4.5 how the linker defined and assigned addresses to all these symbols In the memory window - Figure 5.3 - we see that the values are respectively 0x2000 0000, 0x2000 000c, and 0x0800 2838 The first address fits with the start of our RAM The second address tells us that there are only 12 bytes to initialize, and

1 62 Idr sp, =_estack / * Atollic update: set stack pointer * /

Figure 5.3: Memory referenced by instructions

0000000008000540 0000e7fe af00b580 681b4b08 f0434a07 60130310 DisOfStart.lst” The addresses are however not absolute here, but offsets from the start of Reset_Handler It is up to the linker to change these to absolute addresses - referring to Figure 4.3 If you use the command “arm-none- eabi-objdump -t startup_stm32f334r8tx.o” you will get a list of symbols Here will be lines like *UND* 00000000 _sidata - meaning that the symbol is undefined in the code, and that it is up to the linker to resolve this and the other symbols That is what we saw in the linker script in Listing 4.5.

Startup Code

Listing 5.2 shows the startup code in the Reset_Handler again This time as “normal” assembly source The Listing shows us:

• Line 1: A text section is defined - here with the function name appended to text This corresponds to what the gcc compiler does if you use the - ff unction-sections option - which we discuss in Section

• Line 2: The symbol Reset_Handler is defined as weak - meaning that if defined elsewhere without weak then the new definition will be used instead Thus, you can begin with this Reset_Handler, then later define your own version and keep this source as “fallback” if you decide to remove yours again Much like a library If you make your own Reset_Handler, you still want to keep the startup-file, as it also defines the vectors - see Listing 5.3 We use this concept in Chapter 13.

• Line 3: The type directive is meta-information for e.g debuggers.

• Line 4: A label - in this case our function name Reset_Handler.

• Line 5: We reload the stack-pointer as already described.

• Line 8: Call SystemInit We will get back to the bl instruction as a call in Chapter 7.

• Lines 11-13: Load the start and end (both RAM) of data, and start of initialize-data (Flash) into registers r0, r1 and r2 This is done by using the literal pool as we saw in Section 5.2.

• Line 14: movs r3,#0 Move immediate value: r3 = 0 - and set flags

We will use r3 as byte-counter.

• Line 15: b LoopCopyDataInit Branch unconditionally to the bottom of the loop in the next lines It is a classic approach in assembly language to keep the looping instructions after the body of the loop - and to start by skipping the body.

• Line 23: adds r4,r0,r3 r4 = r0 + r3 r4 contains _sdata + growing byte-counter - a pointer starting at _sdata The “s” in adds means that the ALU flags will be set according to the result of the addition Since there are no square brackets there is no dereferencing

- the sum of r0 and r3 is stored in r4.

• Line 24: cmp r4,r1 Set flags on r4-r1 Like subs - without storing the result The first runs will cause a “borrow” as r1 points to _edata which is at a higher address than _sdata - which is where r4 starts

To my personal horror, a “borrow” in Arm does not lead to a Carry (more about this in Chapter 7.) This is just the opposite of many other assemblers, and just goes to show that when everything else fails - read the manual We will get back to this in Chapter 7.

• Line 25: bcc CopyDataInit Branch if “carry clear” to the start of the loop where the actual copying takes place The assembler could instead have written the alias blo - “Branch Lower” - meaning that we branch when the first operand is lower than the second I suggest that this is what humans should do Then we don’t need to speculate on borrows.

• Line 18: ldr r4,[r2,r3] Load r4 with what the sum of r2 and d3 points to Again, the dereferencing comes from the square brackets r2 will keep pointing at _sidata throughout the loop, while r3 is our byte-counter - 0 on first pass We load a full 32-bit word - 4 bytes

• Line 19: str r4,[r0,r3] Store contents of r4 where the sum of r0 and r3 points to r0 keeps pointing to _sdata, and r3 is the same byte-counter as above r4 holds the data we just fetched from Flash.

• Line 20: adds r3,r3,#4 Add the constant 4 to r3 and store in r3 - and set the flags Thus, we increment our byte-counter with 4 because we handle 32-bit words at a time.

• Lines 23-25: This is the loop control as we saw earlier We exit when we reach _edata by continuing instead of branching in line 25.

• Lines 28-31: The pattern from above repeats Here we initialize the registers There is no source-pointer - only a destination-pointer - as we will write zeros to the whole bss section The zero we have in r3, while r2 is our destination-pointer As before, this initialization ends by skipping the copying part of the loop and jumping down to the loop control.

• Lines 38-39 Again, we use a cmp on the start-pointer and end pointer - setting the flags - followed by a bcc to the copying part of the loop.

• Line 34: str r3,[r2] Store contents of r3 - always 0 - in what r2 points to.

• Line 35: adds r2,r2,#4 Where we before had a growing byte counter which we added to both of the unchanging bases, we now directly modify the single destination pointer.

• Lines 38-39: Again, the loop control, until we end by falling out at the bottom.

• Line 42: The libc-init is called.

• Line 44: Finally - we call main.

• Line 47: If main does not loop infinitely - we will do so here.

Reset_Handler: ldr sp, =_estack / * Atollic update: set stack pointer * /

* Call the clock system initialization function.* / bl SystemInit

* Copy the data segment initializers from flash to SRAM * / ldr r0, =_sdata ldr r1, =_edata ldr r2, =_sidata movs r3, #0 b LoopCopyDataInit

LoopCopyDataInit: adds r4, r0, r3 cmp r4, r1 bcc CopyDataInit

* Zero fill the bss segment * / ldr r2, =_sbss ldr r4, =_ebss movs r3, #0 b LoopFillZerobss

* Call static constructors * / bl libc_init_array

* Call the application’s entry point.* / bl main

Listing 5.2 does not include the full startup_stm32f334r8tx.s file To fo cus, I only included the Reset_Handler To complete the picture, Listing 5.3 contains extracts of the rest Let’s do a very quick summarization:

• Line 1: syntax unified The syntax is the unified version that can handle both ARM and Thumb instructions.

• Line 3: fpu softvfp This means that we are not using HW floating-point - which we actually plan to However, the startup-file does not use the FPU, and this setting will prevent HW floating-point until this is correctly configured The configuration takes place in SystemInit, which is called early in the Reset_Handler.

• Line 4: thumb As stated earlier: Cortex-M can only use the thumb instruction set.

• Lines 6-7: g_pfnVectors and Default_Handler are declared global.

• Lines 11+19: Sections declared for the above.

• Lines 13-14: Default_Handler is an infinite loop.

• Lines 23-30: The first 6 handlers from the start of the isr_vector section As can be seen in the linker-map - Listing 4.7 - this section is the first in the Flash In the linker-map this is at address 0x0800

0000, but as we know, this region is aliased into address 0 Note that as expected Reset_Handler is the only one with an extra word - the top of stack address.

• Lines 33-46: weak and thumb directives for the above handlers - except the Reset_Handler which is seen in Listing 5.2 - also with .weak directive.

Listing 5.3: Startup Assembly - the Rest

Skipping previous listing with Reset_Handler

section text.Default_Handler,"ax",%progbits

size Default_Handler, -Default_Handler

thumb_set NMI_Handler,Default_Handler

thumb_set HardFault_Handler,Default_Handler

thumb_set MemManage_Handler,Default_Handler

thumb_set BusFault_Handler,Default_Handler

thumb_set UsageFault_Handler,Default_Handler

Arm Compilers

We use the term “toolchain” mainly to refer to the C or C++ compiler and the libraries This is a huge topic - especially in the embedded world Compatibility is a keyword Contrary to your desktop compiler, Embedded compilers are cross-compilers 1 - they run on a desktop host platform (Win dows, Linux or Mac), and they generate output for an embedded target platform In this case a “platform” is the combination of hardware and operating system This means that we face a multitude of combinations of host and target platforms Not all of these combinations exist - e.g

I cannot find a compiler that runs on Mac and builds code for a Linux target You can however get far with virtualization I used to run Linux as a “guest” OS on a Windows system This made my Linux system a guest in the Virtual-Machine world and a host in the embedded development world I must admit that I got a bit tired of this in the end and installed a

1I have an embedded compiler running on a Raspberry PI - building code for the PI itself, but this concept seems to be discontinued

When it comes to cross-compilers there is a naming concept with four (sometimes five) fields separated by dashes: a-[b-]c-d-e Note that the host platform is not included in the naming concept This is mainly because there are fewer misunderstandings. a: Target

Architecture arm in our case. b: Vendor

This field is optional Typically, none - also from vendors like Keil.

This might e.g be linux In our “bare-metal” case it is none It is also “none” when we use a small embedded RTOS like FreeRTOS, because this is linked together with our code in a monolith - like a bare-metal system. d: Format

Typically always eabi This stands for “Embedded Application Binary Interface”. e: Tool name

We use tools like gcc, size, readelf On Windows the tool has exten sion “EXE” as usual On Linux there is no extension.

In the “bare metal” scenario, the gcc compiler on a Windows ma chine is called arm-none-eabi-gcc.exe On a Linux host it will be arm-none-eabi-gcc On both hosts you get the C++ compiler by ex changing “gcc” with “g++” In these cases, the none relates to no target operating system, and we have skipped the vendor field.

The many compilers can be grouped in three groups - based on their business model:

1 Open-Source compilers - notably gcc and Clang Clang was created to be a “drop-in” replacement for gcc - built on more modern technology

It is very popular at Apple due to its Objective-C capabilities, whereas most embedded developers stick to gcc.

2 Silicon vendor compilers like Texas Instruments (TI) “ARM CGT” (derived from Clang)

3 Independent compilers - like IAR and Greenhills The very popular Keil compiler used to be independent, but is now owned by Arm However, as Arm does not produce silicon, and Keil target many silicon vendors’ products, we can keep Keil in this group.

So how to choose which compiler to use? If you prefer open source, it will be group 1 If you want more performance - be it compile-time or the running program’s use of time and memory - you might look to EEMBC’s

“CoreMark” for an answer EEMBC is an independent organization that benchmarks CPUs and CPU-boards - but they also note the compiler used in the test If you search for “CoreMark Scores” you will find their tests

- with the latest on top Note that among the main parameters found on

6.1 ARM COMPILERS 81 the overview page we have “Execution Memory” This relates to Section 3.5 Some tests may be performed on the same CPU - but in different CPU clock-speed variants For this reason the CoreMark/MHz is calculated for easier comparison This is however not completely fair as a fast CPU may need to insert wait-states when reading from memory - and thus performs relatively worse than the slower version in CoreMark/MHz Not so fair if you want to compare compilers Still - you can try to filter on a relevant CPU and if you are lucky you may be able to compare compilers But then you need to assure that the compilers compared are using the same optimization Finally, some measurements are certified by EEMBC, while others are uploaded by members - but not certified.

If you want to compare compilers, the safest solution is to download a relevant CoreMark test, and try out the compilers on your own hardware

If you do not yet have your own hardware, you can go with a relevant development board.

In general, it seems that on a small MCU system, gcc is a good choice - in execution speed as well as in benchmarks of the running code This is partly confirmed by the fact that so many of the MCU silicon-vendors IDEs today are based on gcc It also seems that Clang may perform slightly better than gcc on systems with many cores, running vectorized code Clang may be worth a tryout - if not for general compiling, then for static analysis In this book we stick to gcc.

If you have downloaded the developer tools on Linux, you have also received make - the basic GNU tool for running on builds If on Windows you have downloaded an IDE, there will also be a make somewhere in there However, if you want to try to build on the command-line on Windows you also need to get make As it is not a cross-compilation tool it does not come with e.g., the arm-none-eabi tools Just search for GNU make, and you will easily find it If you are tired of adding tools to your “PATH”, no harm is done if you copy make to the folder with the arm tools Alternatively, you may want to go for CMake, which is a more modern tool Be aware that it is easy to find classic “makefiles” for various samples, but not always the CMake equivalent On the other hand, the classic make is very complex, with its many built-in rules and strange patterns - and it needs a better shell than the “DOS”-shell.

Libraries

Imagine that a company releases a library for cryptography in a binary-only version - thus protecting their IP, the source Another company releases a binary-only DSP-library for electrical-engine control If these two libraries are compiled with incompatible compilers you will not be able to use them both from your application You would only be able to use one library - and that would require you to compile your application with a compatible compiler Alternatively, the library vendors would have to compile - and test - their libraries with all relevant compilers.

Incompatibilities would mainly be related to calling conventions; who clears the stack - caller or callee? Do we even use a stack for function parameters, or do we use registers? If so - which registers in which order etc.

This unbearable situation could happen years ago - until the various vendors agreed on a standard proposed by Arm - the “Embedded Ap plication Binary Interface” - the one we see abbreviated in names like

“arm-linux-eabi-gcc” Thanks to this standard, a program compiled with one compiler can utilize a library that is compiled with another A part of the problem persists however, as the choice of instruction-set, operating system, and the use of an FPU will affect the binary output - and therefore can trigger a library variant See Section 6.3 This is the reason why we still have several versions of libraries, but at least the makers can use their favorite compiler.

With the original C-compiler we got libc - the Standard C library This has some problems - mainly in the relatively large amount of Flash and RAM it consumes A newlib was created to avoid these problems, but it too was bloated This led to the introduction of newlib-nano, which consumes much less space If we turn our eyes, once again, to the Nucleo-64 board from Chapter 2, this only has 12 kB RAM To avoid using it all up in the library, the newlib-nano is used in the sample-code This is selected in the MCU-Settings in the IDE and is reflected in the linker-options as specs=nano.specs The specs-option refers to actual files with this extension - like nano.specs A specs-file can tell gcc to replace one name with another These files are kept in the same folders as the libraries We will not go into the syntax of specs-files as they are typically not edited - only included from the right folder.

So how to find the libraries? In the Eclipse-based tools there is a plethora of folders under the “plugins” folder At first glance it can be quite discouraging However, there is a system in the way these long paths

Building for FPU

are built When we look at the sample it uses a library-path built top-down from the following:

• A root folder that was chosen at installation time - typically with the version of the IDE included.

• plugins This is the central folder that contains all tools and libraries.

• com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32

• tools We have both tools and libraries below this one.

• arm-none-eabi Our well-known pattern.

• lib For libraries - that was easy.

• thumb The instruction set There is also an “arm” folder for the older Arm CPUs.

• v7e-m+fp Our Cortex-M4 architecture with the Floating-Point Unit.

• hard The version that uses the hard (floating-point) abi.

You can check the above by visiting the linker-map file It shows the libraries pulled in by the respective object-files.

Back in Section 4.6 where we discussed the Linker, we briefly touched on the compiler and linker options for the FPU - Floating-Point Unit Since we use gcc for both functions we use the same options in both places That is where the easy part ends It becomes quite complex due to history - but also because we have two things to configure.

The first is the “architecture” of the hardware FPU - if there is one We must assure that we declare what our specific MCU supports In the linker section for STM32F334R8 this was “-mfpu=fpv4-sp-d16” Tokenizing the value for the -mfpu option by the “-” this gives us:

This is basically the version of the FPU architecture V4 is the one we see in Cortex-M4 - when there is an FPU Cortex-M7 supports fpv5.

This means Single Precision - what C-programmers know as “float”. Where Cortex-M4 offers the choice between nothing and single pre cision, Cortex-M7 offers the choice between single and double preci sion Cortex-M0, M0+ and M3 does not offer FPU at all.

This means 16 64-bit (double precision) FPU-registers This makes little sense when you only have single precision, but they also serve as 32 32-bit (single precision) registers.

That was the hardware Now we need to specify the software What we need to specify is the calling convention This is part of the ABI - Application Binary Interface We need to assure that caller and callee agree on the use of registers - also when calling libraries In the Blinky sample we had “-mfloat-abi=hard” The -mfloat-abi option can have one of three values Here I am quoting Segger (vendor of debuggers etc.):

Full software floating-point The compiler will not generate any FPU instructions and the -mfpu= option is ignored Function calls are generated by passing floating-point arguments in integer registers,

Hardware floating-point using the soft floating-point ABI The com piler will generate FPU instructions according to the -mfpu= option Function calls are generated by passing floating-point arguments in integer registers This means soft and softfp may be intermixed.

Full hardware floating-point The compiler will generate FPU instruc tions according to the -mfpu= option Function calls are generated by passing floating-point arguments in FPU registers This means hard and softfp cannot be intermixed; neither can hard and soft.

If you select “hard” you need to assure that all files and libraries are compiled with this option Many libraries come in a hard and a soft version. The “softfp” sounds alluring However, these calls will clear the pipeline, and thus you lose some momentum here.

Finally, if you want to take advantage of the single-precision FPU, you need to take care All normal math functions work with double-precision Instead of sin() and sqrt() you will need to use sinf() and sqrtf() in your code You append “f” for “float” to all the usual names Also remember

Make

that constants like “8.0” are double-precision You will need to write “8.0f” or “8.0F” in this case.

In the STM32CubeIDE the FPU settings are found in the Project prop erties under “Settings” and the “MCU-settings” - together with the chosen board, CPU and runtime libraries - see Section 6.1.

Having looked at the plugins-tree in the Eclipse-based solutions, the command-line based make concept may look simpler, but you will soon realize that this can be really hard to follow Like C, make supports include files Here the include-files has extension “mk” Good practice says that sources.mk will list files to compile, while config.mk will define variables that are used by the makefile - like cpu and board You may also meet common.mk which could e.g., describe a whole family of boards And then

- just when you think you got the picture, you realize that on several lines, the makefile changes directory and invokes a new make So which value of which variable is used where?

In this case I recommend the use of the -p option in the make call - like make -p all Not in general - but to get a grip The -p causes the make-tool to print out the variables, directories etc that it has picked up Together with the linker-map this gives a good overview - although the printout can also be overwhelming There are other options - like -d for debugging for even more output Be aware that recursive use of make should not call make directly, but call “$(MAKE)” in order to carry the options through the system As already stated, the modern equivalent of the classic make is CMake.

Compiler and Linker Options

As noted earlier, gcc is a “driver program” that - when triggered - calls the compiler, the assembler, and the linker2

2Unless -c or other option is used to ask gcc only to compile

We cannot go through all options here - there are too many, and some are really exotic, while others are only for specific targets Instead, I have picked some that we often see - and several of these are used in the sample programs I have put all options in one common table - Table 6.1.

Please note that the linker-specific options must be preceded by -Wl, (note the comma) when used with gcc You can have several of these together with a single instance of -Wl, as long as you avoid spaces, and use comma as a separator When gcc hands these options to the linker, they become space-separated It is not always easy to decide when an option falls in the linker option category An example is -specs=nano.specs which tells the linker to use the small newlib-nano library This option is, however, considered a gcc-option When in doubt consult the manual - or find some working examples Also note that in most cases the order of options do not matter - but sometimes they do In these cases, the options are applied from left to right With much already stated, the following summarizes many of the options in Table 6.1.

• Some options are relevant for both compiler and linker This is e.g., FPU calling convention, which is needed by the compiler for the actual calls, and for the linker when selecting the right version of a given library.

• Historically there has always been a need to link to libraries, so using

-l for library needed no explanation in the early days Likewise, all libraries started with “lib” and the extension was given by the platform We basically only needed the first letter after “lib” Thus, the C standard library - “libc.a” on Linux - could be included by -lc and the math library could be included by -lm Today, among so many other options, these options seem a bit confusing.

• Many options come in a short and a long version, where the short option expects the value of the option to follow immediately, while the long value expects an equal-sign In other words -Ic and -library=c both mean “link with libc.a” The long version works with both one and two dashes.

• The library-setup got even more strange when the original library options were kept unchanged, and the -specs=nano.specs was intro duced to use the newlib-nano instead of the old standard C-library Thus, instead of replacing -lc with e.g -lnew-nano, we keep -lc and introduce -specs=nano.specs which silently does a replacement for us The specs-files are found together with the relevant library-versions.

• The path to the library typically includes folders referring to instruc tion-set, architecture, and FPU-calling convention - see Section 6.3.

• The option -static means that we will not allow dynamically linked libraries This means that we create a single binary - often called a

Basic CPU Type -mcpu=cortex-m4

FPU HW - e.g version 4 single-precision

FPU Calling convention - e.g hard, using FP-registers

Each function in own section - ff unction-sections

Data in own sections -fdata-sections

Include-files folder to search - can be used many times

Definition of string - can be used many times

C Standard to use -std=gnu11

Use std C-library - libc.a -lc

Use Math-library - libm.a -lm

C-lib override - nano -specs=nano.specs

Implement error-stubs for system calls

Do not use shared libraries -static

Generate Map-file - name -Map=“$BuildArtifactFile-

Garbage Collection of unused sections

Start of circular ref - typically libraries

End of circular ref - typically libraries -end-group

No unaligned access (Section 7.6) -mno-unaligned-access monolith This approach is normal in “bare-metal” MCUs, while with

Linux-targets you will not see the -static option.

• GC in -gc-sections means “Garbage Collection” Not a popular term in small real-time MCUs It does however, simply mean that if the linker identifies sections that are not used by the code, it should delete them This works hand-in-hand with the compiler options - ff unction- sections and -fdata-sections These two options will respectively assure that each function gets its own section - starting with text - and that each data-declaration gets its section - starting with data or bss In this way our code and data are still placed in the right places, but the linker will - if the -gcc-sections option is present - assure that only code and data that is actually referenced, will make it into the binary image.

• The start group and end group are linker-options used around options that may be read several times to handle circular references This is typically used with libraries that may be re-scanned for symbols that are referenced by other libraries after first scan.

• I earlier stated that ST has chosen that the Cortex-M4 in the sample- MCU is hard-wired to Little-Endian The reason why this is not noted in the compiler options is that Little-Endian is the default

If your chosen system uses Big-Endian you will need -mbig-endian Still, it makes sense to use the Little-Endian version when relevant, to document the choices taken.

• Some options include others You may find the -march option that should be set to armv7e-m in our case This is however included by the -mcpu=cortex-m4 option.

• When going from debug- to release-code, you will often need to fiddle with the compiler option for optimization: -O The numbers 0-3 signify ascending degrees of optimization, starting with zero optimization It is common to use -O0 to assure that debugging is doable I will however recommend the use of -Og instead, which is made for this It will still work with debugging, but it will save you some memory Unfortunately, many people have experienced strange problems that go away when optimizations are turned off Compiler optimization is a field of science in itself.

• Many options are only mentioned in references with the non-default

IDEs and alternatives

version You can often find “counter” options like -no-gc-sections that allow you to be explicit about your choices.

We saw in Chapter 2 how easy it is to install e.g., STM32CubeIDE and the accompanying STM32CubeMX configurator All the major silicon vendors have their own version of an Eclipse-based system - which by design works on Windows, Linux, and Mac hosts This system is as stated relatively easy to download and install It will help you in the development cycle: Edit-Compile-Link-Flash-Debug I taught java at the university at the time when Eclipse got momentum It spread from java to native C/C++- development and eventually to VHDL and embedded development It was fantastic - the first free environment that could do almost everything for embedded developers that the expensive Visual Studio could do for Windows developers Later I developed a driver-stack for a TCP-Offload Engine - aka “TOE” - running on Linux3, and I spent most of the day coding in Eclipse The only issue was that the graphics never really measured up to Visual Studio The screen was never as calm to look at, as the Windows Visual Studio, and the darker color-patterns did not work because colors blended, and you could not separate things.

3When “Connected” it all happened in an FPGA, the Linux-stack only handled the other states and transitions

These days the free Visual Studio Code is conquering the world Again, with a calmer and nicer screen than Eclipse - but now also on Linux and Mac And now also free To many developers this is alluring You can start using VS Code as the simpler alternative; however, it is easy to install so many Extensions that you end up with your own IDE, that you maintain There is a risk that it becomes so popular that people abandon the safe Eclipse-based environment before they are ready to code all the necessary makefiles, scripts etc.

In many cases I think that the discussion on the Eclipse-based envi ronments versus e.g., VS Code based can be a bit dominated by feelings more than facts So - what are the facts? It can be a bit problematic to do everything via Eclipse when:

• You develop in Eclipse, but you want a headless build-server to build your projects.

• You want to reuse a lot of code across CPU-types - e.g., Cortex-M4 and Cortex-M7.

• You are missing some specialized debug-views.

1 Stick to the Eclipse-based environment You can be productive from day one You can share source across project-files via git.

2 Use the already existing makefiles and toolchain from an existing Eclipse-based project As stated earlier, Eclipse creates a plugins folder in its installation with a swarm of folders with very long names They contain the gnu-tools, libraries, include files for CMSIS with HAL, stlink, gdb-server and much more It is possible to build from the command-line based on these You can even refer to them in your VS Code workspace-settings and c_cpp_properties.json I managed to make and debug from within VS Code - using all the existing makefiles and source The nice thing about this solution is that you can keep working in Eclipse You can build from the command-line in your build-server and in the daily work, developers can use VS Code or Eclipse as they prefer The downside is that these environments may start out being the same, but there is a risk that build actions within VS Code begin to differ from the Eclipse-based You now have two environments to maintain Still, it can be an OK way to get started on the command-line way of doing things.

3 Start from a clean sheet on command-line development Add any editor - although VS Code is the popular choice I also managed to do this on a Linux machine In my case I used samples from GitHub They were well organized with CMSIS, HAL and templates for various HW-platforms It was all easy to clone as suggested from git The next step was to install the arm-none-eabi toolchain Arm has abandoned the standard packet-manager, so you need to get this from their website I fetched the latest one and un-tar’ed it Now I got several warnings, that all could be traced back to the fact that the toolchain was newer than the source That does not happen if you start with an Eclipse-package and along the way grow into another process.

4 Wait for Theia This is an upcoming version of Eclipse You may not realize this, but in many ways VS Code was based on Eclipse technology This makes it easier to port back The Theia version of Eclipse supports VS-Code Extensions In this way you might get the best of both worlds.

The “programmers model” deals with processor modes, exceptions, regis ters and instructions These subjects have an impact when it comes to how you write your software - and to which degree it can be reused on other MCUs We have already seen several related architectural concepts and patterns in operation, and there are a few more to come When reading this chapter, you might want to go back to recap a subject or jump ahead to learn more about FPU or MPU The following is a quick list of these related subjects - there is also the book’s Contents and Index.

• Chapter 3: Overall Embedded System Architecture - Von Neumann and Harvard buses, Physical memory types and interfaces First glance at the Cortex A-R-M profiles Load and Store architecture.

• Chapter 4: Memory Model - Regions and access rights Bit-Banding and relations to the linker.

• Chapter 5: Reset Vector and Startup concept - with a bit of assembly code and registers.

• Chapter 6: Compiler and linker flags.

History of Cortex-M

For many years Arm had a growing - but backwards compatible - instruc tion set This was also called Arm All instructions in the Arm instruction set were 32-bit wide Especially for the cost-sensitive microcontrollers,

91 there was a wish to use less memory on code This led to the birth of the very condensed, 16-bit Thumb instruction set Note that this 16-bit instruction set is used in the 32-bit architectures, which may seem a bit odd There is however no law of nature stating that a 32-bit processor must have a 32-bit instruction word When fetching code from the Code region, the hardware will in any case perform prefetches from Flash, and some Cortex-M’s - like Cortex-M7 - can have real caches This means that the core will keep a buffer of instructions that are likely to be used in the near future Smaller instruction-width simply means more instructions in the buffer - or less space needed for the buffer.

The Thumb-instructions were not always flexible and powerful enough, and backward compatibility was still required CPUs and MCUs were created with the ability to switch between the condensed - and simple

- Thumb Mode and the more advanced Arm Mode Compilers were able to go back and forth between these modes, while programmers typically preferred the Arm instruction set when coding in assembly-language Cortex A-R-M profiles were introduced in the generation called Armv7 Arm Cortex R and A still has the ability to go back and forth between the Arm and Thumb instruction sets However, to keep the complexity and price down on Cortex-M, Arm created the Thumb2 instruction set as the only one for Cortex-M Maybe Arm also has information stating that Cortex-

A will almost always be programmed in C/C++ and Cortex-R tends to be

Model-Driven with auto-generated code, while Cortex-M will still to some degree be hand-coded in assembly-language Assembly-programmers probably do not enjoy shifting back and forth between instruction-sets. Thumb2 is in a way the “middle-of-the-road” between the Arm and original Thumb instruction sets With Thumb2, Cortex-M can have a mix of 16 and 32 bit wide instructions - without the above mode switches Instructions are thus not necessarily aligned on a 32-bit boundary (but always on a 16-bit boundary) We saw features of Thumb2 in Chapter 5 Having instructions of different length is far from a new invention Classic CISC - Complex Intruction Set Computer - have always had instructions of one, two or maybe even three word-widths intermixed.

Although all Cortex-M MCUs support Thumb2, few supports the full set of instructions The bigger variants support the same instructions as the smaller ones - and more E.g multiplying two 32-bit numbers to obtain a 64-bit number is not supported in hardware on all variants The optional DSP and FPU components also bring many new instructions into the game Finally, we have the Trustzone instructions that are only possible with the double-digit Cortex-M versions (Armv8 based) Note that instructions

CM/MHz 2.33 2.46 1.83 3.45 3.45 5.29 are binary upwards compatible from Armv6-M to Armv7-M and again to Armv7E-M.

An overview of Cortex-M cores is given in Tables 7.1 and 7.2 Note that the first Cortex-M was Cortex-M3 - based on the Armv7-M architecture The next was Cortex-M1 - based on Armv6-M Then Cortex-M4 on the Armv7E-M architecture And so it continued It seems that the Cortex brand became so popular that it was extended both backwards and for wards - signaling more performance as the number after Cortex-M grows When it comes to the double-digit Cortex-Ms there is a more systematic growth over time in the model numbers The “E” in Armv7E-M comes from the optional addition of the DSP The legend in the two tables is as follows:

• Floating-Point Unit: None or Single/Double/Half Precision.

• Digital Signal Processing Unit and TrustZone: Yes/No/Optional.

• Memory Protection Unit: Max number of Regions 0 for no MPU.

It may seem counter-intuitive that the newer - and bigger - Cortex Ms supports half-precision on the FPU This however fits with the vectorized

SIMD instructions that apply the same instructions to parallel streams of data - e.g a stereo signal In these cases half-precision is sometimes enough, and you can use “both halves” at the same time.

Cortex-M Modes

The programmers model is simpler for Cortex-M than it is for e.g Cortex-A

We do not need to worry about switching back and forth between Thumb and Arm mode - we already know that on Cortex-M we only have Thumb mode There is however still some mode-changes - and it’s all related to when a program can do this and access that On reset a program always runs Privileged This means that the program can access all hardware and all memory Like we saw in Chapter 5 In many bare-metal programs this simply continues, and all operation is privileged It is however possible to switch to non-privileged operation This can be done in a bare-metal system, but is more common when an operating system - OS - is used.

A program always runs in either Thread-Mode or Handler-Mode When executing an Exception the program is in Handler-mode Handler-mode always runs privileged Normal execution is Thread-Mode Thread-mode can be both privileged and unprivileged See Figure 7.1 When an op erating system is utilized, it will typically execute privileged, while the application runs unprivileged A program can choose to go from Privileged Thread Mode to Unprivileged Thread Mode This is what happens when an OS starts a normal application thread (in other systems this is termed a

Many core-peripherals are only accessible from privileged mode If an MPU is present it can be set up to only allow access to peripherals and certain memory from privileged mode An application running in unprivileged Thread-Mode cannot switch to privileged mode Instead, the application relies on the services of OS drivers The application can execute the assembly instruction svc This is a Supervisor Call that generates an SVC-exception This will change execution to handler-mode - running

Figure 7.1: Cortex-M Modes privileged The SVC can now do the things that the program could not This is what in older Arm - and many other - CPUs is called a Software

Interrupt In other systems a supervisor call may be called a Software Trap

I clearly remember one of the first systems I programmed professionally - a Digital PDP11 - having Trap instructions, so the concept is old and proven The term “Trap” is commonly used for fault-exceptions (see below).

As stated, the SVC handles the change from unprivileged to privi leged mode The SVC however, also assures a very nice Loose Coupling to the OS OS and application does not necessarily need to be linked to gether The SVC instruction takes an integer-parameter Thus, the bare minimum shared knowledge between OS and application is an enum dec laration for the SVC parameter The SVC handler can choose to read this integer-parameter from the stack - and thus perform different services The simplest service is as described above - and hinted in Figure 7.1 - where a driver does the privileged work and returns to where it came from Basically a function call with improved privileges It is however also possible to do more advanced stuff We discuss this in Chapter 11.

Since the SVC-exception is triggered by a call, we say that it’s a Syn chronous Exception It happens in the thread of execution where we want it to happen There are other exceptions that like the SVC are synchronous - although not always as welcome as the SVC An example is MemManage

Fault, that e.g., may occur if the program attempts to execute out of an address-space marked as “XN” - “Execute Never” If you divide by zero you will run into a Usage Fault There are more examples of synchronous exceptions We will discuss SVCs more in Chapter 11.

Contrary to the above, an Interrupt is an asynchronous exception gener ated by a peripheral - internal or external It is asynchronous because it can occur at any time - related to the software We do have some control

- we might have started a timer that will interrupt when it counts down to 0 - but we don’t know which instruction we will be executing when the interrupt occurs In the same way we may have started a DMA that transfers data from an A/D-converter to a buffer in the core, but we do not know when we will get an interrupt, due to this buffer being e.g half full. When any exception occurs, the core will change to Handler-Mode and execute the relevant handler The address of this handler is found in the vector-table starting at address 0 See Figure 7.2 As you may remember from Chapter 5, the first entry is the Stack pointer for the reset After this - from address 4 - we have only handler addresses On address 4 we have the reset handler, and on address 0x2C we have the SVCall.

The vector table contains addresses for a lot of exception handlers Those that handle interrupts are also known as ISRs - Interrupt Service

Routines Sometimes we forget to talk about “exceptions” and “exception handlers”, and simply talk about “interrupts” and “interrupt-handlers”

- even though interrupts strictly speaking are a subset of exceptions To get from the exception number to the handler address you only have to multiply by 4 In some scenarios Arm use IRQ-numbers - they are offset by 16 from the Exception numbers as can also be seen in Figure 7.2 Some exceptions have given priorities, while others can be configured The negative IRQ numbers are defined by Arm for all Cortex-M4s while the rest are defined by the silicon-vendor.

So what are the privileges of privileged mode? We have seen the Stack Pointer in action in Chapter 5 If fact - this was the Main Stack-pointer -

MSP - which is always used in privileged mode There is also a Process Stack-pointer - PSP An OS will use the MSP for itself, but give its tasks access to the PSP - after letting it point to the stack-area designated by the

OS for the given task To the code it is always SP or R13, but a privileged program can choose to alias SP to the MSP or the PSP.

In privileged mode the program has access to a number of system registers that are not available in unprivileged mode This is e.g the registers controlling the interrupt controller, the MPU and the Program Status Register.

There are also a few instructions that are limited or non-accessible in unprivileged mode Finally, the MPU can set up different access rights for instructions running privileged versus unprivileged A bare-metal program will stay in privileged mode and thus have full access to all memory, all registers and only use the Main Stack.

Figure 7.2: Vector Table in STM32F334 Courtesy of ST.

Exception number IRQ number Offset Vector

Table 7.3: Handler and Thread Modes

Handler Interrupt or SVC Main Privileged

Thread Application Main or Process [Unprivileged

Registers and Call Stack

We know that Cortex-Ms are RISC-processors of the load-and-store archi tecture This means that the ALU cannot work directly on data in RAM

- it needs data via registers Since all data pass through registers (except for DMA) the registers need to be very fast, and they are central in the architecture - and thereby the instruction set Even if you program in C, it will be good to understand the basics of registers as it will improve your debugging skills.

For now, we will look at the Armv7-M core - known from Cortex-M3, -M4 and -M7 - but the other Cortex-Ms are pretty similar We have 16 basic registers - R0 to R15 - and some Special Registers Figure 7.3 show the 16 registers along with some important special registers.

In assembly-instructions, the basic registers can be referenced by their register number - e.g “R0” or “r0” - but some also have alternative names and may be referenced by these It makes the code more understandable if you refer to “SP” or “sp” rather than “r13” Some smaller 16-bit instruc tions can only use the “low registers” - R0-R7 - at the cost of three bits each, while others - together with the 32-bit instructions - also have access to the “high registers” - R8-R15 Table 7.4 shows the 16 registers and their use Before we dig into this, we need to take a look at the stack.

A stack is in principle a simple concept You push values on top of the stack that you later pop to use Just like a pile of books, the first book you can take from the top - pop - is the last one placed at the top - pushed The pop-action reveals the next book - now at the top You often see the term TOS - Top Of Stack A stack is also sometimes called a LIFO - for Last-In-First-Out1 It doesn’t really matter that the stack typically grows towards descending addresses - although it can be confusing The Arm stack is “Full Descending” because when you push, the stack pointer is initially decremented - then used as address for what is pushed The memory cell that the SP points to after this action is thus “full” In a load- and-store architecture the values on the stack comes from registers and go to registers A stack/LIFO is very handy when subroutines or functions are called (we will use the term “function”) The stack allows function to call function to call function etc - and then unwind back to the start

In the old days you could see code where all parameters were stacked

“manually” first, then - by the call - the Program-counter, so that this could be popped when returning You might also see code where all registers were pushed to save their state and later popped - as the called function

1As opposed to a queue which is a FIFO.

Figure 7.3: Cortex-M Registers Courtesy of ST.

Table 7.4: Registers and their use

Register Alt Name Used For Description

R0 - Arg 1, Result Caller-Saved, Scratchpad

R7 (BP) “Base Pointer” Callee-Saved

R12 IP Intra-Procedure Caller-Saved, Scratchpad

R13 SP Stack Pointer MSP or PSP

R14 LR Link Register Caller-Saved, Return Addr

R15 PC Program Counter Exception-Saved might also use the registers There would always be this discussion: “what is the responsibility of the calling function - and what is the responsibility of the called function (aka callee)?” The only sure thing was that after a function had returned, the stack-pointer should be back where it was before the function was called Some languages define all these things - C does not This allows vendors to optimize - and keep compiler designers busy Since a stack normally resides in main memory, and since access to main memory can be slow, it is an obvious choice to use registers to transfer function-parameters instead of the stack It is also obvious to let the compiler only stack registers that actually are overwritten - if the compiler has this knowledge We often hear that it is a good idea to declare variables in the narrowest scope possible This makes programs easier to read It does however, also make life easier for the compiler.

In the Arm Architecture Procedure Call Standard (AAPCS), Arm has defined Calling Conventions for the various generations It is stated here that the caller for our Cortex-Ms must put up to the first four arguments in registers R0 to R3 - using more than one register for 64-bit variables etc

If there are more arguments than R0-R3 can hold, the extra ones must be pushed onto the stack It also states that callee will put a function result in R0 - if there is a result If the result is 64-bit, R1 is also used etc We will

7.3 REGISTERS AND CALL STACK 101 skip the details here - after all we are not writing a compiler.

If caller needs the unchanged content of the registers R0 to R3 after the call has returned, it is up to the caller to push these on the stack - before any parameters Thus, callee can freely use R0-R3 (after using any parameters here) If callee starts using registers R4 to R11, then callee must first push these on the stack, and after using - but before returning - callee must pop these registers in the reverse order In “the old days” the return address was put on the stack, but in the Arm-world the return address is kept in LR - the Link Register This means that when a function calls a new function, it needs to push the existing LR first “So what’s the gain of having LR if it ends on the stack anyway?” you might ask Apart from the faster process when calling several functions after each other, only one level down, we get a much nicer benefit when it comes to exceptions We will see this shortly.

In Table 7.4 I have nicknamed R7 as “Base Pointer” This is not an official Arm name, but a name I picked up from my Intel 80x86 days 80x86 has a Base Pointer Register When a function is called, you will often see that the stack-pointer is decremented by a number and then copied to R7 Since the stack grows downwards this inserts a temporary empty space in the stack that survives further function calls, but is “abandoned” when the function itself returns This empty space is used for local variables

- in C often called “auto-vars” This concept helps functions becoming reentrant A function can call itself - or call another function that calls the original function, and there will be independent instances of the local variables R7 is typically used to point to the “base” of the instance of local variables - the one with the lowest address on the stack This means that these local variables are easily referenced by R7 plus a specific offset per variable Naturally the concept requires R7 to be pushed on the stack before it is reused.

Note that the assembler allows you to push or pop multiple registers by one instruction using curly braces push {r6, r0-r3, r5} will push r6, r5, r3, r2, r1, r0 - in that order The concept is that registers are pushed and popped with the highest numbered register at the highest stack address This means that you can actually copy and paste the push instruction to the bottom of the function and replace push with pop This makes life easier and avoids stupid mistakes like popping registers in the wrong order2 The concept also saves code-space, and the fewer instruction-reads might gain a bit of performance You can still push and pop “manually” with one instruction per line if you prefer.

2Don’t forget to change from push to pop!

Interrupt Asynch Timer, A/D, Reset Fault/Trap Synch Mem, Usage, Bus

Exceptions

Table 7.5 recaps the exception types Historically, when an exception happens, the stack is used to save not only the old PC, but also the Program

Status Register - PSR - containing the ALU-flags (as discussed in Chapter 3)

This is necessary as a flag might just have been set for the next instruction to test The flags are likely to be affected by the code in the exception handler, so pushing the flags onto the stack on entry to the exception, and popping them on exit is a good solution When exiting the exception it is also necessary to adjust which priority is allowed for new exceptions. The above used to mean that returning from a function call required one assembly instruction - often ret - and returning from an exception required another instruction - popping the status register and adjusting the interrupt logic The latter instruction was often called reti - Return From Interrupt - or iret This in turn meant that exception handlers had to be written in assembly code You could wrap a C-function in an assembly prolog and epilog - and many have done so - but it introduces an extra overhead where you really don’t want it.

Arm decided that programmers should be allowed to write exception handlers in C This meant doing some clever things in hardware - affecting the Stack-Frame as the stack for an exception is called.

An interrupt (and fault exceptions) is basically an unexpected function call without parameters - and a bit of wrapping In order to abide to the calling convention for normal C-functions, the hardware stores the relevant registers for us - as usual with the highest numbered register first

- at the highest address Note that Figure 7.3 shows the PSR below the 16 normal registers, thus hinting it as a higher numbered register This means that the PSR is the first register pushed It actually is the xPSR, that sort of integrates three shorter, individually addressable, registers as one 32-bit register The next register to be pushed is the PC, because in exceptions we are going to use the Link Register for something else We then push the old LR before it is overwritten Now follows R12 which is sometimes used by the linker as a stepping-stone or Veneer Finally, the hardware pushes

EXC Return Mode Stack FPU-Ext

Oxffff fffl Handler MSP No Oxffff fff9 Thread MSP No Oxffff fffd Thread PSP No Oxffff ffel Handler MSP Yes Oxffff ffe9 Thread MSP Yes Oxffff ffed Thread PSP Yes

R3, R2, R1 and R0 in that order Each push requires one instruction-cycle. The exception-handler can now be normal C An exception-handler should not expect any parameters, but the code can mess around with R0-R3 - as usual If more registers are to be used they must be pushed first and popped before returning - as usual The really clever thing is that even the return can be executed as usual When the exception happened, the hardware put a special code in the LR - see Table 7.6 We see this value in the debugger’s view of the call-stack - see Figure 7.5 where it is 0xffff fff9 When the normal return statement is executed, the hardware recognizes this code because it is an invalid address - as can be seen in the memory map way back in Figure 4.1 From the code in LR, the hardware can see that this is a return from an Exception The hardware will then unwind the stack frame, picking out the PC to return to The code in LR also tells the hardware whether to move on with the PSP or the MSP, and whether we go back to Thread-mode or stay in Handler-mode (if we were in a nested exception) The stack-frame is shown on the right side of Figure 7.4. The more complicated left side is what happens when the FPU and its registers are in play Here Arm has introduced “Lazy Stacking” In the Floating-Point Context Control Register - FPCCR - you can configure whether you want to enable the use of FPU in interrupts, by setting the ASPEN flag This is set by default When ASPEN is set, the following happens When - in thread-mode - an FPU-operation is started, the Core sets another flag in the same register At the time of the exception, if the FPU was used in the application, the Core makes room on the stack for all the FPU-registers - but does not yet copy them onto the stack It just decrements the SP (much) more than normally If - now in handler-mode

- the Core “meets” an FPU-operation, it checks whether the interrupted application was using the FPU If so, it now copies the FPU-registers from the application to the allocated stack positions - just in time before using them in the handler.

' - -1 - Pre-IRQ top of stack ।{aligner}i 1

Exception frame with floating-point storage । -— -1 - Pre-IRQ top of stack j _ {aligner} i 1

Decreasing xPSR memory pg address nz -

IRQ top of stack RO IRQ top of stack

Exception frame without floating-point storage MS3ooi9vi

Figure 7.4 : Cortex-M Stackframe Courtesy of ST.

Figure 7.4 also shows a dashed “aligner” - an inserted extra unused word This comes into play if the stack at the time of the exception was not 8-byte aligned, and if the “STKALIGN” bit (bit 9) in the Configuration and Control Register - CCR is set - specifying that we will abide strictly to the convention The hardware also handles this for us - at entry as well as on exit from the exception handler.

In the small Blinky program I have added a few lines of code - see Listing 7.1 The first line enables the exception for User Faults - which Divide-By-Zero falls under Then I simply perform a divide by zero in line

4 - which is line 112 in the full main.c The program is compiled without any optimization - to avoid the infamous division by 0 to be optimized away Note that the commented line was necessary for the exception to get through when debugging with VS Code In STM32CubeIDE there is a “Debug Configurations” dialog, with a “Startup” tab This dialog has checkmarks that offer exception handling choices, and here Division-By-

1 SCB->SHCSR |= SCB_SHCSR_USGFAULTENA_Msk; //Enable User-Fault

2 //SCB->CCR |= SCB_CCR_DIV_0_TRP_Msk; // Set by debugger

- 1đ Thread ô1 [main] 1 [core: 0] (Suspended: Signal: SIGTRAI

= UsageFault_HantflerOatStm32f3xx_ttCl30 0x80004bO

S ssignal handler cailed>0 at Dxfffffff9

■3 SHCSR |= SCB SHCSR_USGFAULTENA_Msk; // Enable Fault

14 080001de: ldr r3, [pc, #44] ; (0x800020c ) (SCB)

21 080001ec: str r3, [r7, #4] ; Local myvar is on stack

24 080001f0: movs r2, #0 ; Prepare Divide By Zero

26 080001f6: str r3, [r7, #4] ; We never get to this following two lines we see how R7 is used as Base Pointer to local variables - making room for two 32-bit variables In line 25 the User Fault is triggered by dividing by 0 Note by the way that from the hex-addresses in the left column we can see that almost all the instructions are only 16-bit wide - very efficient Only 32-bit wide instructions are the branches, the sdiv, and the orr.w Listing 7.3 shows the exception-handler which is basically an endless loop.

Table 7.7 shows the registers when we enter the exception handler As noted in the “Comment” column, LR shows us that we have an exception from privileged Thread-Mode, the PC is now at the entry of the handler and

080004b2: add r7, sp, #0 ; No local vars

7.4 EXCEPTIONS 107 the SP points to the stack in Table 7.8 The privileged mode is confirmed by the fact that the SP is aliased to the MSP, while the PSP is 0 We can also see that the value in R7 - our base pointer - is 32 (0x20) more than the

SP This means that eight 32-bit values has been stacked in the stack-frame This confirms Figure 7.4 with the datasheet stack-frame.

Instructions

Program Status Register

In Section 7.8 we discussed how the PSR - Program Status register - is a union of three status registers We now focus on the Application Program

Status Register - APSR The APSR contains the ALU flags that can be affected by arithmetic operations As discussed in the previous subsection, the flags are set by specific instructions like cmp - while other instructions require a postfix “s” to set the flags Table 7.9 summarizes the flags The table deserves a bit more explanation:

The negative-flag is set if the result of an operation was negative.

The Zero-flag is set if the result of an operation was zero.

The Carry-flag is set if:

- An addition resulted in a “wrap” In other words, the result is bigger than 2 32

- The result of a subtraction is non-negative (0 or positive) As stated earlier, this is against my experiences with other cores.

- The last bit shifted or rotated out (left or right) is 1.

The overflow-flag is set if:

- Adding two positive words results in a negative number.

- Adding two negative numbers results in a positive number.

- Subtracting a positive number from a negative number results in a positive result.

- Subtracting a negative number from a positive number results in a negative result.

The Q-flag is set by DSP arithmetic instructions It is “sticky” - requires a special instruction to be reset The idea is that you may send data through a chain of operations, and at the end you can see if an overflow occurred along the way If saturation was selected you technically do not have an overflow - instead there was a saturation.

Although Table 7.9 states the bit-positions of the flags in the APSR, you rarely need these Normally branch instructions use the flags as arguments to the branch - in the shape of Condition Codes It can be something like beq - “Branch if EQual”, which branches if Z=1 It can however also be

Table 7.9: Status Flags in APSR

N 31 Negative Set if result is negative

Z 30 Zero Set if result is zero

C 29 Carry Set if Carry or Borrow

Q 27 DST / Sat Set if DSP Saturates or Overflows seemingly simple conditions that behind the scenes use more than one flag, like e.g bgt - “Branch Greater Than” which branches if Z=0 and N=V This allows you to think more like a human when you program, and you don’t have to worry about how the carry works A short example: subs Rd, Rn, op2

Here “Op2” can be a register or an immediate value (aka constant) between

0 and 4095 Due to the postfix “s” this sets the flags - but what is subtracted from what? Rd is the “destination” register It helps me to read the code like a C statement:

Thus a following bgt will branch if the content of Rn was greater than the content of Op2 (if Op2 is a register, or Op2 itself if it is a constant).

Branches

Even if you do not write assembly code but only debug with C and as sembly mixed, it is beneficial to recognize the different changes in flow

- branching - as well as conditional instructions In my experience most assembly languages have one name for calling a subroutine/function and a completely different name for the simpler jumps used in loops As an example 80x86 uses call for functions - and jmp for unconditional jumps, with e.g jnz for a conditional “jump-not-zero” I have also seen languages using the term “branch” for one of the above changes in the program flow, but until I started working with Arm I had not seen “branch” used for both

As explained in Chapter 7, Arm even uses “branch” for function return Every change of flow is a branch in the Arm-world When you are used to C-programs where there is a very clear distinction between calls and loops, the first meeting with Arm’s assembly languages is confusing You cannot see the difference between a function call, a loop and a function return Table 7.10 shows the types of branches and their main use Note that “Rm” here is any register R0-R15.

Instruction Summary Usage Dist b label Branch immediate Jump to label +/- 16

MB bl label Branch immediate with link

MB bx Rm Branch indirect Return from func

All 4 GB blx Rm Branch indirect with link

Call by pointer All 4 GB cbz Rm, label

The branch opcode can be followed by a Condition Code These are described in the previous subsection.

Especially the “Less or Equal” condition-code on an immediate branch can be confusing as it becomes ble - not using LR (Link-Register) like the bl Good thing that there is no “L” conditional code The simple branch to a label - which is normally a jump - can only jump +/- 16 MB If there is a condition on the jump it can only jump +/- 1 MB Surely enough for any loop I have seen.

The table only shows the most common usage BX Rm becomes a return if the register is a correct LR - like bx lr - or contains the original value of

LR If the function prolog saved LR like e.g.: push {r3,lr}, the return could also be done with: pop {r3, pc} I actually see this pattern a lot in gcc-compiled code Note that this works because lr and pc are r14 and r15 and using one instead of the other does not change its push/pop order relative to e.g r3 blx rm is convenient with function pointers - e.g in function-tables or initializing callbacks.

The conditional branches use the condition codes This sounds like a given thing, but actually is only correct for the branch instructions starting with “b” The cbz - “Conditional Branch Zero” - and cbnz are immediate instructions They test the content of the given register and immediately branches (or not) - without affecting the flags This is faster than first setting the flags and then branching - and the flags survive these immediate branch instructions.

1 CMP R1 , R2 ; Set the flags for R1-R2

3 ITTTE LT ; Three instr for IF Less-Than and one for ELSE

4 MOVLT R3,#0 ; lt must match 1’st T

5 MOVLT R4,#1 ; lt must match 2’nd T

6 ADDLT R5,R4 ; lt must match 3’rd T

You may come across instructions like TBB [Rn, Rm] This is the

Table Branch Byte instruction that is great for switch-statements In this case Rn is a pointer to a table and Rm is an index into the table Rn can be the PC, in which case the value used is the PC right after the instruction

I suggest that PC-based tables are left to compilers As can be expected TBB works on a table of bytes There is a similar TBH that works with half-word-tables for longer distances Since Thumb-addresses are always even, the value in the table is multiplied by two Thus, the table-values should be half the real offset.

Conditional Instructions

For a RISC, Cortex-M cores certainly has many instructions So far, we have seen many ways of controlling the program flow, but there is more A very compact way of doing conditional actions is by not jumping at all - thus not disrupting the pipeline Arm has boiled down the “if-then-else” to a concept where you can have one to four instructions after each other - each being an “If Then” or “Else” of a common initial condition Listing 7.4 shows the concept with the maximum number of statements in the IT-block In line 1 the condition is evaluated - setting the flags In line

3 we have the defining instruction that tells us that there will be three

“THEN” instructions, followed by an “ELSE” instruction - using the “Less- Than” condition Note that the three “THEN” instructions all postfix the

“LT”, while the “ELSE” postfixes a “GE” - where “Greater-or-Even” is the opposite of “Less-Than” If an instruction in the IT-block changes the flags, the new flags are used in subsequent instructions.

Exclusive Instructions

In Chapter 4 we saw Bit-Banding, which can be used to access single bits in selected parts of memory and peripherals I also mentioned that Bit-Banding can be used for Mutexes A Mutex is used to protect one or

DMB DMB() All explicit memory accesses before the DMB in struction are observed before any explicit mem ory accesses after the DMB instruction

DSB DSB() As DMB - but also synchronizes instructions ISB ISB() Flushes instruction pipeline and re-reads cache more variables so that one Task can perform a Read-Modify-Write without another task interfering As we know, Cortex-M is of the type Load-And- Store - meaning that if we e.g want to increment a counter in memory, we need to load it into a register, increment it and store it back where it came from In a preemptive system where tasks run the risk of being taken off the CPU to make room for other tasks, - as well as interrupts in general - there is always the risk that task A does the read part of the Read-Modify-Write and then gets interrupted by task B or an interrupt, which does its own Read-Modify-Write When task A gets the CPU again it might do its modify and then overwrite the counter (or whatever variable) This problem is as old as embedded systems The “safe” solution has always been to disable interrupts (if they are enabled) and after the Read Modify-Write re-enable interrupts again With unprivileged mode this does not work Some CPUs - like the Motorola 6805 - solved the problem by allowing memory-based variables to be incremented or decremented directly Many CPUs have a Test-And-Set instruction for creating a Mutex - Cortex-M goes a bit further with Exclusive Load and Store instructions. The Exclusive instructions are used in Subsection 12.9.2 to implement a “Lock”.

Barriers

Modern cores have sequential pipelines, and things going on in parallel, in order to give us the high speed we are used to This unfortunately means that we as programmers sometimes need to help the core There are instructions that need to be completed before the next instruction is read or executed This could e.g be related to changing the setup of the MPU.Table 7.11 is a short version of the barriers, with columns for assembly version of the instruction, C intrinsic, and a brief description This is a good place to consult the datasheet for your specific Cortex-M.

Alignment

Single Words and Half-words

We say that a read or write access is aligned if it is performed on an address which has a modulo corresponding to the size of the access.

Some cores - like Cortex-M0 and Cortex-M1 - can only handle aligned accesses, and a “Trap” occurs on unaligned accesses This gives the core a chance to handle the access in software A compiler for such systems should only generate aligned accesses.

The Arm Cortex-M3 and upwards, however, do support unaligned accesses Each access is split up in multiple aligned accesses, and bits are shifted to fit the bill This is not as such very fast - but not having to always handle corner-cases can be beneficial Another issue is that the resulting unaligned access is not atomic We know that a read-modify-write needs a bit of handholding, but with an unaligned access, there is the risk that a word - or even a half-word - is read (or written) with intermediate access from another thread or interrupt The compiler and linker assure that e.g a word-array is word-aligned I recently experienced the following scenario on a Cortex-M7:

1 In the C-source there was a memcpy to a char array The memcpy would copy a fixed byte-sequence - always 7 bytes long - into this byte array.

2 When executed, this code provoked a Hard Fault.

3 When I examined the generated assembly code, there was no call to memcpy The gcc compiler had optimized the memcpy call away at optimize level “-Og” - which is “optimize for debug” Instead of a memcpy copying 7 bytes, the compiler had in-line inserted a word-copy, a half-word-copy and a byte-copy The two first were unaligned, and thus the exception occurred on the first.

4 The Armv7 core was be configured to cause an exception - Unaligned

Trap - on unaligned access This generates a Usage Fault - but only if such exception is enabled If not it is promoted to a Hard Fault.

Note that the code that provoked this scenario was completely compiler generated To avoid getting into this situation you have three alternatives:

1 Use the compiler option “-mno-unaligned-access” (in gcc) to assure that there will be no unaligned accesses.

2 Use e.g SCB->CCR &= ~SCB_CCR_UNALIGN_TRP_Msk to disable the unaligned interrupt The use of the SCB struct is explained in Chapter 8.

3 Enable the user-fault exception, catch the unaligned trap and fix in software I don’t consider this a very viable option though.

I did test the first two alternatives and they both worked nicely In the end I realized that the trap for unaligned access is normally disabled, but had been enabled by a simple checkmark in the “Startup” tab of the

“Debug Configurations” dialog You may want to enable this trap if you generate code that is also used on e.g Cortex-M0 The listing below shows how to enable/disable the unaligned trap - and a test that generates an exception when the trap is enabled.

//SCB->CCR &= ~SCB_CCR_UNALIGN_TRP_Msk; // Disable unaligned trap

SCB->CCR |= SCB_CCR_UNALIGN_TRP_Msk; // Enable unaligned trap uint32_t myArray[2] = {0}; uint32_t * myPtr = myArray; myPtr = (uint32_t * ) ((char* ) myPtr+2);

Note that even though I stated that Cortex-M3 and upwards do handle unaligned accesses, this is only correct for the simple cases of single word or half-word accesses You will still get an exception for multi-word accesses or load-and-store-exclusive on unaligned addresses Similarly, accesses to strongly ordered peripherals must be properly aligned The compiler will not generate any such things - with or without the option stated above.

Structs

C-structs are used everywhere and deserve some attention Take the fol lowing simple struct:

{ char enable; int offset; char mode; short vers; char flag;

On a 32-bit system like a Cortex-M, this struct seems to occupy 9 bytes (1+4+1+2+1) There is however, a good deal of padding involved The struct in reality looks like this - with address offsets included: struct MyRealStruct

The first three paddings are needed to assure that the int “offset” lands on a proper 4-byte boundary Pad4 is needed to assure that the short “vers” ends on a 2-byte boundary Finally, the next three paddings are there to assure that if the struct is used in an array, the next element will land on a decent 4-byte boundary - as the first element will Thus, in reality the simple struct does not occupy 9, but 16 bytes.

We can arrange for data to be packed - like in the following:

#pragma pack(push,1) struct MyPackedStruct

The packing assures that there is no padding Unfortunately we now have alignment problems and performance goes down Packing can be a good choice, if we have very little space, and don’t need to worry about alignment - or if we need the struct to map to a given message-format.

A really good option is to manually declare the struct with decreasing sizes This brings the used bytes down to 12 - saving 25 % - and keeping aligment Also showing the padding:

The optimal solution in this case may be to look at the contents of the struct We have an enable field as well as a flag field It may be possible to use bits in a single char (or uint8_t) We could use bit-fields or manually mask - as we have seen many times with the registers in CMSIS-Core This would do away with one char, and three byte paddings - bringing down the resulting size to 8 bytes (not shown) Thus, shrinking our structure to half the original value.

You can test the above at runtime by printing sizes with sizeof It is, however even simpler to check the Map-file:

Overview of CMSIS

CMSIS stands for “Common Microcontroller Software Interface Standard”

It is an Arm concept and therefore only common for Arm products. Based on origin and usage, CMSIS is grouped into sub-categories The structure can sometimes be a bit unclear - partly because it targets end users as well as silicon vendors - partly because the term HAL - Hardware

Abstraction Layer is overused “HAL” originates in the domain of bigger

CPUs with paged virtual memory In these CPUs there is no way - ever - that a user program can access hardware directly All hardware is accessed via OS drivers In our microcontroller world the HAL is vaguely related to the above I see the following abstraction levels in the hardware support:

At the very bottom - just above the hardware - we have h-files defin ing the addresses of hardware-registers, as well as macros with masks for setting and clearing various bit-positions Performance- and architecture-wise using this is not different from using absolute ad dresses and hex-values when accessing registers It is however much easier, less error-prone and more maintainable The h-files typically contain C-structs for accessing register-groups that belong to specific internal peripherals The C-structs may include bit-fields (which can affect performance) and enums when this makes sense The fact that modern editors have intellisense is a big help here Clearly these basic files are structured one-to-one as the hardware is structured We saw a few samples of register accesses based on C-structs in Listing 7.1 in Chapter 7.

Using a mixture of c- and h-files, a small layer of access functions are defined These are often defined as macros - and therefore found in h-files - because they may be used from interrupts and other critical places where a function call with lower performance is unacceptable These functions are typically very short, and they too follow the structure of the hardware There is no RTOS-awareness at this level These files are often prefixed with “LL” for “Low-Level”.

On top of the access layer we see something most vendors call

“Drivers” These drivers are much more generic than the lower layers Where the lower layers are structured according to the peripherals and their registers, these drivers are structured according to the func tionality an application developer is expected to need By using the lower-level access-functions it is possible for a device-family driver to work on specific devices These drivers can cooperate with an RTOS - e.g., using callbacks - and you can choose whether you want to poll, use interrupts or even DMA The downside of using generic drivers is that they may use more codespace and CPU-cycles than a simpler implementation - with less flexibility In many cases you can save a lot of development time using these drivers in less critical places, and use your energy where it matters most You might e.g., write your own interrupt services, but subscribe to a vendor’s initialization functions.

On top of the drivers we see hardware-independent software like USB- or TCP/IP-stack, zero-copy message system or SQLite database

In larger CPU-systems, middleware tends to be swallowed by the OS.

Here we have our software It can work through drivers - with or without an RTOS - or it may be a “bare-metal” implementation that only use the two bottom layers - or maybe even only the low-level access-functions You can also use the vendor supplied drivers as inspiration.

In different contexts I have seen everything below the middleware termed “HAL” Armed with a bit more terminology we can now take a look at CMSIS.

We have talked about the “Core” as the part of a Cortex-M device that is completely defined by Arm - the “inner CPU” - like we saw back in Figure 3.3 It makes a lot of sense that Arm delivers files for their part of the hardware This is CMSIS-Core Thus, the term “Core” here is not just something central and important - it literally targets the MCU-Core CMSIS-Core thus contains the register-abstractions and access-functions for the core and it’s peripherals.

Since many years MCUs also come with intrinsic functions These are basically small assembly language functions for special use from C

A programmer can see them as C language-extensions - hence the term “intrinsic” This is mode-changes like enabling or disabling interrupts, atomic access constructs and low-level things like rotate (C has shift, but not rotate) Since intrinsic functions invariably involve assembly language, there may be variations for different compilers Thus, I think it’s safe to say that the difference between an intrinsic function and an access-function is that the intrinsic function is focused on what you simply cannot do in normal C An access function may use an intrinsic-function, plus more, like masking values, OR’ing with other values, and writing to a register etc In some Cortex-Ms we also have the SIMD-instructions - leaving full DSP algorithms to CMSIS-DSP Finally, the “Startup” file is included

We saw this in Chapter 5 The Startup calls some initialization functions that vendors must supply.

Silicon vendors integrate the Core in their design and add a bunch of internal peripherals We saw an example of this in Figure 3.4 These peripherals also need register-definitions with access-functions in similar files as the CMSIS-Core On top of these access-functions, the vendors however also supply their own larger drivers The CMSIS- Driver files typically in their file-path contain the name of the board in question Sometimes there are files for a suite of products and other files with additions for board-families or even specific boards

A family-file may define a suite of registers with structs for access, while the specific board-file defines the base addresses of the registers

If you make your own board you can start with the “nearest” vendor board using the same MCU.

We have seen that Cortex-M4 and upwards include DSPs To help C- programmers use all the new facilities and instructions, new h-files were created These may support “Neon” or “Helium” or SIMD- instructions More importantly, CMSIS-DSP also includes math libraries that utilize floating-point and basic integer algorithms We will meet these in Chapter 9.

This is support for Neural Networks, which is out of the scope for this book.

• CMSIS-RTOS v1 and CMSIS-RTOS v2

This is the interface to third party real-time operating systems As can be expected, v2 is simply a newer version of CMSIS-RTOS v1

We deal with this in Chapter 13.

This category contains the System View Descriptions that are used by debuggers to present information As a developer you typically need not bother with this - as long as the names and values presented in the debugger UI match the documentation.

This relates to the Coresight Debug Access Port It controls JTAG and Serial-Wire Viewer We see a subset of this in Chapter 13.

This is definitions related to the Zone-concept introduced with Cortex- M33 This is out of the scope for this book.

This is about building projects from the command-line The toolbox can be found on GitHub.

This is vendor tools for building the packs that we use as developers.

Another way to show the various parts of CMSIS, is to map them to the hardware block-diagram - see Figure 8.1 This mapping may not be 100 % precise, but it may help you get an overview Table 8.1 shows a few named files as well as the CMSIS-category they belong to, and their content In the following I will present snippets of code from some of the files named in Table 8.1 This will give a better understanding of what goes where It might also serve as general inspiration.

CMSIS-Driver (From Silicon Vendor)

- LL: Low-Level Register-Level Access Functions

- Intrinsic Functions (Interrupts, Barriers etc)

- Core Register Access Functions & Masks

- API for Semaphore/Mutex/Lock etc

- Wrappers for selected RTOS’ es

Figure 8.1: Loosely Mapping CMSIS to Hardware

Core core_cm4.h NVIC, SCB, SysTick, Debug, MPU,

FPU, xPSR for Cortex-M4 Core core_cm7.h As above but for Cortex-M7

Core cmsis_gcc.h gcc specific intrinsic functions

Core cmsis_armclang.h As above - but for Clang

Driver stm32f3xx_hal.h Atomic Functions etc.

Driver stm32f334x8.h Huge! Datastructures, Bit definitions and Macros Driver stm32f3xx.h Atomic Functions for the device fam ily Driver stm32f3xx_ll_gpio.h Low-Level GPIO access functions for

STM32f3xx boardsDriver stm32f3xx_hal_gpio.c High-level GPIO Functions

CMSIS-Core

Listing 8.1 shows a small part of core_cm4.h, with condensed comments and small font The first part is the struct for the SCB - System Control

Block - Registers When you read the documentation, these registers are described “on their own” In other words - you may not realize that a register is part of the SCB In the same way you may not realize that the

“ETM” - Embedded Trace Macrocell - registers are a part of the TPIU - Trace Port Interface Unit It makes sense when you know it, but it can be difficult to catch when you are browsing tons of documentation.

Below the SCB we find a lot of “defines” for positions and masks - I only show the ones for the PendSV bit Next, we see the base address definitions with pointer declarations Further down I included an access function for enabling a specific core-interrupt At the very bottom we see ITM_SendChar - used in Chapter 13.

As stated earlier, CMSIS-Core also contains what we might call langua ge-extensions Listing 8.2 shows a bit of cmsis_gcc.h which provides a lot of intrinsic functions At first, we see the generic macro for inline functions Arm use the static_forceinline everywhere and has defined com piler specific implementations of it This allows most of the code to be the same We see the intrinsic function for enabling interrupts from C and for setting the Base Priority This allows you to disable interrupts with a lower priority than a given threshold You can use this to stop task-switching, while still allowing interrupts.

Next we have the two exclusive load and store instructions that are discussed in Chapter 7 and used in Section 12.9 to create a lock They also come in byte and half-word versions.

IM uint32_t CPUID; // Offset : 0x000 (R/ ) CPUID Base Register

IOM uint32_t ICSR; // Offset : 0x004 (R/W)) Interrupt Control and State Register

IOM uint32_t VTOR; // Offset : 0x008 (R/W)) Vector Table Offset Register

IOM uint32_t AIRCR; // Offset : 0x00C (R/W)) Appl Interrupt and Reset Ctrl Reg

IOM uint32_t SCR; // Offset : 0x010 (R/W)) System Control Register

IOM uint32_t CCR; // Offset : 0x014 (R/W) Configuration Control Register

IOM uint8_t SHP[12U]; // Offset : 0x018 (R/W) Sys Hndlrs Prio Regs 4-7,8-11,12-15

IOM uint32_t SHCSR; // Offset : 0x024 (R/W) Sys Hndlr Control and State Register

IOM uint32_t CFSR; // Offset : 0x028 (R/W) Configurable Fault Status Register

IOM uint32_t HFSR; // Offset : 0x02C (R/W) HardFault Status Register

IOM uint32_t DFSR; // Offset : 0x030 (R/W) Debug Fault Status Register

IOM uint32_t MMFAR; // Offset : 0x034 (R/W) MemManage Fault Address Register

IOM uint32_t BFAR; // Offset : 0x038 (R/W) BusFault Address Register

IOM uint32_t AFSR; // Offset : 0x03C (R/W) Auxiliary Fault Status Register

IM uint32_t PFR[2U]; // Offset : 0x040 (R/ ) Processor Feature Register

IM uint32_t DFR; // Offset : 0x048 (R/ ) Debug Feature Register

IM uint32_t ADR; // Offset : 0x04C (R/ ) Auxiliary Feature Register

IM uint32_t MMFR[4U]; // Offset : 0x050 (R/ ) Memory Model Feature Register

IM uint32_t ISAR[5U]; // Offset : 0x0 60 (R/ ) Instruction Set Attributes Register uint32_t RESERVED0[5U];

IOM uint32_t CPACR; // Offset : 0x088 (R/W) Coprocessor Access Control Register

} SCB_Type; // System Control Block Registers

#define SCB_ICSR_.PENDSVSET._Pos 28U

#define SCB_ICSR_.PENDSVSET._Msk (1UL ISER[(((uint32_t)IRQn) >> 5UL)] = (uint32_t)(1UL TCR & ITM_TCR_ITMENA_Msk) != 0UL) && / * ITM enabled * /

((ITM->TER & 1UL ) != 0UL) ) / * ITM Port #0 enabled * /

#define STATIC_FORCEINLINE attribute ((always_inline)) static inline

STATIC_FORCEINLINE void enable_irq(void)

STATIC_FORCEINLINE void set_BASEPRI_MAX(uint32_t basePri)

ASM volatile ("MSRJbasepri_max^J%0" : : "r" (basePri) : "memory"); }

STATIC_FORCEINLINE uint32_t LDREXW(volatile uint32_t * addr)

ASM volatile ("ldre^J%0,J%1" : "=r" (result) : "Q" ( * addr) ); return(result);

STATIC_FORCEINLINE uint32_t STREXW(uint32_t value, volatile uint32_t

ASM volatile ("strex,_l%0, u%2, u%1" : " = &r" (result), "=Q" ( * addr) : "r

CMSIS-Driver - Low-Level

Now we turn to the vendor-supplied things in CMSIS-Driver Let us take a look at how GPIO is defined at the low level in the sample-board - the STM32F334R8 GPIO is short for General Purpose Input/Output GPIO is one of the most basic peripherals in an MCU This will benefit us in Chapter 13 where we use GPIO for an interrupt A selected part of the

LL - Low-level - GPIO functions is shown in Listing 8.4 This listing is also shortened a bit - e.g., by moving curly braces after the final “;” in functions. While we look at the low-level functions, we need to keep an eye on the base address definitions and the typedef for GPIO These are found in the file stm32f334x8.h - from which selected lines are shown in Listing 8.3 It is also in this file that we see the base-addresses for Flash, RAM and Peripherals - as well as the Bit-Band addresses for the latter two These should be recognizable from the Memory-Map back in Figure 4.2 Stepping through the registers in the typedef in Listing 8.3 we have the basic Port:

GPIO is typically used as single-pin I/O but is often grouped in Ports

- here with 16 bits This allows us to use e.g., one port as a 16-bit wide data-port In modern MUCs the limiting factor is not so much the internal functionality as the pins available - we will therefore often strive to use serial I/O instead of parallel We do what we can to utilize each and every pin to the utmost This is the pin-multiplexing described in Chapter 3 That is why we have the Mode-register - allowing us to configure ports as Input/Output/Alternate/Analog These choices are found in stm32f3xx_ll_gpio.h in Listing 8.4 With four choices we need two bits per pin - fitting nicely with 32-bits to configure 16-pins LL_GPIO_SetPinMode does the job here.

The output-type gives us two choices per pin (when it is used as output) We see in Listing 8.4 that the choices are Push-Pull and

Open-Drain “Push-Pull” simply means that the pin will drive (push and pull) the outgoing lane on the PCB low or high for respectively

0 or 1 “Open-Drain” means that the output can be connected to other outputs which is normally a bad idea With open drain and pull-up, a 0-bit will connect the output low A 1-bit will however keep the output floating In the system there is a pull-up resistor assuring the high level - when all outputs are set to binary 1 The resistance is low enough to tie the lane to high when all outputs are floating, but also high enough to assure that when a single output is set to low, this output will pull down the lane, without pulling too much current through the pull-up resistor We basically now have an AND function of the outputs You can also have pull-down, in which case 0 and low is the passive state, and a single 1 or high will cause the lane to go high Like an OR-function This concept can e.g., be used when several slaves are sharing an SPI-bus The active slave drives the line, if all slaves are configured to open-drain and only one output is active LL_GPIO_SetPinOutputType does the job.

With the above circuitry, we can select between Pull-Up, Pull-Down and nothing LL_GPIO_SetPinPull is helpful here.

You can set the output speed - the slew-rate of the digital transitions on the output You normally want this low, as fast flanks generate EMC noise in the system LL_GPIO_GetPinSpeed shows how to read the speed configuration All the other settings also have “get” functions.

LL_GPIO_ReadInputPort reads the entire 16-bits from the port into the low half-word.

LL_GPIO_WriteOutputPort writes all 16 bits from the low half word.

To set a single output bit, you might expect that we would need to read the current output, OR with a bit at the right position, and write it back This can however, cause problems as we know that Read-Modify-Writes can be interrupted Read-Modify-Write also take time Here the BSRR comes to assistance BSRR stands for

Bit-Set/Reset-Register A bit set (1) in one or more of the lower 16 bits will reset (clear) the corresponding pin A bit set (1) in one or more of the upper 16 bits will set the corresponding pin (subtract 16 from bit-no to get pin) Bits that are not set (0) will change nothing LL_GPIO_SetOutputPin is used here.

By sending a sequence of words, a program can lock the configuration of a given port - providing extra security.

These registers are used when a pin can be configured for other purposes The possible values depend on the specific pin.

This register can clear from 0 to 16 bits It is redundant when we have the BSRR.

Only very few access-functions from stm32f3xx_ll_gpio.h are shown in Listing 8.4 There are functions for all reading and writing to almost all registers Except for LL_GPIO_TogglePin they all appear to either read or write in a single line The modify_reg does however also perform Read-Modify-Writes We meet this in the next section.

Listing 8.5 shows selected parts of stm32f3xx.h This file is used for a whole family of boards and contains the modify_reg used in previous section The interesting part here is, however, the atomic macros that are used for Read-Modify-Write We have seen that e.g., GPIO has built-in ways to change single bits, but that is not always possible Here we see - again - the exclusive load and store in action This time to create atomic bit set and bit-clear Also note the use of the do while {} 0 in the macros This is the only way to create a macro that can be used in any C construct.

3 IO uint32_t MODER; // GPIO port mode reg,

4 IO uint32_t OTYPER; // GPIO port output type reg,

5 IO uint32_t OSPEEDR; // GPIO port output speed reg,

6 IO uint32_t PUPDR; // GPIO port pull-up/pull-down reg,

7 IO uint32_t IDR; // GPIO port input data reg,

8 IO uint32_t ODR; // GPIO port output data reg,

9 IO uint32_t BSRR; // GPIO port bit set/reset reg,

10 IO uint32_t LCKR; // GPIO port configuration lock reg,

11 IO uint32_t AFR[2]; // GPIO alternate function regs,

12 IO uint32_t BRR; // GPIO bit reset reg,

19 #define SRAM_BB_BASE 0x22000000UL

20 #define PERIPH_BB_BASE 0x42000000UL

24 #define APB1PERIPH_BASE PERIPH_BASE

25 #define APB2PERIPH_BASE (PERIPH_BASE + 0x00010000UL)

26 #define AHB1PERIPH_BASE (PERIPH_BASE + 0x00020000UL)

27 #define AHB2PERIPH_BASE (PERIPH_BASE + 0x08000000UL)

28 #define AHB3PERIPH_BASE (PERIPH_BASE + 0X10000000UL)

30 #define GPIOA_BASE (AHB2PERIPH_BASE + 0x00000000UL)

31 #define GPIOB_BASE (AHB2PERIPH_BASE + 0x00000400UL)

32 #define GPIOC_BASE (AHB2PERIPH_BASE + 0x00000800UL)

33 #define GPIOD_BASE (AHB2PERIPH_BASE + 0x00000C00UL)

34 #define GPIOF_BASE (AHB2PERIPH_BASE + 0x00001400UL)

Listing 8.4: stm32f3xx_ll_gpio.h

#define LL GPIO PIN_15 GPIO_BSRR_BS_15 // Select pin 15

#define LL GPIO MODE_INPUT (0x00000000U) // Select input mode

#define LL GPIO MODE_OUTPUT GPIO_MODER_MODER0_0 // Output mode

#define LL GPIO MODE_ALTERNATE GPIO_MODER_MODER0_1 // Alt fct mode

#define LL GPIO MODE_ANALOG GPIO_MODER_MODER0 // Analog mode

#define LL GPIO OUTPUT_PUSHPULL (0x00000000U) // Select push-pull

#define LL GPIO OUTPUT_OPENDRAIN GPIO_OTYPER_OT_0 // open-drain

#define LL GPIO SPEED_FREQ_LOW (0x00000000U) // I/O low out speed

#define LL GPIO SPEED_FREQ_MEDIUM GPIO_OSPEEDER_OSPEEDR0_0

#define LL GPIO SPEED_FREQ_HIGH gpio _ ospeeder _ ospeedro

#define LL GPIO PULL_NO (0x00000000U) // I/O no pull

#define LL GPIO PULL_UP GPIO_PUPDR_PUPDR0_0 // I/O pull up

#define LL GPIO PULL_DOWN GPIO_PUPDR_PUPDR0_1 // I/O pull down

STATIC_INLINE void LL_GPIO_SetPinMode(GPIO_TypeDef * GPIOx, uint32_t Pin , uint32_t Mode)

{ MODIFY_REG(GPIOx->MODER, (GPIO_MODER_MODER0 ODR, PortValue);}

STATIC_INLINE void LL_GPIO_SetPinOutputType(GPIO_TypeDef * GPIOx, uint32_t PinMask, uint32_t OutputType)

{ MODIFY_REG(GPIOx->OTYPER, PinMask, (PinMask * OutputType));}

STATIC_INLINE uint32_t LL_GPIO_ReadInputPort(GPIO_TypeDef * GPIOx)

{ return (uint32_t)(READ_REG(GPIOx->IDR));}

STATIC_INLINE void LL_GPIO_TogglePin(GPIO_TypeDef * GPIOx, uint32_t PinMask)

{ uint32_t odr = READ_REG(GPIOx->ODR);

WRITE_REG(GPIOx->BSRR, ((odr & PinMask) exticr[3] to the bit pattern

010 to activate the right multiplexer.

• Set bit 13 in the Interrupt Mode register.

• Not set anything in ext->emr - Event Mask Register - because an interrupt is not the same as an event.

• Configure interrupts to happen on the falling flank by setting bit 13 in exti->ftsr1.

• Alternatively configure interrupts to happen on the rising flank by setting bit 13 in exti->rtsr1.

All the above can be read in “RM0364” - the reference manual for STM32F334xx boards On the other hand - you can simply use the function partly shown in Listing 8.6.

Listing 8.6: stm32f3xx_hal_gpio.c void HAL_GPIO_Init(GPIO_TypeDef * GPIOx, GPIO_InitTypeDef * GPIO_Init)

{ uint32_t position = 0x00u; uint32_t iocurrent; uint32_t temp;

/ * Check the parameters * / assert_param(IS_GPIO_ALL_INSTANCE(GPIOx)); assert_param(IS_GPIO_PIN(GPIO_Init->Pin)); assert_param(IS_GPIO_MODE(GPIO_Init->Mode));

/ * Configure the port pins * / while (((GPIO_Init->Pin) >> position) != 0x00u)

/ * Get current io position * / iocurrent = (GPIO_Init->Pin) & (1uL Mode & EXTI_MODE) != 0x00u)

HAL_RCC_SYSCFG_CLK_ENABLE(); temp = SYSCFG->EXTICR[position >> 2u]; temp &= ~(0x0FuL Mode & TRIGGER_RISING) != 0x00u)

EXTI->RTSR = temp; temp = EXTI->FTSR; temp &= ~(iocurrent); if ((GPIO_Init->Mode & TRIGGER_FALLING) != 0x00u)

EXTI->FTSR = temp; temp = EXTI->EMR; temp &= ~(iocurrent); if ((GPIO_Init->Mode & EXTI_EVT) != 0x00u)

/* Clear EXTI line configuration * / temp = EXTI->IMR; temp &= ~(iocurrent); if ((GPIO_Init->Mode & EXTI_IT) != 0x00u)

Working In Floating-Point

In Section 6.3 we discussed how to build for the FPU - Floating-Point Unit Floating-point is an integrated part of the C-language, with float - aka Single Precision - and double - aka Double Precision - in the floating point standard (IEEE-754) As long as I can remember there always was the question, whenever a CPU was considered: “Does it have floating-point?” The meaning with this question is: “Does it have hardware support for floating-point?” The first PCs used the intel 8086 CPU (or the cheaper

8088 incarnation) If you had money enough, you could also buy a PC with a 8087 “Coprocessor” Several generations of intel CPUs had a coprocessor, until 80486 where it was integrated in the main IC.

Cortex-M4 is the first Cortex-M to offer support for floating-point - sometimes The silicon vendor can choose between no support and support for single precision (sign, 23-bit mantissa, 8-bit exponent) You often see a Cortex-M4 with FPU called a Cortex-M4F With Cortex M7 you can choose between single- and double-precision The newer MCUs also support half precision which is practical when you have programs with SIMD - Single Instruction - Multiple Data See Tables 7.1 and 7.2 for support over all Cortex-Ms.

So how does it work? As discussed in Chapter 6, it is important to compile the code with the right switches and call the right libraries How ever, when it comes to writing C-programs you typically don’t have to worry how to write the program You do not necessarily need CMSIS-DSP This may sound obvious - one might think that CMSIS-DSP is all about using the DSP-instructions built into Cortex-M4 and others for e.g., SIMD, or multiplying two 32-bit integers and accumulating results in 64-bits.

Listing 9.1: Simple Program Using Floating-Point

#define SAMPLES 1024 const float A = 0.912345f; const int PERIODS = 4; float stepsize = PERIODS * (2 * M_PI)/SAMPLES;

• Project Setup for (int i = 0; i < SAMPLES; i + +) mysignal[i] = A * sinf(i * stepsize); float power = 0; float * psignal = mysignal; for (int i = 0; i < SAMPLES; i++)

However, that is only half the story It turns out that CMSIS-DSP contains many algorithms that are similar - but in different versions, working in f64 (double), f32 (float), bytes, 16-bit halfwords and in 32-bit words In other words, if you want to do Digital Signal Processing in your Cortex-M, you may benefit from CMSIS-DSP - also if you consider using “plain” float or double.

My first little sample-program was, however, easily done without li braries Listing 9.1 contains a small program that first generates a sine - then calculates the power of the sine I chose this example because it is easy to verify the result without even opening Excel The power of a sine with amplitude A is A2/2 Please note how the program stays within the confines of single-precision by putting an “f” on the amplitude constant and using sinf - the float version of the normal sin function which works in doubles This program was created as an empty project for the sample STM32F334R8 board, into which I wrote the code in Listing 9.1 In order for us to analyze the effects with and without the hardware acceleration we need more:

I use the Serial-Wire Output and the ITM built into the Cortex-M4 to use standard printf This is described in Chapter 13.

The project must be set up in “Properties-C/C++ Build- Settings- MCU Settings” to use the Single-Precision FPU available with the Hardware implementation of the Floating-Point ABI See Figure 9.1

It is here we later change from “FP4-SP-D16” to “None” - and to the Software ABI Note the small checkmark, lending us support for float in printf We also need to optimize for speed This is done in “Properties-C/C++ Build Settings-Optimization” - also visible in the tree in Figure 9.1.

We can measure time in microseconds - or in cycles I chose cycles because this is independent of the core-frequency - and thus more relevant in comparisons Measuring cycles only requires register declarations from core_cm4.h - described in Chapter 8.

To be able to compare the calculation we also calculate the theoretical value of the power as stated earlier.

Apart from trying various ways to do calculations, I also wanted to use the CCMCRAM that the small Nucleo-64 board has CCMRAM may be used for data - or for code It was clear that setting the CCMRAM up to hold data would be the simplest task, so I decided to start with that My question was: “How does performance improve if CCMRAM is used to hold the sine?” On one hand I expected access to this RAM to be faster than the normal RAM - while on the other hand I knew that the CCMRAM is placed in the code region This might create a bottleneck when fetching data and code from the same region To allocate mysignal in the input section that ends up in CCMRAM, gcc requires an attribute

- this may vary with the compiler used Line 9 in Listing 9.2 shows the declaration Looking at the “Build Analyzer” under “Memory Details” I found mysignal in both CCMRAM and Flash I remembered the linker script that by default expects CCMRAM to be used for code, and in the script I changed ccmram at > flash to ccmram.

It turned out that using CCMRAM instead of normal RAM changed absolutely nothing in this scenario I kept this setting as it perfectly consumed my entire CCMRAM area - leaving more RAM for the rest When I ran the program in Listing 9.2, I got the following output:

2 Measured Power (f32): 0.416187, True Power:0.416187 Took 9239 Cycles

Listing 9.2: Measuring on Floating-Point

#define SAMPLES 1024 const float A = 0.912345f; const int PERIODS = 4; float mysignal[SAMPLES] attribute ((section (".ccmram")));

// Skipping _write, prototypes and comments int main(void)

* Reset of all peripherals, Initializes the Flash interface and the

CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;

DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; uint32_t start; uint32_t stop; float stepsize = PERIODS * (2 * M_PI)/SAMPLES; start = DWT->CYCCNT; for (int i=0; i < SAMPLES; i++) mysignal[i] = A * sinf(i * stepsize); stop = DWT->CYCCNT;

//Needs C/C++ Build Settings - Tool settings enable floats printf("SineJGenerationJtookJ%luJCycles\n", stop-start); float power = 0; float * psignal = mysignal; start = DWT->CYCCNT; for (int i=0; i < SAMPLES; i++)

} stop = DWT->CYCCNT; power /= SAMPLES; // Normalize printfCMeasured^Power^ (f32)\J%f^JTrueJPower:%fJTookJ%luJCycles\n", power, A * A/2.0f, stop-start);

The sine generation is using the FPU in a gcc library sinf function, while the power calculation is done in main.c The disassembled code for the relevant part of main.c is shown in Listing 9.3 This listing is not easy to read - partly because of the speed optimization Table 9.1 is an attempt to explain the main loop Note that in Unified Assembly Language, floating-point instructions start with “v”. vldmia is Load Multiple - Increment after The “multiple” tells us that it can load several registers Due to the exclamation mark, the register used for addressing will be incremented with 4 bytes per register loaded This instruction costs 1 cycle plus 1 more cycle per register - with only one register we end at 2 cycles. vfma is Fused Multiply-Add - which is basically the well-known MAC - Multiply-Accumulate - where intermediate results are not rounded The ALU-flags are set in the cmp statement and used by bne.n - which is Branch

Not Equal with a narrow address If an address designated as “narrow” cannot fit into a 16-bit instruction, the assembler will flag an error In other words - the address we jump to must be near - as is normally the case in a loop This instruction takes one cycle - plus a pipeline refill According to my measurements, the branch takes 3 cycles in my Cortex-M4 In other words, of the 9 cycles per sample, the loop-maintenance takes 4 cycles, and the actual floating-point arithmetic takes 5 cycles. vmul and vadd each take 1 instruction, and it seems that we could use these instead and save an instruction There is however a problem when they are used back-to-back - in which case they also end up costing 3 cycles On top of this we would have rounding after every multiplication.

It is interesting that the final division by the number of samples is done by a multiplication (line 18) This proves that gcc is clever enough to divide by a constant by multiplying with the reciprocal value Another interesting thing is the call to _extendsfdf2 (line 23) This library function converts a float to a double and is provoked by printf, as this only deals with doubles Since the M4 Core only supports single-precision floating-point it throws in the towel here and calls the software library.

Listing 9.4 shows the loop “unrolled” so that it only loops once per four multiply-Accumulations This brings down the total cycles for the loop to 6463 - corresponding to 6.3 cycles per sample Note that this gives a compiler warning because the order of dereferencing and incrementing the pointer is not completely clear, but works - for now If instead each line becomes power += (*psignal) * (*psignal); psignal++; we get no warnings and the cycle-count is the same.

Listing 9.3: Disassembly with Floating-Point Unit

146 printf("MeasuredJPowe^J (f32)^J%f,JTrueJPower:%fJTookJ%luLJ Cycles\n", power, A * A/2.0f, stop-start);

Instruction Explanation Cycles vldmia r3!, s15 Load memory referenced by

2 cmp r3, r2 Set flags for R3-R2 1 vfma.f32 s14, s15, s15 Fused Multiply-Accumulate

3 bne.n 0x8000e4a Branch Not Equal (narrow) to loop start

// Run the up to three last samples for (int i = 0; i < SAMPLES % 4; i + +) power += (*psignal) * (*psignal++); for (int i=0; i < SAMPLES/4; i++)

{ power += (* psignal) * ( * ps 3ignal++); power += (* psignal) * ( * ps >ignal++); power += (* psignal) * ( * ps >ignal++); power += (* psignal) * ( * ps >ignal++);

Floating-Point Without FPU

We can test the software-library alternative to the FPU very easily We only need to go into “Properties-C/C++ Build-Settings-MCU Settings” - as in Figure 9.1 - and set “Floating-Point unit” to “None” and “Floating-Point ABI” to “Software implementation” After a complete rebuild and restart of the the code in Listing 9.2 we get the following printout:

146 printf("MeasureduPoweru (f32):u%f,uTrue,_lPower:%fUJTook,_l%lu,_l Cycles\n", power, A * A/2.0f, stop-start);

Measured Power (f32): 0.416187, True Power:0.416187 Took 160611 Cycles

The above is a factor 8 on the sine-generation and 17 on the power calculation! The FPU certainly makes a huge difference Listing 9.5 shows a disassembly of exactly the same source code as Listing 9.3, but now with floating-point via the software-library We can see that the “v” instructions are gone - and with them all the “s” registers that come with the FPU The floating-point multiplications are done by calling mulsf3, while addi tions are performed by calling aeabi_fadd - which is also sometimes called addsf3.

Fixed-Point

In the beginning of this chapter I talked about CMSIS-DSP having algo rithms based on integers as well as floating-point People working within signal processing are used to getting more for their money if they work with integers These are the main reasons:

A DSP without floating-point support is cheaper than one with We just saw how slow the software version of floating-point is A DSP cannot live with that Integers are the alternative.

Circuits for integer math are simpler than those for floating-point This means less power - and longer battery lifetime if batteries are used.

A 32-bit integer allows higher precision than a float which also takes up 32 bits A float has a larger dynamic range due to the 8-bit exponent - yes - but 8 bits less for the precision.

Floating-point is “emulated” in integers by having an implicit decimal point - or rather - binary point We just have to agree where the binary point is Typically, it is just right to the sign bit - we basically always use signed numbers in DSP A decimal number like xy.z is defined as x * 10 1 + y * 100 + z * 10 - 1 in our Base-10 system In the same way the binary signed (positive) integer 0101 is 1 * 2 2 + 0 * 2 1 + 1 * 20 in the Base-2 system

If we choose to have an implicit point after the sign bit, we will interpret

0101 as 1 * 2 - 1 + 0* 2 - 2 + 1 * 2 - 3 In other words, the byte 0.1010000 with the implicit point shown explicitly after the sign bit, is the same as 1/2 + 0/4 + 1/8 + 0/16 + 0/32 = 0.625 It is often easier to convert to integer first, and then divide by 27 in the case of a byte (and 231 for a word) In this case (16+64)/128 = 0.625.

Since the leftmost bit is the sign bit, we can say that there are 0 integer bits and 7 fraction bits Texas Instruments invented a notation where this is called q7 Similarly, we have q15 in 16 bits (half-word) and q31 in 32 bits (word) Alternatively, we also include the integer-part - including the sign - in the notation This notation is often used by Arm, and I think it makes more sense Here the above would be called 1.7 (byte), 1.15 (half-word) and 1.31 (word).

In any case, we are still working in two’s-complement, and we can add and subtract completely as usual - but multiplications are a bit different

If we multiply two bytes of type q7 - aka 1.7 format - we will get a number in 2.14 format However, the leftmost two bits are now both the sign bit, and we can safely shift one bit left to get 1.15 - or q15 This shift is often done silently in DSPs.

In some cases we want to keep the binary point at a different place to represent integers and fractions in the same number We could e.g., have a 16.48 word Here we have a 48-bit fraction part, and a 16-bit integer part containing the sign.

The concept of moving the point is often used with long series of Multiply-Accumulates to give headroom for the additions - avoiding an overflow In our case where we calculate power this is especially important

Since the multiplications here are squares, we always accumulate positive numbers, and we need headroom If you somehow - in other algorithms - know that the end-result is within the legal range, two’s complement wraps in a way that allows temporary overflow along the way in an accumulator. Listing 9.6 shows our power-calculation once again Now the barrier instruction isb() is introduced after the cycle-counter is started, and before it is stopped, to assure that these instructions does not move around Flushing the instruction pipeline does take a bit of time, and thus results in a slightly higher number of cycles, but it is necessary to give consistent and correct results The previous assembly listings show that the cycle counts were read at the right time before, but now that function calls are introduced this has changed.

Initially a CMSIS-DSP library function is called that converts the al ready existing float array into a q31 array The next DSP-function is where we calculate the dot product The name refers to the dot-product of two vectors which is a single number It is also called the scalar product By using the sine as both vectors, we calculate its power The function ac cumulates in 64 bits and therefore claims to return a q63 number This is confusing because the documentation states that it is actually a 16.48 number Looking at the source-code the latter is confirmed This is what happens - in the library function and our program - also shown in Figure 9.2:

• Our sample is in 1.31 format This means that the value lies in the interval [-1,1[ This number is multiplied with itself in the library dot-product function The multiplication results in a 2.62 number.

• Accumulating a lot of these numbers in 64 bits could easily result in an overflow inside the function To avoid such an overflow, each 64-bit multiplication-result is right-shifted 14 bits before being accu mulated in 64-bits So now we accumulate 16.48 bit numbers - 14 bits more in the integer part and 14 less in the fraction The library function returns this to our program.

• After having done all the intermediate accumulations in 64 bits it is

OK to round the result into 32 bits In C we simply right-shift 32 bits and cast the result to a 32-bit value Right-shifting can be tricky A

“logical” right shift always inserts zeros into the most significant bits What we want is an “arithmetic” shift which always copies the sign when shifting bits in from the left In this special case we work with squares which are always positive, but still - we want it nice and tidy The good thing is that gcc for Arm do use arithmetic shifts for signed data We basically take our 16.48 bit number and throw away the bottom 32-bits - and then we have a 16.16 number.

• The above number is now cast - converted - to a float.

• The cast works on integers - it does not “know” about our invisible binary point We need to manually “shift” the float 16 bits This is done by dividing with 0x10000 Had we done this operation on the integer, we would have thrown away 16 bits and created an underflow Performing the division on the float simply adjusts the mantissa, and we keep the float’s 23-bit accuracy.

• Finally, we divide the result by the number of samples - just as before This is part of the equation for calculating power - not a number format trick.

In the end we get exactly the same result as in the float (f32) version Unfortunately, we end up spending more than twice as many cycles on the integer dot-product as we did on the simple for-loop with floating-point This can be seen from the printout that includes the previous exercise - with unrolled power-loop and library-version of sine-generation:

Measured Power (f32): 0.416187, True Power:0.416187 Took 6957 CyclesMeasured Power (q31): 0.416187, True Power:0.416187 Took 22632 Cycles

LUPPer.3.2 bits are sign bits — cast away 16.16

/ 0x10000 float x ci™ iiiiiiiii iiiiiiiii iiiiiiiii iiiiiiiii

Exponent x x-16 iiiiiiiiiiiiiii iiiiiiiiiiiiiii iiiiiiiiiiiiiii iiiiiiiiiiiiiii iiiiiiiiiiiiiii

Listing 9.6: Fixed-Point arm_float_to_q31(mysignal, myqsignal, SAMPLES); q63_t resultq63; start = DWT->CYCCNT;

isb (); arm_dot_prod_q31(myqsignal, myqsignal, SAMPLES, &resultq63); //result in

// GCC does Arithmetic right shift - and the number is positive anyway q31_t resultq31 = resultq63 >> 32; //result in 16.16 format - not really q31_t

// Change to float and scale float result = ((float) resultq31) / (float) 0x10000; result /= SAMPLES; // Normalize as before printfCMeasured^Power^ (q31)\_1%f,_1True_1Power:%f_1Took_1%lu_1Cycles\n", result, A * A/2.0f, stop-start);

Executing in CCMRAM

Earlier I tried to keep mysignal in CCMRAM - and it changed nothing The CCMRAM is however placed in the code-region and is mainly there to speed up code.

I decided to test the unrolled for-loop running in CCMRAM, because it was already the fastest As expected, it is a lot more work to get this run ning, than simply keeping the signal in CCMRAM We have discussed how the linker and Startup works, but nevertheless it is nice that the process is described in ST’s Application note AN4296 - for several compilers For gcc these are the steps:

• The linker script needs a ccmram output section, very similar to the data section This section must be placed before the text section - but can come after the vectors Please note that this differs from the default setup from STM32CubeMX.

• The unrolled for-loop is moved to a function called CalcPower() - returning the power.

• The new function must have the following attribute before the func tion declaration:

• The Startup needs assembly-code that copies the above new section from Flash to CCMRAM This is very much like the data-section - but you can copy directly from the application note.

• The above did not work Checking, I did not find any ccmram section in the MAP-file I realized that the function CalcPower() is so short that it is inlined by gcc - and thereby still executed in the text section The function therefore also needs the following attribute:

This was the new output:

2 Measured Power (f32): 0.416187, True Power:0.416187 Took 5687 Cycles

3 Measured Power (q31): 0.416187, True Power:0.416187 Took 22385 Cycles

Now we count 5687 cycles for the function call This corresponds to 5.55 cycles/sample - bringing us very close to the “datasheet version” of 5 cycles/sample - 2 for vldmia and 3 for vfma - as can be seen in Table 9.1.

Summing it all up

Please note that moving the code to CCMRAM is not the only change

To make room for the code, the signal is no longer stored in CCMRAM This did not change anything earlier, when the standard sinf functions was used It did not change anything for the for-loop with floating-point multiply-accumulates either It does however, mean something for the specific CMSIS version - arm_sin_f32 This has gone from 94252 to app

110662 cycles Repeating the experiment gives the same results This result is also the same with and without the barrier.

With the help of the small sample we have scratched the surface of floating- and fixed-point arithmetic CMSIS-DSP offers algorithms for both - al though in this case the floating-point version was so simple that we did the calculations directly in main You can use the floating-point CMSIS-DSP algorithms - independently of whether your Cortex-M has a built-in FPU or not If not, it will use the software library In any case, the algorithms can be a great help.

Table 9.2 is a summary of the measurements done in this chapter We saw that although fixed-point arithmetic has a reputation for being faster than the corresponding floating-point version, this is by no means a law of nature.

Floating-point is basically always faster and simpler to implement, mainly because you rarely need to worry about overflow or underflow When doing integer-arithmetic you need to understand and control the implicit binary point Floating-point executed faster than fixed-point in the sample - even though the library is using the Cortex-M’s built-in DSP functions We did, however, see that without the FPU, floating-point math becomes painfully slow Don’t forget that the FPU is of absolutely no use to us if we need to work with doubles - at least on MCUs like Cortex-M4 and Cortex-M33.

Note also that even though the “naive” implementation of the floating point algorithm outperformed the fixed-point library version, you can get even better performance by unrolling the loop a bit - removing most of the overhead related to branches in the loops.

In this chapter we also spent time on testing the CCMRAM It is rel evant because this is just one of the many RAM-types that can be used to speed-up important inner-loops - as those typical of signal processing

We saw that the slow sine-calculation did not change when the signal was stored in CCMRAM, while the fast sine improved slightly When CCM-

Algorithm CCM Method Cycles Per MAC Index

Sine gen arm_sin Data HW f32 94245 92 1.00

Sine gen arm_sin None HW f32 110662 108 1.17

Sine gen gcc sin Data HW f32 323158 315 3.43

Sine gen gcc sin Data SW f32 2626041 2564 27.86 Power unrolled for Code HW f32 5687 5.55 1.00 Power unrolled for Data HW f32 6463 6.31 1.14

Power simple for Data HW f32 9239 9.02 1.62

Power simple for Data SW f32 160611 156.85 28.24

RAM was used to hold the code for the floating-point unrolled for loop, this improved less than 15 %.

The effect of the barriers is not insignificant In the cases where barriers were used, they were absolutely necessary in order to get results that were not completely wrong I did revisit the cases from the start of the chapter where barriers were not needed When barriers were introduced, I measured higher cycle-counts This should not be a surprise - after all we are waiting for pipelines to be flushed It was however, a surprise to me that the values measured grew more than 10 % The good thing is that this was the same for all the cases compared This probably means that the last

- and most optimal cases - would perform 10 % better without the barriers This would bring us very close to the datasheet value of 5 cycles/MAC A way to test this could be to create an outer-loop that performs the tests between the barriers 1000 times This exercise is left to the reader.

An MPU - Memory Protection Unit - is not the same as an MMU - Memory

A Memory Management Unit is a part of CPUs with e.g., a Cortex-A profile It allows processes under e.g., Linux to access the common physical memory using a virtual address space for each process The MMU ensures that one process cannot access the memory space of other processes - unless a shared memory object is specifically set up In such a system the linker will link functions and libraries together into a program, but it is not until this program is loaded by the operating system, that it is assigned a physical address space Even at this time the physical addresses are not

“known” by the program, as addresses are translated on-the-fly by the MMU The operating system can load many such programs that work in parallel They can work independently of each other - although they will be competing for resources like disc-access, network etc Just like your desktop PC.

Systems like our Cortex-M MCUs run a single Monolith program that already at link time is assigned physical addresses in the common physical memory We saw in Chapter 4 how the memory input-sections from the compiler - or assembler - are bundled together in output-sections and assigned to memory regions in physical memory by the linker In many ways this is a simple system However, a “wild” pointer in one part of the program can easily read or write to variables in a completely different part of the program Possibly even overwrite program-code C will allow you to load a constant into a pointer - and then dereference it This is how we access hardware registers when hardware - as normally is the case - is memory mapped One small mistake - like missing a 0 in a hex-address

- and you are writing in the wrong place This has been the condition of

151 embedded programming since day one.

Just as the Cortex-A CPU can run Linux, Cortex-M - or other MCUs - can run a small RTOS - Real-Time Operating System You can use drivers in such a system to shield your application code from the code that is allowed to write directly to hardware This is absolutely recommendable, but it is nevertheless still possible for a programmer to dereference the wrong pointer - in the driver as well as in the application - and havoc will rule, pretty much as in the bare-metal solution This is why we sometimes consider RTOS-based systems as bare-metal systems.

One solution to this danger is to use a language without pointers - like

Rust However, C and C++ is still dominating the embedded world we live in So - wouldn’t it be nice to have some piece of hardware to help us? This is where the MPU comes in ST has a description in their application note AN4838 Some of the smaller Cortex-Ms do not support an MPU, in the mid-range they are optional, while the bigger Cortex-Ms always come with MPUs Please see Tables 7.1 and 7.2.

With an MPU you can define a relatively small number of Regions Cortex-M4 optionally has an MPU, which allows for eight regions - plus a background region Cortex-M7 has the same MPU - with 8 or 16 regions A region defines a memory range and which type of accesses that are legal for the given range We can define the Memory Type as Normal, Device or Strongly Ordered We can also define access as RO - Read-Only, RW - Read-Write, or No-Access - and we can set the XN - eXecute Never flag for a region These definitions can be set for respectively Privileged and

Non-Privileged mode - see Chapter 7 We saw these attributes when we discussed the Memory Map in Chapter 4 Finally, we can set the cache policy for MCUs that have cache-support, and we can select whether the MPU is active during a Hard Fault It might not be practical to trigger an MPU-exception, once you are already in a hard fault The base address of a region must match its length - e.g a 32 kB region must start on an address that modulo 32k is zero.

Stack Guard

bare-metal system basically always runs in privileged mode.

A system with the MPU disabled acts like a system without an MPU A program can configure the MPU statically at initialization, helping you to e.g., avoid executing code from data-space and assure that code writing to registers in the Device regions obeyes the rules for accessing devices.

If you have an RTOS it may dynamically set up the MPU so that any given Task only has access to its own stack This is e.g supported by FreeRTOS in a specific MPU-version The FreeRTOS homepage suggests that you use the MPU to create a pseudo process and thread model - where groups of threads are allowed to share a memory space.

Note that you can also use the MPU during debugging If you see un stable behavior, you can debug a suspected memory overwrite by creating an access function for the data that is overwritten The access function will program the MPU to allow access to the memory before accessing it, and disallow access afterwards Only code that use the access function will avoid triggering an exception When the exception occurs, the stack will show the offending code.

Using the MPU can seem complex, and I am sure that many MPUs are never used Sometimes it is best to start with a simple success This can prove the value of investing more energy in a given concept, and thereby motivate us developers In Chapter 4 we saw an example of a stack-guard, that was checked at regular intervals It is however, much better to catch the villain red-handed As complex as the MPU may seem, it is not difficult to set up a stack-guard, that at the moment it is violated, leads to an exception.

Listing 10.1 shows the last part of the usual linker-script Here the symbol MPUstack_chk_guard is created at the border between the heap and the stack In this case we add 32 to the location pointer before the stack part.

Before this is done, we align to a 32-byte border This is dictated by the MPU The smallest area the MPU can protect is 32 bytes - and as stated - any protected area must be aligned corresponding to its own size So now we have a “no-mans land” of 32 bytes where there should never be any access.

Listing 10.2 shows a very simple setup of the MPU First we read the TYPE register to make sure that the current device has an MPU The STM32F334R8TX does NOT have an MPU and will fail already here This

Listing 10.1: MPU Stack Guard in Linker Script

= + _Min_Heap_Size; / * Allocate heap after bss * /

MPUstack_chk_guard = ; / * MPU guard symbol declared * /

= + _Min_Stack_Size; / * Allocate Stack * /

} >RAM sample is run on a Cortex-M7.

Before we start writing to the MPU we call the Data-Memory-Barrier to assure that data in the pipeline is delivered Then the MPU is disabled to avoid any exceptions during configuration The number of regions are read and compared to our expectations On a Cortex-M4 we expect 8 if an MPU is present, while on a Cortex-M7 we expect 8 (F7 version) or 16 (H7 version) Now, the privileged default layer is enabled This means that privileged code has access as defined by the memory layout All code in this program is privileged Next, the region number is defined (to 0) and the base address of the protected region is set To this purpose we use the symbol from the linker - introduced as an extern definition in line

1 Now we have the most complex register - the RASR - Region Attribute and Size Register - where the size is declared to be 32 bytes, with no access allowed and eXecute never The important enable bit that enables this region is also set here.

To be on the safe side, the remaining regions are now cleared The memory-fault exception is enabled, and finally, the MPU itself is enabled and we are in business.

In the exception, you can do a printout, or go into an infinite loop - or you can set a breakpoint via an assembly instruction With the help of the debugger, the stack can then be examined We will be able to see which line in which function did the bad deed - and also the variables around it.

Listing 10.2: Stack Guard extern uint8_t MPUstack_chk_guard[32]; static int SetupMPU()

// Do we even have an MPU? if (MPU->TYPE == 0)

// Printout an error-statement return false;

// Disable MPU while fiddling with it

DMB(); // Wait until transfers are done

MPU->CTRL = 0; uint32_t regions = (MPU->TYPE & MPU_TYPE_DREGION_Msk) >>

MPU_TYPE_DREGION_Pos; if (regions != EXPECTED_REGIONS)

// Printout an error statement return false;

// Set up the basic privileged default background page

MPU->CTRL |= MPU_CTRL_PRIVDEFENA_Msk;

MPU->RBAR = (uint32_t) MPUstack_chk_guard; // Region Base Address of guard

// We select a region size, we disable Execute of code, we disallow any access and we enable the region

MPU->RASR = ARM_MPU_REGION_SIZE_32B | MPU_RASR_XN_Msk | ARM_MPU_AP_NONE

// Assure that other regions are disabled for (uint32_t i=1; i < regions; i++)

SCB->SHCSR |= SCB_SHCSR_MEMFAULTENA_Msk; // Enable Exception

MPU->CTRL |= MPU_CTRL_ENABLE_Msk; return true;

A program can execute in one of the following ways:

A classic C or assembly program on a single core You can follow the

Thread of Execution through simple code, loops and explicit function calls Even a C++ program with its implicit instantiations of classes and operator-overloads is a sequential program - although it is not always so easy to follow the thread of execution The same thing can be said about an event-driven program without Worker-Threads.

Still a single core The simplest concurrent system is a single main- program with one or more interrupts However, we normally think of a concurrent system as one with multiple Tasks or Threads that seem to run in parallel I will use the term thread here as task is more “overloaded” and therefore vague Although threads seem to be parallel, there is always maximally one thread executing in a concurrent system Concurrency is typically implemented by a

Scheduler in a kernel, which might be part of an Operating System

All the smaller Cortex-Ms are tailored to run an RTOS - Real-Time Operating System - on a single core Such an RTOS will often offer more than just the scheduling Examples are signals, queues or message-boxes and semaphores and/or mutexes We will see this in Chapter 13 To qualify as a full-blown operating system, there is typically also support for GUI, network, file-system etc Such systems are typically found on Cortex-A CPUs As stated earlier, a concurrent program can have issues with e.g., Read-Modify-Write cycles that are interrupted.

Some years ago, Moore’s law stated that performance of cores doubled every 18 months This was mainly done by constantly shrinking the width of the lanes in the ICs - meaning that less current was needed to change state and thus clock-speeds could be raised without using more power Unfortunately, we hit some physical boundaries here and clock-speeds stopped growing The solution was multiple cores We will undoubtedly see multiple cores in more systems in the future There are several Cortex-A CPUs that support multiple cores Since these CPUs include MMU and can run Linux, developers can leave the complex control of such systems to Linux with its SMP -

Multiple Thread Solutions

The thread is waiting for e.g., an event or a resource It could also be suspended on purpose Another name for this state is Waiting.

The thread was killed, but its resources are not released yet.

Figure 11.1 shows a concurrent system with multiple threads At the left side of the figure, we see some of the memory constructs we have dealt with: Heap, global memory, static memory (module-only access) and of course the code These are all shared among the threads.

Whenever there is a Context-Switch from one thread to another, the - currently invisible - OS saves the Context of the thread in what is often called a Task-Control-Block - TCB The OS will keep an array of these TCBs

- indexed by the thread-number Into the TCB goes the content of all registers When a thread is initialized, a memory area is set aside for a stack for the specific thread When all the registers for the next thread are switched in by the OS, the SP will point to the stack of the new thread. Figure 11.1 also shows Thread-Local Storage - TLS Imagine that several threads run the same code They each have their stack, but there is an issue with static variables Some of these may be const initialization values and can be shared, but others should be specific for the thread So just like a thread is initialized with a stack, it can also be initialized with TLS This concept was introduced with C11 and is supported by gcc You can e.g., declare a variable static thread int myMachineID and you will have an instance of this static var per thread.

Figure 11.2 shows the same system as Figure 11.1 - but this time with heavy use of the MPU The idea is that the context-switch now includes a reconfiguration of the MPU Thread 1 and 2 each have their special config uration, while Threads 3 and 4 share a configuration Most importantly the thread-stacks are protected This protects the non-active stacks from being overwritten by the active stack Secondly the code is protected in the same way, which helps us against e.g., bad function-pointers We also see that global and static variables are protected This is probably a bit optimistic, as this can be hard to configure - and the MPU only offers 8 regions on a Cortex-M4 (8 or 16 on Cortex-M7) Finally, the peripherals have their own pattern, signifying that they can only be accessed from privileged mode This could be via a Supervisor Call - SVC - as we saw in Figure 7.1.

Figure 11.2: Multi-Threading with MPU

Figure 11.3: Multiple Processes and Threads with MMU

Figure 11.3 shows the same system again This time our threads are part of Process1, which is now in a Virtual Address Space The physical memory is at the bottom of the figure - together with the peripherals In-between we have the MMU that translates between virtual and physical addresses If we still want to protect our threads from each other, we should consider moving them to separate processes This will provide the optimal protection - at the cost of added overhead Switching between threads in the same process (or in a system without processes) is faster than switching between processes A Cortex-M does not have an MMU and cannot support this system It could be a Linux system on a Cortex-A or Windows/Linux on a PC From the short walkthrough, it is clear that our Cortex-M OS-based system is to be found somewhere between Figure 11.1 and Figure 11.2.

Context Switching Between Threads

Context switching is the concept of switching from executing one thread on the core, to executing another In Section 11.1 I briefly introduced the TCB - Task Control Block We saw in Figures 11.1 and 11.2 that many of the memory-structures are non-thread related - and therefore remain unchanged over a thread-based context-switch We (almost) only need to switch the registers One of these is the Stack-Pointer - SP If the system is initialized correctly, the SP we switch to, will point to the correct stack, and the PC to where we should execute code If we use Thread-Local Storage, the compiler might allocate R9 for this If not, the OS will keep an index. There are three main triggers behind a Context-Switch:

This typically happens when we perform an OS-function call (SVC) that cannot be completed right now We might be reading from a TCP-socket or mailbox where there are currently no data, or we are waiting for a mutex or semaphore The OS can see that the resource cannot deliver, and our thread changes state from Running to Blocked Another thread that was Ready now becomes Running.

2 The time-slice runs out.

Many OS’es support Preemption as this is called Typically, the OS maintains a SysTick timer When the SysTick counts down to 0 it generates an interrupt The timer either automatically reloads (it does in Cortex-M) and restarts its countdown, or it is restarted in the

ISR - Interrupt Service Routine If there is another Ready thread at the same priority as the current, it now becomes Running, and the old thread becomes Ready.

3 A higher priority thread becomes Ready An interrupt may have told the OS that at least one byte is now waiting in the TCP-socket, or the

OS may have received an SVC from another thread that released a mutex or put a mail in a mailbox.

As we can see from the above, all relevant state-changes that should end with a new Running thread, occur via interrupts or SVC-calls - and are thus handled in an exception handler A thread is therefore always in Handler-Mode when it changes state The relevant handlers will call specific OS-code The OS stores the state of the thread that was Running The thread to become Running was earlier paused as Ready or Blocked - also in a Handler The OS now restores the state of the thread to be Running

Storing and Retrieving State

and effectively returns from the Handler belonging to the now Running thread Figure 11.4 is an extension of Figure 7.1, that we saw earlier As the figure shows, we can view the context-switch scenario as the CPU chewing merrily on one thread Then an exception occurs, the CPU goes into one handler, and falls through a wormhole - only to appear in the paused exception-handler of another thread, from which it now returns to continue the normal execution All this works because the Handlers in these specific cases execute OS-code.

We saw in Section 7.4 how the Cortex-M stack-frame - the stack content at an exception - is created From the top of the stack we have R0-R3, R12, old LR (R14), PC (R15) and xPSR When the OS needs to perform a Context-Switch it can extend this stack-frame A context-switch is not a function call - we need to assure that the remaining registers are saved on the stack This means R4-R11 and LR (it is the threads’s old version we already have on the stack) The SP is R13 and is stored in the TCB Before doing the above, the OS however needs to use the right stack - and this knowledge must also be saved in the TCB structure Note that the stackframe is stored on the stack used before the exception occurred We are in Handler mode - thus using the MSP - the Main Stack-Pointer If the entry was via an SVC, the thread was unprivileged - using the PSP - the Process Stack-Pointer Before the OS pushes any registers, it needs to switch to this stack If the entry was via an interrupt, the OS needs to find out which stack was used (the thread could be privileged or the interrupt was nested) This can be found via the Link Register - see Table 7.6 From this we can also see whether we need to save the yet unsaved FPU-registers. When the thread is made active again, the OS reads - from the TCB - which stack to use and builds a consistent LR Also, the value of the Stack Pointer is read The OS switches to the correct stack, R4-R11 are popped

- and possibly also FPU-registers - and the OS performs an Exception Return.

The upper thread in Figure 11.4 is the one that is activated at reset It is where the OS itself is initialized This thread is also likely to create the other threads - basically by assigning a stack to each, building “fake” stack frames as described above, with extra registers, and creating TCBs pointing to these stacks This means that the other threads are started by returning from exceptions that never happened Threads may also be created later on-the-fly if the code spawns a thread However, the procedure will be the

Figure 11.4: Thread Context Switching same.

As usual, the use of LR is Arm-specific, but the principles discussed are also valid for other cores If you want to dig further into how all this is done, I suggest that you look at FreeRTOS or other available RTOS code There is one interesting instruction I would like to show, that is commonly used in context switches on Arm: stmdb r0!, {r4-r11, r14}

In this code R0 was previously loaded with the contents of PSP - the process

SP The stmdb instruction means “Store Multiple Registers - Decrement Before” The exclamation mark means that the pointer-register (here R0) is updated along the way In other words, we here have R0 doubling as a stack-pointer, pushing multiple registers - the ones missing as described above It also respects the rule that highest numbered register is stored at highest address Quite a luxury to have an instruction like that.

Executing the Context Switch

From the previous section we know that the OS-code doing the context switch can be called within SVC, SysTick - and possibly also other inter rupts This is what happens on some architectures This concept can cause challenges if e.g., the SysTick-interrupt, interrupts an ongoing interrupt, that we actually would like to complete first - to keep latency down The Arm remedy for this is PendSV - Pendable Service This is an interrupt that you give the lowest priority When - in a SysTick-Handler or an SVC- Handler - the OS needs a context Switch, the OS sets the PENDSVSET bit in the Interrupt Control and State Register - ICSR At some point the handler exits If the SysTick handler had interrupted another interrupt - this completes Should other interrupts happen during this interrupt, they are also handled Finally, when there are no more interrupts and normal code is about to resume - then the PendSV interrupts We can call it a deferred handler Inside the PendSV-Handler the context switch happens as previously described.

Chapter 11 described the absolute basics of an Operating System - or rather a Real-Time Operating System In this chapter we will look into the various concepts and constructs - patterns - that we need, in order to deal with concurrent or parallel work If an OS or RTOS is available, it will help us in many ways - if not, we need to do more work Several patterns are not specific to parallel systems, but I have chosen to include them, because they become even more important here.

Layering

There is an old rule or guideline that I would like to mention when it comes to writing software: Strong Cohesion - Low Coupling This rule can be applied to functions as well as classes and modules According to Oxford

Languages “cohesion” means “the action or fact of forming a united whole”

I interpret “coupling” to mean “dependency” in the above context The idea is that a function should basically only do one thing - without side effects - like e.g sin() A module or a class should contain functions and data that belong together - it could be trigonometry Thus, you will find cos() in another function - but in the same module or class, and you will not find led_on() in the trigonometry module or class It’s all very clear and pure with math, in real life it’s not so easy But it is definitely worth trying.

In real life we have dependencies - but we can try to be aware of them, and only have them one way Whenever I am introduced to a new embedded system I always look for the layers This is an important pattern

- not only in concurrent or parallel code, but always The simple rule of a layered system is that each layer only “knows” about layers below it An

Figure 12.1: Generic 5-layer stack and Example upper layer knows the APIs of the lower layer and can call these - not the other way around In a truly layered system any layer only knows about the layer just below It only calls one level down The advantage of such a system is that it is easy to modularize You can substitute a given layer with a better implementation - or even an implementation with a different functionality Figure 12.1 shows the generic 5-layer internet protocol stack Under the TCP/IP (layers 4 and 3) you may have an Ethernet (Layers 2 and 1) implementation The Ethernet layers may however be substituted with e.g Wi-Fi The word “stack” in Protocol Stack simply means a number of layers on top of each other - together implementing one or more protocols Note that today we often talk about the 7-layer OSI stack, which unfortunately wasn’t in fashion when the internet was designed You can read more about this in my book “Embedded Software for the IoT”.

The cost of layering is sometimes performance You will have a hard time finding a separate TCP implementation that you can hook up with a separate IP implementation TCP and IP are implemented together because they are used together and the benefits from optimizations can be huge and important Still, it is important to at least have conceptual layers.

I sometimes hear developers complain that with several layers, a little addition to the bottom layer causes a tide-wave of changes in all the layers above as the new functionality is embraced This can happen You should not add layers just to add layers Each layer must be justified However, it is also possible to have flexible data-formats - like JSON - where you can extend a given message without changing intermediate layers.

We saw in Chapter 11 how an OS can provide an API that requires the

Figure 12.2: Concept of Layers with RTOS application to know and use the OS - while the OS does not need to know the application.

Figure 12.2 shows an application running on top of an RTOS - and the RTOS running on top of the Drivers layer The figure also includes

Middleware This is not true layering as the application can go through middleware to reach the OS - but it can also skip the middleware The middleware might be a TCP/IP-stack or a message-mailbox system or other concept that allows threads to communicate Often the middleware is more or less integrated in the OS - just think of Android which is a lot of middleware around a Linux-like core Figure 12.2 contains down-pointing arrows to stress the idea about “who knows who” In both figures 12.1 and 12.2 data will go both up and down - arrows or not.

Things are, however, seldom as simple and layered as Figure 12.2 indicates Some drivers may come with the RTOS, while others may be written by the developers of the application The API offered by any driver is in reality dictated by the RTOS This is bending the rules of layers The layers above and below RTOS and middleware, must to a large degree adapt to these This is how the hardware plus the RTOS (and optionally middleware) ends up being a platform It can be almost impossible to replace one RTOS with another, since the layers both above and below has adapted to the original RTOS This is probably why Arm came up with CMSIS-OS The idea is to have a generic interface to the OS - from

Figure 12.3: Arm view of CMSIS with RTOS applications as well as from drivers This allows code to be ported between systems with different RTOS’es.

Figure 12.3 shows a view from the Arm CMSIS-RTOS documentation Here the Middleware is arranged side-by-side with the application Drivers are not shown.

So, what is the morale of the layering discussion? Basically, layers are important, but modern systems are often too complex to satisfy with a single set of layers Subsystems emerge Apart from the TCP/IP-stack we might have a BLE (Bluetooth Low Energy) stack, and an embedded database system They each have their own layered subsystem.

If the system has no concept of layers - just a lot of code - you will run into circular references It is a nuisance if you must compile one bunch of files before another bunch - and then you need to compile the first bunch again There will be this nagging doubt that you will be testing something that behaves different the next time it is built.

The layers rules are challenged - the RTOS (and middleware) can dominate an embedded system Arm has introduced CMSIS-RTOS to make it easier to substitute one OS with another.

We still need to discuss how to call “upwards” in a layered system This is discussed in Section 12.3.

State Machines

It is easy to get lost in deep hierarchies of if-then-else statements Often a solution is to use switch-case statements, but when you have many events, and different states or modes in the system - you need a state-machine This is also called a Finite-State Machine aka an FSM Hardware developers use state-machines all the time Many embedded developers use them a lot PC-programmers think they don’t use state-machines, but in reality they are hidden in the framework I have created an example based on the

Figure 12.4: A day in the life of a developer day of a developer - see Figure 12.4.

A state-machine is centered around one or more tables Personally, I like to have as much as possible in one table In Listing 12.1 we have events as rows and states as columns In the cells we have structs containing an action-function and the next state You sometimes see tables with gates and other versions with a table per state I like that I can have the full state-machine on a screen - or a single sheet of paper It gives a great overview - you can e.g., see which events you are not using when This may provoke you to think that maybe you actually should use that event.

A state-machine can be very local to a task, but it can also be global to the system Behind the table we have the action-functions and the logic that runs the table In a system-wide state-machine it is preferred to separate the logic from the execution-context of the event-sources In such case you can have one task that is responsible for running the state machine This task may read events from an OS queue - aka mailbox

Listing 12.1: Source for State-machine

1 struct action fsm[EVENTS][STATES] =

3 // EDIT BUILD DOWNL RUN DEBUG EVENTS

5 {doCode,NA} {NA,NA}, {NA,NA}, {NA,NA}, {doEdit,EDIT}, //Idea

6 {NA,NA}, {doCoff,NA}, {NA,NA}, {NA,NA}, {doStep,NA}, //Tick

7 {NA,NA}, {NA,NA}, {NA,NA}, {doRead,NA}, {NA,NA}, //GoodPr

8 {NA,NA}, {NA,NA}, {NA,NA}, {DoEdit,EDIT}, {NA,NA}, //BadPr

9 {doBuild,BUILD},{NA,NA}, {NA,NA}, {NA,NA}, {doRun,RUN}, //Retry

10 {NA,NA}, {doEdit,EDIT}, {NA,NA}, {NA,NA}, {NA,NA}, //WarnErr

11 {NA,NA}, {doDownl,DOWNL] {NA,NA}, {NA,NA}, {NA,NA}, //NoWarnErr

12 {NA,NA}, {NA,NA}, {doEdit,EDIT} {NA,NA}, {NA,NA}, //NoStart

13 {NA,NA}, {NA,NA}, {doRun,RUN}, {NA,NA}, {NA,NA}, //Start

14 {NA,NA}, {NA,NA}, {NA,NA}, {doDebug,DEBUG} {NA,NA}, //BreakPt

Other tasks as well as ISRs may generate messages for the queue In a real system, events can be button-pressures, alarms triggered by the system, data coming in on a port and much more.

I always enjoy good documentation Unfortunately, we often see that the code lives-on while the documentation dies It may help to keep the documentation with the code in the form of e.g., Doxygen Doxygen supports PlantUML Listing 12.2 shows the PlantUML for the drawing in Figure 12.4 On plantuml.com you can insert source, experiment, and get PNG and SVG drawings You can do the same with many other diagram types You can get plug-ins for VS-Code that shows PlantUML graphically when it meets code like this.

Blocking Threads vs Callbacks

Listing 12.2: PlantUML for State-machine

BUILD —> EDIT: Warning or Error

BUILD —> DOWNLOAD: No warnings or Errors

DOWNLOAD —> EDIT: No start, check scripts

DOWNLOAD —> RUN: Start of Main

In Chapter 11 we saw how a thread - or a process - can be Blocked If you have ever tried programming a TCP-server using Berkeley Sockets you know how powerful this can be In a server you can create a thread that performs blocking Accept calls from a Listening socket At the incoming Connect request from a client, a new socket is created - while the original thread is unblocked and normally loops back to Accept where it blocks again The new thread will block in Read or Write requests if there are respectively no data in the input buffer or no room in the transmit-buffer The nice thing about this design is that the code ends up looking like we humans tend to think - and even describe what happens This makes it more maintainable.

If there is no OS to handle all the blocking/unblocking and creation of threads behind the scenes, the above concept does not work The same applies when using some lightweight RTOS’es In these cases, you use

Asynchronous calls The concept is that the application performs the Accept

- and later Read and/or Write - without blocking Instead, the call returns immediately - not having a result with it With the asynchronous call (or in a separate Init call) you hand the lower layer a pointer to a function that

Figure 12.5: Callback it can call when there is a change - e.g when there is data for the Read call

We generally call this function a Callback In FreeRTOS it is called a Hook This is shown in Figure 12.5.

A simpler scenario could be a DMA transfer We register a callback for the exception that happens when our DMA-buffer is half-full When we get the callback, we act on the data we received The advantage of the callback-scenario is that it is very flexible The downsides are:

• The layering suffers If we want to stick to the idea that the upper layer knows about the lower layer, it is important that the lower layer defines the signature for the callback in an h-file The upper layer includes this h-file when compiling and may implement several callbacks that are registered as needed.

• It can be complex to handle e.g several connections via the same callback It is strongly advised to think in terms of state-machines.

• The callback source-code is in a normal application module, but it is executed in the context of the exception - as shown in Figure 12.5 This could block other interrupts A developer might not even realize that this function in an application module is running as an exception, with other rules applying You may not be allowed to use floating-point or printf You should think carefully about making

Reentrant Code

function calls You need to finish your business here asap and maybe continue somewhere else - at a lower priority.

• If you use blocking threads, the source code reflects the order of execution It is simpler to follow when reading, and simpler to debug With callbacks, you cannot see directly which callback is called when - and whether a function is even a callback Even tools like Doxygen cannot help you It is recommended to use names that reflect the context.

1Doxygen is a recommended way to tag the source and later generate documentation

Function pointers are not always popular - but built into the Callback pattern To make the code safer you may consider registering the callback as const at compile-time - assuring that it ends in Flash with the code.

Recursive functions are sometimes popular in data-processing on e.g, PCs, but embedded software-developers rarely use the concept, since each recursion adds a new function call to the stack We typically want better control of the stack, and therefore tend to write code using loops instead Nevertheless, even a single-threaded, non-recursive system may experience reentrancy if e.g., function A calls function B which calls function C - which calls function A In this example function A will not have completed the first call, when the second one starts.

In a multithreaded system it is very likely that two threads call the same function A classic example is the old C-library function strtok - see Listing 12.3 You call strtok several times to parse a string The first call delivers a string-pointer to the string to parse as the first parameter For this, and each following call, strtok finds a delimiter and replaces it with the char ’\0’ - in the original string This allows strtok to return a zero-terminated (sub)string for each call, while the original string in the example ends as "Once\0upon\0a\0time\0" It is all very fast and neat The output from Listing 12.3 is:

The problem is, of course, that not only during the call to strtok, but even between calls does strtok hold on to the original string - which

Listing 12.3: Using the old strtok function

// Skipping prototypes etc. int main(void)

// Skipping init char mystring[] = "Once upon a time"; const char delim[] = " "; char * token;

• No use of static vars or globals without synchronization Instead, use the stack Alternatively use TLS - Thread-Local Storage TLS is also not a solution in the recursive scenario where a common thread is used.

• Use compiler and libraries that support the above.

• Use caller-provided storage to avoid the need for malloc in functions (like strtok_r). token = strtok(mystring, delim); while (token != NULL)

{ printf("Token = %s\n", token); token = strtok(NULL, delim);

} it continues to modify on subsequent calls If one thread starts parsing something, and then another thread starts its own parsing, the first thread will work on the second thread’s string.

The modern version of strtok is called strtok_r (for reentrant) strtok_r lets the caller be the keeper of the original string The caller now delivers a pointer to the string in every call to strtok_r This removes the problem You generally want to make your code reentrant The rules are relatively simple:

• Synchronize access to shared resources by critical section, mutex, lock, semaphore etc This is not a solution in the recursive scenario as it most likely leads to a deadlock.

Divide and Conquer

Heterogeneous Tasks

The first kind of parallelization is the one we normally think of - creating a number of different tasks I think that Ian Foster in his Algorithm for

Task Parallelization describes a neat process Foster’s Algorithm intuitively makes sense and can be used to define parallel heterogeneous tasks It is shown in Figure 12.6 Quoting Foster:

The computation that is to be performed, and the data operated on by this computation are decomposed into small tasks Practical issues such as the number of processors in the target computer are ignored, and attention is focused on recognizing opportunities for parallel execution.

The communication required to coordinate task execution is determined, and appropriate communication structures and algorithms are defined.

The task and communication structures defined in the first two stages of a design are evaluated with respect to performance requirements and implementation costs If necessary, tasks are combined into larger tasks to improve performance or to reduce development costs.

Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs Mapping can be specified statically or determined at runtime by load-balancing algorithms.

Assigning Priorities

There are many complex scheduling algorithms for advanced operating systems, where the scheduler dynamically adjusts priorities In embedded systems however, we often have periodic tasks that we want to assign static priorities to Assume that we have independent, periodic tasks with known attributes - like the example in Table 12.1 Assume also that the overhead is negligible and that we want to assign static priorities to the task In this case the preferred algorithm is “Deadline-Monotonic Priority Assignment” The concept is simply to assign priority to tasks according to their deadlines - the time we have from the task is triggered and until it must be completed The columns in Table 12.1 are:

This is priority as well because I wrote them in that order Lowest number is highest priority.

This is the period of the repeated task All intervals are given in time-slices.

This is the max time we can allow after the event that triggers the task and until it is completed.

This is the time the task uses to execute its periodic job - once run ning.

Here we calculate the fraction of the total time used by each task The sum of these must be less than one.

This is the execution time divided by the deadline time per task If the sum of these is less than one, we can guarantee that we can reach all deadlines If not, we need to examine more closely - like here.

Task Period Deadline Exec Time Util Deadl Util

Figure 12.7 shows how I used Excel to simply map the execution of the three tasks in Table 12.1 The bottom line is a time axis where I count time-slices The concept is now as follows:

• In the line under each task I map Period start with a “P” All tasks start in slice 0.

• In the same line I map the Deadlines with “D” and a running number These are bolded.

• I now start with the top-priority tasks and put an X where it executes

- also with a running number This starts with the beginning of the period.

• Next, I put similar X’es at the second task Again, starting at the Period start when possible - but only in columns that do not contain an X above.

• The same thing is done with the third task with the longest deadlines Here we see something interesting - the X’es are broken up when there are X’s above This is preemption by the higher-priority task.

• Finally, I put X’es in the “Idle” row when the columns above are empty We cannot see this in the figure, but I continued to more than

40 slices, and in the following 11-slice periods I got one or two X’es in the idle task This matches the utilization of approximately 85 %.

• When the longest period is passed - as in the figure - without any X’es crossing their deadlines we are good This is because we started with a period start with all three tasks In the following 11-slice periods, one of the faster tasks already has a jump-start on its job, and we have more time.

Figure 12.7: Verifying that we can make all tasks

SIMD

The second type of work that also intuitively lends itself to parallelization is when the same algorithm is performed on similar datasets This is called

SIMD - Single Instruction, Multiple data This could be processing stereo sound or lines of an image Such processing is parallel, but, if we can apply the SIMD-instructions, we don’t even need to think in terms of multiple tasks The important thing is to make sure that data is lined up in a way that allows the instructions to run smoothly The SIMD term is one of four types in Flynn’s Taxonomy:

• SISD = Single Instruction, Single Data.

• SIMD = Single Instruction, Multiple Data.

Stereo, graphics processing or multichannel measurements and gen erators.

• MISD = Multi Instruction, Single Data.

Could be multiple statistics on same dataset.

• MIMD = Multiple Instruction, Multiple Data

Pipelining

The third type of parallelism is Pipelining This is well known in CPU-cores as well as in Digital Signal Processing from e.g., Codecs The signal may be streamed through different filters, chopped-up, frequency-analyzed, subjected to statistical procedures and transmitted At the other end, the original signal is recreated as good as possible I think that a pipelining split often gives itself when you see the algorithm It may, however, not be the best split The trick is to create an algorithm that can be efficiently pipelined In signal processing, pipelining is popular, but introduces la tency In a one-directional transmission - like flow-TV or streaming Netflix

- latency is typically not a problem However, in a two-way conversation - like on a phone - latency can be an issue.

Dedicated Subsystems

Finally, I want to mention a concept which is sometimes forgotten: Dedi cated hardware with dedicated software for subsystems Just think about a car with hundreds of small MCUs, distributed via CAN-networks Once such a part is implemented, tested and compliance-approved, you don’t need to repeat this process Reuse becomes a realistic possibility.

Bluetooth Low Energy - BLE is a good example in the general embedded area Here we have the “stack” very clearly defining two specific roles -

Host and Controller The Host sits in the top of the stack - close to your application - and is responsible for the various “Profiles” The Controller is responsible for the Link and Physical layers.

As a designer you can choose to do it all yourself Compliance with international standards is a huge thing when doing wireless Both Host and Controller needs to be validated with the “Bluetooth SIG” in tests that you can run yourself There are, however, also many rules for the Radio part

- like emitted power, shape of frequency spectra, duty-cycles and much more These need to be respected in the firmware and tested by certified labs - for all relevant regions To save costs, you may consider applying a stand-alone, already proven, controller Now you can “piggy-back” on the certification of the controller when you do your own certification - at least for the link-layer part Your device will still need to pass measurements on the physical layer - your packaging might e.g., focus all radiated power in one direction Alternatively, you can buy standalone-devices with both host and controller Here you often need to modify the host to adapt to your application If however, you can use a “dongle” or “module” containing the whole thing - including antenna - and you place this where a normal customer can replace it, you do not need to do any certification at all3.

3This is my understanding Legally, of-course you need to seek counsel.

I recently (2023) saw an interview with a CEO (I believe it was) of a large car-maker explaining how being based on all these 3’rd party MCUs with each their own firmware made it very difficult to turn the “super tanker” gasoline/diesel-car production around and compete with Tesla who built it all up from the ground This is an interesting statement For years the car-industry has shown the world the power of Kaizen - constant small improvements of products and procedures This process has steadily led to lower costs and higher quality But there is a downside; you may lose the ability to do innovation-jumps if you focus too much on Kaizen Secondly, this proves that the grass is always greener on the other side Starting up Tesla from the bottom must have been incredibly tough - exactly because Tesla did not have the many years of Kaizen-experience and products on the shelves Obviously, there is still a lot of sense in using sealed subsystems in many scenarios.

Message-Based System

In a multitask-system it is typically necessary to communicate more than simple flags between tasks It is possible to communicate through shared memory, but message-based communication is becoming more and more common One task creates a message and puts the message in a queue belonging to another task.

In a virtual-memory system with real processes, this is complex, but easy to use A buffer is allocated by the OS on request from the sending task The sending task fills the buffer and hands it back to the OS as a message to be sent to another task The OS may now copy the content to memory belonging to the other task This assures secure walls between the processes Each can only read data belonging to itself - even the original empty buffer cannot contain traces of earlier communication from another task The OS is also responsible for unblocking the receiving process, if it is blocked, waiting for instructions at the mailbox Copying data is

12.6 MESSAGE-BASED SYSTEM 183 historically known as a performance-threat However, as multiple cores become more and more common, designers have realized that shared- memory can be a real bottleneck as one core may lockout other cores when accessing shared-memory The result is that high-end CPUs are getting faster and faster when it comes to copying data As usual, the small MCUs will copy concepts and ideas from the bigger CPUs and adapt them to the embedded world From a software point-of-view message-based communication has many advantages over shared memory:

• Simpler code with less low-level synchronization.

When one tasks puts a message in a queue, it is also a statement that the task is done messing with the data The tasks retrieving the data can use it without locks, mutexes etc This is normally even true if the message contains pointers to shared data.

• Tests can mix desktop and target environments.

You may send test-data from a PC to an embedded task that interacts with hardware and sends messages back to the PC This allows you to e.g., develop algorithms on the PC, while the real hardware supplies data Or the PC runs a regression-test by sending exactly the same data to the hardware as usual - testing that it behaves as usual.

• Debug can take place at a higher level.

Debug software can e.g., dump specific fields from each message with a timestamp - allowing you to follow which task communicates when, to which task - about what, and in which queue.

It is easier to port software to a new platform when task-communi cation is message-based You may choose to move one task from your Cortex-M7 to a Cortex-M4, which is part of another SoC - communicating with the M7 over SPI or Ethernet.

My first experiences with a message-based concept were based on a system containing two 8051-derivates from Siemens - SAB80535 - many years ago This system had a small fraction of the MIPS offered by any system today, but still we found that the debug abilities far outweighed the small performance-hit By keeping the number of available buffers in a pool down, we implemented back-pressure where a source of data is throttled We could see which type of messages were sent from one task to another, and how filled the message-boxes and pools4 were The two cores

4We will get to this in Section 12.8 were on discrete ICs, but all debug communication came out on one of them We arranged that one core wrote in red text and the other in white - it was fantastic.

FreeRTOS is a good example of an RTOS for MCUs without virtual memory and processes FreeRTOS allows a task to block when reading from an empty queue - and when writing to a full queue Data - C- structs - are copied to the queue We will see this in Chapter 13 The argument is that for small structures this is efficient, as the sending task is immediately free to overwrite or discard its own copy This makes the control simpler A task may prefer not to include large data-sets - only a pointer to shared memory With the sender writing to such a structure and the receiver reading, the receiver “knows” that when it gets a pointer, it can read safely from it - the sender is done writing The queues can even be used from interrupts and support both streamed and message-based communication Streamed communication can be data from a hardware interrupt, while messages can be used for various thread-to-thread C- structs, where you want to keep boundaries between messages as they are sent This is much like TCP (streamed with no boundaries) and UDP (message-based - keeping boundaries) The FreeRTOS queue concept is very elegant, but unfortunately is basically made for one writer and one reader of every queue As with “normal” mailboxes an embedded system is not terribly limited by having only one reader of a mailbox, but only allowing one writer can be a problem There are ways around - but not so elegant.

Ring Bu ff ers

If you have a system of tasks that together realize the Pipeline-concept outlined in Section 12.5, you will have a data-flow from task to task You may queue data as messages between task, but you might instead prefer a shared-memory Ring-Bu ff er Ring-buffers are often used with simple data streams with constant flow - like e.g., samples from an A/D-Converter, S/PDIF or similar A ring buffer is shown in Figure 12.8.

The concept is simple: We simulate a circular memory buffer One task always writes and the other always reads The reader may e.g., be woken by a timer-tick, or an event from an interrrupt, and reads until it reaches the write-pointer and then blocks If the reader overtakes the writer it will re-read old samples The writer may also monitor whether it passes the read-pointer In this case it’s not about blocking, but rather logging an error, because this means loss of data There are a few things to observe:

• As memory is normally not circular, we need to “wrap” the point ers This is easy, but the above test for pointers passing each other becomes more complex.

• If the writer is DMA-based we cannot have the wrap during a DMA- transfer We need to assure that one transfer stops at the max address, and the next begins at the min address - effectively wrapping between transfers This is a simple matter of assuring that the buffer-size fits with an integer number of transfers.

• Though the concept is often used for simple sample-based transfers, it can also be used for more complex data By using pointers or counting bytes instead of counting items, we can handle varying message-sizes.

• The concept basically assures that writer and reader “knows”, when they have access and we therefore do not need to worry about reading a mix of old and new data The only place where we need to take care is the pointers If 32-bit pointers are 4-byte aligned in a Cortex-M system, there is no risk of reading a pointer in the midst of an update

- e.g reading the most important 16-bits before an increment and the last 16 bits after5 If the writer writes data first (possibly a full transfer) and afterwards updates the pointer, all is well.

5This could happen at some platforms and a Mutex may be needed

• Although ring-buffer administration is not rocket-science, it is still nice that many libraries include ring-buffers Some DSPs even sup port ring-buffers in hardware, so that the memory indeed appears to be circular.

Dynamic Memory

If you can base your embedded system solely on pre-allocated - aka static - memory, as discussed in Chapter 4, then you have a great start on a very deterministic system In the embedded world we love the word “determin istic” which we interpret as predictable There are many scenarios where devices are transferring data, and all you may need at the lower level is two buffers taking turns on being filled and emptied This is called Bu ff er Switching and is normally based on statically allocated buffers Extending this concept from 2 to N buffers gives us the Ring-Buffer, which we saw in Section 12.7 In Section 12.6 we met the message-based system In FreeRTOS, messages are copied into the pre-allocated queue memory This is great for small messages, and by allowing data to be pointers to shared memory it also supports the use of more dynamic memory Other message based systems only support pointers You may also be using libraries that are written for systems with dynamic memory - expecting to be able to allocate or release data It may not be possible to statically allocate memory for these cases.

When you are writing C-programs for a desktop PC you often need to allocate memory “on-the-fly” In C we use malloc, and in C++ we use new These calls allocate the memory on the Heap Most embedded developers however, shy away from using the heap A heap can become fragmented This creates a very non-deterministic behavior - it is really hard to predict when malloc returns 0 - and thus have failed to retrieve more memory.

The classic embedded solution is to use Pooled Allocation A set of pools are created Each pool is a memory area that holds a number of buffers of a given size There could e.g., be 13 buffers of size 8, 8 buffers of size

16, 5 buffers of size 32 etc There is a bit of administration also - typically each buffer has a small admin part In the admin area the system registers which task or mailbox currently owns the buffer - or the admin area is used

Synchronization Patterns

Critical Section

A Critical Section is a piece of code that you only want one thread to run at a time Once a thread starts on this piece of code, it should continue to the end of the critical section We do not want a context switch to another thread We might not even want a hardware- interrupt The critical section may be used to assure that changes to data are coherent before another thread starts using the data, or the critical section may perform something that needs to be completed in real-time.

A classic way to implement a critical section is to start by disabling in terrupts and then enable interrupts when done (only if they were enabled before) This assures against both context-switches and interrupts In FreeRTOS, taskENTER_CRITICAL() and taskEXIT_CRITICAL() dis ables and enables all interrupts or - if configured so - interrupts below a certain priority.

The downside is a higher interrupt latency It is therefore crucial that the critical section is quickly executed The concept can be good for small systems where you understand the dynamics of the system - e.g., you can say that in the given scenario you can live with a longer interrupt latency The alternative is to acquire a lock or a mutex before executing the critical section.

Lock

A lock is a well-known concept If you visit a room where you want to be alone, you lock the door from the inside with the only key When you (finally) leave, someone else can get in They too, will lock the door to get a bit of privacy.

For this purpose we need a way to obtain the key without another thread also believing that it got the key For years the simple solution was a Test-And-Set operation When you want to obtain a key to a specific lock, you test-and-set the variable to one The instruction returns the value of the cell before you did the write Thus, if you read binary one, you have to try again because someone else beat you to it When you are done with the job you clear the memory cell with a plain write.

If things are very fast, and the wait will be very short, you may test in a simple while-loop This is a Spin-Lock and may even be used within an interrupt function.

Listing 12.4: Load and Store Exclusively to get a Lock

3 CMP R3, #0 ; Set flags on the value

4 STREXEQ R3, R0, [R2] ; If clear - then write

6 BNE TestLock ; Try again if failed

Older Arm MCUs had a Swap instruction that could be used for Test- And-Set As we have seen in Chapter 4, Bit-Banding in Memory can be used for this purpose However, not all Cortex-M’s have Bit-Banding, so this is not so portable.

Cortex-M3 and upwards have the Load-Exclusive and Store-Exclusive instructions The ldrex instruction starts a HW-monitor that ends by the strex (or by a clrex ) There are also Byte and Half-word variants - ending with respectively “B” or “H” When the store is done, it returns a zero if no other threads or processes has written anything in the meantime The concept even works on multicore CPUs due to the hardware-based state-machine.

These instructions can be used to perform Read-Modify-Write ope rations A lock can be realized by doing a busy-wait - implementing a spin-lock as shown in Listing 12.4.

Mutex

The term Mutex comes from “Mutual Exclusion” Like a lock, it is about only allowing access to one of several parties If the waiting time gets so long that we don’t want to do a busy-wait, we could build a task-yield into the loop in Listing 12.4 - if we have an RTOS Taking it a step further we could ask the RTOS to let us wait for a specific time in the loop - and then try again Now that we have involved the RTOS - why not let it hold the key and simply wake us up when we are good to go? This is essentially a Mutex.

Most operating systems - real-time or not - implement Priority Inher itance for Mutexes This means that if a high-priority task is waiting for a low-priority task, then the low-priority task will inherit the priority of the high-priority task All tasks using a Mutex are using it in the same way, and they may get blocked Thus, you cannot use a Mutex from an interrupt.

Semaphore

Semaphore is the marine concept of signalling by using two flags in differ ent positions In the programming world semaphores are often a service from the OS There are similarities with Mutexes - but also differences. Where a Mutex always only allows one task to access the shared re source - it is said to be binary - semaphores may also be binary, but al ternatively they can be counting This means that a semaphore can be configured to allow e.g., five active tasks - using the same semaphore - but will block the sixth that asks for a pass Where the Mutex concept is completely symmetric, the Semaphore concept is not One party always posts - or signals - while the other party listens Signalling a semaphore never causes a thread to block, and is therefore allowed from an interrupt Since both “ends” are not blocked, priority inversion does not come into play with semaphores In CMSIS-RTOS a different terminology is used; one task or interrupt Releases a Semaphore and another task Acquires it

We will see an example of this in Chapter 13.

Initial FreeRTOS Setup

In this chapter we test FreeRTOS on the small Nucleo-64 board from ear lier This also gives us a chance to get a closer look on the STM32CubeMX configuration tool As you may remember from Chapters 2 and 4, the Nucleo-64 STM32F334R8 board has very little RAM (12+4 kB) However, FreeRTOS does not need much space I started a fresh STM32 project in the STM32CubeIDE This opened up the STM32CubeMX STM32CubeMX is not a “once-in-a-lifetime-of-a-project” experience You can open the project’s “ioc”-file and redo choices at any time Note that there are com ments in e.g., main.c where you can add “user-code” If you respect these boundaries, your code will not be overwritten by STM32CubeMX updates Still, better safe than sorry - I will always prefer to commit my git branch before starting STM32CubeMX on an existing project in real use.

Under “Middleware” in STM32CubeMX, “FreeRTOS” was selected I added an extra task on top of the default suggested by STM32CubeMX

I renamed the two tasks to Task1 and Task21, giving both the priority osPriorityNormal - to see timeslicing in action - and selected CMSIS-

RTOS v2 as interface Under “advanced settings” the tool suggested to

Use_NEWLIB_REENTRANT - which I also did With an RTOS involved, you generally like things being reentrant You can see the generated setup in Listing 13.2, and the running system in Figure 13.5 (after adding printout in next section) When I built the system and started debugging, I could set breakpoints in the two tasks, and they worked.

1In the code they become StartTask1 and StartTask2

attribute ((weak)) int _write(int file, char * ptr, int len)

{ int DataIdx; for (Dataldx = 0; Dataldx < len; DataIdx++)

Tiêu đề	Microcontrollers with C Cortex-M and Beyond
Tác giả	Klaus Elk
Chuyên ngành	Embedded Software
Thể loại	Book
Năm xuất bản	2023

Định dạng
Số trang	229
Dung lượng	14,14 MB