Carl Kadie, Author at Towards Data Science https://towardsdatascience.com The world’s leading publication for data science, AI, and ML professionals. Tue, 08 Apr 2025 00:26:39 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Carl Kadie, Author at Towards Data Science https://towardsdatascience.com 32 32 How to Optimize your Python Program for Slowness https://towardsdatascience.com/how-to-optimize-your-python-program-for-slowness/ Tue, 08 Apr 2025 00:25:35 +0000 https://towardsdatascience.com/?p=605677 Write a short program that finishes after the universe dies

The post How to Optimize your Python Program for Slowness appeared first on Towards Data Science.

]]>

Also available: A Rust version of this article.

Everyone talks about making Python programs faster [1, 2, 3], but what if we pursue the opposite goal? Let’s explore how to make them slower — absurdly slower. Along the way, we’ll examine the nature of computation, the role of memory, and the scale of unimaginably large numbers.

Our guiding challenge: write short Python programs that run for an extraordinarily long time.

To do this, we’ll explore a sequence of rule sets — each one defining what kind of programs we’re allowed to write, by placing constraints on halting, memory, and program state. This sequence isn’t a progression, but a series of shifts in perspective. Each rule set helps reveal something different about how simple code can stretch time.

Here are the rule sets we’ll investigate:

  1. Anything Goes — Infinite Loop
  2. Must Halt, Finite Memory — Nested, Fixed-Range Loops
  3. Infinite, Zero-Initialized Memory — 5-State Turing Machine
  4. Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)
  5. Infinite, Zero-Initialized Memory — Plain Python (compute 10↑↑15 without Turing machine emulation)

Aside: 10↑↑15 is not a typo or a double exponent. It’s a number so large that “exponential” and “astronomical” don’t describe it. We’ll define it in Rule Set 4.

We start with the most permissive rule set. From there, we’ll change the rules step by step to see how different constraints shape what long-running programs look like — and what they can teach us.

Rule Set 1: Anything Goes — Infinite Loop

We begin with the most permissive rules: the program doesn’t need to halt, can use unlimited memory, and can contain arbitrary code.

If our only goal is to run forever, the solution is immediate:

while True:
  pass

This program is short, uses negligible memory, and never finishes. It satisfies the challenge in the most literal way — by doing nothing forever.

Of course, it’s not interesting — it does nothing. But it gives us a baseline: if we remove all constraints, infinite runtime is trivial. In the next rule set, we’ll introduce our first constraint: the program must eventually halt. Let’s see how far we can stretch the running time under that new requirement — using only finite memory.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

If we want a program that runs longer than the universe will survive and then halts, it’s easy. Just write two nested loops, each counting over a fixed range from 0 to 10¹⁰⁰−1:

for a in range(10**100):
  for b in range(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

You can see that this program halts after 10¹⁰⁰ × 10¹⁰⁰ steps. That’s 10²⁰⁰. And — ignoring the print—this program uses only a small amount of memory to hold its two integer loop variables—just 144 bytes.

My desktop computer runs this program at about 14 million steps per second. But suppose it could run at Planck speed (the smallest meaningful unit of time in physics). That would be about 10⁵⁰ steps per year — so 10¹⁵⁰ years to complete.

Current cosmological models estimate the heat death of the universe in 10¹⁰⁰ years, so our program will run about 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 times longer than the projected lifetime of the universe.

Aside: Practical concerns about running a program beyond the end of the universe are outside the scope of this article.

For an added margin, we can use more memory. Instead of 144 bytes for variables, let’s use 64 gigabytes — about what you’d find in a well-equipped personal computer. That’s about 500 million times more memory, which gives us about one billion variables instead of 2. If each variable iterates over the full 10¹⁰⁰ range, the total number of steps becomes roughly 10¹⁰⁰^(10⁹), or about 10^(100 billion) steps. At Planck speed — roughly 10⁵⁰ steps per year — that corresponds to 10^(100 billion − 50) years of computation.


Can we do better? Well, if we allow an unrealistic but interesting rule change, we can do much, much better.

Rule Set 3: Infinite, Zero-Initialized Memory — 5-State Turing Machine

What if we allow infinite memory — so long as it starts out entirely zero?

Aside: Why don’t we allow infinite, arbitrarily initialized memory? Because it trivializes the challenge. For example, you could mark a single byte far out in memory with a 0x01—say, at position 10¹²⁰—and write a tiny program that just scans until it finds it. That program would take an absurdly long time to run — but it wouldn’t be interesting. The slowness is baked into the data, not the code. We’re after something deeper: small programs that generate their own long runtimes from simple, uniform starting conditions.

My first idea was to use the memory to count upward in binary:

0
1
10
11
100
101
110
111
...

We can do that — but how do we know when to stop? If we don’t stop, we’re violating the “must halt” rule. So, what else can we try?

Let’s take inspiration from the father of Computer Science, Alan Turing. We’ll program a simple abstract machine — now known as a Turing machine — under the following constraints:

  • The machine has infinite memory, laid out as a tape that extends endlessly in both directions. Each cell on the tape holds a single bit: 0 or 1.
  • A read/write head moves across the tape. On each step, it reads the current bit, writes a new bit (0 or 1), and moves one cell left or right.
A read/write head positioned on an infinite tape.
  • The machine also has an internal variable called state, which can hold one of n values. For example, with 5 states, we might name the possible values A, B, C, D, and E—plus a special halting state H, which we don’t count among the five. The machine always starts in the first state, A.

We can express a full Turing machine program as a transition table. Here’s an example we’ll walk through step by step.

A 5-state Turing machine transition table.
  • Each row corresponds to the current tape value (0 or 1).
  • Each column corresponds to the current state (A through E).
  • Each entry in the table tells the machine what to do next:
    • The first character is the bit to write (0 or 1)
    • The second is the direction to move (L for left, R for right)
    • The third is the next state to enter (A, B, C, D, E, or H, where H is the special halting state).

Now that we’ve defined the machine, let’s see how it behaves over time.

We’ll refer to each moment in time — the full configuration of the machine and tape — as a step. This includes the current tape contents, the head position, and the machine’s internal state (like A, B, or H).

Below is Step 0. The head is pointing to a 0 on the tape, and the machine is in state A.

Looking at row 0, column A in the program table, we find the instruction 1RB. That means:

  • Write 1 to the current tape cell.
  • Move the head Right.
  • Enter state B.

Step 0:

This puts us in Step 1:

The machine is now in state B, pointing at the next tape cell (again 0).

What will happen if we let this Turing machine keep running? It will run for exactly 47,176,870 steps — and then halt. 

Aside: With a Google sign in, you can run this yourself via a Python notebook on Google Colab. Alternatively, you can copy and run the notebook locally on your own computer by downloading it from GitHub.

That number 47,176,870 is astonishing on its own, but seeing the full run makes it more tangible. We can visualize the execution using a space-time diagram, where each row shows the tape at a single step, from top (earliest) to bottom (latest). In the image:

  • The first row is blank — it shows the all-zero tape before the machine takes its first step.
  • 1s are shown in orange.
  • 0s are shown in white.
  • Light orange appears where 0s and 1s are so close together they blend.
Space-time diagram for the champion 5-state Turing machine. It runs for 47,176,870 steps before halting. Each row shows the tape at a single step, starting from the top. Orange represents 1, white represents 0.

In 2023, an online group of amateur researchers organized through bbchallenge.org proved that this is the longest-running 5-state Turing machine that eventually halts.


Want to see this Turing machine in motion? You can watch the full 47-million-step execution unfold in this pixel-perfect video:

Or interact with it directly using the Busy Beaver Blaze web app.

The video generator and web app are part of busy-beaver-blaze, the open-source Python & Rust project that accompanies this article.


It’s hard to believe that such a small machine can run 47 million steps and still halt. But it gets even more astonishing: the team at bbchallenge.org found a 6-state machine with a runtime so long it can’t even be written with ordinary exponents.

Rule Set 4: Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)

As of this writing, the longest running (but still halting) 6-state Turing machine known to humankind is:

A   B   C   D   E   F
0   1RB 1RC 1LC 0LE 1LF 0RC
1   0LD 0RF 1LA 1RH 0RB 0RE

Here is a video showing its first 10 trillion steps:

And here you can run it interactively via a web app.

So, if we are patient — comically patient — how long will this Turing machine run? More than 10↑↑15 where “10 ↑↑ 15” means:

This is not the same as 10¹⁵ (which is just a regular exponent). Instead:

  • 10¹ = 10
  • 10¹⁰ = 10,000,000,000
  • 10^10^10 is 10¹⁰⁰⁰⁰⁰⁰⁰⁰⁰⁰, already unimaginably large.
  • 10↑↑4 is so large that it vastly exceeds the number of atoms in the observable universe.
  • 10↑↑15 is so large that writing it in exponent notation becomes annoying.

Pavel Kropitz announced this 6-state machine on May 30, 2022. Shawn Ligocki has a great write up explaining both his and Pavel’s discoveries. To prove that these machines run so long and then halt, researchers used a mix of analysis and automated tools. Rather than simulating every step, they identified repeating structures and patterns that could be proven — using formal, machine-verified proofs — to eventually lead to halting.


Up to this point, we’ve been talking about Turing machines — specifically, the longest-known 5- and 6-state machines that eventually halt. We ran the 5-state champion to completion and watched visualizations to explore its behavior. But the discovery that it’s the longest halting machine with 5 states — and the identification of the 6-state contender — came from extensive research and formal proofs, not from running them step-by-step.

That said, the Turing machine interpreter I built in Python can run for millions of steps, and the visualizer written in Rust can handle trillions (see GitHub). But even 10 trillion steps isn’t an atom in a drop of water in the ocean compared to the full runtime of the 6-state machine. And running it that far doesn’t get us any closer to understanding why it runs so long.

Aside: Python and Rust “interpreted” the Turing machines up to some point — reading their transition tables and applying the rules step by step. You could also say they “emulated” them, in that they reproduced their behavior exactly. I avoid the word “simulated”: a simulated elephant isn’t an elephant, but a simulated computer is a computer.

Returning to our central challenge: we want to understand what makes a short program run for a long time. Instead of analyzing these Turing machines, let’s construct a Python program whose 10↑↑15 runtime is clear by design.

Rule Set 5: Infinite, Zero-Initialized Memory — Plain Python (compute 10↑↑15 without Turing machine emulation)

Our challenge is to write a small Python program that runs for at least 10↑↑15 steps, using any amount of zero-initialized memory.

To achieve this, we’ll compute the value of 10↑↑15 in a way that guarantees the program takes at least that many steps. The ↑↑ operator is called tetration—recall from Rule Set 4 that ↑↑ stacks exponents: for example, 10↑↑3 means 10^(10^10). It’s an extremely fast-growing function. We will program it from the ground up.

Rather than rely on built-in operators, we’ll define tetration from first principles:

  • Tetration, implemented by the function tetrate, as repeated exponentiation
  • Exponentiation, via exponentiate, as repeated multiplication
  • Multiplication, via multiply, as repeated addition
  • Addition, via add, as repeated increment

Each layer builds on the one below it, using only zero-initialized memory and in-place updates.

We’ll begin at the foundation — with the simplest operation of all: increment.

Increment

Here’s our definition of increment and an example of its use:

from gmpy2 import xmpz

def increment(acc_increment):
  assert is_valid_accumulator(acc_increment), "not a valid accumulator"
  acc_increment += 1

def is_valid_accumulator(acc):
  return isinstance(acc, xmpz) and acc >= 0  

b = xmpz(4)
print(f"++{b} = ", end="")
increment(b)
print(b)
assert b == 5

Output:

++4 = 5

We’re using xmpz, a mutable arbitrary-precision integer type provided by the gmpy2 library. It behaves like Python’s built-in int in terms of numeric range—limited only by memory—but unlike int, it supports in-place updates.

To stay true to the spirit of a Turing machine and to keep the logic minimal and observable, we restrict ourselves to just a few operations:

  • Creating an integer with value 0 (xmpz(0))
  • In-place increment (+= 1) and decrement (-= 1)
  • Comparing with zero

All arithmetic is done in-place, with no copies and no temporary values. Each function in our computation chain modifies an accumulator directly. Most functions also take an input value a, but increment—being the most basic—does not. We use descriptive names like increment_acc, add_acc, and so on to make the operation clear and to support later functions where multiple accumulators will appear together.

Aside: Why not use Python’s built-in int type? It supports arbitrary precision and can grow as large as your memory allows. But it’s also immutable, meaning any update like += 1 creates a new integer object. Even if you think you’re modifying a large number in place, Python is actually copying all of its internal memory—no matter how big it is.
For example:

x = 10**100
y = x
x += 1
assert x == 10**100 + 1 and y == 10**100

Even though x and y start out identical, x += 1 creates a new object—leaving y unchanged. This behavior is fine for small numbers, but it violates our rules about memory use and in-place updates. That’s why we use gmpy2.xmpz, a mutable arbitrary-precision integer that truly supports efficient, in-place changes.

Addition

With increment defined, we next define addition as repeated incrementing.

def add(a, add_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(add_acc), "not a valid accumulator"
  for _ in range(a):
      add_acc += 1

def is_valid_other(a):
  return isinstance(a, int) and a >= 0      

a = 2
b = xmpz(4)
print(f"Before: id(b) = {id(b)}")
print(f"{a} + {b} = ", end="")
add(a, b)
print(b)
print(f"After:  id(b) = {id(b)}")  # ← compare object IDs
assert b == 6

Output:

Before: id(b) = 2082778466064
2 + 4 = 6
After:  id(b) = 2082778466064

The function adds a to add_acc by incrementing add_acc one step at a time, a times. The before and after ids are the same, showing that no new object was created—add_acc was truly updated in place.

Aside: You might wonder why add doesn’t just call our increment function. We could write it that way—but we’re deliberately inlining each level by hand. This keeps all loops visible, makes control flow explicit, and helps us reason precisely about how much work each function performs.

Even though gmpy2.xmpz supports direct addition, we don’t use it. We’re working at the most primitive level possible—incrementing by 1—to keep the logic simple, intentionally slow, and to make the amount of work explicit.

As with increment_acc, we update add_acc in place, with no copying or temporary values. The only operation we use is += 1, repeated a times.

Next, we define multiplication.

Multiplication

With addition in place, we can now define multiplication as repeated addition. Here’s the function and example usage. Unlike add and increment, this one builds up a new xmpz value from zero and returns it.

def multiply(a, multiply_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(multiply_acc), "not a valid accumulator"

  add_acc = xmpz(0)
  for _ in count_down(multiply_acc):
      for _ in range(a):
          add_acc += 1
  return add_acc

def count_down(acc):
  assert is_valid_accumulator(acc), "not a valid accumulator"
  while acc > 0:
      acc -= 1
      yield

a = 2
b = xmpz(4)
print(f"{a} * {b} = ", end="")
c = multiply(a, b)
print(c)
assert c == 8
assert b == 0

Output:

2 * 4 = 8

This multiplies a by the value of multiply_acc by adding a to add_acc once for every time multiply_acc can be decremented. The result is returned and then assigned to c. The original multiply_acc is decremented to zero and consumed in the process.

You might wonder what this line does:

for _ in count_down(multiply_acc):

While xmpz technically works with range(), doing so converts it to a standard Python int, which is immutable. That triggers a full copy of its internal memory—an expensive operation for large values. Worse, each decrement step would involve allocating a new integer and copying all previous bits, so what should be a linear loop ends up doing quadratic total work. Our custom count_down() avoids all that by decrementing in place, yielding control without copying, and maintaining predictable memory use.

We’ve built multiplication from repeated addition. Now it’s time to go a layer further: exponentiation.

Exponentiation

We define exponentiation as repeated multiplication. As before, we perform all work using only incrementing, decrementing, and in-place memory. As with multiply, the final result is returned while the input accumulator is consumed.

Here’s the function and example usage:

def exponentiate(a, exponentiate_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(exponentiate_acc), "not a valid accumulator"
  assert a > 0 or exponentiate_acc != 0, "0^0 is undefined"

  multiply_acc = xmpz(0)
  multiply_acc += 1
  for _ in count_down(exponentiate_acc):
      add_acc = xmpz(0)
      for _ in count_down(multiply_acc):
          for _ in range(a):
              add_acc += 1
      multiply_acc = add_acc
  return multiply_acc


a = 2
b = xmpz(4)
print(f"{a}^{b} = ", end="")
c = exponentiate(a, b)
print(c)
assert c == 16
assert b == 0

Output:

2^4 = 16

This raises a to the power of exponentiate_acc, using only incrementing, decrementing, and loop control. We initialize multiply_acc to 1 with a single increment—because repeatedly multiplying from zero would get us nowhere. Then, for each time exponentiate_acc can be decremented, we multiply the current result (multiply_acc) by a. As with the earlier layers, we inline the multiply logic directly instead of calling the multiply function—so the control flow and step count stay fully visible.

Aside: And how many times is += 1 called? Obviously at least 2⁴ times—because our result is 2⁴, and we reach it by incrementing from zero. More precisely, the number of increments is:

• 1 increment — initializing multiply_acc to one

Then we loop four times, and in each loop, we multiply the current value of multiply_acc by a = 2, using repeated addition:

• 2 increments — for multiply_acc = 1, add 2 once
• 4 increments — for multiply_acc = 2, add 2 twice
• 8 increments — for multiply_acc = 4, add 2 four times
• 16 increments — for multiply_acc = 8, add 2 eight times

That’s a total of 1 + 2 + 4 + 8 + 16 = 31 increments, which is 2⁵-1. In general, the number of calls to increment will be exponential, but the number is not the same exponential that we are computing.

With exponentiation defined, we’re ready for the top of our tower: tetration.

Tetration

Here’s the function and example usage:

def tetrate(a, tetrate_acc):
  assert is_valid_other(a), "not a valid other"
  assert is_valid_accumulator(tetrate_acc), "not a valid accumulator"
  assert a > 0, "we don't define 0↑↑b"

  exponentiate_acc = xmpz(0)
  exponentiate_acc += 1
  for _ in count_down(tetrate_acc):
      multiply_acc = xmpz(0)
      multiply_acc += 1
      for _ in count_down(exponentiate_acc):
          add_acc = xmpz(0)
          for _ in count_down(multiply_acc):
              for _ in range(a):
                  add_acc += 1
          multiply_acc = add_acc
      exponentiate_acc = multiply_acc
  return exponentiate_acc


a = 2
b = xmpz(3)
print(f"{a}↑↑{b} = ", end="")
c = tetrate(a, b)
print(c)
assert c == 16  # 2^(2^2)
assert b == 0   # Confirm tetrate_acc is consumed

Output:

2↑↑3 = 16

This computes a ↑↑ tetrate_acc, meaning it exponentiates a by itself repeatedly, tetrate_acc times.

For each decrement of tetrate_acc, we exponentiate the current value. We in-line the entire exponentiate and multiply logic again, all the way down to repeated increments.

As expected, this computes 2^(2^2) = 16. With a Google sign-in, you can run this yourself via a Python notebook on Google Colab. Alternatively, you can copy the notebook from GitHub and then run it on your own computer.

We can also run tetrate on 10↑↑15. It will start running, but it won’t stop during our lifetimes — or even the lifetime of the universe:

a = 10
b = xmpz(15)
print(f"{a}↑↑{b} = ", end="")
c = tetrate(a, b)
print(c)

Let’s compare this tetrate function to what we found in the previous Rule Sets.

Rule Set 1: Anything Goes — Infinite Loop

Recall our first function:

while True:
  pass

Unlike this infinite loop, our tetrate function eventually halts — though not anytime soon.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

Recall our second function:

for a in range(10**100):
  for b in range(10**100):
      if b % 10_000_000 == 0:
          print(f"{a:,}, {b:,}")

Both this function and our tetrate function contain a fixed number of nested loops. But tetrate differs in an important way: the number of loop iterations grows with the input value. In this function, in contrast, each loop runs from 0 to 10¹⁰⁰-1—a hardcoded bound. In contrast, tetrate’s loop bounds are dynamic — they grow explosively with each layer of computation.

Rule Sets 3 & 4: Infinite, Zero-Initialized Memory — 5- and 6-State Turing Machines

Compared to the Turing machines, our tetrate function has a clear advantage: we can directly see that it will call += 1 more than 10↑↑15 times. Even better, we can also see — by construction — that it halts.

What the Turing machines offer instead is a simpler, more universal model of computation — and perhaps a more principled definition of what counts as a “small program.”

Conclusion

So, there you have it — a journey through writing absurdly slow programs. Along the way, we explored the outer edges of computation, memory, and performance, using everything from deeply nested loops to Turing machines to a hand-inlined tetration function.

Here’s what surprised me:

  • Nested loops are enough.
    If you just want a short program that halts after outliving the universe, two nested loops with 144 bytes of memory will do the job. I hadn’t realized it was that simple.
  • Turing machines escalate fast.
    The jump from 5 to 6 states unleashes a dramatic leap in complexity and runtime. Also, the importance of starting with zero-initialized memory is obvious in retrospect — but it wasn’t something I’d considered before.
  • Python’s int type can kill performance
    Yes, Python integers are arbitrary precision, which is great. But they’re also immutable. That means every time you do something like x += 1, Python silently allocates a brand-new integer object—copying all the memory of x, no matter how big it is. It feels in-place, but it’s not. This behavior turns efficient-looking code into a performance trap when working with large values. To get around this, we use the gmpy2.xmpz type—a mutable, arbitrary-precision integer that allows true in-place updates.
  • There’s something beyond exponentiation — and it’s called tetration.
    I didn’t know this. I wasn’t familiar with the ↑↑ notation or the idea that exponentiation could itself be iterated to form something even faster-growing. It was surprising to learn how compactly it can express numbers that are otherwise unthinkably large.
    And because I know you’re asking — yes, there’s something beyond tetration too. It’s called pentation, then hexation, and so on. These are part of a whole hierarchy known as hyperoperations. There’s even a metageneralization: systems like the Ackermann function and fast-growing hierarchies capture entire families of these functions and more.
  • Writing Tetration with Explicit Loops Was Eye-Opening
    I already knew that exponentiation is repeated multiplication, and so on. I also knew this could be written recursively. What I hadn’t seen was how cleanly it could be written as nested loops, without copying values and with strict in-place updates.

Thank you for joining me on this journey. I hope you now have a clearer understanding of how small Python programs can run for an astonishingly long time — and what that reveals about computation, memory, and minimal systems. We’ve seen programs that halt only after the universe dies, and others that run even longer.

Please follow Carl on Towards Data Science and on @carlkadie.bsky.social. I write on scientific programming in Python and Rust, machine learning, and statistics. I tend to write about one article per month.

The post How to Optimize your Python Program for Slowness appeared first on Towards Data Science.

]]>
Nine Pico PIO Wats with Rust (Part 2) https://towardsdatascience.com/nine-pico-pio-wats-with-rust-part-2/ Fri, 14 Mar 2025 18:45:53 +0000 https://towardsdatascience.com/?p=599597 Raspberry Pi programmable IO pitfalls illustrated with a musical example

The post Nine Pico PIO Wats with Rust (Part 2) appeared first on Towards Data Science.

]]>
This is Part 2 of an exploration into the unexpected quirks of programming the Raspberry Pi Pico PIO with Micropython. If you missed Part 1, we uncovered four Wats that challenge assumptions about register count, instruction slots, the behavior of pull noblock, and smart yet cheap hardware.

Now, we continue our journey toward crafting a theremin-like musical instrument — a project that reveals some of the quirks and perplexities of PIO programming. Prepare to challenge your understanding of constants in a way that brings to mind a Shakespearean tragedy.

Wat 5: Inconstant constants

In the world of PIO programming, constants should be reliable, steadfast, and, well, constant. But what if they’re not? This brings us to a puzzling Wat about how the set instruction in PIO works—or doesn’t—when handling larger constants.

Much like Juliet doubting Romeo’s constancy, you might find yourself wondering if PIO constants will, as she says, “prove likewise variable.”

The problem: Constants are not as big as they seem

Imagine you’re programming an ultrasonic range finder and need to count down from 500 while waiting for the Echo signal to drop from high to low. To set up this wait time in PIO, you might naïvely try to load the constant value directly using set:

; In Rust, be sure 'config.shift_in.direction = ShiftDirection::Left;'
set y, 15       ; Load upper 5 bits (0b01111)
mov isr, y      ; Transfer to ISR (clears ISR)
set y, 20       ; Load lower 5 bits (0b10100)
in y, 5         ; Shift in lower bits to form 500 in ISR
mov y, isr      ; Transfer back to y

Aside: Don’t try to understand the crazy jmp operations here. We’ll discuss those next in Wat 6.

But here’s the tragic twist: the set instruction in PIO is limited to constants between 0 and 31. Moreover, the star-crossed set instruction doesn’t report an error. Instead, it silently corrupts the entire PIO instruction. This produces a nonsense result.

Workarounds for inconstant constants

To address this limitation, consider the following approaches:

  • Read Values and Store Them in a Register: We saw this approach in Wat 1. You can load your constant in the osr register, then transfer it to y. For example:
# Read the max echo wait into OSR.
pull                    ; same as pull block
mov y, osr              ; Load max echo wait into Y
  • Shift and Combine Smaller Values: Using the isr register and the in instruction, you can build up a constant of any size. This, however, consumes time and operations from your 32-operation budget (see Part 1, Wat 2).
; In Rust, be sure 'config.shift_in.direction = ShiftDirection::Left;'

set y, 15       ; Load upper 5 bits (0b01111)
mov isr, y      ; Transfer to ISR (clears ISR)
set y, 20       ; Load lower 5 bits (0b10100)
in y, 5         ; Shift in lower bits to form 500 in ISR
mov y, isr      ; Transfer back to y
  • Slow Down the Timing: Reduce the frequency of the state machine to stretch delays over more system clock cycles. For example, lowering the state machine speed from 125 MHz to 343 kHz reduces the timeout constant 182,216 to 500
  • Use Extra Delays and (Nested) Loops: All instructions support an optional delay, allowing you to add up to 31 extra cycles. (To generate even longer delays, use loops — or even nested loops.)
; Generate 10μs trigger pulse (4 cycles at 343_000Hz)
set pins, 1 [3]       ; Set trigger pin to high, add delay of 3
set pins, 0           ; Set trigger pin to low voltage
  • Use the “Subtraction Trick” to Generate the Maximum 32-bit Integer: In Wat 7, we’ll explore a way to generate 4,294,967,295 (the maximum unsigned 32-bit integer) via subtraction.

Much like Juliet cautioning against swearing by the inconstant moon, we’ve discovered that PIO constants are not always as steadfast as they seem. Yet, just as their story takes unexpected turns, so too does ours, moving from the inconstancy of constants to the uneven nature of conditionals. In the next Wat, we’ll explore how PIO’s handling of conditional jumps can leave you questioning its loyalty to logic.

Wat 6: Conditionals through the looking-glass

In most programming environments, logical conditionals feel balanced: you can test if a pin is high or low, or check registers for equality or inequality. In PIO, this symmetry breaks down. You can jump on pin high, but not pin low, and on x!=y, but not x==y. The rules are whimsical — like Humpty Dumpty in Through the Looking-Glass: “When I define a conditional, it means just what I choose it to mean — neither more nor less.”

These quirks force us to rewrite our code to fit the lopsided logic, creating a gulf between how we wish the code could be written and how we must write it.

The problem: Lopsided conditionals in action

Consider a simple scenario: using a range finder, you want to count down from a maximum wait time (y) until the ultrasonic echo pin goes low. Intuitively, you might write the logic like this:

measure_echo_loop:
 jmp !pin measurement_complete   ; If echo voltage is low, measurement is complete
 jmp y-- measure_echo_loop       ; Continue counting down unless timeout

And when processing the measurement, if we only wish to output values that differ from the previous value, we would write:

measurement_complete:
 jmp x==y cooldown             ; If measurement is the same, skip to cool down
 mov isr, y                    ; Store measurement in ISR
 push                          ; Output ISR
 mov x, y                      ; Save the measurement in X

Unfortunately, PIO doesn’t let you test !pin or x==y directly. You must restructure your logic to accommodate the available conditionals, such as pin and x!=y.

The solution: The way it must be

Given PIO’s limitations, we adapt our logic with a two-step approach that ensures the desired behavior despite the missing conditionals:

  • Jump on the opposite conditional to skip two instructions forward.
  • Next, use an unconditional jump to reach the desired target.

This workaround adds one extra jump (affecting the instruction limit), but the additional label is cost-free.

Here is the rewritten code for counting down until the pin goes low:

measure_echo_loop:
   jmp pin echo_active     ; if echo voltage is high continue count down
   jmp measurement_complete ; if echo voltage is low, measurement is complete
echo_active:
   jmp y-- measure_echo_loop ; Continue counting down unless timeout

And here is the code for processing the measurement such that it will only output differing values:

measurement_complete:
   jmp x!=y send_result    ; if measurement is different, then send it.
   jmp cooldown            ; If measurement is the same, don't send.

send_result:
   mov isr, y              ; Store measurement in ISR
   push                    ; Output ISR
   mov x, y               ; Save the measurement in X

Lessons from Humpty Dumpty’s conditionals

In Through the Looking-Glass, Alice learns to navigate Humpty Dumpty’s peculiar world — just as you’ll learn to navigate PIO’s Wonderland of lopsided conditions.

But as soon as you master one quirk, another reveals itself. In the next Wat, we’ll uncover a surprising behavior of jmp that, if it were an athlete, would shatter world records.

Wat 7: Overshooting jumps

In Part 1’s Wat 1 and Wat 3, we saw how jmp x-- or jmp y-- is often used to loop a fixed number of times by decrementing a register until it reaches 0. Straightforward enough, right? But what happens when y is 0 and we run the following instruction?

jmp y-- measure_echo_loop

If you guessed that it does not jump to measure_echo_loop and instead falls through to the next instruction, you’re absolutely correct. But for full credit, answer this: What value does y have after the instruction?

The answer: 4,294,967,295. Why? Because y is decremented after it is tested for zero. Wat!?

Aside: If this doesn’t surprise you, you likely have experience with C or C++ which distinguish between pre-increment (e.g., ++x) and post-increment (e.g., x++) operations. The behavior of jmp y-- is equivalent to a post-decrement, where the value is tested before being decremented.

This value, 4,294,967,295, is the maximum for a 32-bit unsigned integer. It’s as if a track-and-field long jumper launches off the takeoff board but, instead of landing in the sandpit, overshoots and ends up on another continent.

Aside: As foreshadowed in Wat 5, we can use this behavior intentionally to set a register to the value 4,294,967,295.

Now that we’ve learned how to stick the landing with jmp, let’s see if we can avoid getting stuck by the pins that PIO reads and sets.

Wat 8: Too many “pins”

In Dr. Seuss’s Too Many Daves, Mrs. McCave had 23 sons, all named Dave, leading to endless confusion whenever she called out their name. In PIO programming, pin and pins can refer to completely different ranges of pins depending on the context. It’s hard to know which Dave or Daves you’re talking to.

The problem: Pin ranges and subranges

In PIO, both pin and pins instructions depend on pin ranges defined in Rust, outside of PIO. However, individual instructions often operate on a subrange of those pin ranges. The behavior varies depending on the command: the subrange could be the first n pins of the range, all the pins, or just a specific pin given by an index. To clarify PIO’s behavior, I created the following table:

This table shows how PIO interprets the terms pin and pins in different instructions, along with their associated contexts and configurations.

Example: Distance program for the range finder

Here’s a PIO program for measuring the distance to an object using Trigger and Echo pins. The key features of this program are:

  • Continuous Operation: The range finder runs in a loop as fast as possible.
  • Maximum Range Limit: Measurements are capped at a given distance, with a return value of 4,294,967,295 if no object is detected.
  • Filtered Outputs: Only measurements that differ from their immediate predecessor are sent, reducing the output rate.

Glance over the program and notice that although it is working with two pins — Trigger and Echo — throughout the program we only see pin and pins.

.program distance

; X is the last value sent. Initialize it to
; u32::MAX which means 'echo timeout'
; (Set X to u32::MAX by subtracting 1 from 0)
   set x, 0
subtraction_trick:
   jmp x-- subtraction_trick

; Read the max echo wait into OSR
   pull                         ; same as pull block

; Main loop
.wrap_target
   ; Generate 10μs trigger pulse (4 cycles at 343_000Hz)
   set pins, 0b1 [3]       ; Set trigger pin to high, add delay of 3
   set pins, 0b0           ; Set trigger pin to low voltage

   ; When the trigger goes high, start counting down until it goes low
   wait 1 pin 0            ; Wait for echo pin to be high voltage
   mov y, osr              ; Load max echo wait into Y

measure_echo_loop:
   jmp pin echo_active     ; if echo voltage is high continue count down
   jmp measurement_complete ; if echo voltage is low, measurement is complete
echo_active:
   jmp y-- measure_echo_loop ; Continue counting down unless timeout

; Y tells where the echo countdown stopped. It
; will be u32::MAX if the echo timed out.
measurement_complete:
   jmp x!=y send_result    ; if measurement is different, then sent it.
   jmp cooldown            ; If measurement is the same, don't send.

send_result:
   mov isr, y              ; Store measurement in ISR
   push                    ; Output ISR
   mov x, y               ; Save the measurement in X

; Cool down period before next measurement
cooldown:
   wait 0 pin 0           ; Wait for echo pin to be low
.wrap                      ; Restart the measurement loop

Configuring Pins

To ensure the PIO program behaves as intended:

  • set pins, 0b1 should control the Trigger pin.
  • wait 1 pin 0 should monitor the Echo pin.
  • jmp pin echo_active should also monitor the Echo pin.

Here’s how you can configure this in Rust (followed by an explanation):

let mut distance_state_machine = pio1.sm0;
let trigger_pio = pio1.common.make_pio_pin(hardware.trigger);
let echo_pio = pio1.common.make_pio_pin(hardware.echo);
distance_state_machine.set_pin_dirs(Direction::Out, &[&trigger_pio]);
distance_state_machine.set_pin_dirs(Direction::In, &[&echo_pio]);
distance_state_machine.set_config(&{
   let mut config = Config::default();
   config.set_set_pins(&[&trigger_pio]); // For set instruction
   config.set_in_pins(&[&echo_pio]); // For wait instruction
   config.set_jmp_pin(&echo_pio); // For jmp instruction
   let program_with_defines = pio_file!("examples/distance.pio");
   let program = pio1.common.load_program(&program_with_defines.program);
   config.use_program(&program, &[]); // No side-set pins
   config
});

The keys here are the <strong>set_set_pins</strong>, <strong>set_in_pins</strong>, and set_jmp_pin methods on the Config struct.

  • set_in_pins: Specifies the pins for input operations, such as wait(1, pin, …). The “in” pins must be consecutive.
  • set_set_pins: Configures the pin for set operations, like set(pins, 1). The “set” pins must also be consecutive.
  • set_jmp_pin: Defines the single pin used in conditional jumps, such as jmp(pin, ...).

As described in the table, other optional inputs include:

  • set_out_pins: Sets the consecutive pins for output operations, such as out(pins, …).
  • use_program: Sets a) the loaded program and b) consecutive pins for sideset operations. Sideset operations allow simultaneous pin toggling during other instructions.

Configuring Multiple Pins

Although not required for this program, you can configure a range of pins in PIO by providing a slice of consecutive pins. For example, suppose we had two ultrasonic range finders:

let trigger_a_pio = pio1.common.make_pio_pin(hardware.trigger_a);
let trigger_b_pio = pio1.common.make_pio_pin(hardware.trigger_b);
config.set_set_pins(&[&trigger_a_pio, &trigger_b_pio]);

A single instruction can then control both pins:

set pins, 0b11 [3]  # Sets both trigger pins (17, 18) high, adds delay
set pins, 0b00      # Sets both trigger pins low

This approach lets you efficiently apply bit patterns to multiple pins simultaneously, streamlining control for applications involving multiple outputs.

Aside: The Word “Set” in Programming

In programming, the word “set” is notoriously overloaded with multiple meanings. In the context of PIO, “set” refers to something to which you can assign a value — such as a pin’s state. It does not mean a collection of things, as it often does in other programming contexts. When PIO refers to a collection, it usually uses the term “range” instead. This distinction is crucial for avoiding confusion as you work with PIO.

Lessons from Mrs. McCave

In Too Many Daves, Mrs. McCave lamented not giving her 23 Daves more distinct names. You can avoid her mistake by clearly documenting your pins with meaningful names — like Trigger and Echo — in your comments.

But if you think handling these pin ranges is tricky, debugging a PIO program adds an entirely new layer of challenge. In the next Wat, we’ll dive into the kludgy debugging methods available. Let’s see just how far we can push them.

Wat 9: Kludgy debugging

I like to debug with interactive breakpoints in VS Code. I also do print debugging, where you insert temporary info statements to see what the code is doing and the values of variables. Using the Raspberry Pi Debug Probe and probe-rs, I can do both of these with regular Rust code on the Pico.

With PIO programming, however, I can do neither.

The fallback is push-to-print debugging. In PIO, you temporarily output integer values of interest. Then, in Rust, you use info! to print those values for inspection.

For example, in the following PIO program, we temporarily add instructions to push the value of x for debugging. We also include set and out to push a constant value, such as 7, which must be between 0 and 31 inclusive.

.program distance

; X is the last value sent. Initialize it to
; u32::MAX which means 'echo timeout'
; (Set X to u32::MAX by subtracting 1 from 0)
   set x, 0
subtraction_trick:
   jmp x-- subtraction_trick

; DEBUG: See the value of x
   mov isr, x
   push

; Read the max echo wait into OSR
   pull                         ; same as pull block

; DEBUG: Send constant value
   set y, 7           ; Push '7' so that we know we've reached this point
   mov isr, y
   push
; ...

Back in Rust, you can read and print these values to help understand what’s happening in the PIO code (full code and project):

  // ...
   distance_state_machine.set_enable(true);
   distance_state_machine.tx().wait_push(MAX_LOOPS).await;
   loop {
       let end_loops = distance_state_machine.rx().wait_pull().await;
       info!("end_loops: {}", end_loops);
   }
  // ...

Outputs:

INFO  Hello, debug!
└─ distance_debug::inner_main::{async_fn#0} @ examplesdistance_debug.rs:27
INFO  end_loops: 4294967295
└─ distance_debug::inner_main::{async_fn#0} @ examplesdistance_debug.rs:57
INFO  end_loops: 7
└─ distance_debug::inner_main::{async_fn#0} @ examplesdistance_debug.rs:57

When push-to-print debugging isn’t enough, you can turn to hardware tools. I bought my first oscilloscope (a FNIRSI DSO152, for $37). With it, I was able to confirm the Echo signal was working. The Trigger signal, however, was too fast for this inexpensive oscilloscope to capture clearly.

Using these methods — especially push-to-print debugging — you can trace the flow of your PIO program, even without a traditional debugger.

Aside: In C/C++ (and potentially Rust), you can get closer to a full debugging experience for PIO, for example, by using the piodebug project.

That concludes the nine Wats, but let’s bring everything together in a bonus Wat.

Bonus Wat 10: Putting it all together

Now that all the components are ready, it’s time to combine them into a working theremin-like musical instrument. We need a Rust monitor program. This program starts both PIO state machines — one for measuring distance and the other for generating tones. It then waits for a new distance measurement, maps that distance to a tone, and sends the corresponding tone frequency to the tone-playing state machine. If the distance is out of range, it stops the tone.

Rust’s Place: At the heart of this system is a function that maps distances (from 0 to 50 cm) to tones (approximately B2 to F5). This function is simple to write in Rust, leveraging Rust’s floating-point math and exponential operations. Implementing this in PIO would be virtually impossible due to its limited instruction set and lack of floating-point support.

Here’s the core monitor program to run the theremin (full file and project):

sound_state_machine.set_enable(true);
distance_state_machine.set_enable(true);
distance_state_machine.tx().wait_push(MAX_LOOPS).await;
loop {
   let end_loops = distance_state_machine.rx().wait_pull().await;
   match loop_difference_to_distance_cm(end_loops) {
       None => {
           info!("Distance: out of range");
           sound_state_machine.tx().wait_push(0).await;
       }
       Some(distance_cm) => {
           let tone_frequency = distance_to_tone_frequency(distance_cm);
           let half_period = sound_state_machine_frequency / tone_frequency as u32 / 2;
           info!("Distance: {} cm, tone: {} Hz", distance_cm, tone_frequency);
           sound_state_machine.tx().push(half_period); // non-blocking push
           Timer::after(Duration::from_millis(50)).await;
       }
   }
}

Using two PIO state machines alongside a Rust monitor program lets you literally run three programs at once. This setup is convenient on its own and is essential when strict timing or very high-frequency I/O operations are required.

Aside: Alternatively, Rust Embassy’s async tasks let you implement cooperative multitasking directly on a single main processor. You code in Rust rather than a mixture of Rust and PIO. Although Embassy tasks don’t literally run in parallel, they switch quickly enough to handle applications like a theremin. Here’s a snippet from theremin_no_pio.rs showing a similar core loop:

loop {
       match distance.measure().await {
           None => {
               info!("Distance: out of range");
               sound.rest().await;
           }
           Some(distance_cm) => {
               let tone_frequency = distance_to_tone_frequency(distance_cm);
               info!("Distance: {} cm, tone: {} Hz", distance_cm, tone_frequency);
               sound.play(tone_frequency).await;
               Timer::after(Duration::from_millis(50)).await;
           }
       }
   }

See our recent article on Rust Embassy programming for more details.

Now that we’ve assembled all the components, let’s watch the video again of me “playing” the musical instrument. On the monitor screen, you can see the debugging prints displaying the distance measurements and the corresponding tones. This visual connection highlights how the system responds in real time.

Conclusion

PIO programming on the Raspberry Pi Pico is a captivating blend of simplicity and complexity, offering unparalleled hardware control while demanding a shift in mindset for developers accustomed to higher-level programming. Through the nine Wats we’ve explored, PIO has both surprised us with its limitations and impressed us with its raw efficiency.

While we’ve covered significant ground — managing state machines, pin assignments, timing intricacies, and debugging — there’s still much more you can learn as needed: DMA, IRQ, side-set pins, differences between PIO on the Pico 1 and Pico 2, autopush and autopull, FIFO join, and more.

Recommended Resources

At its core, PIO’s quirks reflect a design philosophy that prioritizes low-level hardware control with minimal overhead. By embracing these characteristics, PIO will not only meet your project’s demands but also open doors to new possibilities in embedded systems programming.

Please follow Carl on Towards Data Science and on @carlkadie.bsky.social. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.

The post Nine Pico PIO Wats with Rust (Part 2) appeared first on Towards Data Science.

]]>
Nine Rules for SIMD Acceleration of Your Rust Code (Part 1) https://towardsdatascience.com/nine-rules-for-simd-acceleration-of-your-rust-code-part-1-c16fe639ce21/ Thu, 27 Feb 2025 02:03:02 +0000 https://towardsdatascience.com/?p=598485 General Lessons from Boosting Data Ingestion in the range-set-blaze Crate by 7x

The post Nine Rules for SIMD Acceleration of Your Rust Code (Part 1) appeared first on Towards Data Science.

]]>

Thanks to Ben Lichtman (B3NNY) at the Seattle Rust Meetup for pointing me in the right direction on SIMD.

SIMD (Single Instruction, Multiple Data) operations have been a feature of Intel/AMD and ARM CPUs since the early 2000s. These operations enable you to, for example, add an array of eight i32 to another array of eight i32 with just one CPU operation on a single core. Using SIMD operations greatly speeds up certain tasks. If you’re not using SIMD, you may not be fully using your CPU’s capabilities.

Is this “Yet Another Rust and SIMD” article? Yes and no. Yes, I did apply SIMD to a programming problem and then feel compelled to write an article about it. No, I hope that this article also goes into enough depth that it can guide you through your project. It explains the newly available SIMD capabilities and settings in Rust nightly. It includes a Rust SIMD cheatsheet. It shows how to make your SIMD code generic without leaving safe Rust. It gets you started with tools such as Godbolt and Criterion. Finally, it introduces new cargo commands that make the process easier.


The range-set-blaze crate uses its RangeSetBlaze::from_iter method to ingest potentially long sequences of integers. When the integers are “clumpy”, it can do this 30 times faster than Rust’s standard HashSet::from_iter. Can we do even better if we use Simd operations? Yes!

See this documentation for the definition of “clumpy”. Also, what happens if the integers are not clumpy? RangeSetBlaze is 2 to 3 times slower than HashSet.

On clumpy integers, RangeSetBlaze::from_slice — a new method based on SIMD operations — is 7 times faster than RangeSetBlaze::from_iter. That makes it more than 200 times faster than HashSet::from_iter. (When the integers are not clumpy, it is still 2 to 3 times slower than HashSet.)

Over the course of implementing this speed up, I learned nine rules that can help you accelerate your projects with SIMD operations.

The rules are:

  1. Use nightly Rust and core::simd, Rust’s experimental standard SIMD module.
  2. CCC: Check, Control, and Choose your computer’s SIMD capabilities.
  3. Learn core::simd, but selectively.
  4. Brainstorm candidate algorithms.
  5. Use Godbolt and AI to understand your code’s assembly, even if you don’t know assembly language.
  6. Generalize to all types and LANES with in-lined generics, (and when that doesn’t work) macros, and (when that doesn’t work) traits.

See Part 2 for these rules:

7. Use Criterion benchmarking to pick an algorithm and to discover that LANES should (almost) always be 32 or 64.

8. Integrate your best SIMD algorithm into your project with as_simd, special code for i128/u128, and additional in-context benchmarking.

9. Extricate your best SIMD algorithm from your project (for now) with an optional cargo feature.

Aside: To avoid wishy-washiness, I call these “rules”, but they are, of course, just suggestions.

Rule 1: Use nightly Rust and core::simd, Rust’s experimental standard SIMD module.

Rust can access SIMD operations either via the stable core::arch module or via nighty’s core::simd module. Let’s compare them:

core::arch

core::simd

  • Nightly
  • Delightfully easy and portable.
  • Limits downstream users to nightly.

I decided to go with “easy”. If you decide to take the harder road, starting first with the easier path may still be worthwhile.


In either case, before we try to use SIMD operations in a larger project, let’s make sure we can get them working at all. Here are the steps:

First, create a project called simd_hello:

cargo new simd_hello
cd simd_hello

Edit src/main.rs to contain (Rust playground):

// Tell nightly Rust to enable 'portable_simd'
#![feature(portable_simd)]
use core::simd::prelude::*;

// constant Simd structs
const LANES: usize = 32;
const THIRTEENS: Simd<u8, LANES> = Simd::<u8, LANES>::from_array([13; LANES]);
const TWENTYSIXS: Simd<u8, LANES> = Simd::<u8, LANES>::from_array([26; LANES]);
const ZEES: Simd<u8, LANES> = Simd::<u8, LANES>::from_array([b'Z'; LANES]);

fn main() {
    // create a Simd struct from a slice of LANES bytes
    let mut data = Simd::<u8, LANES>::from_slice(b"URYYBJBEYQVQBUBCRVGFNYYTBVATJRYY");

    data += THIRTEENS; // add 13 to each byte

    // compare each byte to 'Z', where the byte is greater than 'Z', subtract 26
    let mask = data.simd_gt(ZEES); // compare each byte to 'Z'
    data = mask.select(data - TWENTYSIXS, data);

    let output = String::from_utf8_lossy(data.as_array());
    assert_eq!(output, "HELLOWORLDIDOHOPEITSALLGOINGWELL");
    println!("{}", output);
}

Next — full SIMD capabilities require the nightly version of Rust. Assuming you have Rust installed, install nightly (rustup install nightly). Make sure you have the latest nightly version (rustup update nightly). Finally, set this project to use nightly (rustup override set nightly).

You can now run the program with cargo run. The program applies ROT13 decryption to 32 bytes of upper-case letters. With SIMD, the program can decrypt all 32 bytes simultaneously.

Let’s look at each section of the program to see how it works. It starts with:

#![feature(portable_simd)]
use core::simd::prelude::*;

Rust nightly offers its extra capabilities (or “features”) only on request. The #![feature(portable_simd)] statement requests that Rust nightly make available the new experimental core::simd module. The use statement then imports the module’s most important types and traits.

In the code’s next section, we define useful constants:

const LANES: usize = 32;
const THIRTEENS: Simd<u8, LANES> = Simd::<u8, LANES>::from_array([13; LANES]);
const TWENTYSIXS: Simd<u8, LANES> = Simd::<u8, LANES>::from_array([26; LANES]);
const ZEES: Simd<u8, LANES> = Simd::<u8, LANES>::from_array([b'Z'; LANES]);

The Simd struct is a special kind of Rust array. (It is, for example, always memory aligned.) The constant LANES tells the length of the Simd array. The from_array constructor copies a regular Rust array to create a Simd. In this case, because we want const Simd’s, the arrays we construct from must also be const.

The next two lines copy our encrypted text into data and then adds 13 to each letter.

let mut data = Simd::<u8, LANES>::from_slice(b"URYYBJBEYQVQBUBCRVGFNYYTBVATJRYY");
data += THIRTEENS;

What if you make an error and your encrypted text isn’t exactly length LANES (32)? Sadly, the compiler won’t tell you. Instead, when you run the program, from_slice will panic. What if the encrypted text contains non-upper-case letters? In this example program, we’ll ignore that possibility.

The += operator does element-wise addition between the Simd data and Simd THIRTEENS. It puts the result in data. Recall that debug builds of regular Rust addition check for overflows. Not so with SIMD. Rust defines SIMD arithmetic operators to always wrap. Values of type u8 wrap after 255.

Coincidentally, Rot13 decryption also requires wrapping, but after ‘Z’ rather than after 255. Here is one approach to coding the needed Rot13 wrapping. It subtracts 26 from any values on beyond ‘Z’.

let mask = data.simd_gt(ZEES);
data = mask.select(data - TWENTYSIXS, data);

This says to find the element-wise places beyond ‘Z’. Then, subtract 26 from all values. At the places of interest, use the subtracted values. At the other places, use the original values. Does subtracting from all values and then using only some seem wasteful? With SIMD, this takes no extra computer time and avoids jumps. This strategy is, thus, efficient and common.

The program ends like so:

let output = String::from_utf8_lossy(data.as_array());
assert_eq!(output, "HELLOWORLDIDOHOPEITSALLGOINGWELL");
println!("{}", output);

Notice the .as_array() method. It safely transmutes a Simd struct into a regular Rust array without copying.

Surprisingly to me, this program runs fine on computers without SIMD extensions. Rust nightly compiles the code to regular (non-SIMD) instructions. But we don’t just want to run “fine”, we want to run faster. That requires us to turn on our computer’s SIMD power.

Rule 2: CCC: Check, Control, and Choose your computer’s SIMD capabilities.

To make SIMD programs run faster on your machine, you must first discover which SIMD extensions your machine supports. If you have an Intel/AMD machine, you can use my simd-detect cargo command.

Run with:

rustup override set nightly
cargo install cargo-simd-detect --force
cargo simd-detect

On my machine, it outputs:

extension       width                   available       enabled
sse2            128-bit/16-bytes        true            true
avx2            256-bit/32-bytes        true            false
avx512f         512-bit/64-bytes        true            false

This says that my machine supports the sse2avx2, and avx512f SIMD extensions. Of those, by default, Rust enables the ubiquitous twenty-year-old sse2 extension.

The SIMD extensions form a hierarchy with avx512f above avx2 above sse2. Enabling a higher-level extension also enables the lower-level extensions.

Most Intel/AMD computers also support the ten-year-old avx2 extension. You enable it by setting an environment variable:

# For Windows Command Prompt
set RUSTFLAGS=-C target-feature=+avx2

# For Unix-like shells (like Bash)
export RUSTFLAGS="-C target-feature=+avx2"

“Force install” and run simd-detect again and you should see that avx2 is enabled.

# Force install every time to see changes to 'enabled'
cargo install cargo-simd-detect --force
cargo simd-detect
extension         width                   available       enabled
sse2            128-bit/16-bytes        true            true
avx2            256-bit/32-bytes        true            true
avx512f         512-bit/64-bytes        true            false

Alternatively, you can turn on every SIMD extension that your machine supports:

# For Windows Command Prompt
set RUSTFLAGS=-C target-cpu=native

# For Unix-like shells (like Bash)
export RUSTFLAGS="-C target-cpu=native"

On my machine this enables avx512f, a newer SIMD extension supported by some Intel computers and a few AMD computers.

You can set SIMD extensions back to their default (sse2 on Intel/AMD) with:

# For Windows Command Prompt
set RUSTFLAGS=

# For Unix-like shells (like Bash)
unset RUSTFLAGS

You may wonder why target-cpu=native isn’t Rust’s default. The problem is that binaries created using avx2 or avx512f won’t run on computers missing those SIMD extensions. So, if you are compiling only for your own use, use target-cpu=native. If, however, you are compiling for others, choose your SIMD extensions thoughtfully and let people know which SIMD extension level you are assuming.

Happily, whatever level of SIMD extension you pick, Rust’s SIMD support is so flexible you can easily change your decision later. Let’s next learn details of programming with SIMD in Rust.

Rule 3: Learn core::simd, but selectively.

To build with Rust’s new core::simd module you should learn selected building blocks. Here is a cheatsheet with the structs, methods, etc., that I’ve found most useful. Each item includes a link to its documentation.

Structs

  • Simd – a special, aligned, fixed-length array of SimdElement. We refer to a position in the array and the element stored at that position as a “lane”. By default, we copy Simd structs rather than reference them.
  • Mask – a special Boolean array showing inclusion/exclusion on a per-lane basis.

SimdElements

  • Floating-Point Types: f32f64
  • Integer Types: i8u8i16u16i32u32i64u64isizeusize
  • — but not i128u128

Simd constructors

  • Simd::from_array – creates a Simd struct by copying a fixed-length array.
  • Simd::from_slice – creates a Simd<T,LANE> struct by copying the first LANE elements of a slice.
  • Simd::splat – replicates a single value across all lanes of a Simd struct.
  • slice::as_simd – without copying, safely transmutes a regular slice into an aligned slice of Simd (plus unaligned leftovers).

Simd conversion

  • Simd::as_array – without copying, safely transmutes an Simd struct into a regular array reference.

Simd methods and operators

  • simd[i] – extract a value from a lane of a Simd.
  • simd + simd – performs element-wise addition of two Simd structs. Also, supported -*/%, remainder, bitwise-and, -or, xor, -not, -shift.
  • simd += simd – adds another Simd struct to the current one, in place. Other operators supported, too.
  • Simd::simd_gt – compares two Simd structs, returning a Mask indicating which elements of the first are greater than those of the second. Also, supported simd_ltsimd_lesimd_gesimd_ltsimd_eqsimd_ne.
  • Simd::rotate_elements_left – rotates the elements of a Simd struct to the left by a specified amount. Also, rotate_elements_right.
  • simd_swizzle!(simd, indexes) – rearranges the elements of a Simd struct based on the specified const indexes.
  • simd == simd – checks for equality between two Simd structs, returning a regular bool result.
  • Simd::reduce_and – performs a bitwise AND reduction across all lanes of a Simd struct. Also, supported: reduce_orreduce_xorreduce_maxreduce_minreduce_sum (but noreduce_eq).

Mask methods and operators

  • Mask::select – selects elements from two Simd struct based on a mask.
  • Mask::all – tells if the mask is all true.
  • Mask::any – tells if the mask contains any true.

All about lanes

  • Simd::LANES – a constant indicating the number of elements (lanes) in a Simd struct.
  • SupportedLaneCount – tells the allowed values of LANES. Use by generics.
  • simd.lanes – const method that tells a Simd struct’s number of lanes.

Low-level alignment, offsets, etc.

When possible, use to_simd instead.

More, perhaps of interest

With these building blocks at hand, it’s time to build something.

Rule 4: Brainstorm candidate algorithms.

What do you want to speed up? You won’t know ahead of time which SIMD approach (of any) will work best. You should, therefore, create many algorithms that you can then analyze (Rule 5) and benchmark (Rule 7).

I wanted to speed up range-set-blaze, a crate for manipulating sets of “clumpy” integers. I hoped that creating is_consecutive, a function to detect blocks of consecutive integers, would be useful.

Background: Crate range-set-blaze works on “clumpy” integers. “Clumpy”, here, means that the number of ranges needed to represent the data is small compared to the number of input integers. For example, these 1002 input integers

100, 101, …, 489, 499, 501, 502, …, 998, 999, 999, 100, 0

Ultimately become three Rust ranges:

0..=0, 100..=499, 501..=999.

(Internally, the RangeSetBlaze struct represents a set of integers as a sorted list of disjoint ranges stored in a cache efficient BTreeMap.)

Although the input integers are allowed to be unsorted and redundant, we expect them to often be “nice”. RangeSetBlaze’s from_iter constructor already exploits this expectation by grouping up adjacent integers. For example, from_iter first turns the 1002 input integers into four ranges

100..=499, 501..=999, 100..=100, 0..=0.

with minimal, constant memory usage, independent of input size. It then sorts and merges these reduced ranges.

I wondered if a new from_slice method could speed construction from array-like inputs by quickly finding (some) consecutive integers. For example, could it— with minimal, constant memory — turn the 1002 inputs integers into five Rust ranges:

100..=499, 501..=999, 999..=999, 100..=100, 0..=0.

If so, from_iter could then quickly finish the processing.

Let’s start by writing is_consecutive with regular Rust:

pub const LANES: usize = 16;
pub fn is_consecutive_regular(chunk: &[u32; LANES]) -> bool {
    for i in 1..LANES {
        if chunk[i - 1].checked_add(1) != Some(chunk[i]) {
            return false;
        }
    }
    true
}

The algorithm just loops through the array sequentially, checking that each value is one more than its predecessor. It also avoids overflow.

Looping over the items seemed so easy, I wasn’t sure if SIMD could do any better. Here was my first attempt:

Splat0

use std::simd::prelude::*;

const COMPARISON_VALUE_SPLAT0: Simd<u32, LANES> =
    Simd::from_array([15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]);

pub fn is_consecutive_splat0(chunk: Simd<u32, LANES>) -> bool {
    if chunk[0].overflowing_add(LANES as u32 - 1) != (chunk[LANES - 1], false) {
        return false;
    }
    let added = chunk + COMPARISON_VALUE_SPLAT0;
    Simd::splat(added[0]) == added
}

Here is an outline of its calculations:

Source: This and all following images by author.

It first (needlessly) checks that the first and last items are 15 apart. It then creates added by adding 15 to the 0th item, 14 to the next, etc. Finally, to see if all items in added are the same, it creates a new Simd based on added’s 0th item and then compares. Recall that splat creates a Simd struct from one value.

Splat1 & Splat2

When I mentioned the is_consecutive problem to Ben Lichtman, he independently came up with this, Splat1:

const COMPARISON_VALUE_SPLAT1: Simd<u32, LANES> =
    Simd::from_array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);

pub fn is_consecutive_splat1(chunk: Simd<u32, LANES>) -> bool {
    let subtracted = chunk - COMPARISON_VALUE_SPLAT1;
    Simd::splat(chunk[0]) == subtracted
}

Splat1 subtracts the comparison value from chunk and checks if the result is the same as the first element of chunk, splatted.

He also came up with a variation called Splat2 that splats the first element of subtracted rather than chunk. That would seemingly avoid one memory access.

I’m sure you are wondering which of these is best, but before we discuss that let’s look at two more candidates.

Swizzle

Swizzle is like Splat2 but uses simd_swizzle! instead of splat. Macro simd_swizzle! creates a new Simd by rearranging the lanes of an old Simd according to an array of indexes.

pub fn is_consecutive_sizzle(chunk: Simd<u32, LANES>) -> bool {
    let subtracted = chunk - COMPARISON_VALUE_SPLAT1;
    simd_swizzle!(subtracted, [0; LANES]) == subtracted
}

Rotate

This one is different. I had high hopes for it.

const COMPARISON_VALUE_ROTATE: Simd<u32, LANES> =
    Simd::from_array([4294967281, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]);

pub fn is_consecutive_rotate(chunk: Simd<u32, LANES>) -> bool {
    let rotated = chunk.rotate_elements_right::<1>();
    chunk - rotated == COMPARISON_VALUE_ROTATE
}

The idea is to rotate all the elements one to the right. We then subtract the original chunk from rotated. If the input is consecutive, the result should be “-15” followed by all 1’s. (Using wrapped subtraction, -15 is 4294967281u32.)

Now that we have candidates, let’s start to evaluate them.

Rule 5: Use Godbolt and AI to understand your code’s assembly, even if you don’t know assembly language.

We’ll evaluate the candidates in two ways. First, in this rule, we’ll look at the assembly language generated from our code. Second, in Rule 7, we’ll benchmark the code’s speed.

Don’t worry if you don’t know assembly language, you can still get something out of looking at it.

The easiest way to see the generated assembly language is with the Compiler Explorer, AKA Godbolt. It works best on short bits of code that don’t use outside crates. It looks like this:

Referring to the numbers in the figure above, follow these steps to use Godbolt:

  1. Open godbolt.org with your web browser.
  2. Add a new source editor.
  3. Select Rust as your language.
  4. Paste in the code of interest. Make the functions of interest public (pub fn). Do not include a main or unneeded functions. The tool doesn’t support external crates.
  5. Add a new compiler.
  6. Set the compiler version to nightly.
  7. Set options (for now) to -C opt-level=3 -C target-feature=+avx512f.
  8. If there are errors, look at the output.
  9. If you want to share or save the state of the tool, click “Share”

From the image above, you can see that Splat2 and Sizzle are exactly the same, so we can remove Sizzle from consideration. If you open up a copy of my Godbolt session, you’ll also see that most of the functions compile to about the same number of assembly operations. The exceptions are Regular — which is much longer — and Splat0 — which includes the early check.

In the assembly, 512-bit registers start with ZMM. 256-bit registers start YMM. 128-bit registers start with XMM. If you want to better understand the generated assembly, use AI tools to generate annotations. For example, here I ask Bing Chat about Splat2:

Try different compiler settings, including -C target-feature=+avx2 and then leaving target-feature completely off.

Fewer assembly operations don’t necessarily mean faster speed. Looking at the assembly does, however, give us a sanity check that the compiler is at least trying to use SIMD operations, inlining const references, etc. Also, as with Splat1 and Swizzle, it can sometimes let us know when two candidates are the same.

You may need disassembly features beyond what Godbolt offers, for example, the ability to work with code the uses external crates. B3NNY recommended the cargo tool cargo-show-asm to me. I tried it and found it reasonably easy to use.

The range-set-blaze crate must handle integer types beyond u32. Moreover, we must pick a number of LANES, but we have no reason to think that 16 LANES is always best. To address these needs, in the next rule we’ll generalize the code.

Rule 6: Generalize to all types and LANES with in-lined generics, (and when that doesn’t work) macros, and (when that doesn’t work) traits.

Let’s first generalize Splat1 with generics.

#[inline]
pub fn is_consecutive_splat1_gen<T, const N: usize>(
    chunk: Simd<T, N>,
    comparison_value: Simd<T, N>,
) -> bool
where
    T: SimdElement + PartialEq,
    Simd<T, N>: Sub<Simd<T, N>, Output = Simd<T, N>>,
    LaneCount<N>: SupportedLaneCount,
{
    let subtracted = chunk - comparison_value;
    Simd::splat(chunk[0]) == subtracted
}

First, note the #[inline] attribute. It’s important for efficiency and we’ll use it on pretty much every one of these small functions.

The function defined above, is_consecutive_splat1_gen, looks great except that it needs a second input, called comparison_value, that we have yet to define.

If you don’t need a generic const comparison_value, I envy you. You can skip to the next rule if you like. Likewise, if you are reading this in the future and creating a generic const comparison_value is as effortless as having your personal robot do your household chores, then I doubly envy you.

We can try to create a comparison_value_splat_gen that is generic and const. Sadly, neither From<usize> nor alternative T::One are const, so this doesn’t work:

// DOESN'T WORK BECAUSE From<usize> is not const
pub const fn comparison_value_splat_gen<T, const N: usize>() -> Simd<T, N>
where
    T: SimdElement + Default + From<usize> + AddAssign,
    LaneCount<N>: SupportedLaneCount,
{
    let mut arr: [T; N] = [T::from(0usize); N];
    let mut i_usize = 0;
    while i_usize < N {
        arr[i_usize] = T::from(i_usize);
        i_usize += 1;
    }
    Simd::from_array(arr)
}

Macros are the last refuge of scoundrels. So, let’s use macros:

#[macro_export]
macro_rules! define_is_consecutive_splat1 {
    ($function:ident, $type:ty) => {
        #[inline]
        pub fn $function<const N: usize>(chunk: Simd<$type, N>) -> bool
        where
            LaneCount<N>: SupportedLaneCount,
        {
            define_comparison_value_splat!(comparison_value_splat, $type);

            let subtracted = chunk - comparison_value_splat();
            Simd::splat(chunk[0]) == subtracted
        }
    };
}
#[macro_export]
macro_rules! define_comparison_value_splat {
    ($function:ident, $type:ty) => {
        pub const fn $function<const N: usize>() -> Simd<$type, N>
        where
            LaneCount<N>: SupportedLaneCount,
        {
            let mut arr: [$type; N] = [0; N];
            let mut i = 0;
            while i < N {
                arr[i] = i as $type;
                i += 1;
            }
            Simd::from_array(arr)
        }
    };
}

This lets us run on any particular element type and all number of LANES (Rust Playground):

define_is_consecutive_splat1!(is_consecutive_splat1_i32, i32);

let a: Simd<i32, 16> = black_box(Simd::from_array(array::from_fn(|i| 100 + i as i32)));
let ninety_nines: Simd<i32, 16> = black_box(Simd::from_array([99; 16]));
assert!(is_consecutive_splat1_i32(a));
assert!(!is_consecutive_splat1_i32(ninety_nines));

Sadly, this still isn’t enough for range-set-blaze. It needs to run on all element types (not just one) and (ideally) all LANES (not just one).

Happily, there’s a workaround, that again depends on macros. It also exploits the fact that we only need to support a finite list of types, namely: i8i16i32i64isizeu8u16u32u64, and usize. If you need to also (or instead) support f32 and f64, that’s fine.

If, on the other hand, you need to support i128 and u128, you may be out of luck. The core::simd module doesn’t support them. We’ll see in Rule 8 how range-set-blaze gets around that at a performance cost.

The workaround defines a new trait, here called IsConsecutive. We then use a macro (that calls a macro, that calls a macro) to implement the trait on the 10 types of interest.

pub trait IsConsecutive {
    fn is_consecutive<const N: usize>(chunk: Simd<Self, N>) -> bool
    where
        Self: SimdElement,
        Simd<Self, N>: Sub<Simd<Self, N>, Output = Simd<Self, N>>,
        LaneCount<N>: SupportedLaneCount;
}

macro_rules! impl_is_consecutive {
    ($type:ty) => {
        impl IsConsecutive for $type {
            #[inline] // very important
            fn is_consecutive<const N: usize>(chunk: Simd<Self, N>) -> bool
            where
                Self: SimdElement,
                Simd<Self, N>: Sub<Simd<Self, N>, Output = Simd<Self, N>>,
                LaneCount<N>: SupportedLaneCount,
            {
                define_is_consecutive_splat1!(is_consecutive_splat1, $type);
                is_consecutive_splat1(chunk)
            }
        }
    };
}

impl_is_consecutive!(i8);
impl_is_consecutive!(i16);
impl_is_consecutive!(i32);
impl_is_consecutive!(i64);
impl_is_consecutive!(isize);
impl_is_consecutive!(u8);
impl_is_consecutive!(u16);
impl_is_consecutive!(u32);
impl_is_consecutive!(u64);
impl_is_consecutive!(usize);

We can now call fully generic code (Rust Playground):

// Works on i32 and 16 lanes
let a: Simd<i32, 16> = black_box(Simd::from_array(array::from_fn(|i| 100 + i as i32)));
let ninety_nines: Simd<i32, 16> = black_box(Simd::from_array([99; 16]));

assert!(IsConsecutive::is_consecutive(a));
assert!(!IsConsecutive::is_consecutive(ninety_nines));

// Works on i8 and 64 lanes
let a: Simd<i8, 64> = black_box(Simd::from_array(array::from_fn(|i| 10 + i as i8)));
let ninety_nines: Simd<i8, 64> = black_box(Simd::from_array([99; 64]));

assert!(IsConsecutive::is_consecutive(a));
assert!(!IsConsecutive::is_consecutive(ninety_nines));

With this technique, we can create multiple candidate algorithms that are fully generic over type and LANES. Next, it is time to benchmark and see which algorithms are fastest.


Those are the first six rules for adding SIMD code to Rust. In Part 2, we look at rules 7 to 9. These rules will cover how to pick an algorithm and set LANES. Also, how to integrate SIMD operations into your existing code and (importantly) how to make it optional. Part 2 concludes with a discussion of when/if you should use SIMD and ideas for improving Rust’s SIMD experience. I hope to see you there.

Please follow Carl on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.


The post Nine Rules for SIMD Acceleration of Your Rust Code (Part 1) appeared first on Towards Data Science.

]]>
Nine Pico PIO Wats with Rust (Part 1) https://towardsdatascience.com/nine-pico-pio-wats-with-rust-part-1-9d062067dc25/ Thu, 30 Jan 2025 16:32:01 +0000 https://towardsdatascience.com/nine-pico-pio-wats-with-rust-part-1-9d062067dc25/ Raspberry Pi programmable IO pitfalls illustrated with a musical example

The post Nine Pico PIO Wats with Rust (Part 1) appeared first on Towards Data Science.

]]>
Pico PIO Surprises - Source: https://openai.com/dall-e-2/. All other figures from the author.
Pico PIO Surprises – Source: https://openai.com/dall-e-2/. All other figures from the author.

Also available: A MicroPython version of this article

In JavaScript and other languages, we call a surprising or inconsistent behavior a “Wat!” [that is, a “What!?”]. For example, in JavaScript, an empty array plus an empty array produces an empty string, [] + [] === "". Wat!

Rust, by comparison, is consistent and predictable. However, one corner of Rust on the Raspberry Pi Pico microcontroller offers similar surprises. Specifically, the Pico’s Programmable Input/Output (PIO) subsystem, while incredibly powerful and versatile, comes with peculiarities.

PIO Programming matters because it provides an ingenious solution to the challenge of precise, low-level hardware control. It is incredibly fast and flexible: rather than relying on special-purpose hardware for the countless peripherals you might want to control, PIO allows you to define custom behaviors in software, seamlessly adapting to your needs without adding hardware complexity.

Consider this simple example: a $15 theremin-like musical instrument. By waving their hand in the air, the musician changes the pitch of (admittedly annoying) tones. Using PIO provides a simple way to program this device that ensures it reacts instantly to movement.

So, all is wonderful, except – to paraphrase Spider-Man:

With great power comes… nine Wats!?

We’ll explore and illustrate those nine PIO Wats through the creation of this theremin.

Who Is This Article For?

  • All Programmers: Microcontrollers like the Pico cost under $7 and support high-level languages like Python, Rust, and C/C++. This article will show how microcontrollers let your programs interact with the physical world and introduce you to programming the Pico’s low-level, high-performance PIO hardware.
  • Rust Pico Programmers: Curious about the Pico’s hidden potential? Beyond its two main cores, it has eight tiny “state machines” dedicated to PIO programming. These state machines take over time-critical tasks, freeing up the main processors for other work and enabling surprising parallelism.
  • C/C++ Pico Programmers: While this article uses Rust, PIO programming is – for good and bad – nearly identical across all languages. If you understand it here, you’ll be well-equipped to apply it in C/C++.
  • MicroPython Pico Programmers: You may wish to read the MicroPython version of this article.
  • PIO Programmers: The journey through nine Wats may not be as entertaining as JavaScript’s quirks (thankfully), but it will shed light on the peculiarities of PIO programming. If you’ve ever found PIO programming confusing, this article should reassure you that the problem isn’t (necessarily) you – it’s partly PIO itself. Most importantly, understanding these Wats will make writing PIO code simpler and more effective.

Finally, this article isn’t about “fixing” PIO programming. PIO excels at its primary purpose: efficiently and flexibly handling custom peripheral interfaces. Its design is purposeful and well-suited to its goals. Instead, this article focuses on understanding PIO programming and its quirks – starting with a bonus Wat.

Bonus Wat 0: “State Machines” Are Not State Machines – Except When They Are

Despite their name, the eight “PIO state machines” in the Raspberry Pi Pico are not state machines in the strict, formal sense of computer science. A classical finite state machine (FSM) consists of a finite set of states and transitions, driven purely by input symbols. In contrast, the Pico’s PIO state machines are tiny programmable processors with their own instruction set. Unlike a formal FSM, they have a program counter, registers, and the ability to modify those registers through shifting and decrementing, allowing them to store intermediate values and influence execution flow beyond fixed state transitions.

That said, calling them programmable state machines makes sense: rather than having a fixed set of predefined states, their behavior depends on the program they execute. And in most cases, that program will define a state machine or something similar – especially since PIO is often used to implement precise, state-driven I/O control.

Each PIO processes one instruction per clock cycle. The $4 Pico 1 runs at 125 million cycles per second, while the $5 Pico 2 offers a faster 150 million cycles per second. Each instruction performs a simple operation, such as “move a value” or “jump to a label”.

With that bonus Wat out of the way, let’s move to our first main Wat.

Wat 1: The Register Hunger Games

In PIO programming, a register is a small, fast storage location that acts like a variable for the state machine. You might dream of an abundance of variables to hold your counters, delays, and temporary values, but the reality is brutal: you only get two general-purpose registers, x and y. It’s like The Hunger Games, where no matter how many tributes enter the arena, only Katniss and Peeta emerge as victors. You’re forced to winnow down your needs to fit within these two registers, ruthlessly deciding what to prioritize and what to sacrifice. Also, like the Hunger Games, we can sometimes bend the rules.

Let’s start with a challenge: create a backup beeper – 1000 Hz for ½ second, silence for ½ second, repeat. The result? “Beep Beep Beep…”

We would like five variables:

  • half_period: The number of clock cycles to hold the voltage high and then low to create a 1000 Hz tone. This is 125,000,000 / 1000 / 2 = 62,500 cycles high and 62, 500 cycles low.
Voltage and timing (millisecond and clock cycles) to generate a 1000 Hz square tone.
Voltage and timing (millisecond and clock cycles) to generate a 1000 Hz square tone.
  • y: Loop counter from 0 to half_period to create a delay.
  • period_count: The number of repeated periods needed to fill ½ second of time. 125,000,000 × 0.5 / (62,500 × 2) = 500.
  • x: Loop counter from 0 to period_count to fill ½ second of time.
  • silence_cycles: The number of clock cycles for ½ second of silence. 125,000,000 × 0.5 = 62,500,000.

We want five registers but can only have two, so let the games begin! May the odds be ever in your favor.

First, we can eliminate silence_cycles because it can be derived as half_period × period_count × 2. While PIO doesn’t support multiplication, it does support loops. By nesting two loops—where the inner loop delays for 2 clock cycles—we can create a delay of 62,500,000 clock cycles.

One variable down, but how can we eliminate two more? Fortunately, we don’t have to. While PIO only provides two general-purpose registers, x and y, it also includes two special-purpose registers: osr (output shift register) and isr (input shift register).

The PIO code that we’ll see in a moment implements the backup beeper. Here’s how it works:

Initialization:

  • The pull block instruction reads the half period of the tone (62,500 clock cycles) from a buffer and places the value into osr.
  • The value is then copied to isr for later use.
  • The second pull block reads the period count (500 repeats) from the buffer and places the value in osr, where we leave it.

Beep Loops:

  • The mov x, osr instruction copies the period count into the x register, which serves as the outer loop counter.
  • For the inner loops, mov y, isr repeatedly copies the half period into y to create delays for the high and low states of the tone.

Silence Loops:

  • The silence loops mirror the structure of the beep loops but don’t set any pins, so they act solely as a delay.

Wrap and Continuous Execution:

  • The .wrap_target and .wrap directives define the main loop of the state machine.
  • After finishing both the beep and silence loops, the state machine jumps back near the start of the program, repeating the sequence indefinitely.

With this outline in mind, here’s the PIO assembly code for generating the backup beeper signal.

.program backup

; Read initial configuration
    pull block         ; Read the half period of the beep sound
    mov isr, osr      ; Store the half period in ISR
    pull block        ; Read the period_count

.wrap_target          ; Start of the main loop

; Generate the beep sound
    mov x, osr        ; Load period_count into X
beep_loop:
    set pins, 1       ; Set the buzzer to high voltage (start the tone)
    mov y, isr        ; Load the half period into Y
beep_high_delay:
    jmp y--, beep_high_delay    ; Delay for the half period

    set pins, 0       ; Set the buzzer to low voltage (end the tone)
    mov y, isr        ; Load the half period into Y
beep_low_delay:
    jmp y--, beep_low_delay     ; Delay for the low duration

    jmp x--, beep_loop          ; Repeat the beep loop

; Silence between beeps
    mov x, osr        ; Load the period count into X for outer loop
silence_loop:
    mov y, isr        ; Load the half period into Y for inner loop
silence_delay:
    jmp y--, silence_delay [1]  ; Delay for two clock cycles (jmp + 1 extra)

    jmp x--, silence_loop       ; Repeat the silence loop

.wrap                 ; End of the main loop, jumps back to wrap_target

Here’s the core Rust code to configure and run the PIO program for the backup beeper. It uses the Embassy framework for embedded applications. The function initializes the state machine, calculates the timing values (half_period and period_count), and sends them to the PIO. It then plays the beeping sequence for 5 seconds before entering an endless loop. The full source file and project are available on GitHub.

async fn inner_main(_spawner: Spawner) -> Result<Never> {
    info!("Hello, back_up!");
    let hardware: Hardware<'_> = Hardware::default();
    let mut pio0 = hardware.pio0;
    let state_machine_frequency = embassy_rp::clocks::clk_sys_freq();
    let mut back_up_state_machine = pio0.sm0;
    let buzzer_pio = pio0.common.make_pio_pin(hardware.buzzer);
    back_up_state_machine.set_pin_dirs(Direction::Out, &amp;[&amp;buzzer_pio]);
    back_up_state_machine.set_config(&amp;{
        let mut config = Config::default();
        config.set_set_pins(&amp;[&amp;buzzer_pio]); // For set instruction
        let program_with_defines = pio_file!("examples/backup.pio");
        let program = pio0.common.load_program(&amp;program_with_defines.program);
        config.use_program(&amp;program, &amp;[]);
        config
    });
    back_up_state_machine.set_enable(true);
    let half_period = state_machine_frequency / 1000 / 2;
    let period_count = state_machine_frequency / (half_period * 2) / 2;
    info!(
        "Half period: {}, Period count: {}",
        half_period, period_count
    );
    back_up_state_machine.tx().wait_push(half_period).await;
    back_up_state_machine.tx().wait_push(period_count).await;
    Timer::after(Duration::from_millis(5000)).await;
    info!("Disabling back_up_state_machine");
    back_up_state_machine.set_enable(false);
    // run forever
    loop {
        Timer::after(Duration::from_secs(3_153_600_000)).await; // 100 years
    }
}

Here’s what happens when you run the program:

Aside 1: Running this yourselfThe simplest – but often frustrating – way to run Rust code on the Pico is to cross-compile it on your desktop and manually copy over the resulting files. A much better approach is to invest in a $12 Raspberry Pi Debug Probe and set up [probe-rs](https://probe.rs/docs/getting-started/installation/) on your desktop. With this setup, you can use cargo run to automatically compile on your desktop, copy to your Pico, and then start your code running. Even better, your Pico code can use info! statements to send messages back to your desktop, and you can perform interactive breakpoint debugging. For setup instructions, visit the probe-rs website.

To hear sound, I connected a passive buzzer, a resistor, and a transistor to the Pico. For detailed wiring diagrams and a parts list, check out the passive buzzer instructions in the SunFounder’s Kepler Kit.

Aside 2: If your only goal is to generate tones with the Pico, PIO isn’t necessary. MicroPython is fast enough to toggle pins directly, or you can use the Pico’s built-in pulse width modulation (PWM) feature.

Alternative Endings to the Register Hunger Games

We used four registers – two general and two special – to resolve the challenge. If this solution feels less than satisfying, here are alternative approaches to consider:

Use Constants: Why make half_period, period_count, and silence_cycles variables at all? Hardcoding the constants “62,500,” “500,” and “62,500,000” could simplify the design. However, PIO constants have limitations, which we’ll explore in Wat 5.

Pack Bits: Registers hold 32 bits. Do we really need two registers (2×32=64 bits) to store half_period and period_count? No. Storing 62,500 only requires 16 bits, and 500 requires 9 bits. We could pack these into a single register and use the out instruction to shift values into x and y. This approach would free up either osr or isr for other tasks, but only one at a time—the other register must hold the packed value.

Slow Motion: In Rust with the Embassy framework, you can configure a PIO state machine to run at a slower frequency by setting its clock_divider. This allows the state machine to run as slow as ~1907 Hz. Running the state machine at a slower speed means that values like half_period can be smaller, potentially as small as 2. Small values are easier to hardcode as constants and more compactly bit-packed into registers.

A Happy Ending to the Register Hunger Games

The Register Hunger Games demanded strategic sacrifices and creative workarounds, but we emerged victorious by leveraging PIO’s special registers and clever looping structures. If the stakes had been higher, alternative techniques could have helped us adapt and survive.

But victory in one arena doesn’t mean the challenges are over. In the next Wat, we face a new trial: PIO’s strict 32-instruction limit.

Wat 2: The 32-Instruction Carry-On Suitcase

Congratulations! You’ve purchased a trip around the world for just $4. The catch? All your belongings must fit into a tiny carry-on suitcase. Likewise, PIO programs allow you to create incredible functionality, but every PIO program is limited to just 32 instructions.

Wat! Only 32 instructions? That’s not much space to pack everything you need! But with clever planning, you can usually make it work.

The Rules

  • No PIO program can be longer than 32 instructions.
  • The wrap_target and wrap directives do not count.
  • Labels do not count.
  • A Pico 1 includes eight state machines, organized into two blocks of four. A Pico 2 includes twelve state machines, organized into three blocks of four. Each block shares 32 instruction slots. So, because all four state machines in a block draw from the same 32-instruction pool, if one machine’s program uses all 32 slots, there’s no space left for the other three.

When Your Suitcase Won’t Close

If your idea doesn’t fit in the PIO instruction slots, these packing tricks may help. (Disclaimer: I haven’t tried all of these myself.)







let half_period = state_machine_frequency / 1000 / 2;
back_up_state_machine.tx().push(half_period); // Using non-blocking push since FIFO is empty
let pull_block = pio_asm!("pull block").program.code[0];
unsafe {
    back_up_state_machine.exec_instr(pull_block);
}
  • Use PIO’s exec commands:Within your state machine, you can dynamically execute instructions using PIO’s exec mechanism. For example, you can execute an instruction value stored in osr with out exec. Alternatively, you can use mov exec, x or mov exec, y to execute instructions directly from those registers.


With your bags now packed, let’s join Dr. Dolittle’s search for a fabled creature.

Bonus Wat 2.5: Dr. Dolittle’s PIO Pushmi-Pullyu

Two readers pointed out an important PIO Wat that I missed – so here’s a bonus! When programming PIO, you’ll notice something peculiar:

  • The PIO pull instruction receives values from TX FIFO (transmit buffer) and inputs them into the output shift register (osr). So, it inputs into output and transmits from receive.
  • Likewise, the PIO push instruction outputs values from the input shift register (isr) and transmits them to the RX FIFO (receive buffer). So, it outputs from input and receives from transmit.

Wat!? Like the two-headed Pushmi-Pullyu from the Dr. Dolittle stories, something seems backwards. But it starts to make sense when you realize PIO names most things from the host’s perspective (MicroPython, Rust, C/C++), not the point of view of the PIO program.

This table summarizes the instructions, registers, and buffer names. (“FIFO” stands for first-in-first-out.)

With the Pushmi-Pullyu in hand, we next move to the scene of a mystery.

Wat 3: The pull noblock “ Mystery

In Wat 1, we programmed our audio hardware as a backup beeper. But that’s not what we need for our musical instrument. Instead, we want a PIO program that plays a given tone indefinitely – until it’s told to play a new one. The program should also wait silently when given a special “rest” tone.

Resting until a new tone is provided is easy to program with pull block—we’ll explore the details below. Playing a tone at a specific frequency is also straightforward, building on the work we did in Wat 1.

But how can we check for a new tone while continuing to play the current one? The answer lies in using “noblock” instead of “block” in pull noblock. Now, if there’s a new value, it will be loaded into osr, allowing the program to update seamlessly.

Here’s where the mystery begins: what happens to osr if pull noblock is called and there’s no new value?

I assumed it would keep its previous value. Wrong! Maybe it gets reset to 0? Wrong again! The surprising truth: it gets the value of x. Why? (No, not yx.) Because the Pico SDK says so. Specifically, section 3.4.9.2 explains:

A nonblocking PULL on an empty FIFO has the same effect as MOV OSR, X.

Knowing how pull noblock works is important, but there’s a bigger lesson here. Treat the Pico SDK documentation like the back of a mystery novel. Don’t try to solve everything on your own—cheat! Skip to the “who done it” section, and in section 3.4, read the fine details for each command you use. Reading just a few paragraphs can save you hours of confusion.

Aside: When even the SDK documentation feels unclear, turn to the RP2040 (Pico 1) and RP2350 (Pico 2) datasheets. These encyclopedias – 600 and 1,300 pages respectively – are like omnipotent narrators: they provide the ground truth.

With this in mind, let’s look at a practical example. Below is the PIO program for playing tones and rests continuously. It uses pull block to wait for input during a rest and pull noblock to check for updates while playing a tone.

.program sound

;  Rest until a new tone is received.
resting:
    pull block           ; Wait for a new delay value
    mov x, osr           ; Copy delay into X
    jmp !x resting       ; If delay is zero, keep resting

; Play the tone until a new delay is received.
.wrap_target             ; Start of the main loop
    set pins, 1          ; Set the buzzer high voltage.
high_voltage_loop:
    jmp x-- high_voltage_loop   ; Delay

    set pins, 0          ; Set the buzzer low voltage.
    mov x, osr           ; Load the half period into X.
low_voltage_loop:
    jmp x-- low_voltage_loop    ; Delay

; Read any new delay value. If none, keep the current delay.
    mov x, osr          ; set x, the default value for "pull(noblock)"
    pull noblock        ; Read a new delay value or use the default.

; If the new delay is zero, rest. Otherwise, continue playing the tone.    
    mov x, osr          ; Copy the delay into X.
    jmp !x resting      ; If X is zero, rest.
.wrap ; Continue playing the sound.

We’ll eventually use this PIO program in our theremin-like musical instrument. For now, let’s see the PIO program in action by playing a familiar melody. This demo uses “Twinkle, Twinkle, Little Star” to show how you can control a melody by feeding frequencies and durations to the state machine. With just this code (full file and project), you can make the Pico sing!

const TWINKLE_TWINKLE: [(u32, u64, &amp;str); 16] = [
    // Bar 1
    (262, 400, "Twin-"), // C
    (262, 400, "-kle"),  // C
    (392, 400, "twin-"), // G
    (392, 400, "-kle"),  // G
    (440, 400, "lit-"),  // A
    (440, 400, "-tle"),  // A
    (392, 800, "star"),  // G
    (0, 400, ""),        // rest
    // Bar 2
    (349, 400, "How"),  // F
    (349, 400, "I"),    // F
    (330, 400, "won-"), // E
    (330, 400, "-der"), // E
    (294, 400, "what"), // D
    (294, 400, "you"),  // D
    (262, 800, "are"),  // C
    (0, 400, ""),       // rest
];
async fn inner_main(_spawner: Spawner) -> Result<Never> {
    info!("Hello, sound!");
    let hardware: Hardware<'_> = Hardware::default();
    let mut pio0 = hardware.pio0;
    let state_machine_frequency = embassy_rp::clocks::clk_sys_freq();
    let mut sound_state_machine = pio0.sm0;
    let buzzer_pio = pio0.common.make_pio_pin(hardware.buzzer);
    sound_state_machine.set_pin_dirs(Direction::Out, &amp;[&amp;buzzer_pio]);
    sound_state_machine.set_config(&amp;{
        let mut config = Config::default();
        config.set_set_pins(&amp;[&amp;buzzer_pio]); // For set instruction
        let program_with_defines = pio_file!("examples/sound.pio");
        let program = pio0.common.load_program(&amp;program_with_defines.program);
        config.use_program(&amp;program, &amp;[]);
        config
    });
    sound_state_machine.set_enable(true);
    for (frequency, ms, lyrics) in TWINKLE_TWINKLE.iter() {
        if *frequency > 0 {
            let half_period = state_machine_frequency / frequency / 2;
            info!("{} -- Frequency: {}", lyrics, frequency);
            // Send the half period to the PIO state machine
            sound_state_machine.tx().wait_push(half_period).await;
            Timer::after(Duration::from_millis(*ms)).await; // Wait as the tone plays
            sound_state_machine.tx().wait_push(0).await; // Stop the tone
            Timer::after(Duration::from_millis(50)).await; // Give a short pause between notes
        } else {
            sound_state_machine.tx().wait_push(0).await; // Play a silent rust
            Timer::after(Duration::from_millis(*ms + 50)).await; // Wait for the rest duration + a short pause
        }
    }
    info!("Disabling sound_state_machine");
    sound_state_machine.set_enable(false);
    // run forever
    loop {
        Timer::after(Duration::from_secs(3_153_600_000)).await; // 100 years
    }
}

Here’s what happens when you run the program:

We’ve solved one mystery, but there’s always another challenge lurking around the corner. In Wat 4, we’ll explore what happens when your smart hardware comes with a catch – it’s also very cheap.

Wat 4: Smart, Cheap Hardware: An Emotional Roller Coaster

With sound working, we turn next to measuring the distance to the musician’s hand using the HC-SR04+ ultrasonic range finder. This small but powerful device is available for less than two dollars.

HC-SR04+ Range Finder (Pen added for scale.)

This little peripheral took me on an emotional roller coaster of “Wats!?”:

  • Up: Amazingly, this $2 range finder includes its own microcontroller, making it smarter and easier to use.
  • Down: Frustratingly, that same “smart” behavior is unintuitive.
  • Up: Conveniently, the Pico can supply peripherals with either 3.3V or 5V power.
  • Down: Unpredictably, many range finders are unreliable – or fail outright – at 3.3V, and they can damage your Pico at 5V.
  • Up: Thankfully, both damaged range finders and Picos are inexpensive to replace, and a dual-voltage version of the range finder solved my problems.

Details

I initially assumed the range finder would set the Echo pin high when the echo returned. I was wrong.

Instead, the range finder emits a pattern of 8 ultrasonic pulses at 40 kHz (think of it as a backup beeper for dogs). Immediately after, it sets Echo high. The Pico should then start measuring the time until Echo goes low, which signals that the sensor detected the pattern – or that it timed out.

As for voltage, the documentation specifies the range finder operates at 5V. It seemed to work at 3.3V – until it didn’t. Around the same time, while my Pico kept working with Rust (via the Debug Probe and probe-rs), it stopped working with any of the MicroPython IDEs, which rely on a special USB protocol.

So, at this point both the Pico and the range finder were damaged.

After experimenting with various cables, USB drivers, programming languages, and even an older 5V-only range finder, I finally resolved the issue by:

Wat 4: Lessons Learned

As the roller coaster return to the station, I learned two key lessons. First, thanks to microcontrollers, even simple hardware can behave in non-intuitive ways that require careful reading of the documentation. Second, while this hardware is clever, it’s also inexpensive – and that means it is prone to failure. When it fails, take a deep breath, remember it’s only a few dollars, and replace it.

Hardware quirks, however, are only part of the story. In Wat 5, in Part 2, we’ll shift our focus back to software: the PIO programming language itself. We’ll uncover a behavior so unexpected, it might leave you questioning everything you thought you knew about constants.


Those are the first four Wats from programming the Pico PIO with MicroPython. You can find the code for the project on GitHub.

In Part 2, we’ll explore Wats 5 through 9. These will cover inconstant constants, conditions through the looking glass, overshooting jumps, too many pins, and kludgy debugging. We’ll also unveil the code for the finished musical instrument.

Follow me on Medium to get notified about this and future articles. I write on scientific programming in Rust and Python, machine learning, and statistics. I typically post one article a month.

The post Nine Pico PIO Wats with Rust (Part 1) appeared first on Towards Data Science.

]]>
Nine Pico PIO Wats with MicroPython (Part 2) https://towardsdatascience.com/nine-pico-pio-wats-with-micropython-part-2-984a642f25a4/ Tue, 28 Jan 2025 12:02:00 +0000 https://towardsdatascience.com/nine-pico-pio-wats-with-micropython-part-2-984a642f25a4/ Raspberry Pi programmable IO pitfalls illustrated with a musical example

The post Nine Pico PIO Wats with MicroPython (Part 2) appeared first on Towards Data Science.

]]>
Pico PIO Surprises, Part 2 - Source: https://openai.com/dall-e-2/. All other figures from the author.
Pico PIO Surprises, Part 2 – Source: https://openai.com/dall-e-2/. All other figures from the author.

This is Part 2 of an exploration into the unexpected quirks of programming the Raspberry Pi Pico PIO with MicroPython. If you missed Part 1, we uncovered four Wats that challenge assumptions about register count, instruction slots, the behavior of pull(noblock), and smart yet cheap hardware.

Now, we continue our journey toward crafting a theremin-like musical instrument – a project that reveals some of the quirks and perplexities of PIO programming. Prepare to challenge your understanding of constants in a way that brings to mind a Shakespearean tragedy.

Wat 5: Inconstant Constants

In the world of PIO programming, constants should be reliable, steadfast, and, well, constant. But what if they’re not? This brings us to a puzzling Wat about how the set instruction in PIO works—or doesn’t—when handling larger constants.

Much like Juliet doubting Romeo’s constancy, you might find yourself wondering if PIO constants will, as she says, "prove likewise variable."

The Problem: Constants Are Not as Big as They Seem

Imagine you’re programming an ultrasonic range finder and need to count down from 500 while waiting for the Echo signal to drop from high to low. To set up this wait time in PIO, you might naively try to load the constant value directly using set:

set(y, 500)                   # Load max echo wait into Y
label("measure_echo_loop")
jmp(pin, "echo_active")         # if echo voltage is high continue count down
jmp("measurement_complete")     # if echo voltage is low, measurement is complete
label("echo_active")
jmp(y_dec, "measure_echo_loop") # Continue counting down unless timeout

Aside: Don’t try to understand the crazy jmp operations here. We’ll discuss those next in Wat 6.

But here’s the tragic twist: the set instruction in PIO is limited to constants between 0 and 31. Moreover, MicroPython’s star-crossed set instruction doesn’t report an error. Instead, it silently corrupts the entire PIO instruction. (PIO from Rust shows a similar problem.) This produces a nonsense result.

Workarounds for Inconstant Constants

To address this limitation, consider the following approaches:

  • Read Values and Store Them in a Register: We saw this approach in Wat 1. You can load your constant in the osr register, then transfer it to y. For example:
# Read the max echo wait into OSR.
pull()                          # same as pull(block)
mov(y, osr)                     # Load max echo wait into Y
  • Shift and Combine Smaller Values: Using the isr register and the in_ instruction, you can build up a constant of any size. This, however, consumes time and operations from your 32-operation budget (see Part 1, Wat 2).
# Initialize Y to 500
set(y, 15)         # Load upper 5 bits (0b01111)
mov(isr, y)        # Transfer to ISR (clears ISR)
set(y, 20)         # Load lower 5 bits (0b10100)
in_(y, 5)          # Shift in lower bits to form 500 in ISR
mov(y, isr)        # Move final value (500) from ISR to y
  • Slow Down the Timing: Reduce the frequency of the state machine to stretch delays over more system clock cycles. For example, lowering the state machine speed from 150 MHz to 343 kHz reduces the timeout constant 218,659 to 500.
  • Use Extra Delays and (Nested) Loops: All instructions support an optional delay, allowing you to add up to 31 extra cycles. (To generate even longer delays, use loops – or even nested loops.)
# Generate 10μs trigger pulse (4 cycles at 343_000Hz)
set(pins, 1)[3]                # Set trigger pin to high, add delay of 3
set(pins, 0)                   # Set trigger pin to low voltage
  • Use the "Subtraction Trick" to Generate the Maximum 32-bit Integer: ** In Wat** 7, we’ll explore a way to generate 4,294,967,295 (the maximum unsigned 32-bit integer) via subtraction.

Much like Juliet cautioning against swearing by the inconstant moon, we’ve discovered that PIO constants are not always as steadfast as they seem. Yet, just as their story takes unexpected turns, so too does ours, moving from the inconstancy of constants to the uneven nature of conditions. In the next Wat, we’ll explore how PIO’s handling of conditional jumps can leave you questioning its loyalty to logic.

Wat 6: Conditions Through the Looking-Glass

In most programming environments, logical conditions feel balanced: you can test if a pin is high or low, or check registers for equality or inequality. In PIO, this symmetry breaks down. You can jump on pin high, but not pin low, and on x_not_y, but not x_eq_y. The rules are whimsical – like Humpty Dumpty in Through the Looking-Glass: "When I offer a condition, it means just what I choose it to mean – neither more nor less."

These quirks force us to rewrite our code to fit the lopsided logic, creating a gulf between how we wish the code could be written and how we must write it.

The Problem: Lopsided Conditions in Action

Consider a simple scenario: using a range finder, you want to count down from a maximum wait time (y) until the ultrasonic echo pin goes low. Intuitively, you might write the logic like this:

label("measure_echo_loop")
jmp(not_pin, "measurement_complete") # If echo voltage is low, measurement is complete
jmp(y_dec, "measure_echo_loop")      # Continue counting down unless timeout

And when processing the measurement, if we only wish to output values that differ from the previous value, we would write:

label("measurement_complete")
jmp(x_eq_y, "cooldown")         # If measurement is the same, skip to cool down
mov(isr, y)                     # Store measurement in ISR
push()                          # Output ISR
mov(x, y)                       # Save the measurement in X

Unfortunately, PIO doesn’t let you test not_pin or x_eq_y directly. You must restructure your logic to accommodate the available conditions, such as pin and x_not_y.

The Solution: The Way It Must Be

Given PIO’s limitations, we adapt our logic with a two-step approach that ensures the desired behavior despite the missing conditions:

  • Jump on the opposite condition to skip two instructions forward.
  • Next use an unconditional jump to reach the desired target.

This workaround adds one extra jump (affecting the instruction limit), but the additional label is cost-free.

Here is the rewritten code for counting down until the pin goes low:

label("measure_echo_loop")
jmp(pin, "echo_active")         # If echo voltage is high, continue countdown
jmp("measurement_complete")     # If echo voltage is low, measurement is complete
label("echo_active")
jmp(y_dec, "measure_echo_loop") # Continue counting down unless timeout

And here is the code for processing the measurement such that it will only output differing values:

label("measurement_complete")
jmp(x_not_y, "send_result")     # If measurement is different, send it
jmp("cooldown")                 # If measurement is the same, skip sending
label("send_result")
mov(isr, y)                     # Store measurement in ISR
push()                          # Output ISR
mov(x, y)                       # Save the measurement in X

Lessons from Humpty Dumpty’s Conditions

In Through the Looking-Glass, Alice learns to navigate Humpty Dumpty’s peculiar world – just as you’ll learn to navigate PIO’s Wonderland of lopsided conditions.

But as soon as you master one quirk, another reveals itself. In the next Wat, we’ll uncover a surprising behavior of jmp that, if it were an athlete, would shatter world records.

Wat 7: Overshooting Jumps

In Part 1’s Wat 1 and Wat 3, we saw how jmp x_dec or jmp y_dec is often used to loop a fixed number of times by decrementing a register until it reaches 0. Straightforward enough, right? But what happens when y is 0 and we run the following instruction?

jmp(y_dec, "measure_echo_loop")

If you guessed that it does not jump to measure_echo_loop and instead falls through to the next instruction, you’re absolutely correct. But for full credit, answer this: What value does y have after the instruction?

The answer: 4,294,967,295. Why? Because y is decremented after it is tested for zero. Wat!?

This value, 4,294,967,295, is the maximum for a 32-bit unsigned integer. It’s as if a track-and-field long jumper launches off the takeoff board but, instead of landing in the sandpit, overshoots and ends up on another continent.

Aside: ** As foreshadowed in Wat** 5, we can use this behavior intentionally to set a register to the value 4,294,967,295.

Now that we’ve learned how to stick the landing with jmp, let’s see if we can avoid getting stuck by the pins that PIO reads and sets.

Wat 8: Too Many "Pins"

In Dr. Seuss’s Too Many Daves, Mrs. McCave had 23 sons, all named Dave, leading to endless confusion whenever she called out their name. In PIO programming, pin and pins can refer to completely different ranges of pins depending on the context. It’s hard to know which Dave or Daves you’re talking to.

The Problem: Pin Ranges and Bases

In PIO, both pin and pins depend on base pins defined outside the program. Each instruction interacts with a specific base pin, and some instructions also operate on a range of pins starting from that base. To clarify PIO’s behavior, I created this table:

Table showing how PIO interprets ‘pin’ and ‘pins’ in different instructions, with their associated contexts and configurations.

Example: Distance Program for the Range Finder

Here’s a PIO program for measuring the distance to an object using Trigger and Echo pins. The key features of this program are:

  • Continuous Operation: The range finder runs in a loop as fast as possible.
  • Maximum Range Limit: Measurements are capped at a given distance, with a return value of 4,294,967,295 if no object is detected.
  • Filtered Outputs: Only measurements that differ from their immediate predecessor are sent, reducing the output rate.

Glance over the program and notice that although it is working with two pins – Trigger and Echo – throughout the program we only see pin and pins.

import rp2

@rp2.asm_pio(set_init=rp2.PIO.OUT_LOW)
def distance():
    # X is the last value sent. Initialize it to 
    # u32::MAX which means 'echo timeout'
    # (Set X to u32::MAX by subtracting 1 from 0)
    set(x, 0)
    label("subtraction_trick")
    jmp(x_dec, "subtraction_trick")

    # Read the max echo wait into OSR.
    pull()                           # same as pull(block)

    # Main loop
    wrap_target()

    # Generate 10μs trigger pulse (4 cycles at 343_000Hz)
    set(pins, 0b1)[3]                # Set trigger pin to high, add delay of 3
    set(pins, 0b0)                   # Set trigger pin to low voltage

    # When the trigger goes high, start counting down until it goes low
    wait(1, pin, 0)                 # Wait for echo pin to be high voltage
    mov(y, osr)                     # Load max echo wait into Y

    label("measure_echo_loop")
    jmp(pin, "echo_active")         # if echo voltage is high continue count down
    jmp("measurement_complete")     # if echo voltage is low, measurement is complete
    label("echo_active")
    jmp(y_dec, "measure_echo_loop") # Continue counting down unless timeout

    # Y tells where the echo countdown stopped. It
    # will be u32::MAX if the echo timed out.
    label("measurement_complete")
    jmp(x_not_y, "send_result")     # if measurement is different, then sent it.
    jmp("cooldown")                 # If measurement is the same, don't send.
    # Send the measurement
    label("send_result")
    mov(isr, y)                    # Store measurement in ISR
    push()                         # Output ISR
    mov(x, y)                      # Save the measurement in X

    # Cool down period before next measurement
    label("cooldown")
    wait(0, pin, 0)                 # Wait for echo pin to be low
    wrap()                          # Restart the measurement loop

Configuring Pins

To ensure the PIO program behaves as intended:

  • set(pins, 0b1) should control the Trigger pin.
  • wait(1, pin, 0) should monitor the Echo pin.
  • jmp(pin, "echo_active") should also monitor the Echo pin.

Here’s how you can configure this in MicroPython:

ECHO_PIN = 16
TRIGGER_PIN = 17

echo = Pin(ECHO_PIN, Pin.IN)
distance_state_machine = rp2.StateMachine(
    4,  # PIO Block 1, State machine 4
    distance,  # PIO program
    freq=state_machine_frequency,
    in_base=echo,
    set_base=Pin(TRIGGER_PIN, Pin.OUT),
    jmp_pin=echo,
)

The key here is the optional in_base, set_base, and jmp_pin inputs to the StateMachine constructor:

  • in_base: Specifies the starting pin for input operations, such as wait(1, pin, ...).
  • set_base: Configures the starting pin for set operations, like set(pins, 1).
  • jmp_pin: Defines the pin used in conditional jumps, such as jmp(pin, ...).

As described in the table, other optional inputs include:

  • out_base: Sets the starting pin for output operations, such as out(pins, ...).
  • sideset_base: Configures the starting pin for sideset operations, which allow simultaneous pin toggling during other instructions.

Configuring Multiple Pins

Although not required for this program, you can configure a range of pins in PIO using a tuple that specifies the initial states for each pin. Unlike what you might expect, the range is not defined by specifying a base pin and a count (or end). Instead, the tuple determines the pins’ initial values and implicitly sets the range, starting from the set_base pin.

For example, the following PIO decorator configures two pins with initial states of OUT_LOW:

@rp2.asm_pio(set_init=(rp2.PIO.OUT_LOW, rp2.PIO.OUT_LOW))

If set_base is set to pin 17, this tuple designates pin 17 and the next consecutive pin (pin 18) as "set pins." A single instruction can then control both pins:

set(pins, 0b11)[3]  # Sets both trigger pins (17, 18) high, adds delay
set(pins, 0b00)     # Sets both trigger pins low

This approach lets you efficiently apply bit patterns to multiple pins simultaneously, streamlining control for applications involving multiple outputs.

Aside: The Word "Set" in Programming In programming, the word "set" is notoriously overloaded with multiple meanings. In the context of PIO, "set" refers to something to which you can assign a value – such as a pin’s state. It does not mean a collection of things, as it often does in other programming contexts. When PIO refers to a collection, it usually uses the term "range" instead. This distinction is crucial for avoiding confusion as you work with PIO.

Lessons from Mrs. McCave

In Too Many Daves, Mrs. McCave lamented not giving her 23 Daves more distinct names. You can avoid her mistake by clearly documenting your pins with meaningful names – like Trigger and Echo – in your comments.

But if you think handling these pin ranges is tricky, debugging a PIO program adds an entirely new layer of challenge. In the next Wat, we’ll dive into the kludgy debugging methods available. Let’s see just how far we can push them.

Wat 9: Kludgy Debugging

I like to debug with interactive breakpoints in VS Code. MicroPython does not support that.

The fallback is print debugging, where you insert temporary print statements to see what the code is doing and the values of variables. MicroPython supports this, but PIO does not.

The fallback to the fallback is push-to-print debugging. In PIO, you temporarily output integer values of interest. Then, in MicroPython, you print those values for inspection.

For example, in the following PIO program, we temporarily add instructions to push the value of x for debugging. We also include set and out to push a constant value, such as 7, which must be between 0 and 31 inclusive.

@rp2.asm_pio(set_init=rp2.PIO.OUT_LOW)
def distance():
    # X is the last value sent. Initialize it to
    # u32::MAX which means 'echo timeout'
    # (Set X to u32::MAX by subtracting 1 from 0)
    set(x, 0)
    label("subtraction_trick")
    jmp(x_dec, "subtraction_trick")

    # DEBUG: See the value of X
    mov(isr, x)
    push()

    # Read the max echo wait into OSR.
    pull()  # same as pull(block)

    # DEBUG: Send constant value
    set(y, 7)  # Push '7' so we know we've reached this point
    mov(isr, y)
    push()
    # ...

Back in MicroPython, you can read and print these values to help understand what’s happening in the PIO code:

import rp2
from machine import Pin

from distance_debug_pio import distance

def demo_debug():
    print("Hello, debug!")
    pio1 = rp2.PIO(1)
    pio1.remove_program()
    echo = Pin(16, Pin.IN)
    distance_state_machine = rp2.StateMachine(
        4, distance, freq=343_000, in_base=echo, set_base=Pin(17, Pin.OUT), jmp_pin=echo
    )
    try:
        distance_state_machine.active(1)  # Start the PIO state machine
        distance_state_machine.put(500)
        while True:
            end_loops = distance_state_machine.get()
            print(end_loops)
    except KeyboardInterrupt:
        print("distance demo stopped.")
    finally:
        distance_state_machine.active(0)

demo_debug()

Outputs:

Hello, debug!
4294967295
7

When push-to-print debugging isn’t enough, you can turn to hardware tools. I bought my first oscilloscope (a FNIRSI DSO152, for $37). With it, I was able to confirm the Echo signal was working. The Trigger signal, however, was too fast for this inexpensive oscilloscope to capture clearly.

Using these methods – especially push-to-print debugging – you can trace the flow of your PIO program, even without a traditional debugger.

Aside: In C/C++ (and potentially Rust), you can get closer to a full debugging experience for PIO, for example, by using the piodebug project.

That concludes the nine Wats, but let’s bring everything together in a bonus Wat.

Bonus Wat 10: Putting It All Together

Now that all the components are ready, it’s time to combine them into a working theremin-like musical instrument. We need a MicroPython monitor program. This program starts both PIO state machines – one for measuring distance and the other for generating tones. It then waits for a new distance measurement, maps that distance to a tone, and sends the corresponding tone frequency to the tone-playing state machine. If the distance is out of range, it stops the tone.

MicroPython’s Place: At the heart of this system is a function that maps distances (from 0 to 50 cm) to tones (approximately B2 to F5). This function is simple to write in MicroPython, leveraging Python’s floating-point math and exponential operations. Implementing this in PIO would be virtually impossible due to its limited instruction set and lack of floating-point support.

Here’s the monitor program to run the theremin:

import math

import machine
import rp2
from machine import Pin

from distance_pio import distance
from sound_pio import sound

BUZZER_PIN = 15
ECHO_PIN = 16
TRIGGER_PIN = 17

CM_MAX = 50
CM_PRECISION = 0.1
LOWEST_TONE_FREQUENCY = 123.47  # B2
OCTAVE_COUNT = 2.5  #  to F5

def theremin():
    print("Hello, theremin!")

    pio0 = rp2.PIO(0)
    pio0.remove_program()
    sound_state_machine_frequency = machine.freq()
    sound_state_machine = rp2.StateMachine(0, sound, set_base=Pin(BUZZER_PIN))

    pio1 = rp2.PIO(1)
    pio1.remove_program()
    echo = Pin(ECHO_PIN, Pin.IN)
    distance_state_machine_frequency = int(2 * 34300.0 / CM_PRECISION / 2.0)
    distance_state_machine = rp2.StateMachine(
        4,
        distance,
        freq=distance_state_machine_frequency,
        set_base=Pin(TRIGGER_PIN, Pin.OUT),
        in_base=echo,
        jmp_pin=echo,
    )
    max_loops = int(CM_MAX / CM_PRECISION)

    try:
        sound_state_machine.active(1)
        distance_state_machine.active(1)
        distance_state_machine.put(max_loops)

        while True:
            end_loops = distance_state_machine.get()
            distance_cm = loop_difference_to_distance_cm(max_loops, end_loops)
            if distance_cm is None:
                sound_state_machine.put(0)
            else:
                tone_frequency = distance_to_tone_frequency(distance_cm)
                print(f"Distance: {distance_cm} cm, tone: {tone_frequency} Hz")
                half_period = int(sound_state_machine_frequency / (2 * tone_frequency))
                sound_state_machine.put(half_period)
    except KeyboardInterrupt:
        print("theremin stopped.")
    finally:
        sound_state_machine.active(0)
        distance_state_machine.active(0)

def loop_difference_to_distance_cm(max_loops, end_loops):
    if end_loops == 0xFFFFFFFF:
        return None
    distance_cm = (max_loops - end_loops) * CM_PRECISION
    return distance_cm

def distance_to_tone_frequency(distance):
    return LOWEST_TONE_FREQUENCY * 2.0 ** ((distance / CM_MAX) * OCTAVE_COUNT)

theremin()

Notice how using two PIO state machines and a MicroPython monitor program lets us run three programs at once. This approach combines simplicity with responsiveness, achieving a level of performance that would otherwise be difficult to realize in MicroPython alone.

Now that we’ve assembled all the components, let’s watch the video again of me "playing" the musical instrument. On the monitor screen, you can see the debugging prints displaying the distance measurements and the corresponding tones. This visual connection highlights how the system responds in real time.

Conclusion

PIO programming on the Raspberry Pi Pico is a captivating blend of simplicity and complexity, offering unparalleled hardware control while demanding a shift in mindset for developers accustomed to higher-level programming. Through the nine Wats we’ve explored, PIO has both surprised us with its limitations and impressed us with its raw efficiency.

While we’ve covered significant ground – managing state machines, pin assignments, timing intricacies, and debugging – there’s still much more you can learn as needed: DMA, IRQ, side-set pins, differences between PIO on the Pico 1 and Pico 2, autopush and autopull, FIFO join, and more.

Recommended Resources

At its core, PIO’s quirks reflect a design philosophy that prioritizes low-level hardware control with minimal overhead. By embracing these characteristics, PIO will not only meet your project’s demands but also open doors to new possibilities in Embedded Systems programming.

Please follow Carl on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.

The post Nine Pico PIO Wats with MicroPython (Part 2) appeared first on Towards Data Science.

]]>
Nine Pico PIO Wats with MicroPython (Part 1) https://towardsdatascience.com/nine-pico-pio-wats-with-micropython-part-1-82b80fb84473/ Thu, 23 Jan 2025 12:32:07 +0000 https://towardsdatascience.com/nine-pico-pio-wats-with-micropython-part-1-82b80fb84473/ Raspberry Pi programmable IO pitfalls illustrated with a musical example

The post Nine Pico PIO Wats with MicroPython (Part 1) appeared first on Towards Data Science.

]]>
Pico PIO Surprises - Source: https://openai.com/dall-e-2/. All other figures from the author.
Pico PIO Surprises – Source: https://openai.com/dall-e-2/. All other figures from the author.

Also available: A Rust version of this article

In JavaScript and other languages, we call a surprising or inconsistent behavior a "Wat!" [that is, a "What!?"]. For example, in JavaScript, an empty array plus an empty array produces an empty string, [] + [] === "". Wat!

Python, by comparison, is more consistent and predictable. However, one corner of MicroPython on the Raspberry Pi Pico microcontroller offers similar surprises. Specifically, the Pico’s Programmable Input/Output (PIO) subsystem, while incredibly powerful and versatile, comes with peculiarities.

PIO Programming matters because it provides an ingenious solution to the challenge of precise, low-level hardware control. It is incredibly fast – even when programmed from MicroPython, PIO performs just as efficiently as it does from Rust or C/C++. Its flexibility is unparalleled: rather than relying on special-purpose hardware for the countless peripherals you might want to control, PIO allows you to define custom behaviors in software, seamlessly adapting to your needs without adding hardware complexity.

Consider this simple example: a $15 theremin-like musical instrument. By waving their hand in the air, the musician changes the pitch of (admittedly annoying) tones. Using PIO provides a simple way to program this device that ensures it reacts instantly to movement.

So, all is wonderful, except – to paraphrase Spider-Man:

With great power comes… nine Wats!?

We’ll explore and illustrate those nine PIO Wats through the creation of this theremin.

Who Is This Article For?

  • All Programmers: Microcontrollers like the Pico cost under $7 and support high-level languages like Python, Rust, and C/C++. This article will show how microcontrollers let your programs interact with the physical world and introduce you to programming the Pico’s low-level, high-performance PIO hardware.
  • MicroPython Pico Programmers: Curious about the Pico’s hidden potential? Beyond its two main cores, it has eight tiny "state machines" dedicated to PIO programming. These state machines take over time-critical tasks, freeing up the main processors for other work and enabling surprising parallelism.
  • C/C++ Pico Programmers: While this article uses MicroPython, PIO programming is – for good and bad – nearly identical across all languages. If you understand it here, you’ll be well-equipped to apply it in C/C++.
  • Rust Pico Programmers: You may wish to read the Rust version of this article.
  • PIO Programmers: The journey through nine Wats may not be as entertaining as JavaScript’s quirks (thankfully), but it will shed light on the peculiarities of PIO programming. If you’ve ever found PIO programming confusing, this article should reassure you that the problem isn’t (necessarily) you – it’s partly PIO itself. Most importantly, understanding these Wats will make writing PIO code simpler and more effective.

Finally, this article isn’t about "fixing" PIO programming. PIO excels at its primary purpose: efficiently and flexibly handling custom peripheral interfaces. Its design is purposeful and well-suited to its goals. Instead, this article focuses on understanding PIO programming and its quirks – starting with a bonus Wat.

Bonus Wat 0: "State Machines" Are Not State Machines – Except When They Are

Despite their name, the eight "PIO state machines" in the Raspberry Pi Pico are not state machines in the strict, formal sense of computer science. A classical finite state machine (FSM) consists of a finite set of states and transitions, driven purely by input symbols. In contrast, the Pico’s PIO state machines are tiny programmable processors with their own instruction set. Unlike a formal FSM, they have a program counter, registers, and the ability to modify those registers through shifting and decrementing, allowing them to store intermediate values and influence execution flow beyond fixed state transitions.

That said, calling them programmable state machines makes sense: rather than having a fixed set of predefined states, their behavior depends on the program they execute. And in most cases, that program will define a state machine or something similar – especially since PIO is often used to implement precise, state-driven I/O control.

Each PIO processes one instruction per clock cycle. The $4 Pico 1 runs at 125 million cycles per second, while the $5 Pico 2 offers a faster 150 million cycles per second. Each instruction performs a simple operation, such as "move a value" or "jump to a label".

With that bonus Wat out of the way, let’s move to our first main Wat.

Wat 1: The Register Hunger Games

In PIO programming, a register is a small, fast storage location that acts like a variable for the state machine. You might dream of an abundance of variables to hold your counters, delays, and temporary values, but the reality is brutal: you only get two general-purpose registers, x and y. It’s like The Hunger Games, where no matter how many tributes enter the arena, only Katniss and Peeta emerge as victors. You’re forced to winnow down your needs to fit within these two registers, ruthlessly deciding what to prioritize and what to sacrifice. Also, like the Hunger Games, we can sometimes bend the rules.

Let’s start with a challenge: create a backup beeper – 1000 Hz for ½ second, silence for ½ second, repeat. The result? "Beep Beep Beep…"

We would like five variables:

  • half_period: The number of clock cycles to hold the voltage high and then low to create a 1000 Hz tone. This is 150,000,000 / 1000 / 2 = 75,000 cycles high and 75,000 cycles low.
Voltage and timing (millisecond and clock cycles) to generate a 1000 Hz square tone.
Voltage and timing (millisecond and clock cycles) to generate a 1000 Hz square tone.
  • y: Loop counter from 0 to half_period to create a delay.
  • period_count: The number of repeated periods needed to fill ½ second of time. 150,000,000 × 0.5 / (75,000 × 2) = 500.
  • x: Loop counter from 0 to period_count to fill ½ second of time.
  • silence_cycles: The number of clock cycles for ½ second of silence. 150,000,000 × 0.5 = 75,000,000.

We want five registers but can only have two, so let the games begin! May the odds be ever in your favor.

First, we can eliminate silence_cycles because it can be derived as half_period × period_count × 2. While PIO doesn’t support multiplication, it does support loops. By nesting two loops—where the inner loop delays for 2 clock cycles—we can create a delay of 75,000,000 clock cycles.

One variable down, but how can we eliminate two more? Fortunately, we don’t have to. While PIO only provides two general-purpose registers, x and y, it also includes two special-purpose registers: osr (output shift register) and isr (input shift register).

The PIO code that we’ll see in a moment implements the backup beeper. Here’s how it works:

Initialization:

  • The pull(block) instruction reads the half period of the tone (75,000 clock cycles) from a buffer and places the value into osr.
  • The value is then copied to isr for later use.
  • The second pull(block) reads the period count (500 repeats) from a buffer and places the value in osr, where we leave it.

Beep Loops:

  • The mov(x, osr) instruction copies the period count into the x register, which serves as the outer loop counter.
  • For the inner loops, mov(y, isr) repeatedly copies the half period into y to create delays for the high and low states of the tone.

Silence Loops:

  • The silence loops mirror the structure of the beep loops but don’t set any pins, so they act solely as a delay.

Wrap and Continuous Execution:

  • The wrap_target() and wrap() instructions define the main loop of the state machine.
  • After finishing both the beep and silence loops, the state machine jumps back near the start of the program, repeating the sequence indefinitely.

With this outline in mind, here’s the PIO assembly code for generating the backup beeper signal.

import rp2

@rp2.asm_pio(set_init=rp2.PIO.OUT_LOW)
def back_up():
    pull(block)                     # Read the half period of the beep sound.
    mov(isr, osr)                   # Store the half period in ISR.
    pull(block)                     # Read the period_count.

    wrap_target()                   # Start of the main loop.

    # Generate the beep sound.
    mov(x, osr)                     # Load period_count into X.
    label("beep_loop")

    set(pins, 1)                    # Set the buzzer to high voltage (start the tone).
    mov(y, isr)                     # Load the half period into Y.
    label("beep_high_delay")
    jmp(y_dec, "beep_high_delay")   # Delay for the half period.
    set(pins, 0)                    # Set the buzzer to low voltage (end the tone).
    mov(y, isr)                     # Load the half period into Y.
    label("beep_low_delay")
    jmp(y_dec, "beep_low_delay")    # Delay for the low duration.

    jmp(x_dec, "beep_loop")         # Repeat the beep loop.

    # Silence between beeps.
    mov(x, osr)                     # Load the period count into X for outer loop.
    label("silence_loop")

    mov(y, isr)                     # Load the half period into Y for inner loop.
    label("silence_delay")
    jmp(y_dec, "silence_delay")[1]  # Delay for two clock cycles (jmp + 1 extra)

    jmp(x_dec, "silence_loop")      # Repeat the silence loop.

    wrap()                          # End of the main loop, jumps back to wrap_target for continuous execution.

And here’s the MicroPython code to configure and run the PIO program for the backup beeper. This script initializes the state machine, calculates timing values (half_period and period_count), and sends them to the PIO. It plays the beeping sequence for 5 seconds and stops. If connected to a desktop machine via USB, you can stop it early with Ctrl-C.

import machine
from machine import Pin
import time

BUZZER_PIN = 15

def demo_back_up():
    print("Hello, back_up!")
    pio0 = rp2.PIO(0)
    pio0.remove_program()
    state_machine_frequency = machine.freq()
    back_up_state_machine = rp2.StateMachine(0, back_up, set_base=Pin(BUZZER_PIN))

    try:
        back_up_state_machine.active(1)
        half_period = int(state_machine_frequency / 1000 / 2)
        period_count = int(state_machine_frequency * 0.5 / (half_period * 2))
        print(f"half_period: {half_period}, period_count: {period_count}")
        back_up_state_machine.put(half_period)
        back_up_state_machine.put(period_count)
        time.sleep_ms(5_000)
    except KeyboardInterrupt:
        print("back_up demo stopped.")
    finally:
        back_up_state_machine.active(0)

demo_back_up()

Here’s what happens when you run the program:

Aside 1: Running this yourself The most popular Integrated Development Environment (IDE) for programming the Raspberry Pi Pico with MicroPython is Thonny. Personally, I use the PyMakr extension for VS Code, though the MicroPico extension is another popular choice. To hear sound, I connected a passive buzzer, a resistor, and a transistor to the Pico. For detailed wiring diagrams and a parts list, check out the passive buzzer instructions in the SunFounder’s Kepler Kit.

Aside 2: If your only goal is to generate tones with the Pico, PIO isn’t necessary. MicroPython is fast enough to toggle pins directly, or you can use the Pico’s built-in pulse width modulation (PWM) feature.

Alternative Endings to the Register Hunger Games

We used four registers – two general and two special – to resolve the challenge. If this solution feels less than satisfying, here are alternative approaches to consider:

Use Constants: Why make half_period, period_count, and silence_cycles variables at all? Hardcoding the constants "75,000," "500," and "75,000,000" could simplify the design. However, PIO constants have limitations, which we’ll explore in Wat 5.

Pack Bits: Registers hold 32 bits. Do we really need two registers (2×32=64 bits) to store half_period and period_count? No. Storing 75,000 only requires 17 bits, and 500 requires 9 bits. We could pack these into a single register and use the out instruction to shift values into x and y. This approach would free up either osr or isr for other tasks, but only one at a time—the other register must hold the packed value.

Slow Motion: In MicroPython, you can configure a PIO state machine to run at a slower frequency by simply specifying your desired clock speed. MicroPython takes care of the details for you, allowing the state machine to run as slow as ~2290 Hz. Running the state machine at a slower speed means that values like half_period can be smaller, potentially as small as 2. Small values are easier to hardcode as constants and more compactly bit-packed into registers.

A Happy Ending to the Register Hunger Games

The Register Hunger Games demanded strategic sacrifices and creative workarounds, but we emerged victorious by leveraging PIO’s special registers and clever looping structures. If the stakes had been higher, alternative techniques could have helped us adapt and survive.

But victory in one arena doesn’t mean the challenges are over. In the next Wat, we face a new trial: PIO’s strict 32-instruction limit.

Wat 2: The 32-Instruction Carry-On Suitcase

Congratulations! You’ve purchased a trip around the world for just $4. The catch? All your belongings must fit into a tiny carry-on suitcase. Likewise, PIO programs allow you to create incredible functionality, but every PIO program is limited to just 32 instructions.

Wat! Only 32 instructions? That’s not much space to pack everything you need! But with clever planning, you can usually make it work.

The Rules

  • No PIO program can be longer than 32 instructions.
  • The wrap_target, label, and wrap instructions do not count.
  • A Pico 1 includes eight state machines, organized into two blocks of four. A Pico 2 includes twelve state machines, organized into three blocks of four. Each block shares 32 instruction slots. So, because all four state machines in a block draw from the same 32-instruction pool, if one machine’s program uses all 32 slots, there’s no space left for the other three.

Avoiding Turbulence

For a smooth flight, use MicroPython code to clean out any previous programs from the block before loading new ones. Here’s how to do it:

pio0 = rp2.PIO(0) # Block 0, containing state machines 0,1,2,3
pio0.remove_program() # Remove all programs from block 0
back_up_state_machine = rp2.StateMachine(0, back_up, set_base=Pin(BUZZER_PIN))

This ensures your instruction memory is fresh and ready for takeoff. Clearing the blocks is especially important when using the PyMakr extension’s "development mode."

When Your Suitcase Won’t Close

If your idea doesn’t fit in the PIO instruction slots, these packing tricks may help. (Disclaimer: I haven’t tried all of these myself.)

  • Swap PIO Programs on the Fly: Instead of trying to cram everything into one program, consider swapping out programs mid-flight. Load only what you need, when you need it.

  • Share Programs Across State Machines: Multiple state machines can run the same program at the same time. Each state machine can make the shared program behave differently based on an input value.

  • Use MicroPython’s exec Command: Save space by offloading instructions to MicroPython. For example, you can execute initialization steps directly from a string:

back_up_state_machine.exec("pull(block)")
  • Use PIO’s exec commands:Inside your state machine, you can execute instruction values stored in osr with out(exec) or use mov(exec, x) or mov(exec, y) for registers.
  • Offload to the Main Processors: If all else fails, move more of your program to the Pico’s larger dual processors – think of this as shipping your extra baggage to your destination separately. The Pico SDK (section 3.1.4) calls this "bit banging".

With your bags now packed, let’s join Dr. Dolittle’s search for a fabled creature.

Bonus Wat 2.5: Dr. Dolittle’s PIO Pushmi-Pullyu

Two readers pointed out an important PIO Wat that I missed – so here’s a bonus! When programming PIO, you’ll notice something peculiar:

  • The PIO pull instruction receives values from TX FIFO (transmit buffer) and inputs them into the output shift register (osr). So, it inputs into output and transmits from receive.
  • Likewise, the PIO push instruction outputs values from the input shift register (isr) and transmits them to the RX FIFO (receive buffer). So, it outputs from input and receives from transmit.

Wat!? Like the two-headed Pushmi-Pullyu from the Dr. Dolittle stories, something seems backwards. But it starts to make sense when you realize PIO names most things from the host’s perspective (MicroPython, Rust, C/C++), not the point of view of the PIO program.

This table summarizes the instructions, registers, and buffer names. ("FIFO" stands for first-in-first-out.)

With the Pushmi-Pullyu in hand, we next move to the scene of a mystery.

Wat 3: The pull(noblock) “ Mystery

In Wat 1, we programmed our audio hardware as a backup beeper. But that’s not what we need for our musical instrument. Instead, we want a PIO program that plays a given tone indefinitely – until it’s told to play a new one. The program should also wait silently when given a special "rest" tone.

Resting until a new tone is provided is easy to program with pull(block)—we’ll explore the details below. Playing a tone at a specific frequency is also straightforward, building on the work we did in Wat 1.

But how can we check for a new tone while continuing to play the current one? The answer lies in using "noblock" instead of "block" in pull(noblock). Now, if there’s a new value, it will be loaded into osr, allowing the program to update seamlessly.

Here’s where the mystery begins: what happens to osr if pull(noblock) is called and there’s no new value?

I assumed it would keep its previous value. Wrong! Maybe it gets reset to 0? Wrong again! The surprising truth: it gets the value of x. Why? (No, not yx.) Because the Pico SDK says so. Specifically, section 3.4.9.2 explains:

A nonblocking PULL on an empty FIFO has the same effect as MOV OSR, X.

Knowing how pull(noblock) works is important, but there’s a bigger lesson here. Treat the Pico SDK documentation like the back of a mystery novel. Don’t try to solve everything on your own—cheat! Skip to the "who done it" section, and in section 3.4, read the fine details for each command you use. Reading just a few paragraphs can save you hours of confusion.

Aside: When even the SDK documentation feels unclear, turn to the RP2040 (Pico 1) and RP2350 (Pico 2) datasheets. These encyclopedias – 600 and 1,300 pages respectively – are like omnipotent narrators: they provide the ground truth.

With this in mind, let’s look at a practical example. Below is the PIO program for playing tones and rests continuously. It uses pull(block) to wait for input during a rest and pull(noblock) to check for updates while playing a tone.

import rp2

@rp2.asm_pio(set_init=rp2.PIO.OUT_LOW)
def sound():
    # Rest until a new tone is received.
    label("resting")
    pull(block)                     # Wait for a new delay value, keep it in osr.
    mov(x, osr)                     # Copy the delay into X.
    jmp(not_x, "resting")           # If new delay is zero, keep resting.

    # Play the tone until a new delay is received.
    wrap_target()                   # Start of the main loop.

    set(pins, 1)                    # Set the buzzer to high voltage.
    label("high_voltage_loop")
    jmp(x_dec, "high_voltage_loop") # Delay

    set(pins, 0)                    # Set the buzzer to low voltage.
    mov(x, osr)                     # Load the half period into X.
    label("low_voltage_loop")
    jmp(x_dec, "low_voltage_loop")  # Delay

    # Read any new delay value. If none, keep the current delay.
    mov(x, osr)                     # set x, the default value for "pull(noblock)"
    pull(noblock)                   # Read a new delay value or use the default.

    # If the new delay is zero, rest. Otherwise, continue playing the tone.
    mov(x, osr)                     # Copy the delay into X.
    jmp(not_x, "resting")           # If X is zero, rest.
    wrap()                          # Continue playing the sound.

We’ll eventually use this PIO program in our theremin-like musical instrument. For now, let’s see the PIO program in action by playing a familiar melody. This demo uses "Twinkle, Twinkle, Little Star" to show how you can control a melody by feeding frequencies and durations to the state machine. With just a few lines of code, you can make the Pico sing!

import rp2
import machine
from machine import Pin
import time

from sound_pio import sound

BUZZER_PIN = 15

twinkle_twinkle = [
    # Bar 1
    (262, 400, "Twin-"),  # C
    (262, 400, "-kle"),  # C
    (392, 400, "twin-"),  # G
    (392, 400, "-kle"),  # G
    (440, 400, "lit-"),  # A
    (440, 400, "-tle"),  # A
    (392, 800, "star"),  # G
    (0, 400, ""),  # rest
    # Bar 2
    (349, 400, "How"),  # F
    (349, 400, "I"),  # F
    (330, 400, "won-"),  # E
    (330, 400, "-der"),  # E
    (294, 400, "what"),  # D
    (294, 400, "you"),  # D
    (262, 800, "are"),  # C
    (0, 400, ""),  # rest
]

def demo_sound():
    print("Hello, sound!")
    pio0 = rp2.PIO(0)
    pio0.remove_program()
    state_machine_frequency = machine.freq()
    sound_state_machine = rp2.StateMachine(0, sound, set_base=Pin(BUZZER_PIN))

    try:
        sound_state_machine.active(1)
        for frequency, ms, lyrics in twinkle_twinkle:
            if frequency > 0:
                half_period = int(state_machine_frequency / frequency / 2)
                print(f"'{lyrics}' -- Frequency: {frequency}")
                # Send the half period to the PIO state machine
                sound_state_machine.put(half_period)
                time.sleep_ms(ms)  # Wait as the tone plays
                sound_state_machine.put(0)  # Stop the tone
                time.sleep_ms(50)  # Give a short pause between notes
            else:
                sound_state_machine.put(0)  # Play a silent rest
                time.sleep_ms(ms + 50)  # Wait for the rest duration + a short pause
    except KeyboardInterrupt:
        print("Sound demo stopped.")
    finally:
        sound_state_machine.active(0)

demo_sound()

Here’s what happens when you run the program:

We’ve solved one mystery, but there’s always another challenge lurking around the corner. In Wat 4, we’ll explore what happens when your smart hardware comes with a catch – it’s also very cheap.

Wat 4: Smart, Cheap Hardware: An Emotional Roller Coaster

With sound working, we turn next to measuring the distance to the musician’s hand using the HC-SR04+ ultrasonic range finder. This small but powerful device is available for less than two dollars.

HC-SR04+ Range Finder (Pen added for scale.)
HC-SR04+ Range Finder (Pen added for scale.)

This little peripheral took me on an emotional roller coaster of "Wats!?":

  • Up: Amazingly, this $2 range finder includes its own microcontroller, making it smarter and easier to use.
  • Down: Frustratingly, that same "smart" behavior is unintuitive.
  • Up: Conveniently, the Pico can supply peripherals with either 3.3V or 5V power.
  • Down: Unpredictably, many range finders are unreliable – or fail outright – at 3.3V, and they can damage your Pico at 5V.
  • Up: Thankfully, both damaged range finders and Picos are inexpensive to replace, and a dual-voltage version of the range finder solved my problems.

Details

I initially assumed the range finder would set the Echo pin high when the echo returned. I was wrong.

Instead, the range finder emits a pattern of 8 ultrasonic pulses at 40 kHz (think of it as a backup beeper for dogs). Immediately after, it sets Echo high. The Pico should then start measuring the time until Echo goes low, which signals that the sensor detected the pattern – or that it timed out.

As for voltage, the documentation specifies the range finder operates at 5V. It seemed to work at 3.3V – until it didn’t. Around the same time, my Pico stopped connecting to MicroPython IDEs, which rely on a special USB protocol.

So, at this point both the Pico and the range finder were damaged.

After experimenting with various cables, USB drivers, programming languages, and even an older 5V-only range finder, I finally resolved the issue with:

  • A new Pico microcontroller that I already had on hand. (It was a Pico 2, but I don’t think the model matters.)
  • A new dual-voltage 3.3/5V range finder, still just $2 per piece.

Wat 4: Lessons Learned

As the roller coaster return to the station, I learned two key lessons. First, thanks to microcontrollers, even simple hardware can behave in non-intuitive ways that require careful reading of the documentation. Second, while this hardware is clever, it’s also inexpensive – and that means it is prone to failure. When it fails, take a deep breath, remember it’s only a few dollars, and replace it.

Hardware quirks, however, are only part of the story. In Wat 5, in Part 2, we’ll shift our focus back to software: the PIO programming language itself. We’ll uncover a behavior so unexpected, it might leave you questioning everything you thought you knew about constants.


Those are the first four Wats from programming the Pico PIO with MicroPython. You can find the code for the project on GitHub.

In Part 2, we explore Wats 5 through 9. These cover inconstant constants, conditions through the looking glass, overshooting jumps, too many pins, and kludgy debugging. We’ll also unveil the code for the finished musical instrument.

Please follow Carl on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.

The post Nine Pico PIO Wats with MicroPython (Part 1) appeared first on Towards Data Science.

]]>
Explore Solvable and Unsolvable Equations with Python https://towardsdatascience.com/explore-solvable-and-unsolvable-equations-with-python-661ac11f4f20/ Tue, 29 Oct 2024 18:20:56 +0000 https://towardsdatascience.com/explore-solvable-and-unsolvable-equations-with-python-661ac11f4f20/ Find Closed-Form Solutions When Possible - Use Numerical Methods When Necessary

The post Explore Solvable and Unsolvable Equations with Python appeared first on Towards Data Science.

]]>
Find closed-form solutions when possible – use numerical methods when necessary
A Python Refereeing an Italian Renaissance Mathematics Duel - Source: https://openai.com/dall-e-2/. All other figures from the author.
A Python Refereeing an Italian Renaissance Mathematics Duel – Source: https://openai.com/dall-e-2/. All other figures from the author.

Why can we solve some equations easily, while others seem impossible? And another thing: why is this knowledge hidden from us?

As data scientists, applied scientists, and engineers, we often create mathematical models. For example, consider the model: y = x². Given a value for _x, w_e can apply it forward to compute y. For instance, if x = 3, then y = 9.

We can also apply this model backward. Starting with y = x², we rearrange to solve for x: x = ±√y. If y = 9, then x = ±3. The expression x = ±√y is an example of a closed-form solution – an expression that uses a finite combination of standard operations and functions.

However, not all models are so straightforward. Sometimes, we encounter equations where we can’t simply "solve for x" and get a closed-form expression. In such cases, we might hear, "That’s not solvable – you need numerical methods." Numerical methods are powerful. They can provide precise approximations. Still, it frustrates me (and perhaps you) that no one ever seems to explain when closed-form solutions are possible and when they aren’t.

The great Johannes Kepler shared our frustration. When studying planetary motion, he created this model:

  • y = x −c sin(x)

This equation converts a body’s position along its orbit (x) into its time along the orbit (y). Kepler sought a closed-form solution for x to turn time into a position. However, even 400 years later, the best we have are numerical methods.

In this article, we’ll build intuition about when to expect a closed-form solution. The only way to determine this rigorously is by using advanced Mathematics – such as Galois theory, transcendental number theory, and algebraic geometry. These topics go far beyond what we, as applied scientists and engineers, typically learn in our training.

Instead of diving into these advanced fields, we’ll cheat. Using SymPy, a Python-based computer algebra system, we’ll explore different classes of equations to see which it can solve with a closed-form expression. For completeness, we’ll also apply numerical methods.

We’ll explore equations that combine polynomials, exponentials, logarithms, and trigonometric functions. Along the way, we’ll discover specific combinations that often resist closed-form solutions. We’ll see that if you want to create an equation with (or without) a closed-form solution, you should avoid (or try) the following:

  • Fifth degree and higher polynomials
  • Mixing x with exp(x) or log(x) – if Lambert’s W function is off-limits
  • Combining exp(x) and log(x) within the same equation
  • Some pairs of trigonometric functions with commensurate frequencies
  • Many pairs of trigonometric functions with non-commensurate frequencies
  • Mixing trigonometric functions with x, exp(x), or log(x)

Aside 1: I’m not a mathematician, and my SymPy scripts are not higher mathematics. If you find any mistakes or overlooked resources, forgive my oversight. Please share them with me, and I’ll gladly add a note.

Aside 2: Welch Lab’s recent video, Kepler’s Impossible Equation, reminded me of my frustration about not knowing when an equation can be solved in a closed form. The video sparked the investigation that follows and provides our first example.

Kepler’s Equation

Imagine you are Johannes Kepler’s research programmer. He has created the following model of orbital motion:

y = xc sin(x)

where:

  • x is the body’s position along its orbit. We measure this position as an angle (in radians). The angle starts at 0 radians when the body is closest to the Sun. When the body has covered ¼ of its orbit’s distance, the angle is π/2 radians (90°). When it has covered half of its orbit’s distance, the angle is π (180°), and so on. Recall that radians measure angles from 0 to 2π rather than from 0 to 360°.
  • c is the orbit’s eccentricity, ranging from 0 (a perfect circle) to just under 1 (a highly elongated ellipse). Suppose Kepler has observed a comet with c = 0.967.
  • y is the body’s time along its orbit. We measure this time as an angle (in radians). For instance, if the comet has an orbital period of 76 Earth years, then π/2 (90°) corresponds to ¼ of 76 years, or 19 years. A time of π (180°) corresponds to ½ of 76 years, or 38 years. A time of 2π (360°) is the full 76-year orbital period.

This diagram shows the comet’s position at π/2 radians (90°), which is ¼ of the way along its orbit:

Kepler asks for the time when the comet reaches position π/2 radians (90°). You create and run this Python code:

import numpy as np

def kepler_equation(x):
    return x - c * np.sin(x)

c = 0.967
position_radians = np.pi / 2 # aka 90 degrees
time_radians = kepler_equation(position_radians)
orbital_period_earth_years = 76

t_earth_years = (time_radians / (2 * np.pi)) * orbital_period_earth_years
print(f"It takes approximately {t_earth_years:.2f} Earth years for the comet to move from 0 to π/2 radians.")

You report back to Kepler:

It takes approximately 7.30 Earth years for the comet to move from 0 to π/2 radians.

Aside: The comet covers 25% of its orbit distance in under 10% of its orbital period because it speeds up when closer to the Sun.

No good deed goes unpunished. Kepler, fascinated by the result, assigns you a new task: "Can you tell me how far along its orbit the comet is after 20 Earth years? I want to know the position in radians."

"No problem," you think. "I’ll just use a bit of high school algebra."

First, you convert 20 Earth years into radians:

  • time_radians = (20 / 76) × 2π = (10 / 19)π

Next, you rearrange Kepler’s equation, setting it equal to 0.

  • x − 0.967 sin(x) − (10 / 19)π = 0

Now you want to find the value of x that makes this equation true. You decide to graph the equation to see where it crosses zero:

import numpy as np
import matplotlib.pyplot as plt

c = 0.967
time_earth_years = 20
orbital_period_earth_years = 76
time_radians = (time_earth_years / orbital_period_earth_years) * 2 * np.pi

def function_to_plot(x):
    return x - c * np.sin(x) - time_radians

x_vals = np.linspace(0, 2 * np.pi, 1000)
function_values = function_to_plot(x_vals)
plt.figure(figsize=(10, 6))
plt.axhline(0, color='black', linestyle='--') # dashed horizontal line at y=0
plt.xlabel("Position (radians)")
plt.ylabel("Function Value")
plt.title("Graph of x - c sin(x) - y to Find the Root")
plt.grid(True)

plt.plot(x_vals, function_values)
plt.show()

So far, so good. The graph shows that a solution for x exists. But when you try to rearrange the equation to solve for x using algebra, you hit a wall. How do you isolate x when you have a combination of x and sin(x)?

"That’s okay," you think. "We’ve got Python, and Python has the SymPy package," a powerful and free computer algebra system.

You pose the problem to SymPy:

# Warning: This code will fail.
import sympy as sym
from sympy import pi, sin
from sympy.abc import x

c = 0.967
time_earth_years = 20
orbital_period_earth_years = 76

time_radians = (time_earth_years / orbital_period_earth_years) * 2 * pi
equation = x - c * sin(x) - time_radians

solution = sym.solve(equation, x)
#^^^^^^^^^^^^^error^^^^^^^^^^^^^^
print(solution)

Unfortunately, it replies with an error:

NotImplementedError: multiple generators [x, sin(x)]
No algorithms are implemented to solve equation x - 967*sin(x)/1000 - 10*pi/19

SymPy is quite good at solving equations, but not all equations can be solved in what’s called closed form – a solution expressed in a finite number of elementary functions such as addition, multiplication, roots, exponentials, logarithms, and trigonometric functions. When we combine a term such as x with a trigonometric term like sin⁡(x), isolating x can become fundamentally impossible. In other words, these types of mixed equations often lack a closed-form solution.

That’s okay. From the graph, we know a solution exists. SymPy can get us arbitrarily close to that solution using numerical methods. We use SymPy’s nsolve():

import sympy as sym
from sympy import pi, sin
from sympy.abc import x

c = 0.967
time_earth_years = 20
orbital_period_earth_years = 76
time_radians = (time_earth_years / orbital_period_earth_years) * 2 * pi
equation = x - c * sin(x) - time_radians

initial_guess = 1.0   # Initial guess for the numerical solver
position_radians = sym.nsolve(equation, x, initial_guess)
print(f"After {time_earth_years} Earth years, the comet will travel {position_radians:.4f} radians ({position_radians * 180 / pi:.2f}°) along its orbit.")

Which reports:

After 20 Earth years, the comet will travel 2.3449 radians (134.35°) along its orbit.

We can summarize the results in a table:

Are we sure there is not a closed-form solution? We add a question mark to our "No" answer. This reminds us that SymPy’s failure is not a mathematical proof that no closed-form solution exists. We label the last column "A Numeric" to remind ourselves that it represents one numerical solution. There could be more.


In this section, we explored Kepler’s equation and discovered the challenge of solving it in closed form. Python’s SymPy package confirmed our struggle, and in the end, we had to rely on a numerical solution.

This gives us one example of an equation with no apparent closed-form solution. But is this typical? Are there classes of equations where we can always – or never – find a closed-form solution? Let’s dig deeper by exploring another kind of equation: polynomials.

Polynomials

Polynomial equations such as _x_² − x − 1 = 0 are the reliable hammer of mathematical modeling – straightforward but powerful. We all learn how to solve degree-two polynomials (those with _x_², "quadratic") in school.

500 years ago, during the Renaissance in Italy, solving polynomials of higher degrees became a form of public entertainment. Mathematicians like Tartaglia and Cardano competed for glory and recognition in public math duels. These contests led to solutions for degree-three (cubic) and degree-four (quartic) polynomials. But what about degree five?

Let’s use SymPy to investigate a sample of polynomials:

For polynomials up to degree four, we can always find closed-form elementary solutions. Specifically, these solutions require only a finite expression of basic arithmetic operations and roots (such as square roots or cube roots).

The number of solutions will never exceed the degree of the polynomial. However, some solutions may involve i, the square root of −1, which represents complex numbers. More on that in a moment.

And what about degree-five polynomials and beyond? Can we always find closed-form solutions? The answer is mixed. Sometimes, we can. When a closed-form solution exists – for example, for _x_⁵+1=0 above – SymPy typically finds it.

However, in other cases, such as with _x_⁵-x-1=0, SymPy cannot find a closed-form, elementary solution. Évariste Galois famously demonstrated the impossibility of closed-form solutions for general higher-degree polynomial. However, SymPy’s failure on a specific equation is not a proof that no closed-form solution exists. So, for this example, we add a question mark and answer "No?".

To explore further, let’s see exactly what SymPy does when given _x_⁵-x-1=0:

import sympy as sym
from sympy.abc import x

equation = x**5 - x - 1
solution = sym.solve(equation, x)
print(solution)

The output is:

[CRootOf(x**5 - x - 1, 0), CRootOf(x**5 - x - 1, 1), CRootOf(x**5 - x - 1, 2), CRootOf(x**5 - x - 1, 3), CRootOf(x**5 - x - 1, 4)]

Yikes! SymPy is clearly cheating here. It’s saying, "Oh, you want a closed form? No problem! I’ll just define a new, one-off function called CRootOf(x**5 - x - 1, 0) and call that the answer."

This is cheating because it doesn’t answer the question of interest. SymPy is essentially giving a new name to an unsolved problem and claiming success.

SymPy, of course, has good reasons for producing its answer this way. For one thing, we can now easily find a numerical solution:

from sympy import N, CRootOf

print(N(CRootOf(x**5 - x - 1, 0)))

Prints 1.16730397826142.

Solutions Even When No Real Solutions Exist: One surprising thing about polynomial equations is that you can always find solutions – at least numerically – even when no real solutions exist!

Consider this simple equation of degree two:

  • _x_² + 1 = 0

If we plot this equation, it never crosses the x-axis, indicating no real solutions.

However, using SymPy, we can find numerical solutions for any polynomial. For example:

from sympy import solve, Eq, CRootOf, N, degree
from sympy.abc import x

equation = Eq(x**2 + 1, 0)
numerical_solution = [N(CRootOf(equation, d)) for d in range(degree(equation))]
print(numerical_solution)

Which prints: [-1.0*I, 1.0*I].

Notice that the solutions use i (the imaginary unit), meaning they are complex numbers. This is an illustration of the Fundamental Theorem of Algebra, which states that every (non-constant) polynomial equation has at least one complex solution, even when no real solutions exist.

The takeaway: unless complex numbers are meaningful in your domain, you should ignore complex solutions.


To summarize polynomials:

  • Degree four and below: There is always a closed-form solution involving basic arithmetic operations and roots.
  • Degree five and above: Generally, no closed-form solution exists using elementary operations, though SymPy occasionally finds one.
  • Solutions: Polynomials will always have solutions – at least numerically – but these solutions may not be real (both mathematically and practically). You should typically ignore them unless complex numbers are meaningful in your domain.

Next, we’ll add exponentials and logarithms to our equations. In the solutions, we discover the Lambert W function. Is it a CRootOf-like cheat?

Exp, Log and x

When we model data mathematically, we often use exponentials and logarithms. Below is a sample of what happens when we try to reverse such models by solving their equations with SymPy:

Observations:

  • Sometimes you get lucky: The first equation _x_eˣ=0 has an elementary solution x=0. While this isn’t always the case, simple closed-form solutions can sometimes be found, even in equations involving exponentials or logarithms.
  • Every equation in this "family" appears to be solvable, with two caveats: First, I can’t precisely define this family and am unsure if a clear definition is possible. Second, solving these equations requires the Lambert W function, such as W(1) and W₋₁(1/10). This function arises when x appears both inside and outside of an exponential (or logarithmic) expression.
  • If you don’t accept W, you can’t solve these functions in closed form: Equations in this "family" generally have no closed-form elementary solutions without the Lambert W function.
  • We should accept W: The Lambert W function is a well-defined, easily computable function with applications across math and Science. Its late adoption relative to exp, log, sin, and cos is simply historical.
  • A single W can generate multiple solutions: Similar to how the square root function can produce two solutions, a W expression can yield zero, one, or two real solutions. When two real solutions exist, SymPy lists them separately, representing one as W (the principal branch) and the other as W₋₁ (the secondary branch). Beyond the real solutions, any W expression also generates an infinite number of complex solutions.
  • Complex solutions will arise: Some equations, such as x log(x)+1=0, lead to only complex solutions. As with polynomials, you should ignore complex numbers unless they are meaningful in your domain.
  • Degree-five and higher polynomials mixed with exp (or log) remain unsolvable: Even with special functions like the Lambert W function, degree-five and higher polynomials cannot be solved in closed form using elementary functions.

What happens if we use both an exponential and a logarithm in the same equation? Generally, we won’t find a closed-form solution – not even with the Lambert W function:


To summarize, combining exponentials or logarithms with polynomials typically makes the equation unsolvable by traditional closed-form methods. However, if we allow the Lambert W function, equations with exponentials or logarithms (but not both) become solvable. We should embrace W as a valid tool for handling such cases.

Next, let’s generalize Kepler’s problem and see what happens when we introduce trigonometric functions into our equations.

Trigonometric Equations

Simple Trigonometric Equations: Here is our first batch of trigonometric samples:

SymPy successfully finds closed-form elementary solutions for each equation. The solutions involve trigonometric functions, and in some cases, complex numbers appear. (Again, we typically ignore the complex solutions unless they are meaningful for the problem at hand.)

Keep in mind that sine and cosine are periodic, which leads to infinitely many solutions. The closed-form solutions that SymPy provides typically represent a single cycle.

Commensurate Frequency Equations: In the preceding equations, we limited the trigonometric function’s input to x+b, where b is a constant. What happens if we allow inputs like _a_₁x+_b_₁ and _a_₂x+_b_₂ where _a_₁ is rational and _a_₂ is rational? This means the two periodic functions may have different frequencies but those frequences can synchronize. (The a‘s are the frequencies.) We say our trigonometric functions have "commensurate frequencies."

Observations:

  • We occasionally get a closed-form elementary solution.
  • On sin(x) + sin(3x)+1=0, SymPy returns zero solutions. Plots and numerical methods, however, suggest solutions exist. Moreover, when I input sin(x) + sin(3x)+1=0 into WolframAlpha, an on-line computer algebra system, it produces hybrid solutions. (The WolframAlpha solutions combine elementary functions with CRootOf expressions of degree six. As we discussed in the polynomial section, such expressions generally lack a closed-form solution.)
  • SymPy sometimes times out looking for a closed-form solution when numerical methods can still provide solutions.
  • In other cases, it times out, and both numerical methods and plots confirm there are no solutions. Before, instead of no numerical solution, we’d get a complex number solution. [WolframAlpha does give a complex numerical solution.]

Let’s plot the equation that returned zero closed-formed solutions. Let’s also plot the one that numerically returned ValueError:

Additional Observations:

  • From the blue plot, SymPy’s response of "no solutions" appears to be a bug. There are clearly solutions in the plot, and SymPy should either have found them or thrown an exception.
  • On the other hand, in the red plot, the numerical result of ValueError is accurate. There are no solutions.

For all the trigonometric equations we’ve encountered so far, SymPy seems to find real-valued closed-form solutions when they exist. When they don’t exist, it times out or gives unpredictable errors.

Non-Commensurate Frequency Equations: In the preceding equations, we allowed trigonometric functions with inputs of the form ax+b​, where a​ is a rational constant. What happens if we allow inputs like _a_₁x+_b_₁ and _a_₂x+_b_₂ where _a_₁ is rational and _a_₂ is irrational? This means the two periodic functions will never synchronize. We say they have "non-commensurate frequencies."

Observations:

  • Equations with two trigonometric functions having non-commensurate frequencies generally seem unsolvable in closed form. When no elementary solution is available, SymPy returns NotImplementedError.
  • We can still get lucky and occasionally find an equation with an elementary solution. In the case above, in which SymPy returned PolynomialDivisionFailed, WolframAlpha found a closed-form solution.
  • When an equation has no solutions, SymPy produces a ValueError, which we can confirm through plots (see below). We did not see complex-number results in these cases.
The equations do not quite touch zero, so no solutions
The equations do not quite touch zero, so no solutions

Our conclusion regarding trigonometric equations is that we can often find elementary closed-form solutions. The main exception seems to be when the frequencies are non-commensurate – for example, in an equation containing sin(x) and sin⁡(√3 x).

The final question we’ll explore is what happens when we mix trigonometric functions with exponentials and logarithms.

Trigonometric and x, Exp, Log

Our final set of samples will require only a short discussion. What if we run a sample of equations through SymPy, each equation containing one trigonometric function combined with either x, exp(x), or log(x)?

The results are unanimous: SymPy is unable to produce closed-form solutions for any of these combinations. However, it seems that SymPy should have produced x=0 as the closed-form solution for the first equation, as indeed WolframAlpha does.

Conclusion

So, there you have it – an exploration of which equations tend to lack closed-form solutions. If you’re interested in experimenting with the examples in this article, you can find my Python code on GitHub.

As I worked through these sample equations, here is what surprised me:

  • Kepler’s Equation is wonderfully simple. I didn’t know one could model an **** ellipse – a geometric shape I find complicated – with such elegance.
  • Lambert’s W function proved to be invaluable for handling equations that mix terms like x and exp⁡(x). We should consider it an elementary function.
  • SymPy is an excellent, free tool that handles symbolic algebra and trigonometric equations far better than many of us could handle manually. While it may not match WolframAlpha in some cases, it’s incredibly versatile and accessible.
  • Mixing trigonometric functions with other terms frequently prevents closed-form solutions, especially when frequencies are non-commensurate.
  • When closed-form solutions remain out of reach, plotting and numerical methods step in, delivering practical results.

Thank you for joining me on this journey. I hope you now have a clearer understanding of when you can use equation-solving techniques to reverse models and how much SymPy can assist. Also, when an equation resists a closed-form solution, you can now understand why and when to rely on numerical methods.

If you enjoyed exploring mathematics with Python and SymPy, you may also enjoy using them to explore Newtonian physics. Please see this **** Towards Data Science article and the related, popular PyData conference talk.


Interested in future articles? Please follow me on Medium. I write about Rust and Python, scientific programming, machine learning, and statistics. I tend to write about one article per month.

The post Explore Solvable and Unsolvable Equations with Python appeared first on Towards Data Science.

]]>
Nine Rules for Running Rust on Embedded Systems https://towardsdatascience.com/nine-rules-for-running-rust-on-embedded-systems-b0c247ee877e/ Sun, 13 Oct 2024 15:23:17 +0000 https://towardsdatascience.com/nine-rules-for-running-rust-on-embedded-systems-b0c247ee877e/ Practical Lessons from Porting range-set-blaze to no_std

The post Nine Rules for Running Rust on Embedded Systems appeared first on Towards Data Science.

]]>
Practical Lessons from Porting range-set-blaze to no_std
Rust Running on Embedded - Source: https://openai.com/dall-e-2/. All other figures from the author.
Rust Running on Embedded – Source: https://openai.com/dall-e-2/. All other figures from the author.

Do you want your Rust code to run everywhere – from large servers to web pages, robots, and even watches? In this final article of a three-part series [152a), 2, 3], we’ll see how to use Rust to run on embedded devices using no_std.

Porting your Rust project to a no_std environment allows you to target microcontrollers and deeply Embedded Systems, creating highly efficient software for constrained environments. For example, I used the upcoming version of range-set-blaze to create an LED animation sequencer and compositor that runs on a Raspberry Pi Pico:

Running Rust without the standard library presents unique challenges. Without operating system support, features like file I/O, networking, and sometimes even dynamic memory allocation are unavailable. In this article, we’ll look at practical strategies to overcome these limitations.

Porting Rust to no_std requires careful steps and choices, and missing any step can lead to failure. We’ll simplify the process by following these nine rules, which we will examine in detail:

  1. Confirm that your project works with WASM WASI and WASM in the Browser.
  2. Use target thumbv7m-none-eabi and cargo tree to identify and fix dependencies incompatible with no_std.
  3. Mark main (non-test) code no_std and alloc. Replace std:: with core:: and alloc::.
  4. Use Cargo features to let your main code use std optionally for file-related (etc.) functions.
  5. Understand why test code always uses the standard library.
  6. Create a simple embedded test project. Run it with QEMU.
  7. In Cargo.toml, add keywords and categories for WASM and no_std.
  8. [Optional] Use preallocated data types to avoid alloc.
  9. Add thumbv7m-none-eabi and QEMU to your CI (continuous integration) tests.

Aside: These articles are based on a three-hour workshop that I presented at RustConf24 in Montreal. Thanks to the participants of that workshop. A special thanks, also, to the volunteers from the Seattle Rust Meetup who helped test this material. These articles replace an article I wrote last year with updated information.

As with the first and second articles in this series, before we look at the rules one by one, let’s define our terms.

  • Native: Your home OS (Linux, Windows, macOS)
  • Standard library (std): Provides Rust’s core functionality – Vec, String, file input/output, networking, time.
  • WASM: WebAssembly (WASM) is a binary instruction format that runs in most browsers (and beyond).
  • WASI: WebAssembly System Interface (WASI) allows outside-the-browser WASM to access file I/O, networking (not yet), and time handling.
  • no_std: Instructs a Rust program not to use the full standard library, making it suitable for small, embedded devices or highly resource-constrained environments.
  • alloc: Provides heap memory allocation capabilities (Vec, String, etc.) in no_std environments, essential for dynamically managing memory.

Based on my experience with [range-set-blaze](https://github.com/CarlKCarlK/range-set-blaze), a data structure project, here are the decisions I recommend, described one at a time. To avoid wishy-washiness, I’ll express them as rules.

Rule 1: Confirm that your project works with WASM WASI and WASM in the Browser.

Before porting your Rust code to an embedded environment, ensure it runs successfully in WASM WASI and WASM in the Browser. These environments expose issues related to moving away from the standard library and impose constraints like those of embedded systems. By addressing these challenges early, you’ll be closer to running your project on embedded devices.

Aside: If you don’t need your project to also run on native and/or WASM, you can skip this step. You may, however, find some steps from the previous articles useful – for example, running in a 32-bit environment and understanding conditional compilation.

Environments in which we wish to run our code as a Venn diagram of progressively tighter constraints.
Environments in which we wish to run our code as a Venn diagram of progressively tighter constraints.

Run the following commands to confirm that your code works in both WASM WASI and WASM in the Browser:

cargo test --target wasm32-wasip1
cargo test --target wasm32-unknown-unknown

If the tests fail or don’t run, revisit the steps from the earlier articles in this series: WASM WASI and WASM in the Browser.

The WASM WASI article also provides crucial background on understanding Rust targets (Rule 2), conditional compilation (Rule 4), and Cargo features (Rule 6).

Once you’ve fulfilled these prerequisites, the next step is to see how (and if) we can get our dependencies working on embedded systems.

Rule 2: Use target thumbv7m-none-eabi and cargo tree to identify and fix dependencies incompatible with no_std.

To check if your dependencies are compatible with an embedded environment, compile your project for an embedded target. I recommend using the thumbv7m-none-eabi target:

  • thumbv7m – Represents the ARM Cortex-M3 microcontroller, a popular family of embedded processors.
  • none – Indicates that there is no operating system (OS) available. In Rust, this typically means we can’t rely on the standard library (std), so we use no_std. Recall that the standard library provides core functionality like Vec, String, file input/output, networking, and time.
  • eabi – Embedded Application Binary Interface, a standard defining calling conventions, data types, and binary layout for embedded executables.

Since most embedded processors share the no_std constraint, ensuring compatibility with this target helps ensure compatibility with other embedded targets.

Install the target and check your project:

rustup target add thumbv7m-none-eabi
cargo check --target thumbv7m-none-eabi

When I did this on range-set-blaze, I encountered errors complaining about dependencies, such as:

This shows that my project depends on num-traits, which depends on either, ultimately depending on std.

The error messages can be confusing. To better understand the situation, run this cargo tree command:

cargo tree --edges no-dev --format "{p} {f}"

It displays a recursive list of your project’s dependencies and their active Cargo features. For example:

range-set-blaze v0.1.6 (C:deldirbranchesrustconf24.nostd) 
├── gen_ops v0.3.0
├── itertools v0.13.0 default,use_alloc,use_std
│   └── either v1.12.0 use_std
├── num-integer v0.1.46 default,std
│   └── num-traits v0.2.19 default,i128,std
│       [build-dependencies]
│       └── autocfg v1.3.0
└── num-traits v0.2.19 default,i128,std (*)

We see multiple occurrences of Cargo features named use_std and std, strongly suggesting that:

  • These Cargo features require the standard library.
  • We can turn these Cargo features off.

Using the techniques explained in the first article, Rule 6, we disable the use_std and std Cargo features. Recall that Cargo features are additive and have defaults. To turn off the default features, we use default-features = false. We then enable the Cargo features we want to keep by specifying, for example, features = ["use_alloc"]. The Cargo.toml now reads:

[dependencies]
gen_ops = "0.3.0"
itertools = { version = "0.13.0", features=["use_alloc"], default-features = false }
num-integer = { version = "0.1.46", default-features = false }
num-traits = { version = "0.2.19", features=["i128"], default-features = false }

Turning off Cargo features will not always be enough to make your dependencies no_std-compatible.

For example, the popular thiserror crate introduces std into your code and offers no Cargo feature to disable it. However, the community has created no_std alternatives. You can find these alternatives by searching, for example, https://crates.io/search?q=thiserror+no_std.

In the case of range-set-blaze, a problem remained related to crate [gen_ops](https://crates.io/crates/gen_ops) – a wonderful crate for conveniently defining operators such as + and &. The crate used std but didn’t need to. I identified the required one-line change (using the methods we’ll cover in Rule 3) and submitted a pull request. The maintainer accepted it, and they released an updated version: 0.4.0.

Sometimes, our project can’t disable std because we need capabilities like file access when running on a full operating system. On embedded systems, however, we’re willing—and indeed must—give up such capabilities. In Rule 4, we’ll see how to make std usage optional by introducing our own Cargo features.

Using these methods fixed all the dependency errors in range-set-blaze. However, resolving those errors revealed 281 errors in the main code. Progress!

Rule 3: Mark main (non-test) code no_std and alloc. Replace std:: with core:: and alloc::.

At the top of your project’s lib.rs (or main.rs) add:

#![no_std]
extern crate alloc;

This means we won’t use the standard library, but we will still allocate memory. For range-set-blaze, this change reduced the error count from 281 to 52.

Many of the remaining errors are due to using items in std that are available in core or alloc. Since much of std is just a re-export of core and alloc, we can resolve many errors by switching std references to core or alloc. This allows us to keep the essential functionality without relying on the standard library.

For example, we get an error for each of these lines:

use std::cmp::max;
use std::cmp::Ordering;
use std::collections::BTreeMap;

Changing std:: to either core:: or (if memory related) alloc:: fixes the errors:

use core::cmp::max;
use core::cmp::Ordering;
use alloc::collections::BTreeMap;

Some capabilities, such as file access, are std-only—that is, they are defined outside of core and alloc. Fortunately, for range-set-blaze, switching to core and alloc resolved all 52 errors in the main code. However, this fix revealed 89 errors in its test code. Again, progress!

Aside: You can also find places where std could be alloc or core via Clippy rules.

We’ll address errors in the test code in Rule 5, but first, let’s figure out what to do if we need capabilities like file access when running on a full operating system.

Rule 4: Use Cargo features to let your main code use std optionally for file-related (etc.) functions.

If we need two versions of our code – one for running on a full operating system and one for embedded systems – we can use Cargo features (see Rule 6 in the first article). For example, let’s define a feature called foo, which will be the default. We’ll include the function demo_read_ranges_from_file only when foo is enabled.

In Cargo.toml (preliminary):

[features]
default = ["foo"]
foo = []

In lib.rs (preliminary):

#![no_std]
extern crate alloc;

// ...

#[cfg(feature = "foo")]
pub fn demo_read_ranges_from_file<P, T>(path: P) -> std::io::Result<RangeSetBlaze<T>>
where
    P: AsRef<std::path::Path>,
    T: FromStr + Integer,
{
    todo!("This function is not yet implemented.");
}

This says to define function demo_read_ranges_from_file only when Cargo feature foo is enabled. We can now check various versions of our code:

cargo check # enables "foo", the default Cargo features
cargo check --features foo # also enables "foo"
cargo check --no-default-features # enables nothing

Now let’s give our Cargo feature a more meaningful name by renaming foo to std. Our Cargo.toml (intermediate) now looks like:

[features]
default = ["std"]
std = []

In our lib.rs, we add these lines near the top to bring in the std library when the std Cargo feature is enabled:

#[cfg(feature = "std")]
extern crate std;

So, lib.rs (final) looks like this:

#![no_std]
extern crate alloc;

#[cfg(feature = "std")]
extern crate std;

// ...

#[cfg(feature = "std")]
pub fn demo_read_ranges_from_file<P, T>(path: P) -> std::io::Result<RangeSetBlaze<T>>
where
    P: AsRef<std::path::Path>,
    T: FromStr + Integer,
{
    todo!("This function is not yet implemented.");
}

We’d like to make one more change to our Cargo.toml. We want our new Cargo feature to control dependencies and their features. Here is the resulting Cargo.toml (final):

[features]
default = ["std"]
std = ["itertools/use_std", "num-traits/std", "num-integer/std"]

[dependencies]
itertools = { version = "0.13.0", features = ["use_alloc"], default-features = false }
num-integer = { version = "0.1.46", default-features = false }
num-traits = { version = "0.2.19", features = ["i128"], default-features = false }
gen_ops = "0.4.0"

Aside: If you’re confused by the Cargo.toml format for specifying dependencies and features, see my recent article: Nine Rust Cargo.toml Wats and Wat Nots: Master Cargo.toml formatting rules and avoid frustration in Towards Data Science.

To check that your project compiles both with the standard library (std) and without, use the following commands:

cargo check # std
cargo check --no-default-features # no_std

With cargo check working, you’d think that cargo test would be straight forward. Unfortunately, it’s not. We’ll look at that next.

Rule 5: Understand why test code always uses the standard library.

When we compile our project with --no-default-features, it operates in a no_std environment. However, Rust’s testing framework always includes the standard library, even in a no_std project. This is because cargo test requires std; for example, the #[test] attribute and the test harness itself are defined in the standard library.

As a result, running:

# DOES NOT TEST `no_std`
cargo test --no-default-features

does not actually test the no_std version of your code. Functions from std that are unavailable in a true no_std environment will still be accessible during testing. For instance, the following test will compile and run successfully with --no-default-features, even though it uses std::fs:

#[test]
fn test_read_file_metadata() {
    let metadata = std::fs::metadata("./").unwrap();
    assert!(metadata.is_dir());
}

Additionally, when testing in std mode, you may need to add explicit imports for features from the standard library. This is because, even though std is available during testing, your project is still compiled as #![no_std], meaning the standard prelude is not automatically in scope. For example, you’ll often need the following imports in your test code:

#![cfg(test)]
use std::prelude::v1::*;
use std::{format, print, println, vec};

These imports bring in the necessary utilities from the standard library so that they are available during testing.

To genuinely test your code without the standard library, you’ll need to use alternative methods that do not rely on cargo test. We’ll explore how to run no_std tests in the next rule.

Rule 6: Create a simple embedded test project. Run it with QEMU.

You can’t run your regular tests in an embedded environment. However, you can – and should – run at least one embedded test. My philosophy is that even a single test is infinitely better than none. Since "if it compiles, it works" is generally true for no_std projects, one (or a few) well-chosen test can be quite effective.

Aside: There is hope for embedded running tests in a more normal fashion [1][2]. As far as I know, nothing works easily with normal, native tests. If this changes, please let me know and I’ll update this section.

To run this test, we use QEMU (Quick Emulator, pronounced "cue-em-you"), which allows us to emulate thumbv7m-none-eabi code on our main operating system (Linux, Windows, or macOS).

Install QEMU.

See the QEMU download page for full information:

Linux/WSL

  • Ubuntu: sudo apt-get install qemu-system
  • Arch: sudo pacman -S qemu-system-arm
  • Fedora: sudo dnf install qemu-system-arm

Windows

  • Method 1: https://qemu.weilnetz.de/w64. Run the installer (tell Windows that it is OK). Add "C:Program Filesqemu" to your path.
  • Method 2: Install MSYS2 from https://www.msys2.org/. Open MSYS2 UCRT64 terminal. pacman -S mingw-w64-x86_64-qemu. Add C:msys64mingw64bin to your path.

Mac

  • brew install qemu or sudo port install qemu

Test installation with:

qemu-system-arm --version

Create an embedded subproject.

Create a subproject for the embedded tests:

cargo new tests/embedded

This command generates a new subproject, including the configuration file at tests/embedded/Cargo.toml.

Aside: This command also modifies your top-level Cargo.toml to add the subproject to your workspace. In Rust, a workspace is a collection of related packages defined in the [workspace] section of the top-level Cargo.toml. All packages in the workspace share a single Cargo.lock file, ensuring consistent dependency versions across the entire workspace.

Edit tests/embedded/Cargo.toml to look like this, but replace "range-set-blaze" with the name of your top-level project:

[package]
name = "embedded"
version = "0.1.0"
edition = "2021"

[dependencies]
alloc-cortex-m = "0.4.4"
cortex-m = "0.7.7"
cortex-m-rt = "0.7.3"
cortex-m-semihosting = "0.5.0"
panic-halt = "0.2.0"
# Change to refer to your top-level project
range-set-blaze = { path = "../..", default-features = false }

Update the test code.

Replace the contents of tests/embedded/src/main.rs with:

// Based on https://github.com/rust-embedded/cortex-m-quickstart/blob/master/examples/allocator.rs
// and https://github.com/rust-lang/rust/issues/51540
#![feature(alloc_error_handler)]
#![no_main]
#![no_std]
extern crate alloc;
use alloc::string::ToString;
use alloc_cortex_m::CortexMHeap;
use core::{alloc::Layout, iter::FromIterator};
use cortex_m::asm;
use cortex_m_rt::entry;
use cortex_m_semihosting::{debug, hprintln};
use panic_halt as _;
#[global_allocator]
static ALLOCATOR: CortexMHeap = CortexMHeap::empty();
const HEAP_SIZE: usize = 1024; // in bytes
#[alloc_error_handler]
fn alloc_error(_layout: Layout) -> ! {
    asm::bkpt();
    loop {}
}

#[entry]
fn main() -> ! {
    unsafe { ALLOCATOR.init(cortex_m_rt::heap_start() as usize, HEAP_SIZE) }

    // Test(s) goes here. Run only under emulation
    use range_set_blaze::RangeSetBlaze;
    let range_set_blaze = RangeSetBlaze::from_iter([100, 103, 101, 102, -3, -4]);
    hprintln!("{:?}", range_set_blaze.to_string());
    if range_set_blaze.to_string() != "-4..=-3, 100..=103" {
        debug::exit(debug::EXIT_FAILURE);
    }

    debug::exit(debug::EXIT_SUCCESS);
    loop {}
}

Most of this main.rs code is embedded system boilerplate. The actual test code is:

use range_set_blaze::RangeSetBlaze;
let range_set_blaze = RangeSetBlaze::from_iter([100, 103, 101, 102, -3, -4]);
hprintln!("{:?}", range_set_blaze.to_string());
if range_set_blaze.to_string() != "-4..=-3, 100..=103" {
    debug::exit(debug::EXIT_FAILURE);
}

If the test fails, it returns EXIT_FAILURE; otherwise, it returns EXIT_SUCCESS. We use the hprintln! macro to print messages to the console during emulation. Since this is an embedded system, the code ends in an infinite loop to run continuously.

Add supporting files.

Before you can run the test, you must add two files to the subproject: build.rs and memory.x from the Cortex-M quickstart repository:

Linux/WSL/macOS

cd tests/embedded
wget https://raw.githubusercontent.com/rust-embedded/cortex-m-quickstart/master/build.rs
wget https://raw.githubusercontent.com/rust-embedded/cortex-m-quickstart/master/memory.

Windows (Powershell)

cd tests/embedded
Invoke-WebRequest -Uri 'https://raw.githubusercontent.com/rust-embedded/cortex-m-quickstart/master/build.rs' -OutFile 'build.rs'
Invoke-WebRequest -Uri 'https://raw.githubusercontent.com/rust-embedded/cortex-m-quickstart/master/memory.x' -OutFile 'memory.x'

Also, create a tests/embedded/.cargo/config.toml with the following content:

[target.thumbv7m-none-eabi]
runner = "qemu-system-arm -cpu cortex-m3 -machine lm3s6965evb -nographic -semihosting-config enable=on,target=native -kernel"

[build]
target = "thumbv7m-none-eabi"

This configuration instructs Cargo to use QEMU to run the embedded code and sets thumbv7m-none-eabi as the default target for the subproject.

Run the test.

Run the test with cargo run (not cargo test):

# Setup
# Make this subproject 'nightly' to support #![feature(alloc_error_handler)]
rustup override set nightly
rustup target add thumbv7m-none-eabi

# If needed, cd tests/embedded
cargo run

You should see log messages, and the process should exit without error. In my case, I see: "-4..=-3, 100..=103".

These steps may seem like a significant amount of work just to run one (or a few) tests. However, it’s primarily a one-time effort involving mostly copy and paste. Additionally, it enables running tests in a CI environment (see Rule 9). The alternative – claiming that the code works in a no_std environment without ever actually running it in no_std—risks overlooking critical issues.

The next rule is much simpler.

Rule 7: In Cargo.toml, add keywords and categories for WASM and no_std.

Once your package compiles and passes the additional embedded test, you may want to publish it to crates.io, Rust’s package registry. To let others know that it is compatible with WASM and no_std, add the following keywords and categories to your Cargo.toml file:

[package]
# ... 
categories = ["no-std", "wasm", "embedded"] # + others specific to your package
keywords = ["no_std", "wasm"] # + others specific to your package

Note that for categories, we use a hyphen in no-std. For keywords, no_std (with an underscore) is more popular than no-std. Your package can have a maximum of five keywords and five categories.

Here is a list of categories and keywords of possible interest, along with the number of crates using each term:

Good categories and keywords will help people find your package, but the system is informal. There’s no mechanism to check whether your categories and keywords are accurate, nor are you required to provide them.

Next, we’ll explore one of the most restricted environments you’re likely to encounter.

Rule 8: [Optional] Use preallocated data types to avoid alloc.

My project, range-set-blaze, implements a dynamic data structure that requires memory allocation from the heap (via alloc). But what if your project doesn’t need dynamic memory allocation? In that case, it can run in even more restricted embedded environments—specifically those where all memory is preallocated when the program is loaded.

The reasons to avoid alloc if you can:

  • Completely deterministic memory usage
  • Reduced risk of runtime failures (often caused by memory fragmentation)
  • Lower power consumption

There are crates available that can sometimes help you replace dynamic data structures like Vec, String, and HashMap. These alternatives generally require you to specify a maximum size. The table below shows some popular crates for this purpose:

I recommend the heapless crate because it provides a collection of data structures that work well together.

Here is an example of code – using heapless – related to an LED display. This code creates a mapping from a byte to a list of integers. We limit the number of items in the map and the length of the integer list to DIGIT_COUNT (in this case, 4).

use heapless::{LinearMap, Vec};
// ...
let mut map: LinearMap<u8, Vec<usize, DIGIT_COUNT>, DIGIT_COUNT> = LinearMap::new();
// ...
let mut vec = Vec::default();
vec.push(index).unwrap();
map.insert(*byte, vec).unwrap(); // actually copies

Full details about creating a no_alloc project are beyond my experience. However, the first step is to remove this line (added in Rule 3) from your lib.rs or main.rs:

extern crate alloc; // remove this

Rule 9: Add thumbv7m-none-eabi and QEMU to your CI (continuous integration) tests.

Your project is now compiling to no_std and passing at least one embedded-specific test. Are you done? Not quite. As I said in the previous two articles:

If it’s not in CI, it doesn’t exist.

Recall that continuous integration (CI) is a system that can automatically run tests every time you update your code. I use GitHub Actions as my CI platform. Here’s the configuration I added to .github/workflows/ci.yml to test my project on embedded platforms:

test_thumbv7m_none_eabi:
    name: Setup and Check Embedded
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up Rust
        uses: dtolnay/rust-toolchain@master
        with:
          toolchain: stable
          target: thumbv7m-none-eabi
      - name: Install check stable and nightly
        run: |
          cargo check --target thumbv7m-none-eabi --no-default-features
          rustup override set nightly
          rustup target add thumbv7m-none-eabi
          cargo check --target thumbv7m-none-eabi --no-default-features
          sudo apt-get update &amp;&amp; sudo apt-get install qemu qemu-system-arm
      - name: Test Embedded (in nightly)
        timeout-minutes: 1
        run: |
          cd tests/embedded
          cargo run

By testing embedded and no_std with CI, I can be sure that my code will continue to support embedded platforms in the future.


So, there you have it – nine rules for porting your Rust code to embedded. To see a snapshot of the whole range-set-blaze project after applying all nine rules, see this branch on Github.

Here is what surprised me about porting to embedded:

The Bad:

  • We cannot run our existing tests on embedded systems. Instead, we must create a new subproject and write (a few) new tests.
  • Many popular libraries rely on std, so finding or adapting dependencies that work with no_std can be challenging.

The Good:

  • The Rust saying that "if it compiles, it works" holds true for embedded development. This gives us confidence in our code’s correctness without requiring extensive new tests.
  • Although no_std removes our immediate access to the standard library, many items continue to be available via core and alloc.
  • Thanks to emulation, you can develop for embedded systems without hardware.

Thank you for joining me on this journey from WASI to WebAssembly in the browser and, finally, to embedded development. Rust has continued to impress me with its ability to run efficiently and safely across environments. As you explore these different domains, I hope you find Rust’s flexibility and power as compelling as I do. Whether you’re working on cloud servers, browsers, or microcontrollers, the tools we’ve discussed will help you tackle the challenges ahead with confidence.


Interested in future articles? Please follow me on Medium. I write about Rust and Python, scientific Programming, machine learning, and statistics. I tend to write about one article per month.

The post Nine Rules for Running Rust on Embedded Systems appeared first on Towards Data Science.

]]>
Nine Rules for Running Rust in the Browser https://towardsdatascience.com/nine-rules-for-running-rust-in-the-browser-8228353649d1/ Tue, 08 Oct 2024 07:25:04 +0000 https://towardsdatascience.com/nine-rules-for-running-rust-in-the-browser-8228353649d1/ Practical lessons from porting range-set-blaze to WASM

The post Nine Rules for Running Rust in the Browser appeared first on Towards Data Science.

]]>
Do you want your Rust code to run everywhere – from large servers to web pages, robots, and even watches? In this second of three articles [152a), 2, 3], I’ll show you how to use WebAssembly (WASM) to run your Rust code directly in the user’s browser.

With this technique, you can provide CPU-intensive, dynamic web pages from a – perhaps free – static web server. As a bonus, a user’s data never leaves their machine, avoiding privacy issues. For example, I offer a tool to search race results for friends, running club members, and teammates. To see the tool, go to its web page, and click "match".

Aside: To learn more about matching names, see Use Bayes’ Theorem to Find Distinctive Names in a List in Towards Data Science.

Running Rust in the browser presents challenges. Your code doesn’t have access to a full operating system like Linux, Windows, or macOS. You have no direct access to files or networks. You have only limited access to time and random numbers. We’ll explore workarounds and solutions.

Porting code to WASM in the browser requires several steps and choices, and navigating these can be time-consuming. Missing a step can lead to failure. We’ll reduce this complication by offering nine rules, which we’ll explore in detail:

  1. Confirm that your existing app works with WASM WASI and create a simple JavaScript web page.
  2. Install the wasm32-unknown-unknown target, wasm-pack, wasm-bindgen-cli, and Chrome for Testing & Chromedriver.
  3. Make your project cdylib (and rlib), add wasm-bindgen dependencies, and test.
  4. Learn what types wasm-bindgen supports.
  5. Change functions to use supported types. Change files to generic BufRead.
  6. Adapt tests, skipping those that don’t apply.
  7. Change to JavaScript-friendly dependencies, if necessary. Run tests.
  8. Connect your web page to your functions.
  9. Add wasm-pack to your CI (continuous integration) tests.

Aside: These articles are based on a three-hour workshop that I presented at RustConf24 in Montreal. Thanks to the participants of that workshop. A special thanks, also, to the volunteers from the Seattle Rust Meetup who helped test this material. These articles replace an article I wrote last year with updated information.

As with the first article in this series, before we look at the rules one by one, let’s define our terms.

  • Native: Your home OS (Linux, Windows, macOS)
  • Standard library (std): Provides Rust’s core functionality – Vec, String, file input/output, networking, time.
  • WASM: WebAssembly (WASM) is a binary instruction format that runs in most browsers (and beyond).
  • WASI: WebAssembly System Interface (WASI) allows outside-the-browser WASM to access file I/O, networking (not yet), and time handling.
  • no_std: Instructs a Rust program not to use the full standard library, making it suitable for small, embedded devices or highly resource-constrained environments.
  • alloc: Provides heap memory allocation capabilities (Vec, String, etc.) in no_std environments, essential for dynamically managing memory.

Based on my experience with [range-set-blaze](https://github.com/CarlKCarlK/range-set-blaze), a data structure project, here are the decisions I recommend, described one at a time. To avoid wishy-washiness, I’ll express them as rules.

Rule 1: Confirm that your existing app works with WASM WASI and create a simple JavaScript web page.

Getting your Rust code to run in the browser will be easier if you meet two prerequisites:

  • Get your Rust code running in WASM WASI.
  • Get some JavaScript to run in the browser.

For the first prerequisite, see Nine Rules for Running Rust on WASM WASI in Towards Data Science. That article – the first article in this series – details how to move your code from your native operating system to WASM WASI. With that move, you will be halfway to running on WASM in the Browser.

Environments in which we wish to run our code as a Venn diagram of progressively tighter constraints.
Environments in which we wish to run our code as a Venn diagram of progressively tighter constraints.

Confirm your code runs on WASM WASI via your tests:

rustup target add wasm32-wasip1
cargo install wasmtime-cli
cargo test --target wasm32-wasip1

For the second prerequisite, show that you can create some JavaScript code and run it in a browser. I suggest adding this index.html file to the top level of your project:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Line Counter</title>
</head>
<body>
    <h1>Line Counter</h1>
    <input type="file" id="fileInput" />
    <p id="lineCount">Lines in file: </p>
    <script>
        const output = document.getElementById('lineCount');
        document.getElementById('fileInput').addEventListener('change', (event) => {
            const file = event.target.files[0];
            if (!file) { output.innerHTML = ''; return } // No file selected
            const reader = new FileReader();
            // When the file is fully read
            reader.onload = async (e) => {                
                const content = e.target.result;
                const lines = content.split(/rn|n/).length;
                output.textContent = `Lines in file: ${lines}`;
            };
            // Now start to read the file as text
            reader.readAsText(file);
        });
    </script>
</body>
</html>

Now, serve this page to your browser. You can serve web pages via an editor extension. I use Live Preview for VS Code. Alternatively, you can install and use a standalone web server, such as Simple Html Server:

cargo install simple-http-server
simple-http-server --ip 127.0.0.1 --port 3000 --index
# then open browser to http://127.0.0.1:3000

You should now see a web page on which you can select a file. The JavaScript on the page counts the lines in the file.

Let’s go over the key parts of the JavaScript because later we will change it to call Rust.

Aside: Must you learn JavaScript to use Rust in the browser? Yes and no. Yes, you’ll need to create at least some simple JavaScript code. No, you may not need to "learn" **** JavaScript. I’ve found ChatGPT good enough to generate the simple JavaScript that I need.

  • See what file the user chose. If none, just return:
const file = event.target.files[0];
if (!file) { output.innerHTML = ''; return } // No file selected
  • Create a new FileReader object, do some setup, and then read the file as text:
const reader = new FileReader();
// ... some setup ...
// Now start to read the file as text
reader.readAsText(file);
  • Here is the setup. It says: wait until the file is fully read, read its contents as a string, split the string into lines, and display the number of lines.
// When the file is fully read
reader.onload = async (e) => {                
    const content = e.target.result;
    const lines = content.split(/rn|n/).length;
    output.textContent = `Lines in file: ${lines}`;
    };

With the prerequisites fulfilled, we turn next to installing the needed WASM-in-the-Browser tools.

Rule 2: Install the wasm32-unknown-unknown target, wasm-pack, wasm-bindgen-cli, and Chrome for Testing & Chromedriver.

We start with something easy, installing these three tools:

rustup target add wasm32-unknown-unknown
cargo install wasm-pack --force
cargo install wasm-bindgen-cli --force

The first line installs a new target, wasm32-unknown-unknown. This target compiles Rust to WebAssembly without any assumptions about the environment the code will run in. The lack of assumptions makes it suitable to run in browsers. (For more on targets, see the previous article‘s Rule #2.)

The next two lines install wasm-pack and wasm-bindgen-cli, command-line utilities. The first builds, packages, and publishes into a form suitable for use by a web page. The second makes testing easier. We use --force to ensure the utilities are up-to-date and mutually compatible.

Now, we get to the annoying part, installing Chrome for Testing & Chromedriver. Chrome for Testing is an automatable version of the Chrome browser. Chromedriver is a separate program that can take your Rust tests cases and run them inside Chrome for Testing.

Why is installing them annoying? First, the process is somewhat complex. Second, the version of Chrome for Testing must match the version of Chromedriver. Third, installing Chrome for Testing will conflict with your current installation of regular Chrome.

With that background, here are my suggestions. Start by installing the two programs into a dedicated subfolder of your home directory.

  • Linux and WSL (Windows Subsystem for Linux):
cd ~
mkdir -p ~/.chrome-for-testing
cd .chrome-for-testing/
wget https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/linux64/chrome-linux64.zip
wget https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/linux64/chromedriver-linux64.zip
unzip chrome-linux64.zip
unzip chromedriver-linux64.zip
  • Windows (PowerShell):
New-Item -Path $HOME -Name ".chrome-for-testing" -ItemType "Directory"
Set-Location -Path $HOME.chrome-for-testing
bitsadmin /transfer "ChromeDownload" https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/win64/chrome-win64.zip $HOME.chrome-for-testingchrome-win64.zip
bitsadmin /transfer "ChromeDriverDownload" https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/win64/chromedriver-win64.zip $HOME.chrome-for-testingchromedriver-win64.zip
Expand-Archive -Path "$HOME.chrome-for-testingchrome-win64.zip" -DestinationPath "$HOME.chrome-for-testing"
Expand-Archive -Path "$HOME.chrome-for-testingchromedriver-win64.zip" -DestinationPath "$HOME.chrome-for-testing"

Aside: I’m sorry but I haven’t tested any Mac instructions. Please see the Chrome for Testing web page and then try to adapt the Linux method. If you let me know what works, I’ll update this section.

This installs version 129.0.6668.70, the stable version as of 9/30/2024. If you wish, check the Chrome for Testing Availability page for newer stable versions.

Next, we need to add these programs to our PATH. We can add them temporarily, meaning only for the current terminal session:

  • Linux and WSL (just for this session):
export PATH=~/.chrome-for-testing/chrome-linux64:~/.chrome-for-testing/chromedriver-linux64:$PATH
  • Windows (just for this session):
# PowerShell
$env:PATH = "$HOME.chrome-for-testingchrome-win64;$HOME.chrome-for-testingchromedriver-win64;$PATH"
# or, CMD
set PATH=%USERPROFILE%.chrome-for-testingchrome-win64;%USERPROFILE%.chrome-for-testingchromedriver-win64;%PATH%

Alternatively, we can add them to our PATH permanently for all future terminal sessions. Understand that this may interfere with access to your regular version of Chrome.

Linux and WSL (then restart your terminal):

echo 'export PATH=~/.chrome-for-testing/chrome-linux64:~/.chrome-for-testing/chromedriver-linux64:$PATH' >> ~/.bashrc

Windows (PowerShell, then restart your terminal):

[System.Environment]::SetEnvironmentVariable("Path", "$HOME.chrome-for-testingchrome-win64;$HOME.chrome-for-testingchromedriver-win64;" + $env:PATH, [System.EnvironmentVariableTarget]::User)

Once installed, you can verify the installation with:

chromedriver --version

Aside: Can you skip installing and using Chrome for Testing and Chromedriver? Yes and no. If you skip them, you’ll still be able to create WASM from your Rust. Moreover, you’ll be able to call that WASM from JavaScript in a web page.

However, your project – like all good code – should already contain tests. If you skip Chrome for Testing, you will not be able to run WASM-in-the-Browser test cases. Moreover, WASM in the Browser violates Rust’s "If it compiles, it works" principle. Specifically, if you use an unsupported feature, like file access, compiling to WASM won’t catch the error. Only test cases can catch such errors. This makes running test cases critically important.

Now that we have the tools to run tests in the browser, let’s try (and almost certainly fail) to run those tests.

Rule 3: Make your project cdylib (and rlib), add wasm-bindgen dependencies, and test.

The wasm-bindgen package is a set of automatically generated bindings between Rust and JavaScript. It lets JavaScript call Rust.

To prepare your code for WASM in the Browser, you’ll make your project a library project. Additionally, you’ll add and use wasm-bindgen dependencies. Follow these steps:

  • If your project is executable, change it to a library project by renaming src/main.rs to src/lib.rs. Also, comment out your main function.
  • Make your project create both a static library (the default) and a dynamic library (needed by WASM). Specifically, edit Cargo.toml to include:
[lib]
crate-type = ["cdylib", "rlib"]
  • Add wasm-bindgen dependencies:
cargo add wasm-bindgen
cargo add wasm-bindgen-test --dev
  • Create or update .cargo/config.toml (not to be confused with Cargo.toml) to include:
[target.wasm32-unknown-unknown]
runner = "wasm-bindgen-test-runner"

Next, what functions do you wish to be visible to JavaScript? Mark those functions with #[wasm_bindgen] and make them pub (public). At the top of the functions’ files, add use wasm_bindgen::prelude::*;.

Aside: For now, your functions may fail to compile. We’ll address this issue in subsequent rules.

What about tests? Everywhere you have a #[test] add a #[wasm_bindgen_test]. Where needed for tests, add this use statement and a configuration statement:

use wasm_bindgen_test::wasm_bindgen_test;
wasm_bindgen_test::wasm_bindgen_test_configure!(run_in_browser);

If you like, you can try the preceding steps on a small, sample project. Install the sample project from GitHub:

# cd to the top of a work directory
git clone --branch native_version --single-branch https://github.com/CarlKCarlK/rustconf24-good-turing.git good-turing
cd good-turing
cargo test
cargo run pg100.txt

Here we see all these changes on the small, sample project’s lib.rs:

// --- May fail to compile for now. ---
use wasm_bindgen::prelude::*;
// ...
#[wasm_bindgen]
pub fn good_turing(file_name: &amp;str) -> Result<(u32, u32), io::Error> {
    let reader = BufReader::new(File::open(file_name)?);
    // ...
}
// fn main() {
// ...
// }
#[cfg(test)]
mod tests {
    use wasm_bindgen_test::wasm_bindgen_test;
    wasm_bindgen_test::wasm_bindgen_test_configure!(run_in_browser);
    // ...
    #[test]
    #[wasm_bindgen_test]
    fn test_process_file() {
      let (prediction, actual) = good_turing("./pg100.txt").unwrap();
      // ...
    }
}

With these changes made, we’re ready to test (and likely fail):

cargo test --target wasm32-unknown-unknown

On this sample, the compiler complains that WASM in the Browser doesn’t like to return tuple types, here, (u32, u32). It also complains that it doesn’t like to return a Result with io::Error. To fix these problems, we’ll need to understand which types WASM in the Browser supports. That’s the topic of Rule 4.

What will happen after we fix the type problems and can run the test? The test will still fail, but now with a runtime error. WASM in the Browser doesn’t support reading from files. The sample test, however, tries to read from a file. In Rule 5, we’ll discuss workarounds for both type limitations and file-access restrictions.

Rule 4: Learn what types wasm-bindgen supports.

Rust functions that JavaScript can see must have input and output types that wasm-bindgen supports. Use of unsupported types causes compiler errors. For example, passing in a u32 is fine. Passing in a tuple of (u32, 32) is not.

More generally, we can sort Rust types into three categories: "Yep!", "Nope!", and "Avoid".

Yep!

This is the category for Rust types that JavaScript (via wasm-bindgen) understands well.

We’ll start with Rust’s simple copy types:

Two items surprised me here. First, 64-bit integers require extra work on the JavaScript side. Specifically, they require the use of JavaScript’s [BigInt](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt) class. Second, JavaScript does not support 128-bit integers. The 128-bit integers are "Nopes".

Turning now to String-related and vector-related types:

These super useful types use heap-allocated memory. Because Rust and JavaScript manage memory differently, each language makes its own copy of the data. I thought I might avoid this allocation by passing a &mut [u8] (mutable slice of bytes) from JavaScript to Rust. That didn’t work. Instead of zero copies or one, it copied twice.

Aside: For String and &str, wasm-bindgen also converts between JavaScript’s UTF-16 Unicode encoding and Rust’s UTF-8 Unicode encoding.

Next, in Rust we love our Option and Result types. I’m happy to report that they are "Yeps".

A Rust Some(3) becomes a JavaScript 3, and a Rust None becomes a JavaScript null. In other words, wasm-bindgen converts Rust’s type-safe null handling to JavaScript’s old-fashioned approach. In both cases, null/None is handled idiomatically within each language.

Rust Result behaves similarly to Option. A Rust Ok(3) becomes a JavaScript 3, and a Rust Err("Some error message") becomes a JavaScript exception that can be caught with try/catch. Note that the value inside the Rust Err is restricted to types that implement the Into<JsValue> trait. Using String generally works well.

Finally, let’s look at struct, enum, and JSValue, our last set of "Yeps":

Excitingly, JavaScript can construct and call methods on your Rust structs. To enable this, you need to mark the struct and any JavaScript-accessible methods with #[wasm_bindgen].

For example, suppose you want to avoid passing a giant string from JavaScript to Rust. You could define a Rust struct that processes a series of strings incrementally. JavaScript could construct the struct, feed it chunks from a file, and then ask for the result.

JavaScript’s handling of Rust enums is less exciting. It can only handle enums without associated data (C-like enums) and treats their values as integers.

In the middle of the excitement spectrum, you can pass opaque JavaScript values to Rust as JsValue. Rust can then dynamically inspect the value to determine its subtype or—if applicable—call its methods.

That ends the "Yeps". Time to look at the "Nopes".

Nope!

This is the category for Rust types that JavaScript (via wasm-bindgen) doesn’t handle.

Not being able to pass, for example, &u8 by reference is fine because you can just use u8, which is likely more efficient anyway.

Not being able to return a string slice (&str) or a regular slice (&[u8]) is somewhat annoying. To avoid lifetime issues, you must instead return an owned type like String or Vec<u8>.

You can’t accept a mutable String reference (&mut String). However, you can accept a String by value, mutate it, and then return the modified String.

How do we workaround the "Nopes"? In place of fixed-length arrays, tuples, and 128-bit integers, use vectors (Vec<T>) or structs.

Rust has sets and maps. JavaScript has sets and maps. The wasm-bindgen library, however, will not automatically convert between them. So, how can you pass, for example, a HashSet from Rust to JavaScript? Wrap it in your own Rust struct and define needed methods. Then, mark the struct and those methods with #[wasm-bindgen].

And now our third category.

Avoid

This is the category for Rust types that JavaScript (via wasm-bindgen) allows but that you shouldn’t use.

Avoid using usize and isize because most people will assume they are 64-bit integers, but in WebAssembly (WASM), they are 32-bit integers. Instead, use u32, i32, u64, or i64.

In Rust, char is a special u32 that can contain only valid Unicode scalar values. JavaScript, in contrast, treats a char as a string. It checks for Unicode validity but does not enforce that the string has a length of one. If you need to pass a char from JavaScript into Rust, it’s better to use the String type and then check the length on the Rust side.

Rule 5: Change functions to use supported types. Change files to generic BufRead.

With our knowledge of wasm-bindgen supported types, we can fixup the functions we wish to make available to JavaScript. We left Rule 3’s example with a function like this:

#[wasm_bindgen]
pub fn good_turing(file_name: &amp;str) -> Result<(u32, u32), io::Error> {
    let reader = BufReader::new(File::open(file_name)?);
    // ...
}

We, now, change the function by removing #[wasm_bindgen] pub. We also change the function to read from a generic reader rather than a file name. Using BufRead allows for more flexibility, enabling the function to accept different types of input streams, such as in-memory data or files.

fn good_turing<R: BufRead>(reader: R) -> Result<(u32, u32), io::Error> {
  // delete: let reader = BufReader::new(File::open(file_name)?);
  // ...
}

JavaScript can’t see this function, so we create a wrapper function that calls it. For example:

#[wasm_bindgen]
pub fn good_turing_byte_slice(data: &amp;[u8]) -> Result<Vec<u32>, String> {
    let reader = BufReader::new(data);
    match good_turing(reader) {
        Ok((prediction, actual)) => Ok(vec![prediction, actual]),
        Err(e) => Err(format!("Error processing data: {e}")),
    }
}

This wrapper function takes as input a byte slice (&[u8]), something JavaScript can pass. The function turns the byte slice into a reader and calls the inner good_turing. The inner function returns a Result<(u32, u32), io::Error>. The wrapper function translates this result into Result<Vec<u32>, String>, a type that JavaScript will accept.

In general, I’m only willing to make minor changes to functions that will run both natively and in WASM in the Browser. For example, here I’m willing to change the function to work on a generic reader rather than a file name. When JavaScript compatibility requires major, non-idiomatic changes, I create a wrapper function.

In the example, after making these changes, the main code now compiles. The original test, however, does not yet compile. Fixing tests is the topic of Rule 6.

Rule 6: Adapt tests, skipping those that don’t apply.

Rule 3 advocated marking every regular test (#[test]) to also be a WASM-in-the-Browser test (#[wasm_bindgen_test]). However, not all tests from native Rust can be run in a WebAssembly environment, due to WASM’s limitations in accessing system resources like files.

In our example, Rule 3 gives us test code that does not compile:

#[cfg(test)]
mod tests {
    use super::*;
    use wasm_bindgen_test::wasm_bindgen_test;
    wasm_bindgen_test::wasm_bindgen_test_configure!(run_in_browser);

    #[test]
    #[wasm_bindgen_test]
    fn test_process_file() {
        let (prediction, actual) = good_turing("./pg100.txt").unwrap();
        assert_eq!(prediction, 10223);
        assert_eq!(actual, 7967);
    }
}

This test code fails because our updated good_turing function expects a generic reader rather than a file name. We can fix the test by creating a reader from the sample file:

    use std::fs::File;

    #[test]
    fn test_process_file() {
        let reader = BufReader::new(File::open("pg100.txt").unwrap());
        let (prediction, actual) = good_turing(reader).unwrap();
        assert_eq!(prediction, 10223);
        assert_eq!(actual, 7967);
    }

This is a fine native test. Unfortunately, we can’t run it as a WASM-in-the-Browser test because it uses a file reader – something WASM doesn’t support.

The solution is to create an additional test:

    #[test]
    #[wasm_bindgen_test]
    fn test_good_turing_byte_slice() {
        let data = include_bytes!("../pg100.txt");
        let result = good_turing_byte_slice(data).unwrap();
        assert_eq!(result, vec![10223, 7967]);
    }

At compile time, this test uses the macro include_bytes! to turn a file into a WASM-compatible byte slice. The good_turing_byte_slice function turns the byte slice into a reader and calls good_turing. (The include_bytes macro is part of the Rust standard library and, therefore, available to tests.)

Note that the additional test is both a regular test and a WASM-in-the-Browser test. As much as possible, we want our tests to be both.

In my range-set-blaze project, I was able to mark almost all tests as both regular and WASM in the Browser. One exception: a test used a Criterion benchmarking function. Criterion doesn’t run in WASM in the Browser, so I marked that test regular only (#[test]).

With both our main code (Rule 5) and our test code (Rule 6) fixed, can we actually run our tests? Not necessarily, we may need to find JavaScript friendly dependences.

Aside: If you are on Windows and run WASM-in-the-Browser tests, you may see "ERROR tiny_http] Error accepting new client: A blocking operation was interrupted by a call to WSACancelBlockingCall. (os error 10004)" This is not related to your tests. You may ignore it.

Rule 7: Change to JavaScript-friendly dependencies, if necessary. Run tests.

Dependencies

The sample project will now compile. With my range-set-blaze project, however, fixing my code and tests was not enough. I also needed to fix several dependencies. Specifically, I needed to add this to my Cargo.toml:

[target.'cfg(all(target_arch = "wasm32", target_os = "unknown"))'.dev-dependencies]
getrandom = { version = "0.2", features = ["js"] }
web-time = "1.1.0"

These two dependences enable random numbers and provide an alternative time library. By default, WASM in the Browser has no access to random numbers or time. Both the dependences wrap JavaScript functions making them accessible to and idiomatic for Rust.

Aside: For more information on using cfg expressions in Cargo.toml, see my article: Nine Rust Cargo.toml Wats and Wat Nots: Master Cargo.toml formatting rules and avoid frustration | Towards Data Science (medium.com).

Look for other such JavaScript-wrapping libraries in WebAssembly – Categories – crates.io. Popular crates that I haven’t tried but look interesting include:

  • reqwestfeatures=["wasm"]— HTTP network access
  • plotters – Plotting – includes a demo that controls the HTML canvas object from Rust
  • gloo – Toolkit of JavaScript wrappers

Also see Rule 7 in the previous article – about WASM WASI – for more about fixing dependency issues. In the next article in this series – about no_std and embedded – we’ll go deeper into more strategies for fixing dependencies.

Run Tests

With our dependencies fixed, we can finally run our tests, both regular and WASM in the Browser:

cargo test
cargo test --target wasm32-unknown-unknown

Recall that behind the scenes, our call to cargo test --target wasm32-unknown-unknown:

  • Looks in .cargo/config.toml and sees wasm-bindgen-test-runner (Rule 3).
  • Calls wasm-bindgen-test-runner.
  • Uses Chromedriver to run our tests in Chrome for Testing. (Rule 2, be sure Chrome for Testing and Chromedriver are on your path).

With our tests working, we’re now ready to call our Rust code from a web page.

Rule 8: Connect your web page to your functions.

To call your Rust functions from a web page you must first package your Rust library for the web. We installed wasm-pack in Rule 2. Now, we run it:

wasm-pack build --target web

This compiles your project and creates a pkg output directory that JavaScript understands.

Example

In Rule 1, we created an index.html file that didn’t call Rust. Let’s change it now so that it does call Rust. Here is an example of such an index.html followed by a description of the changes of interest.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Good-Turing Estimation</title>
</head>
<body>
    <h1>Good-Turing Estimation</h1>
    <input type="file" id="fileInput" />
    <p id="lineCount"></p>

    <script type="module">
        import init, { good_turing_byte_slice } from './pkg/good_turing.js'; // These files are generated by `wasm-pack build --target web`
        const output = document.getElementById('lineCount');
        document.getElementById('fileInput').addEventListener('change', (event) => {
            const file = event.target.files[0];
            if (!file) { output.innerHTML = ''; return } // No file selected
            const reader = new FileReader();
            // When the file is fully read
            reader.onload = async (e) => {
                await init(); // Ensure 'good_turing_byte_slice' is ready
                // View the memory buffer as a Uint8Array
                const u8array = new Uint8Array(e.target.result);
                try { // Actually run the WASM
                    const [prediction, actual] = good_turing_byte_slice(u8array);
                    output.innerHTML =
                        `Prediction (words that appear exactly once on even lines): ${prediction.toLocaleString()}<br>` +
                        `Actual distinct words that appear only on odd lines: ${actual.toLocaleString()}`;
                } catch (err) { // Or output an error
                    output.innerHTML = `Error: ${err}`;
                }
            };
            // Now start to read the file as memory buffer
            reader.readAsArrayBuffer(file);
        });
    </script>
</body>
</html>

Let’s go through the changes of interest.

  • The line below imports two functions into JavaScript from the module file pkg/good_turing.js, which we created using wasm-pack. The default function, init, initializes our Rust-generated WebAssembly (WASM) module. The second function, good_turing_byte_slice, is explicitly imported by including its name in curly brackets.
import init, { good_turing_byte_slice } from './pkg/good_turing.js';
  • Create a new FileReader object, do some setup, and then read the file as an array of bytes.
const reader = new FileReader();
// ... some setup code ...
// Now start to read the file as bytes.
reader.readAsArrayBuffer(file);
  • Here is how we setup code that will run after the file is fully read:
reader.onload = async (e) => {
//...
};
  • This line ensures the WASM module is initialized. The first time it’s called, the module is initialized. On subsequent calls, it does nothing because the module is already ready.
await init(); // Ensure 'good_turing_byte_slice' is ready
  • Extract the byte array from the read file.
// View the memory buffer as a Uint8Array
const u8array = new Uint8Array(e.target.result);
  • Call the Rust-generated WASM function.
const [prediction, actual] = good_turing_byte_slice(u8array);

Aside: Here good_turing_byte_slice is a regular (synchronous) function. If you want, however, you can mark it async on the Rust side and then call it with await on the JavaScript side. If your Rust processing is slow, this can keep your web page more lively.

  • Display the result.
output.innerHTML =
    `Prediction (words that appear exactly once on even lines): ${prediction.toLocaleString()}<br>` +
    `Actual distinct words that appear only on odd lines: ${actual.toLocaleString()}`;
  • If there is an error, display the error message.
try { // Actually run the WASM
    // ...
} catch (err) { // Or output an error
    output.innerHTML = `Error: ${err}`;
}

The final code of the sample project is on GitHub, including a README.md that explains what it is doing. Click this link for a live demo.

range-set-blaze

I ported [range-set-blaze](https://github.com/CarlKCarlK/range-set-blaze) to WASM at a user’s request so that they could use it inside their own project. The range-set-blaze project is typically used as a library in other projects. In other words, you normally wouldn’t expect range-set-blaze to be the centerpiece of a web page. Nevertheless, I did make a small demo page. You can browse it or inspect its index.html. The page shows how range-set-blaze can turn a list of integers into a sorted list of disjoint ranges.

Aside: Host Your WASM-in-the-Browser Project on GitHub for Free

  1. In your project, create a docs folder.
  2. Do wasm-pack build --target web.
  3. Copy (don’t just move) index.html and pkg into docs.
  4. Delete the .gitignore file in docs/pkg.
  5. Check the project into GitHub.
  6. Go to the project on GitHub. Then go to "Settings", "Pages".
  7. Set the branch (in my case main) and the folder to docs. Save.
  8. The URL will be based on your account and project names, for example, https://carlkcarlk.github.io/rustconf24-good-turing/
  9. To update, repeat steps 2 through 5 (inclusive).

Rule 9: Add wasm-pack to your CI (continuous integration) tests.

Your project is now compiling to WASM in the Browser, passing tests, and showcased on a web page. Are you done? Not quite. Because, as I said in the first article:

If it’s not in CI, it doesn’t exist.

Recall that continuous integration (CI) is a system that can automatically run your tests every time you update your code, ensuring that your code continues to work as expected. In my case, GitHub hosts my project. Here’s the configuration I added to .github/workflows/ci.yml to test my project on WASM in the browser:

  test_wasm_unknown_unknown:
    name: Test WASM unknown unknown
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up Rust
        uses: dtolnay/rust-toolchain@master
        with:
          toolchain: stable
          target: wasm32-unknown-unknown
      - name: Install wasm-pack
        run: |
          curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
      - name: Run WASM tests with Chrome
        run: |
          rustup target add wasm32-unknown-unknown
          wasm-pack test --chrome --headless

By integrating WASM in the Browser into CI, I can confidently add new code to my project. CI will automatically test that all my code continues to support WASM in the browser in the future.


So, there you have it – nine rules for porting your Rust code to WASM in the Browser. Here is what surprised me:

The Bad:

  • It’s hard to set up testing for WASM in the Browser. Specifically, Chrome for Testing and Chromedriver are hard to install and manage.
  • WASM in the Browser violates Rust’s saying "If it compiles, it works". If you use an unsupported feature – for example, direct file access – the compiler won’t catch the error. Instead, you will fail at runtime.
  • Passing strings and byte vectors creates two copies of your data, one on the JavaScript side and one on the Rust side.

The Good:

  • WASM in the Browser is useful and fun.
  • You can mark your regular tests to also run in WASM in the Browser. Just mark your tests with both attributes:
#[test]
#[wasm_bindgen_test]
  • You can run on WASM in the Browser without needing to port to no_std. Nevertheless, WASM in the Browser is useful as a steppingstone toward running on embedded/no_std.

Stay tuned! In the next article, we’ll see how to port your Rust code to run in an embedded environment via no_std. This allows your code to run in small devices which I find very cool.


Interested in future articles? Please follow me on Medium. I write about Rust and Python, scientific Programming, machine learning, and statistics. I tend to write about one article per month.

The post Nine Rules for Running Rust in the Browser appeared first on Towards Data Science.

]]>
Nine Rules for Running Rust on WASM WASI https://towardsdatascience.com/nine-rules-for-running-rust-on-wasm-wasi-550cd14c252a/ Sat, 28 Sep 2024 04:55:12 +0000 https://towardsdatascience.com/nine-rules-for-running-rust-on-wasm-wasi-550cd14c252a/ Practical Lessons from Porting range-set-blaze to this Container-Like Environment

The post Nine Rules for Running Rust on WASM WASI appeared first on Towards Data Science.

]]>
Practical Lessons from Porting range-set-blaze to this Container-Like Environment
Rust Running on a Container-Like Environment - Source: https://openai.com/dall-e-2/. All other figures from the author.
Rust Running on a Container-Like Environment – Source: https://openai.com/dall-e-2/. All other figures from the author.

Do you want your Rust code to run everywhere – from large servers to web pages, robots, and even watches? In this first of three articles [152a), 2, 3], I’ll detail the steps to make that happen.

Running Rust in constrained environments presents challenges. Your code may not have access to a complete operating system such as Linux, Windows, or macOS. You may have limited (or no) access to files, networks, time, random numbers, and even memory. We’ll explore workarounds and solutions.

This first article focuses on running code on "WASM WASI", a container-like environment. We’ll see that WASM WASI may (or may not) be useful in its own right. However, it is valuable as a first step toward running Rust in browsers or Embedded Systems.

Porting code to run on WASM WASI requires many steps and choices. Navigating these choices can be time consuming. Missing a step can lead to failure. We’ll reduce this complication by offering nine rules, which we’ll explore in detail:

  1. Prepare for disappointment: WASM WASI is easy, but – for now – mostly useless – except as a steppingstone.
  2. Understand Rust targets.
  3. Install the wasm32-wasip1 target and WASMTIME, then create "Hello, WebAssembly!".
  4. Understand conditional compilation.
  5. Run regular tests but with the WASM WASI target.
  6. Understand Cargo features.
  7. Change the things you can: dependency issues by choosing Cargo features, 64-bit/32-bit issues.
  8. Accept that you cannot change everything: Networking, Tokio, Rayon, etc.
  9. Add WASM WASI to your CI (continuous integration) tests.

Aside: These articles are based on a three-hour workshop that I presented at RustConf24 in Montreal. Thanks to the participants of that workshop. A special thanks, also, to the volunteers from the Seattle Rust Meetup who helped test this material. These articles replace an article I wrote last year with updated information.

Before we look at the rules one by one, let’s define our terms.

  • Native: Your home OS (Linux, Windows, macOS)
  • Standard library (std): Provides Rust’s core functionality – Vec, String, file input/output, networking, time.
  • WASM: WebAssembly (WASM) is a binary instruction format that runs in most browsers (and beyond).
  • WASI: WebAssembly System Interface (WASI) allows outside-the-browser WASM to access file I/O, networking (not yet), and time handling.
  • no_std: Instructs a Rust program not to use the full standard library, making it suitable for small, embedded devices or highly resource-constrained environments.
  • alloc: Provides heap memory allocation capabilities (Vec, String, etc.) in no_std environments, essential for dynamically managing memory.

With these terms in mind, we can visualize the environments we want our code to run in as a Venn diagram of progressively tighter constraints. This article details how to move from native to WASM WASI. The second article tells how to then move to WASM in the Browser. The final article covers running Rust in no_std environments, both with and without alloc, ideal for embedded systems.


Based on my experience with [range-set-blaze](https://github.com/CarlKCarlK/range-set-blaze), a data structure project, here are the decisions I recommend, described one at a time. To avoid wishy-washiness, I’ll express them as rules.

Rule 1: Prepare for disappointment: WASM WASI is easy, but – for now – mostly useless – except as a steppingstone.

In 2019, Docker co-creator Solomon Hykes tweeted:

If WASM+WASI existed in 2008, we wouldn’t have needed to created Docker. That’s how important it is. Webassembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task.

Today, if you follow technology news, you’ll see optimistic headlines like these:

If WASM WASI were truly ready and useful, everyone would already be using it. The fact that we keep seeing these headlines suggests it’s not yet ready. In other words, they wouldn’t need to keep insisting that WASM WASI is ready if it really were.

As of WASI Preview 1, here is how things stand: You can access some file operations, environment variables, and have access to time and random number generation. However, there is no support for networking.

WASM WASI might be useful for certain AWS Lambda-style web services, but even that’s uncertain. Because wouldn’t you prefer to compile your Rust code natively and run twice as fast at half the cost compared to WASM WASI?

Maybe WASM WASI is useful for plug ins and extensions. In genomics, I have a Rust extension for Python, which I compile for 25 different combinations (5 versions of Python across 5 OS targets). Even with that, I don’t cover every possible OS and chip family. Could I replace those OS targets with WASM WASI? No, it would be too slow. Could I add WASM WASI as a sixth "catch-all" target? Maybe, but if I really need portability, I’m already required to support Python and should just use Python.

So, what is WASM WASI good for? Right now, its main value lies in being a step toward running code in the browser or on embedded systems.

Rule 2: Understand Rust targets.

In Rule 1, I mentioned "OS targets" in passing. Let’s look deeper into Rust targets – essential information not just for WASM WASI, but also for general Rust development.

On my Windows machine, I can compile a Rust project to run on Linux or macOS. Similarly, from a Linux machine, I can compile a Rust project to target Windows or macOS. Here are the commands I use to add and check the Linux target to a Windows machine:

rustup target add x86_64-unknown-linux-gnu
cargo check --target x86_64-unknown-linux-gnu

Aside: While cargo check verifies that the code compiles, building a fully functional executable requires additional tools. To cross-compile from Windows to Linux (GNU), you’ll also need to install the Linux GNU C/C++ compiler and the corresponding toolchain. That can be tricky. Fortunately, for the WASM targets we care about, the required toolchain is easy to install.

To see all the targets that Rust supports, use the command:

rustc --print target-list

It will list over 200 targets including x86_64-unknown-linux-gnu, wasm32-wasip1, and wasm32-unknown-unknown.

Target names contain up to four parts: CPU family, vendor, OS, and environment (for example, GNU vs LVMM):

Target Name parts - figure from author
Target Name parts – figure from author

Now that we understand something of targets, let’s go ahead and install the one we need for WASM WASI.

Rule 3: Install the wasm32-wasip1 target and WASMTIME, then create "Hello, WebAssembly!".

To run our Rust code on WASM outside of a browser, we need to target wasm32-wasip1 (32-bit WebAssembly with WASI Preview 1). We’ll also install WASMTIME, a runtime that allows us to run WebAssembly modules outside of the browser, using WASI.

rustup target add wasm32-wasip1
cargo install wasmtime-cli

To test our setup, let’s create a new "Hello, WebAssembly!" Rust project using cargo new. This initializes a new Rust package:

cargo new hello_wasi
cd hello_wasi

Edit src/main.rs to read:

fn main() {
    #[cfg(not(target_arch = "wasm32"))]
    println!("Hello, world!");
    #[cfg(target_arch = "wasm32")]
    println!("Hello, WebAssembly!");
}

Aside: We’ll look deeper into the #[cfg(...)] attribute, which enables conditional compilation, in Rule 4.

Now, run the project with cargo run, and you should see Hello, world! printed to the console.

Next, create a .cargo/config.toml file, which specifies how Rust should run and test the project when targeting WASM WASI.

[target.wasm32-wasip1]
runner = "wasmtime run --dir ."

Aside: This .cargo/config.toml file is different from the main Cargo.toml file, which defines your project’s dependencies and metadata.

Now, if you say:

cargo run --target wasm32-wasip1

You should see Hello, WebAssembly!. Congratulations! You’ve just successfully run some Rust code in the container-like WASM WASI environment.

Rule 4: Understand conditional compilation.

Now, let’s investigate #[cfg(...)]—an essential tool for conditionally compiling code in Rust. In Rule 3, we saw:

fn main() {
    #[cfg(not(target_arch = "wasm32"))]
    println!("Hello, world!");
    #[cfg(target_arch = "wasm32")]
    println!("Hello, WebAssembly!");
}

The #[cfg(...)] lines tell the Rust compiler to include or exclude certain code items based on specific conditions. A "code item" refers to a unit of code such as a function, statement, or expression.

With #[cfg(...)] lines, you can conditionally compile your code. In other words, you can create different versions of your code for different situations. For example, when compiling for the wasm32 target, the compiler ignores the #[cfg(not(target_arch = "wasm32"))] block and only includes the following:

fn main() {
    println!("Hello, WebAssembly!");
}

You specify conditions via expressions, for example, target_arch = "wasm32". Supported keys include target_os and target_arch. See the Rust Reference for the full list of supported keys. You can also create expressions with Cargo features, which we will learn about in Rule 6.

You may combine expressions with the logical operators not, any, and all. Rust’s conditional compilation doesn’t use traditional if...then...else statements. Instead, you must use #[cfg(...)] and its negation to handle different cases:

#[cfg(not(target_arch = "wasm32"))]
...
#[cfg(target_arch = "wasm32")]
...

To conditionally compile an entire file, place #![cfg(...)] at the top of the file. (Notice the "!"). This is useful when a file is only relevant for a specific target or configuration.

You can also use cfg expressions in Cargo.toml to conditionally include dependencies. This allows you to tailor dependencies to different targets. For example, this says "depend on Criterion with Rayon when not targeting wasm32".

[target.'cfg(not(target_arch = "wasm32"))'.dev-dependencies]
criterion = { version = "0.5.1", features = ["rayon"] }

Aside: For more information on using cfg expressions in Cargo.toml, see my article: Nine Rust Cargo.toml Wats and Wat Nots: Master Cargo.toml formatting rules and avoid frustration | Towards Data Science (medium.com).

Rule 5: Run regular tests but with the WASM WASI target.

It’s time to try to run your project on WASM WASI. As described in Rule 3, create a .cargo/config.toml file for your project. It tells Cargo how to run and test your project on WASM WASI.

[target.wasm32-wasip1]
runner = "wasmtime run --dir ."

Next, your project – like all good code – should already contain tests. My range-set-blaze project includes, for example, this test:

#[test]
fn insert_255u8() {
    let range_set_blaze = RangeSetBlaze::<u8>::from_iter([255]);
    assert!(range_set_blaze.to_string() == "255..=255");
}

Let’s now attempt to run your project’s tests on WASM WASI. Use the following command:

cargo test --target wasm32-wasip1

If this works, you may be done – but it probably won’t work. When I try this on range-set-blaze, I get this error message that complains about using Rayon on WASM.

 error: Rayon cannot be used when targeting wasi32. Try disabling default features.
  --> C:Userscarlk.cargoregistrysrcindex.crates.io-6f17d22bba15001fcriterion-0.5.1srclib.rs:31:1
   |
31 | compile_error!("Rayon cannot be used when targeting wasi32. Try disabling default features.");

To fix this error, we must first understand Cargo features.

Rule 6: Understand Cargo features.

To resolve issues like the Rayon error in Rule 5, it’s important to understand how Cargo features work.

In Cargo.toml, an optional [features] section allows you to define different configurations, or versions, of your project depending on which features are enabled or disabled. For example, here is a simplified part of the Cargo.toml file from the Criterion benchmarking project:

[features]
default = ["rayon", "plotters", "cargo_bench_support"]
rayon = ["dep:rayon"]
plotters = ["dep:plotters"]
html_reports = []
cargo_bench_support = []

[dependencies]
#...
# Optional dependencies
rayon = { version = "1.3", optional = true }
plotters = { version = "^0.3.1", optional = true, default-features = false, features = [
    "svg_backend",
    "area_series",
    "line_series",
] }

This defines four Cargo features: rayon, plotters, html_reports, and cargo_bench_support. Since each feature can be included or excluded, these four features create 16 possible configurations of the project. Note also the special default Cargo feature.

A Cargo feature can include other Cargo features. In the example, the special default Cargo feature includes three other Cargo features – rayon, plotters, and cargo_bench_support.

A Cargo feature can include a dependency. The rayon Cargo feature above includes the rayon crate as a dependent package.

Moreover, dependent packages may have their own Cargo features. For example, the plotters Cargo feature above includes the plotters dependent package with the following Cargo features enabled: svg_backend, area_series, and line_series.

You can specify which Cargo features to enable or disable when running cargo check, cargo build, cargo run, or cargo test. For instance, if you’re working on the Criterion project and want to check only the html_reports feature without any defaults, you can run:

cargo check --no-default-features --features html_reports

This command tells Cargo not to include any Cargo features by default but to specifically enable the html_reports Cargo feature.

Within your Rust code, you can include/exclude code items based on enabled Cargo features. The syntax uses #cfg(...), as per Rule 4:

#[cfg(feature = "html_reports")]
SOME_CODE_ITEM

With this understanding of Cargo features, we can now attempt to fix the Rayon error we encountered when running tests on WASM WASI.

Rule 7: Change the things you can: dependency issues by choosing Cargo features, 64-bit/32-bit issues.

When we tried running cargo test --target wasm32-wasip1, part of the error message stated: Criterion ... Rayon cannot be used when targeting wasi32. Try disabling default features. This suggests we should disable Criterion’s rayon Cargo feature when targeting WASM WASI.

To do this, we need to make two changes in our Cargo.toml. First, we need to disable the rayon feature from Criterion in the [dev-dependencies] section. So, this starting configuration:

[dev-dependencies]
criterion = { version = "0.5.1", features = ["html_reports"] }

becomes this, where we explicitly turn off the default features for Criterion and then enable all the Cargo features except rayon.

[dev-dependencies]
criterion = { version = "0.5.1", features = [
        "html_reports",
        "plotters",
        "cargo_bench_support"],
      default-features = false }

Next, to ensure rayon is still used for non-WASM targets, we add it back in with a conditional dependency in Cargo.toml as follows:

[target.'cfg(not(target_arch = "wasm32"))'.dev-dependencies]
criterion = { version = "0.5.1", features = ["rayon"] }

In general, when targeting WASM WASI, you may need to modify your dependencies and their Cargo features to ensure compatibility. Sometimes this process is straightforward, but other times it can be challenging – or even impossible, as we’ll discuss in Rule 8.

Aside: In the third article in this series – about no_std and embedded – we go deeper into strategies for fixing dependencies.

After running the tests again, we move past the previous error, only to encounter a new one, which is progress!

#[test]
fn test_demo_i32_len() {
    assert_eq!(demo_i32_len(i32::MIN..=i32::MAX), u32::MAX as usize + 1);
                                                  ^^^^^^^^^^^^^^^^^^^^^ attempt to compute 
`usize::MAX + 1_usize`, which would overflow
}

The compiler complains that u32::MAX as usize + 1 overflows. On 64-bit Windows the expression doesn’t overflow because usize is the same as u64 and can hold u32::MAX as usize + 1. WASM, however, is a 32-bit environment so usize is the same as u32 and the expression is one too big.

The fix here is to replace usize with u64, ensuring that the expression doesn’t overflow. More generally, the compiler won’t always catch these issues, so it’s important to review your use of usize and isize. If you’re referring to the size or index of a Rust data structure, usize is correct. However, if you’re dealing with values that exceed 32-bit limits, you should use u64 or i64.

Aside: In a 32-bit environment, a Rust array, Vec, BTreeSet, etc., can theoretically hold up to 2³²−1 = 4,294,967,295 elements. However, this is only a theoretical limit based on addressable memory.

Aside Aside: The actual maximum number of elements is even more limited. Rust limits our allocations to an isize, so 2³¹−1 (about 2 billion) bytes. If each element is, for example, 2 bytes, we can have at most about 1 billion elements.

So, we’ve fixed the dependency issue and addressed a usize overflow. But can we fix everything? Unfortunately, the answer is no.

Rule 8: Accept that you cannot change everything: Networking, Tokio, Rayon, etc.

WASM WASI Preview 1 (the current version) supports file access (within a specified directory), reading environment variables, and working with time and random numbers. However, its capabilities are limited compared to what you might expect from a full operating system.

If your project requires access to networking, asynchronous tasks with Tokio, or multithreading with Rayon, Unfortunately, these features aren’t supported in Preview 1.

Fortunately, WASM WASI Preview 2 is expected to improve upon these limitations, offering more features, including better support for networking and possibly asynchronous tasks.

Rule 9: Add WASM WASI to your CI (continuous integration) tests.

So, your tests pass on WASM WASI, and your project runs successfully. Are you done? Not quite. Because, as I like to say:

If it’s not in CI, it doesn’t exist.

Continuous integration (CI) is a system that can automatically run your tests every time you update your code, ensuring that your code continues to work as expected. By adding WASM WASI to your CI, you can guarantee that future changes won’t break your project’s compatibility with the WASM WASI target.

In my case, my project is hosted on GitHub, and I use GitHub Actions as my CI system. Here’s the configuration I added to .github/workflows/ci.yml to test my project on WASM WASI:

test_wasip1:
      name: Test WASI P1
      runs-on: ubuntu-latest
      steps:
        - name: Checkout
          uses: actions/checkout@v4
        - name: Set up Rust
          uses: dtolnay/rust-toolchain@master
          with:
            toolchain: stable
            targets: wasm32-wasip1
        - name: Install Wasmtime
          run: |
            curl https://wasmtime.dev/install.sh -sSf | bash
            echo "${HOME}/.wasmtime/bin" >> $GITHUB_PATH
        - name: Run WASI tests
          run: cargo test --verbose --target wasm32-wasip1

By integrating WASM WASI into CI, I can confidently add new code to my project. CI will automatically test that all my code continues to support WASM WASI in the future.


So, there you have it – nine rules for porting your Rust code to WASM WASI. Here is what surprised me about porting to WASM WASI:

The Bad:

  • Running on WASM WASI offers little utility today. It, however, holds the potential to be useful tomorrow.
  • In Rust, there’s a common saying: "If it compiles, it works." Unfortunately, this doesn’t always hold true for WASM WASI. If you use an unsupported feature, like networking, the compiler won’t catch the error. Instead, it will fail at runtime. For example, this code compiles and runs on WASM WASI but always returns an error because networking isn’t supported.
use std::net::TcpStream;

fn main() {
    match TcpStream::connect("crates.io:80") {
        Ok(_) => println!("Successfully connected."),
        Err(e) => println!("Failed to connect: {e}"),
    }
}

The Good:

  • Running on WASM WASI is a good first step toward running your code in the browser and on embedded systems.
  • You can run Rust code on WASM WASI without needing to port to no_std. (Porting to no_std is the topic of the third article of this series.)
  • You can run standard Rust tests on WASM WASI, making it easy to verify your code.
  • The .cargo/config.toml file and Rust’s --target option make it incredibly straightforward to configure and run your code on different targets—including WASM WASI.

Stay tuned! In the next article, you’ll see how to port your Rust code to run on WASM in the browser – an ability I find super useful. After that, the final article will explain porting code to embedded systems, which I find incredibly cool.

Aside: Interested in future articles? Please follow me on Medium. I write about Rust and Python, scientific Programming, machine learning, and statistics. I tend to write about one article per month.

The post Nine Rules for Running Rust on WASM WASI appeared first on Towards Data Science.

]]>