The article is a bit light on technical details, but the following is noteworthy:
> Flight software for rockets at SpaceX is structured around the concept of a control cycle. “You read all of your inputs: sensors that we read in through an ADC, packets from the network, data from an IMU, updates from a star tracker or guidance sensor, commands from the ground,” explains Gerding. “You do some processing of those to determine your state, like where you are in the world or the status of the life support system. That determines your outputs – you write those, wait until the next tick of the clock, and then do the whole thing over again.”
Wow, sounds an awful lot like a typical event loop from a game engine.
Of course the main difference from a game engine would be the reliability requirements. The level of verification effort that goes into flight control software can't be comparable with the effort that goes into a game engine (assuming it's greater than zero :))
> Wow, sounds an awful lot like a typical event loop from a game engine.
I coulda sworn I recall John Carmack making a similar comparison when he was working on Armadillo Aerospace, but implying that rockets were actually a bit simpler.
Embedded high-integrity systems have traditioanlly used this kind of cycle.
It's simple to reason about and analyse. You can make assertions about worst case execution time and memory useage and you know for a fact what the order of execution is. The concurrancy on large systems like this often comes in the form of a lot of small computers talking to each other over some network/bus rather than having many systems all running on the same hardware.
Building concurrant systems with predictable real-time characteristics is hard. When you have a bunch of things that really need to happen every Nth of a second in order to fly straight, a simple approach like this is definitely preferrable. In situations like this predictability tends to be more important than raw performance (assuming you can reliably hit minimums).
That doesn't mean people don't use multi-threading in domains like this, but it's one of those things you want to keep as simple as you can and avoid where possible.
Seems like a really clean and testable system, too. Can make a harness with a set of inputs, run the cycle and test outputs consistently. Performance is also easily checked, each cycle needs to run without 100ms for the 10hz systems, I guess, including garbage collection.
Not only does flight control software typically never use garbage collection, it is also preferable to never allocate memory during the runtime -- all memory ever needed is allocated at program startup, and memory is not acquired or released until landing or some other major end event. Because memory issues are often the most significant causes of crashes, which you don't want to translate into, well.. a different kind of crash.
The creator of C++, Bjarne, wrote a nice paper on flight control software if you are interested in more.
>This sparked an interesting memory for me. I was once working with a
customer who was producing on-board software for a missile. In my analysis
of the code, I pointed out that they had a number of problems with storage
leaks. Imagine my surprise when the customers chief software engineer said
"Of course it leaks". He went on to point out that they had calculated the
amount of memory the application would leak in the total possible flight time
for the missile and then doubled that number. They added this much
additional memory to the hardware to "support" the leaks. Since the missile
will explode when it hits its target or at the end of its flight, the
ultimate in garbage collection is performed without programmer intervention.
Very funny. But leak memory until you have to restart the process seems to be a very common strategy in practice. The programs explode even if the computer doesn't.
Many moons ago I remember talking with someone who works in HFT software and they said they'd just load the system with tons of RAM because allocation/deallocation and even manual memory management was too slow.
Allocating/deallocating memory has a cost, if you can afford to just add more memory that's faster than any clever memory management solution you have.
Ah, thanks for the link; yes this is the one I was referring to. I had thought it was by Bjarne since it was on his site, but either way, it's an interesting read.
Interesting. I wonder how compiler-dependent rule 10 is. Like, if I'm writing for a compiler that gives really bad and usually unhelpful warnings that make my code worse... but I suppose these are more very strict guidelines than rules.
That section explicitly addresses that question and says you should rewrite it in a way that avoids the warning, since "usually unhelpful" means "sometimes critical". It's certainly an uncompromising view but that's what you get when failure is disastrous.
Pretty sure they use a language without a garbage collector for the control software. Probably C or ADA. Look at the FAA specifications for more guidelines. For airplanes it's DO-178C I think, not sure about rockets.
EDIT: One of the coolest projects I saw in recent time was CoPilot, which is a Haskell dialect that compiles down to C control loops which are statically guaranteed to run in constant time and memory. There is also arduino-copilot for if you want to play with ultra-hard real time software but can't afford an entire rocket.
But it is worth noting that these control loops can often have some sort of memory, so you normally need to test over multiple cycles - you usually have a test vector to load in, and watch for the response vector.
Trivial example would be if you had a PID controller implemented into your main loop. Your main loop would be storing the integral and "previous" error term.
Yes, you could in principle convert any memory within a given loop into a separate unit. You just gotta pick between the tradeoffs between having more I/O and software/hardware units, versus having more memory within specific software/hardware units.
You just gotta find a balance that works for your requirements and constraints. In my application (it was medical devices), we found that stateful loops fit nicely.
That's the irony of closed loop control systems. They appear to be simple but the emergent behavior, particularly when there is a hierarchy of control loops, is incredibly complex.
I had once thinking about this type of system, and later found that the SNMP(1988) as well as NIST/Army 4D/RCS project(1980s) had this train of thought before I was even born. Now I'm wondering why this type of distributed, synchronized, hard realtime decision making framework don't seem to exist as some Apache Foundation project or something. It just sounds right and massively useful but I can't find an implementation, at least on public Internet -- why?
If you mean distributed to mean "over multiple computers", then ya... this is a really specialized field. Your ability to synchronize and maintain hard real-timeness between computers is completely dependent on IO and hardware features, and topology. This makes it much harder to make a generalized framework.
We we ignore the distributed portion, then you're describing a real-time operating system - in which case something like https://www.freertos.org/ might fit the bill?
I mean realtime inter-node message passing in a tree topology network, with message frequency associated to edges, like shown in [1]. Each nodes might run on its own CPU clocks, but message exchanges occur realtime, progressively faster at each depth, and the system as a whole(what "Autonomous System" probably supposed to mean) runs as a slow but coherent realtime cluster.
Perhaps the idea that message passing has to happen realtime is something obvious or elementary to military/defense/aerospace from technical requirements or experiences fighting fog of war etc., but to me it was dumbfounding new and I guess it might also be for the general public?
See the Beam/OTP as part of Erlang and now Elixir, Gleam, etc.
Erlang came out of Ericcson when they were building giant telephone switches.
Nowadays things like RabbitMQ and WhatsApp run on the Beam VM.
The beam handles both local and remote nodes, mostly transparently.
Then there are things like Nomad or k8s which try to tackle the problem more like *nix(linux/BSD/etc) does, but across multiple nodes. These are not meant for hard real-time systems like launching rockets obviously, not even for soft real-time.
Even though you can achieve pretty good latency with Erlang because of the lack of global memory (most if not all allocation is local to an Erlang process).
The article's indeed light on details, but what it describes sounds very similar to autonomous driving / driver assistance software I have experience with. That's not really surprising, as the overall system for an autonomous car and a spaceship is very similar - keep processing inputs from sensors, calculate some values and manage state machines, use those results to adjust outputs, some of which may operate actuators, and repeat.
The similarity with game event loops is IMO superficial. A game typically processes events as fast as it can, and the time between any two iterations of the loop is completely unpredictable. One frame could render in 20ms and the next in 25ms, which is completely fine. For a car or spaceship though, there are hard real-time requirements. Like the article describes, some Dragon tasks run every 100ms, others run every 20ms. A lot of effort has definitely gone into making sure the tasks can complete quickly enough, with some margin, and there are definitely systems that monitor these deadlines and treat even one timing miss as a major failure.
"One frame could render in 20ms and the next in 25ms, which is completely fine."
No it is not fine, if you want to have a real smooth gameplay. So also in game loops, you can adjust for that, by checking timediff.
But sure, the difference is that, if you miss a frame in a game loop - no real rockets crash, so there is probably (and hopefully) not as much effort put into that, like they do on SpaceX.
Framerates are averaged over much longer periods than two frames. Rendering 45 frames in 20ms each and then having a couple take 25ms just doesn't matter mostly, and addressing that is the kind of extra "nice to have" (and even then not for every kind of game). Yeah it's nice to have consistent frame times but less important than avoiding individual frames getting very slow, and a framerate drop certainly won't be treated by the engine as a critical failure that causes e.g. a crash or restart.
On a hard real-time system like the Dragon, failing the deadline once is a critical error just like a game engine being unable to talk to the GPU.
Makes sense from a financial standpoint too since game devs are known to be passionate enough to accept lower salaries if they can work on cool stuff. Not necessarily knocking it (not my preference), but it matches exactly what I hear from people that work there. The recruiters even say the pay is low, but you'll have "SpaceX" on your resume so it's worth it.
The part that's way harder though is that at least in a game engine, you know the absolute world state and when you make a change to your world state, it happens exactly as you wanted.
In dynamic systems, such as a rocket, robot, etc., reality is fuzzy. Your loop is mostly about "what is my best guess at my present state" and "what is my desired next state". You make changes to try and get from A to B but you may not exactly achieve B. You may have not even been in state A. The error propagates back into your next guess of state, and this repeats forever.
Sensors like GPS give you an absolute position but they're imprecise. Inertial navigation is extremely accurate in telling you the change from a sample to the next, but as your position is the second integral of your acceleration, any error compounds quickly (admittedly the quality of the INS available to people willing to spend 10s of millions of dollars on a vehicle FAR exceeds what you'd be able to put in a car). Some rockets have even used ground based radar measurements uplinked to them to further improve position estimates. They probably still do this, I just don't have any hard data on current launchers.
You could abstractly structure the computation as a graph (there are many concrete approaches to that) and only recompute the parts that change due to changed input or changed intermediate results.
If you have multiple outputs you can also have a scheduling/prioritization system for the subtasks.
And yes, use interrupts or multiple timers to detect only changed parts without having to compare current input to previous input.
It's basically the same problem as updating a browser DOM in response to application state changes.
Right, but since you typically want timing guarantees (for example, "respond to change in measured acceleration within 20ms"), you'd often end up structuring and resourcing everything for the worst case (everything needs to be computed) anyways... which means that your graph optimization didn't end up saving any resources.
Also, there sometimes ARE interrupts meshed in with these main loop based systems (for whatever reason). They just aren't predominant. If your willing to through more hardware at things, you can typically always transform anything into a buffer.
That pretty well describes the board in front of me ATM. Most of the system runs on a 20ms (I think) tick to manage the sensor read/evaluate/control loops, but there's a 10kHz timer interrupt always going that handles commutation of a stepper motor if it needs to run. 99.9% of the time, the interrupt fires, sees there's nothing to do and just returns.
for battery electric use cases it saves energy: you do just the necessary computations and you sleep the CPU until the next cycle. if little has changed you can sleep a lot. if a lot of rumpus is happening, you use the full cycle.
Remind me not to drive a vehicle where you implemented the control systems.
This is a great example of a little knowledge being dangerous. There's a reason control systems are not implemented the same way as a browser DOM. There's a reason control systems are designed to avoid subtasks, conditional execution, interrupts (as much as possible) and complex data models. The reasons are pretty damn solid, IMO.
Conceptually, you can talk about the two domains as being somehow equivalent. That says nothing about the appropriate implementations.
No, as in a the graph starts from nodes corresponding to sensor inputs, with edges going to nodes that represent computations based on those, to further computations, to the outputs that drive actuators.
If only one sensor input changes then intermediate computations that don't depend on it don't need to be updated.
Right. I can see how that could be useful in some situations but presumably it wouldn't work very well for a very dynamic situation like a rocket in flight (changing height, orientation, fuel, thrust every millisecond)
And you still need a control loop to look for sensor changes? Its just a way of caching most of the computation?
It can work well for dynamic event-driven problems, with no need for a central busy loop[1]. Any sensor changes are responsible for triggering an update in any nodes that depend on it, who are responsible for triggering nodes that depend on those, and so on. One way to do it: abstract class Node, which has a list of child nodes that it loops through and alerts whenever there's been a state change in the parent node.
[1] There is a busy loop but it's only there to trigger tasks on a timer.
I’ve never seen this implemented well in a real-time setup. I’ve come to associate it with everything falling apart and otherwise being taped together in weird ways to function.
It sounds elegant, but I’m not convinced there’s a benefit to a model of abstractions of abstractions like this unless it’s done at the language level. In practice, I’ve only seen this produce a web of race conditions, performance issues, and system complexity. My experience is anecdotal.
I've seen it work well once, but others struggled to contribute to it because it was so complicated and it added little benefit to the usual busy loop approach. So I'm in the same boat as you.
One reason why this is dangerous is that you can see extremely large changes in the CPU load. For a bunch of cycles, there's apparently nothing to do, then there's a whole bunch of stuff to do, then there's almost nothing etc.
This makes it hard to reason about the overall capacity of the system (in terms of numbers of sensors and actuators). You will always end up uncertain whether you've ever really evaluated the "full load" condition, even if you formally test it in ways that say you have.
The flight software is actually a real-time (aka deterministic) embedded system software. And the control loop is the typical control loop found in embedded systems: 1 - read data from sensors through ADC (Analog-To-Digital Converters), I2C, SPI, CAN (Controller Area Network) and so on; 2 - compute the output to actuators, such as motors, hydraulic cylinders, valves, motors, lights and so on, using some control law and the current state; 3 - repeat the cycle. The algorithm that drives the output may be based on algorithms from control theory, namely state space model, PID control or Kalman filter. The computers that they may be using might be single-board computers based on ARM-core, MIPS-core, PowerPC or even X86 variant for embedded systems and/or lots of microcontrollers. Before the advent of computers, the control theory algorithms were implemented using "analog computers", which are specially designed analog circuits for computing differential equations.
Control theory algorithms can be developed using tools such
as Matlab, Matlab-Simulink, Modelica or Scilab-Scicos (Open-Source) from Inria. Besides C or C++, another language used in embedded systems like that is Ada, which is much safer and reliable than both C and C++.
This is an extremely common pattern for robotics processing... so I wonder if they just consider the rockets a robot where you're just using a finite state machine based on inputs.
It had manually encoded ROM in the form of "core rope memory", which is pretty wacky, but it was a digital computer. In fact, it was the first IC computer.
Unlikely, they talk about 50hz and 10hz tasks. Real time OS often allow you to create threads that run at a given rate and must comolete. Like the 50hz task would have 20ms to complete.
`wait_for_next_tick();` in any microcontroller-based embedded system will put the CPU to sleep, often lowering your overall power consumption from the mA range it uses while actively running, to the uA range.
Some systems, like those based on coin cells, will completely shut down the chip, leaving only a single timer or input active, and when triggered will turn back on and initialise the entire CPU again to handle the input/timer. Some microcontrollers give you a small allocation of you can preserve under these conditions. That's how you get year+ runtimes on tiny batteries.
If they are rolling their own os or are using microcontrollers sure, but if they are using a safety critical rtos like vxworks or integrity then it has rate monotonic scheduling built in using os threads.
No, he is correct. Alot of embedded software uses a more complex form of that exact technique that he is shown in the code.
You technically can have a dedicated timer interrupt for every task but there are usually alot more tasks than HW timers, so instead they use a dedicated HW timer for time-keeping and use that as reference for all other tasks.
Exactly. IF there is operating system used (it's not always needed), programmers implement separate OS tasks. For simpler systems (like just reading some sensors) real time os is not needed, then such loops are implemented as a template (with much more error checking and restarting). Typically that "wait for tick" function just waits until hardware timer sets overflow flag (not even full interrupt), and this is done to have simple, easy to reason about system.
> Flight software for rockets at SpaceX is structured around the concept of a control cycle. “You read all of your inputs: sensors that we read in through an ADC, packets from the network, data from an IMU, updates from a star tracker or guidance sensor, commands from the ground,” explains Gerding. “You do some processing of those to determine your state, like where you are in the world or the status of the life support system. That determines your outputs – you write those, wait until the next tick of the clock, and then do the whole thing over again.”
Wow, sounds an awful lot like a typical event loop from a game engine.
Of course the main difference from a game engine would be the reliability requirements. The level of verification effort that goes into flight control software can't be comparable with the effort that goes into a game engine (assuming it's greater than zero :))