Software patching at a billion miles_问答_开发者

Could someone here shed some light about how NASA goes about designing their spacecraft architecture to ensure that they are able to patch bugs in the deployed code?

I have never built any “real time” type systems and this is a question that has come to mind after reading this article:

http://pluto.jhuapl.edu/overview/piPerspective.php?page=piPerspective_05_21_2010

“One of the first major things we’ll do when we wake the spacecraft up next week will be uploading almost 20 minor bug fixes and other code开发者_StackOverflow中文版 enhancements to our fault protection (or “autopilot response”) software.”

I've been a developer on public telephone switching system software, which has pretty severe constraints on reliability, availability, survivability, and performance that approach what spacecraft systems need. I haven't worked on spacecraft (although I did work with many former shuttle programmers while at IBM), and I'm not familiar with VXworks, the operating system used on many spacecraft (including the Mars rovers, which have a phenomenal operating record).

One of the core requirements for patchability is that a system should be designed from the ground up for patching. This includes module structure, so that new variables can be added, and methods replaced, without disrupting current operations. This often means that both old and new code for a changed method will be resident, and the patching operation simply updates the dispatching vector for the class or module.

It is just about mandatory that the patching (and un-patching) software is integrated into the operating system.

When I worked on telephone systems, we generally used patching and module-replacement functions in the system to load and test our new features as well as bug fixes, long before these changes were submitted for builds. Every developer needs to be comfortable with patching and replacing modules as part of their daly work. It builds a level of trust in these components, and makes sure that the patching and replacement code is exercised routinely.

Testing is far more stringent on these systems than anything you've ever encountered on any other project. Complete and partial mock-ups of the deployment system will be readily available. There will likely be virtual machine environments as well, where the complete load can be run and tested. Test plans at all levels above unit test will be written and formally reviewed, just like formal code inspections (and those will be routine as well).

Fault tolerant system design, including software design, is essential. I don't know about spacecraft systems specifically, but something like high-availability clusters is probably standard, with the added capability to run both synchronized and unsynchronized, and with the ability to transfer information between sides during a failover. An added benefit of this system structure is that you can split the system (if necessary), reload the inactive side with a new software load, and test it in the production system without being connected to the system network or bus. When you're satisfied that the new software is running properly, you can simply failover to it.

As with patching, every developer should know how to do failovers, and should do them both during development and testing. In addition, developers should know every software update issue that can force a failover, and should know how to write patches and module replacement that avoid required failovers whenever possible.

In general, these systems are designed from the ground up (hardware, operating system, compilers, and possibly programming language) for these environments. I would not consider Windows, Mac OSX, Linux, or any unix variant, to be sufficiently robust. Part of that is realtime requirements, but the whole issue of reliability and survivability is just as critical.

UPDATE: As another point of interest, here's a blog by one of the Mars rover drivers. This will give you a perspective on the daily life of maintaining an operating spacecraft. Neat stuff!

I've never build real-time system either, but in those system, I suspect their system would not have memory protection mechanism. They do not need it since they wrote all their own software themselves. Without memory protection, it will be trivial for a program to write the memory location of another program and this can be used to hot-patch a running program (writing a self-modifying code was a popular technique in the past, without memory protection the same techniques used for self-modifying code can be used to modify another program's code).

Linux has been able to do minor kernel patching without rebooting for some time with Ksplice. This is necessary for use in situations where any downtime can be catastrophic. I've never used it myself, but I think the technique they uses is basically this:

Ksplice can apply patches to the Linux kernel without rebooting the computer. Ksplice takes as input a unified diff and the original kernel source code, and it updates the running kernel in memory. Using Ksplice does not require any preparation before the system is originally booted (the running kernel does not need to have been specially compiled, for example). In order to generate an update, Ksplice must determine what code within the kernel has been changed by the source code patch. Ksplice performs this analysis at the ELF object code layer, rather than at the C source code layer.

To apply a patch, Ksplice first freezes execution of a computer so it is the only program running. The system verifies that no processors were in the middle of executing functions that will be modified by the patch. Ksplice modifies the beginning of changed functions so that they instead point to new, updated versions of those functions, and modifies data and structures in memory that need to be changed. Finally, Ksplice resumes each processor running where it left off.

(from Wikipedia)

Well I'm sure they have simulators to test with and mechanisms for hot-patching. Take a look at the linked article below - there's a pretty good overview of the spacecraft design. Section 5 discusses the computation machinery.

http://www.boulder.swri.edu/pkb/ssr/ssr-fountain.pdf

Of note:

Redundant processors
Command switching by the uplink card that does not require processor help
Time-lagged rules

I haven't worked on spacecraft, but the machines I've worked on have all been built to have a stable idle state where it's possible to shut down the machine briefly to patch the firmware. The systems that have accommodated 'live' updates are those that were broken into interacting components, where you can bring down one segment of the system long enough to update it and the other components can continue operating as normal, as they can tolerate the temporary downtime of the serviced component.

One way you can do this is to have parallel (redundant) capabilities, such as parallel machines that all perform the same task, so that the process can be routed around the machine under service. The benefit of this approach is that you can bring it down for longer periods for more significant service, such as regular hardware preventative maintenance. Once you have this capability, supporting downtime for a firmware patch is fairly easy.

One of the approaches that's been used in the past is to use LISP.