Modification history:
Field Programmable Gate Arrays (FPGAs) are a tremendously exciting implementation platform. The chips are getting big enough (millions of gates) to implement impressively large systems. Celoxica's Handel-C language provided a high-level language for programming the devices -- no longer are we restricted to the hideous VHDL. So now we can relatively easily program massively parallel simulations onto these chips. This leads to two sets of questions: how to use FPGAs to implement a simulation platform; and how to approach FPGAs as novel computational devices in their own right.
Investigating biologically inspired models of computation needs massive computational capacity. Biological "computations" involve trillions of components -- up until now simulations have been vastly more modest (usually of the order of 100 nodes), perhaps explaining why some of the qualitative differences have yet to be appreciated in full. The classical computational resources required to simulate quantum computers increases exponentially with the size of the problem.
Simulation platforms can be implemented on conventional massively powerful computers. Are there any useful simulation packages to be implemented? Suitable special-purpose programming languages for different simulation paradigms? Multi-dimensional visualisation of the large number of components as they vary in time?
Simulation: Simulation is one particular use of parallel architectures -- from simple physics simulations using CAs or solving differential equations on meshes, up to complex simulations of intelligent agents roving around weird landscapes.
We need a new conceptual model of program execution. A program is compiled down into FPGA hw, not into the assembly language of some processor. There is no von Neumann architecture of CPU with fetch/execute cycle. Rather, different parts of the program can be executing in parallel. There is no separation of memory locations and program instructions -- the state is given by values held in on-chip RAM registers (or in conventional off-chip RAM), the program is (mostly) the hw configuration.
So, how do we think about FPGA programs? What are the conceptual building blocks? How do we specify and design them? Is animation a good way to build intuition about parallel algorithms? What are the interesting constraints? Space, time -- we have lots of intuitions about von Neumann programs that we need to extend to FPGAs (and even more so for Quantum computers!).
We already have procedural, functional, logic languages -- we need a new kind of "massively parallel procedural" language. Handel-C superficially looks like a conventional programming language. But look closer. The parallel construct allows parts to execute simultaneously. Each assignment takes only one clock cycle, no matter how complicated the expression (but your clock might have to run quite slowly -- it might actually be faster to split an assignment in two); channels allow "instantaneous" communication; and so on. Ideas from CSP for the parallelism, and data flow architectures, are needed, to design these programs.
Reconfigurability: FPGAs are "soft" hw -- a chip is configured by setting up connections between its array of gates. It uses ordinary volatile SRAM technology -- so is fast, and the reconfiguration can happen an unlimited number of times. Currently reconfiguration is a whole chip operation: the configuration data is serially load into the chip, in about 10 microseconds. Next-generation hybrid FPGAs have separately configurable regions, to allow reconfiguration of one part of the chip while the rest continues executing (as well as some on-chip conventional CPU, etc). This would also allow an entire chip to be reconfigured faster, if the separate areas could be configured in parallel. Ultimately one would have fine-grained reconfigurablity, on the fly.
Soft processors: One of the programs that can be written, compiled, and loaded onto an FPGA is, or course, an emulation of a CPU. Such "soft processors" are useful for a variety of reasons running code for old processors that are hard to get -- like 8086s emulating virtual machines, like JVM, or pcode prototyping and experimenting with new instruction sets, new pipelining architectures, etc, for as-yet non-existent chips portability testing platforms for application-specific "little language" processors. Soft versions of special-purpose chips could also be useful. For example application specific devices (FFTs, filters, DES, )
Customisation: Develop many different FPGA implementations of the same software, by using "meaning preserving net-list mutation operators", and use this to digitally watermark the software (maybe this idea should be patented!) to provide the basis for the software licence to provide some security properties. (Again, this technique transcends FPGAs -- one could mutate the source code, or the object code, of any application. These operators could be developed for space/time optimisations, as well as watermarking? This could be a route into parallelisation technology.)
FPGAs can provide flexible "glue logic" components for gluing together other special-purpose components, particularly CPUs. They can also be used for combination architectures: spatially heterogeneous, by having multiple (small) applications on a single chip, and temporally heterogeneous, by reconfiguring the chip, for example, to implement a complex pipelining process. (Combination of the two may need to wait for dynamically reconfigurable FPGAs.) Any ASICs or FPGAs that are today being designed using VHDL (for example, MP3 chips) should move to using a high level language -- currently something like Handel-C, eventually much higher level still (OO, parallel, dynamic configuration, )
Parallelism: Transputers etc are good for coarse-grained heterogeneous parallelism -- separate process can easily run on separate chips, but on a single chip, one wants to reduce the number of relatively expensive context-switches between time-sliced "parallel" processes. Things like vector processors (Crays) and array processors (DAPs) are good for fine-grained homogeneous parallelism -- doing the same thing to each element of a big array. FPGAs can do both of these, but also make it feasible to do fine-grained heterogeneous parallelism.
Currently, quite a bit of knowledge of how the program maps to hw is needed. Consider x := (a*b)+(c*d). Compile that naively, and you get two silicon multipliers. Or compile it as a "function call" -- there is now only one silicon multiplier, used by both terms, so they cannot execute in parallel. Also, there might be routing problems to the piece of silicon implementing the function from all the pieces that want to use it. It is a real space/time trade-off. There is lots of experience, from transputers to Crays, to be called on here. And there is Moore's law -- FPGAs are already up to millions of gates. What happens when we design for parallelism as if silicon area is essentially "free", and time is the only constraint? (modulo combinatorial explosion ...)
Parallel architectures: The most obvious application of the parallelism is a cheap entry level parallel processor, in particular, CA machines and NN architectures. Then things start to get more interesting: more intelligence in a node (beyond a simplistic CA state machine or neural thresholding function) different intelligences in different nodes (heterogeneous) intelligences moving between nodes (agents) non-euclidean and irregular connection topologies changing topologies (initially rather clumsily via soft routing, eventually by one-the-fly reconfiguration)
Applications that gobble up large amounts of processor power, that could benefit from massive parallelism, include signal processing, image processing, multi-media, virtual reality, cryptography, and cryptanalysis. Standard CS algorithms should be examined to determine how they could be parallelised; maybe less popular algorithms suddenly become better appreciated because they are more amenable to parallelisation. For example, how good is a parallel bubble sort? (Again, there's been a lot of work already done on parallel algorithms -- but mainly only for particular algorithms or for highly specific architectures like data flow and array processors -- is there much general-purpose CS stuff?) CS complexity theory will need to pay much more attention to space complexity order, as well as time complexity order.
So far this has been "naïve parallelism", where the topology of the silicon simulation maps straightforwardly onto the topology of the system being simulated. Because the mapping is so simple, there is little overhead, and so this is an efficient approach, provided that the activity across the landscape is relatively uniform, and so the activity across the silicon is relatively uniform. Simulations of "lumpy" activity on regular grids is inefficient -- effort should be focussed on where the action is, not everywhere uniformly. "adaptive meshes" are a technique used in conventional simulation -- the resolution of the mesh is adaptively altered to give high resolution in places of interest, and lower resolution elsewhere the "processor farm" arrangement is used with multiple transputers to perform load balancing there's lots of work on parallel simulation techniques, in particular asynchronous simulations, where parts are allowed to run ahead at full speed, provided they can "roll back" if their assumptions are shown not to hold Bill Gosper's cacheing technique for a simulation of unbounded Life.
Once you have got the idea that all your silicon should be doing something useful all the time, load balancing will become important. Applications become more complex as the silicon topology does not map simply onto the real world topology. (An alternative view: as silicon becomes essentially "free", who cares if it's not all being used fully all the time, if that means your application is simpler to design, develop, debug and run?)
Search: Many search algorithms are (more or less intelligent) generate and test. Candidates can be generated and tested in parallel. If the testing takes candidate-dependent amounts of time, load balancing may be used. In particular, a hybrid breadth/depth first search algorithm with dynamic pruning may be appropriate. A similar approach could be used to build a quantum computer emulation. Generate all the "parallel worlds" actually in parallel, then choose the best answer. (Again, combinatorial explosion will probably mean some "processor farm with load balancing" technique is needed for larger problems.) When exploring a parameter space, models with different parameter values can be explored in parallel (and see partial evaluation later)
Optimisation: Biologically inspired optimisation algorithms involve calculating cost functions over large populations, repeatedly. Members of these populations can be evaluated in parallel. Lots of these algorithms (Holland's "bucket brigade", Hofstadter's "parallel terraced scan") are described in an intrinsically parallel way. Programs can be evolved as FPGA netlists -- so evaluating the cost function simply involves running the program loaded onto an FPGA. (Care must be taken. Some original experiments ended up with very neat solutions -- that would run only on that FPGA, even only in that corner of that FPGA, only at that temperature, and so on -- non-computational properties of the particular physical device were being exploited.)
Complexity, ALife, etc: Fractal image decompression by "the chaos game" rendering algorithm (randomly iterating the affine transformations describing the compression) can be performed in parallel, starting with different seeds, so filling in different parts of the picture by overlaying their results. Tom Ray's famous Tierra system, an experiment in ALife evolution, uses a virtual machine to simulate the evolution of instruction strings. Getting away from the von Neumann architecture firmly embedded in his model (although there is now an Internet version of it, so organisms can execute in parallel on different sites, the idea that each is a sequentially executing program is still there), what would a parallel FPGA evolutionary experiment look like? There are lots of other complexity theory and ALife experiments, emergent property investigations, that require a lot of essentially parallelisable computation.
Application monitoring: With the availability of true parallelism, monitoring, profiling and run-time debugging support can be built into the application and have no time overhead (there is a space overhead, naturally). This should provide plenty of information for optimisation and performance tuning (especially needed in the early days before parallel intuition has developed). And it should be invaluable for monitoring and debugging hard real time applications.
Application optimisation: Many programs can be viewed as a function form inputs to outputs. Some of these inputs may vary more slowly than others (for example, your Preferences in a text editor vary less than the text you type into it, your crypto key varies less often than the text you encrypt with it). Consider a curried version of the function, with the slowest varying inputs first in the argument application order. Then evaluate the function with respect to the current value of those slowly varying inputs. This gives another function with those values "hardwired" into it. It should be more efficient to evaluate this "partially evaluated" function against the remaining parameters that to evaluate the full function against all its parameters. If the values vary slowly enough, the overhead of doing the partial evaluation should be paid for by the several evaluations of the more efficient resulting function.
Originally, partial evaluation was developed on functional programs. But partial evaluators have been built for Pascal and C. So an application can be optimised by partial evaluation. When a slowly varying argument does change, the function can be revaluated, and the FPGA reconfigured with the new version. (There is nothing about FPGAs per se that make them more suitable for this technique. Partial evaluation could be used more widely for most applications.)
We need new ways of thinking about complex systems. And we need a lot of computational power to implement or simulate such large systems.
In partnership with Celoxica and Xilinx
I would like to thank John Clark for several stimulating discussions in this area, and for commenting on these drafts.