bbbb+
/
eeee+
/
dddd+
cccc+
cccc/
ppppppppp
\
cccc*
zzzz+
Note, independent from the word width (4
in this case) the availibility of
variables for further calculation occures at 3
cycles. The full solution
of the product of p=c*x in word width +
2 cycles. However, the result
is available for use as soon as the first bit
becomes available.
In viewing this in terms of a RISC processor where
an Add occures in 1
cycle (and let's say the multiply occures in 1
cycle as well), the sample
program takes 5 cycles (one for each statement). In
the serial approach
and assuming 32 bit words. And assuming that the
bit array clocks
at 32x the RISC, then the 32 bit RISC instructions
take the equivilent
of 32x5 clocks (160) whereas the availability of
variables for computation
begins at 3 cycles. Or, in excess of 50x the RISC
processor.
This simple example illustrates to some extent the
power attainable
using multiple stream serial processing.
Conceptualization of this is one thing. Putting it
into practice is another.
To put this into practice the program "compiler"
must determine an
optimal configuration for routing the data and then
"wire" the processor
to perform the task. PLDs illustrate that a
"processor" can be rewired
however, the current design of the popular PLDs
are designed for bussing
data for parallel use.e.g. the result of an
n-bit adder is available only after
the complete result is available and not as the
result propigates across
width of the adder.
An entirely new design of PLD would be
required. One where the data
flows on route programable serial busses. And the
computational
(logic) elements are fast but relatively few in
number. For example,
if the problem to solve and the compiler
available was suitable to
utilize only 128 adders then you view the problem
as one of routing
the variables and partial results.
The routing problem is non-trivial. The output of
each of the 128 adders
could go to any one or number of adders, including
self, as well as
to any or any number of destinations. As the
computations proceeds
the routing changes as required. This is somewhat
like a massive
patch pannel. The
duration of the connection might be perminant or
it could be as fleeting as 1 bit time. The
performance of the system
becomes dependant on not only the speed of the
few components
(e.g. adder) but as dependant on the time it takes
to reroute the
connections.
Although it would seem a logical extension of an
optical circuit (OPLD)
I see no sense in waiing for these devices. If 50x
performance is
attainable in a simple problem (as for the example
above), it would
seem advantagious to offer this
capability using current technology.
A manufacturer, such as Altera or Xilinx, could first test market
this design in a new component. Then based on the
experience
learned make a device that integrates the two
technologies into
one die. A second problem to solve is in producing
the compiler-
like program that produces the routing and
scheduling information.
Obtaining a 50x performance gain over your
competition should
provide enough incentive to pursue R&D in this
area. This would
be something I would be interested in pursuing.
Although this
is something the major players (Altera, Xilinx)
should invest in
it is a project that a startup could do. The
startup could use
existing devices (or even use a
software emulator) to emulate
potential designs. After the necessary IP is
protected with
patent (application) you pursue raising capital to
make the devices
or license the technology to one of the bigger
players.
Jim Dempsey
----- Original Message -----
Sent: Monday, February 10, 2003 6:50
AM
Subject: Re: [oc] Beyond
Transmeta...
>
> There are quite a lot of bit streaming techiniques
in use, look at a
> delta-sigma , or the ancient MILDAP which was a 1 bit
array processor
> using SIMD, I think bit streaming operations is most
usefull for SIMD, (
> IMHO ), since it's the multiply/divide algorithms
that are really hard
> to do.
> --
> To unsubscribe from
cores mailing list please visit http://www.opencores.org/mailinglists.shtml