[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [oc] Beyond Transmeta...

To: <cores@opencores.org>
Subject: Re: [oc] Beyond Transmeta...
From: "Jim Dempsey" <tapedisk@ameritech.net>
Date: Mon, 10 Feb 2003 11:49:47 -0600
References: <200302091606.h19G6oN02285@www.lampret.com> <008201c2d064$415fce60$0202a8c0@SARAH> <1044847156.1372.76.camel@linux8k> <3E47A032.2070704@comsys.se>
Reply-To: cores@opencores.org
Sender: owner-cores@opencores.org

In my humble opinion, the 1 bit array processor is best constructed using

an MIMDMP design (multiple instruction, multiple data, multiple phase).

In this thread's original example

c = a + b

d = c + e

The second statement in the pseudocode begins 1 bit time out of phase but

otherwise executes concurrently with the prior instruction. Unrelated

instructions would be constructed to execute concurrently

    c = a + b
    d = c + e
    x = y + z

p = c * x

c = c + d

Becomes (using 4 bit wide numbers)

(use courier typeface for alignment, bit times occure right to left)

aaaa+

bbbb+

cccc+

eeee+

dddd+

cccc+

cccc/

ppppppppp

cccc*

xxxx*

yyyy+

zzzz+

Note, independent from the word width (4 in this case) the availibility of

variables for further calculation occures at 3 cycles. The full solution

of the product of p=c*x in word width + 2 cycles. However, the result

is available for use as soon as the first bit becomes available.

In viewing this in terms of a RISC processor where an Add occures in 1

cycle (and let's say the multiply occures in 1 cycle as well), the sample

program takes 5 cycles (one for each statement). In the serial approach

and assuming 32 bit words. And assuming that the bit array clocks

at 32x the RISC, then the 32 bit RISC instructions take the equivilent

of 32x5 clocks (160) whereas the availability of variables for computation

begins at 3 cycles. Or, in excess of 50x the RISC processor.

This simple example illustrates to some extent the power attainable

using multiple stream serial processing.

Conceptualization of this is one thing. Putting it into practice is another.

To put this into practice the program "compiler" must determine an

optimal configuration for routing the data and then "wire" the processor

to perform the task. PLDs illustrate that a "processor" can be rewired

however, the current design of the popular PLDs are designed for bussing

data for parallel use.e.g. the result of an n-bit adder is available only after

the complete result is available and not as the result propigates across

width of the adder.

An entirely new design of PLD would be required. One where the data

flows on route programable serial busses. And the computational

(logic) elements are fast but relatively few in number. For example,

if the problem to solve and the compiler available was suitable to

utilize only 128 adders then you view the problem as one of routing

the variables and partial results.

The routing problem is non-trivial. The output of each of the 128 adders

could go to any one or number of adders, including self, as well as

to any or any number of destinations. As the computations proceeds

the routing changes as required. This is somewhat like a massive

patch pannel. The duration of the connection might be perminant or

it could be as fleeting as 1 bit time. The performance of the system

becomes dependant on not only the speed of the few components

(e.g. adder) but as dependant on the time it takes to reroute the

connections.

Although it would seem a logical extension of an optical circuit (OPLD)

I see no sense in waiing for these devices. If 50x performance is

attainable in a simple problem (as for the example above), it would

seem advantagious to offer this capability using current technology.

A manufacturer, such as Altera or Xilinx, could first test market

this design in a new component. Then based on the experience

learned make a device that integrates the two technologies into

one die. A second problem to solve is in producing the compiler-

like program that produces the routing and scheduling information.

Obtaining a 50x performance gain over your competition should

provide enough incentive to pursue R&D in this area. This would

be something I would be interested in pursuing. Although this

is something the major players (Altera, Xilinx) should invest in

it is a project that a startup could do. The startup could use

existing devices (or even use a software emulator) to emulate

potential designs. After the necessary IP is protected with

patent (application) you pursue raising capital to make the devices

or license the technology to one of the bigger players.

Jim Dempsey

----- Original Message -----

From: "Lars Segerlund" <lars.segerlund@comsys.se>

To: <cores@opencores.org>

Sent: Monday, February 10, 2003 6:50 AM

Subject: Re: [oc] Beyond Transmeta...

>
> There are quite a lot of bit streaming techiniques in use, look at a
> delta-sigma , or the ancient MILDAP which was a 1 bit array processor
> using SIMD, I think bit streaming operations is most usefull for SIMD, (
> IMHO ), since it's the multiply/divide algorithms that are really hard
> to do.
> --
> To unsubscribe from cores mailing list please visit http://www.opencores.org/mailinglists.shtml

Follow-Ups:
- Re: [oc] Beyond Transmeta...
  - From: Marko Mlinar <markom@opencores.org>
- Re: [oc] Beyond Transmeta...
  - From: Ivan Guzvinec <ivang@opencores.org>
- Re: [oc] Beyond Transmeta...
  - From: Ivan Guzvinec <ivang@flextronics.si>

References:
- Re: [oc] Beyond Transmeta...
  - From: mr.modman@email.cz
- Re: [oc] Beyond Transmeta...
  - From: "Jim Dempsey" <tapedisk@ameritech.net>
- Re: [oc] Beyond Transmeta...
  - From: Rudolf Usselmann <rudi@asics.ws>
- Re: [oc] Beyond Transmeta...
  - From: Lars Segerlund <lars.segerlund@comsys.se>

Prev by Date: [oc] Verilog
Next by Date: Re: [oc] Reed Solomon Encoder/decoder core requirements
Prev by thread: Re: [oc] Beyond Transmeta...
Next by thread: Re: [oc] Beyond Transmeta...
Index(es):
- Date
- Thread