[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [oc] MISCs and partially desinchronized networks





Hi,
I don't know if you ever consider all of the requirements.
I want to pose my doubts.

1. Multi-task environment? How will you handle address translation?
    and Context Switch?
2. debugging environment?
3. Each MISC has his own register file. But the whole chip register file
    resides where? Or the whole chip register file is the collection of
    all registers in each MISC.
4. How to map each MISC's memory space to global memory space?
5. How to handle the CPU's special registers or global control registers?
6. what's plan of bus interface connecting chip to external world?
    all by FU0? Then if one FU locating farest from FU0 wants to send messages
    out of the chip, the cost would be great.
Hope it helps...
Best regards, Jimmy





"Marko Mlinar" <Marko.Mlinar@campus.fri.uni-lj.si> 於 2000/06/18 04:07:46 PM

請回應 給 cores@opencores.org

收件人:  cores@opencores.org
副本抄送: (副本密送: jimmy87/Sunplus)

主旨:    Re: [oc] MISCs and partially desinchronized networks



> I agree. I prefer simpler designs like MISC and MOVE (TTA) processors
> as well. For those not familiar with these terms:
>
>    http://www.ultratechnology.com/mup21.html
>
>    http://cardit.et.tudelft.nl/MOVE/
I like both ideas. S. Roos (and others) had suprisingly equivalent idea
to our or2k:
http://cardit.et.tudelft.nl/MOVE/papers/Roos99a.ps.gz
(I hope they have this article still online, otherwise I can post it
somewhere)
I think there is a future for TTAs, but I think that currently you have to
find
balance between data driven and control driven processor.
But we lost contact with them, when we passed our result to them - I think
they stopped research too, since they also wanted to build GPC.

I like MISC idea also. However that idea with dual stack has some
disadvantages,
since dual stack concept is very useful for dataflow tree extraction but not
for DAGs, e.g.:
t=x;x=y;y=t; would produce unnecessary reg-mem movements.
I have tought a lot about single/dual stack computers, but all I can say
they are useless for
general purpose applications. However I think mup21 can bring speedup in
some simple
applications, and also smaller code size is very useful.

Let me humiliate myself a bit:
I was developing my own MISC recently:
32bit two operand MISC, 4 32b registers, 16 instructions, 128B for data and
program.
And partialy desynchronized networks.
I suppose anyone who knows a bit of computers, would call me crazy :)

If anyone has enough patience and time, may read more detailed briefing:

INTRODUCTION
OK, the idea is to build computer for computation intensive applications.
This processor is not intended for strict sequential programs, e.g. programs
compiled in Self would be great, standard C programs could yield better
results
with really compiler, but not as good as with parallel languages.
The goal was to achieve max. computation power with minimum transistors.
Cluster is matrix 16x16 of FUs. FU can be MISC or memory unit.
Each MISC has its own memory (128B) and 4 32b registers. Each FU is
connected
to four neighboured FUs and has its own routing circuits.
Network has supplied clock - 2 times faster than that one in FUs

NETWORK
Network has following properties:
- each MISC has exactly one neighbour which is memory unit
- each memory unit has all four neigbours MISCs
- for all FUs: all four neighbours have different clock phase
e.g.:
11221122 ...
34433443 ...
22112211 ...
...
where each digit is clock phase
It is simple to prove that that network exists
- FUs are connected with 32+8+5+2=47b bus.
- each FU has its owh unique address 8b (row, column)

FUs communicate through messages. Each message consist:
(data, FU address, mem address, control)
So FUs communicate between each other with messages,
FU address means unique address. Address means one of
128B aligned to 32b. Control bits identifies type of message.

ISA
Each MISC has following ISA:
add, sub, shl, shr, and, xor, snd, rcv, move : type 1
load, store, fork, end : type 2
(this is preliminary ISA, it is subject to change)
Each instruction has 4b opcode, and
type1 1B:
[opcode 4b][reg1 2b][reg2 2b]
type2 5B:
[opcode 4b][reg1 2b][reg2 2b][data 32b]

So, this ISA decreases code length to minimum.
maybe I should explain special instructions:
- snd: sends message and store its result in destination's memory at address
- rcv: wait for (constant) msgs
- load: loads immediate in register or memory at (reg2+disp) in reg1
- store: store reg1 in mem at (reg2+disp)
- fork: allocates free FU
- end: ends current program and sends termination msg and halts processor

Maybe you have figured out, that it is not intended to place whole programs
in MISCs, but rather separate functions or basic blocks (if not enough
memory).
When function call occurs fork is executed, which allocates empty FU (one of
MISCs - (0,0) has this special function and is intended only for program
loading).
Function parameters are passed through snd, and received when calle sends
them
back. When we need them we must execute e.g.: rcv 1.
Programs are loaded via snd too.
If MISC doesn't have enough memory it can use its neighboured memory unit
using
snd and rcv.

ROUTING
One would expect routing would be problem, but because all neighbours have
different
clocks, collisions and ques could be easily avoided.
Each FUs knows it own address, if msg address matches its own it executes
store
to mem. If not it sends msg to right neighbour. Right neighbour = neighbour
that is
closer to destination than ours and has free flag on. Free flag means it
doesn't contain
any msg. Generally we have two possibilities and there are chances where one
msg
comes faster than another. But with some principles this is not considered
problem.

SIMPLICITY
It is very important to put as much processors on die as possible, even if
clock speed
falls:
- adds can be simplest possible 32 full adders
- shl, shr shifts only by one
- every instructions executes in two clocks - one clock for fetch, one for
memory. If there is no memory access msg can be stored (if FU not free).

I/O and external memory
special FU (0,0) must have access to external buses. It can supply other FUs
in cluster
with I/O and memory data and programs. It is also responsible for swapping
stalled
programs in local memory.

LARGER NETS
Using the fractal idea, this networks can be enlarget up to infinity, e.g.
instead of 16x16 we have cluster 16x16 each consisting 16x16 MIPS.

ADVANTAGES/DISADVANTAGES
So as we can see, there is really no need to hurry :) The advantages of this
model are achieved with large processor numbers.
Disadvantages are: slower external memory operations, slower sequential
programs.

Comments are desirable. If anything is unclear or my english is bad, please
ask.
NOTE: I have studied this model in grater detail, but I have limited space
and time here :(

Marko