[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [oc] Binary Divides for uP core

To: cores@opencores.org
Subject: Re: [oc] Binary Divides for uP core
From: Richard Herveille <richard@asics.ws>
Date: Thu, 17 Jan 2002 05:18:23 +0100
In-Reply-To: <017001c19ea5$f0323820$77a35442@mshome.net>
References: <200201131342.OAA17584@iza.mr2.si><5.0.2.1.2.20020115090603.00ad5610@asics.ws><5.0.2.1.2.20020116100206.00a87650@asics.ws>
Reply-To: cores@opencores.org
Sender: owner-cores@opencores.org


Aah, fooled by the datasheet.

What you are saying is kinda correct. The LUT cell-structure supports all
these connections (similar structures are available in all FPGAs), but the
cell can only be used in certain modes (again true for other FPGAs). These
modes enable some of the connections you are talking about. It is NOT
possible to use them all at the same time. Check the datasheet "LE Operating
Modes" for a list of supported cell modes.

If you set the cell for normal-mode, then you enable the 4-input lut
structure. In this mode the lut has four inputs and outputs one result bit.
It is not possible to produce a second lut output using this mode.
If you set the cell for arithmetic-mode, then you enable the carry structure.
In this mode the 4-input lut, which is actually a 16x1bit ram, is divided
into two pieces, each 8x1bit large. Because each lut is 8x1bit it supports
3 inputs log2(8), two external and a carry-in. One lut produces the result-bit
the second produces the carry-out bit. Well, ok you say, let's use this mode
and produce two bits per cycle. However, it is not possible to directly feed
the carry-output into a register or a different lcell then the next lcell in
the same row (called LAB in Altera devices), you need to feed it into a lut
and then into the accompanying register. So instead of saving an cell, you
sacrifice one with dummy logic (it only passes the carry signal). But the
worst part is, you increase the cycle time by the added delay through the
second lut. However if you use the carry chain, you connect all cells used
in the arithmetic block you describe directly by dedicated high speed
connections. The carry chain is the fastest interconnection available in
Altera devices (and probably in Xilinx as well). Carry-in to Carry-out
generation is 3 to 4 times faster than normal lut operation. This means you
need to generate at least 4bits at a time to be faster than the carry chain.
And since this is impossible with the cell structure, it's a NONO.
Interesting point to think about is what happens when your carry chain delay
is larger than the extra lut+routing delay. But I guess at this point your
additional cell resources start adding up. Nevertheless it deserves some
more thinking than I am doing while I am typing this.

Anyway the basic conclusion is that, although your proposal sounds good, it
is not possible with the available cell structures in FPGAS.

Richard



>Correct me if I am wrong but don't FPGA use a LUT (Look Up Table)
>to emulate the low level logic elements (AND, OR, ...). A device like the
>Altera APEX 20K has 6 inputs (data1-data4, Carry-In, Cascade-In)
>and arguably a 7th input (one of the clears). For output it has conceivably
>4 with register packing: Register, Register Bypass, Carry-Out, Cascade-Out.
>With careful investigation and crafty handiwork you might be able to
>construct the LE to perform two bits at a time. I haven't looked
>at this close enough to see if this can be done.
>
>Placing that aside and working 1 bit at a time for performance you
>would not look at the problem from an abstract pseudo C programmers
>point of view (restricting yourself to operations performed in C) but
>look at the problem from the perspective of what the Logic Element
>does well. For the purposes of division I would try to construct
>a barrel shifter comparitor subtractor. I don't think you could squeeze
>in subtraction. That lacking then create a barrel shifter comparitor while
>using a different set of LEs to do the conditional subtraction.
>
>The point is to pack two (or three) abstract operations (pseudo C)
>into one LE and perform these operations in one cycle. This
>accomplishes several things. Reduction in LE's, reduction in
>clock counts to propagate the solution. This produces a faster
>core while using less resources (and power).
>
>Jim Dempsey
>Newbee to FPGA programming
>Old hand at optimization techniques
>
>----- Original Message -----
>From: "Richard Herveille" <richard@asics.ws>
>To: <cores@opencores.org>
>Sent: Wednesday, January 16, 2002 3:05 AM
>Subject: Re: [oc] Binary Divides for uP core
>
>
> >
> > Correct, these are called high-radix dividers (comparable to high-radix
> > multipliers).
> > If you are targeting ASICS and have time to/want to generate the tables
> > this works great and is faster than a radix-2 (1-bit at a time) divider.
> > If you are (also) targeting FPGAS, this is a NONO. Because it will make
> > the divider slower. This is caused by the fact that the ripple-carry
> > structure in FPGAS is much faster than the other structures. So your
> > critical path is not the adder/subtractor structure, but the
>lookup-tables.
> >
> > Richard
> >
> > >You can also multiply using a combination of look-up table and adder.
> > >Instead of doing a shift and add one bit at a time you can process two,
> > >three, four, ... bits at a time. Some of the CRC32 generators used to
> > >do this (way back).
> > >
> > >Also (way back), some implimentations of division generated 1/n then
> > >performed a multiply.
> > >
> > >     m / n = m * (1 / n)
> > >
> > >The accuracy of the remainder is not the same.
> > >
> > >Jim Dempsey
> > >
> > >
> > >----- Original Message -----
> > >From: "Richard Herveille" <richard@asics.ws>
> > >To: <cores@opencores.org>
> > >Sent: Tuesday, January 15, 2002 2:30 AM
> > >Subject: Re: [oc] Binary Divides for uP core
> > >
> > >
> > > >
> > > > As everybody knows (or should know), a hardware multiplier is simply a
> > > > bunch of adders.
> > > > The proces is called "multiplication by repeated addition".
> > > > Each adder adds a shifted-and-anded part to the final result, see
>example.
> > > >
> > > > A=10, B=2 A*B=
> > > >
> > > >   1010
> > > >     10
> > > > -----
> > > >   0000    ((1010 && 0) << 0)
> > > > 10100    ((1010 && 1) <<1)
> > > > ----- +
> > > > 10100
> > > >
> > > > The final result always fits into Abits + Bbits (4+2=6 bits in the
> > > > example), i.e.
> > > > the result of a k*k bit multiplication always fits into 2k bits.
> > > > Also there is no (real) difference between unsigned and signed
> > >multiplication.
> > > >
> > > >
> > > > Now for the difficult part division:
> > > >
> > > > Differences from the multiplication
> > > > 1) The most common implementation is a bunch of subtractors.
> > > >     This process is called "Division by repeated subtraction".
> > > >     Other implementations use BSD numbers and/or adder-subtractor
> > >structures.
> > > > 2) In a divider bits become available one at a time, starting at the
>MSB.
> > > >     intuitive: Try to divide 1189/4 by hand, you also start at the
>MSD.
> > > > 3) Because of point-2 dividers are inherently slower than multipliers
> > > >     Also single clock division is (almost) impossible.
> > > >     note: Single cycle IS possible (pipeline your design).
> > > > 4) A 2K by K division does NOT always fit into K bits
> > > >     example 0100/01
> > > > 5) There's a big difference between signed and unsigned division.
> > > >
> > > > Here are some common implementations you should try to find, because
>these
> > > > are the simplest to implement:
> > > > "Restoring Unsigned Divider"
> > > > "Non-Restoring Unsigned Divider"
> > > > "Non-Restoring Signed Divider"
> > > > "Division by repeated multiplication"
> > > > The last method works great, if a hardware multiplier is available.
> > > > There are also some structures which combine multipliers and dividers.
> > > >
> > > >
> > > > Richard
> > > >
> > > >
> > > > >Hi,
> > > > >
> > > > >Does anybody know the routine for doing a binary division in HDL?
> > > > >I'm developing a uP core and would like to do divisions in one
> > > > >clock cycle if possible? The multiple instruction was easy to
> > > > >figure out but I'm still stuck on the division one.
> > > > >
> > > > >Paul
> > > > >
> > > > >
> > > > >--
> > > > >To unsubscribe from cores mailing list please visit
> > > > >http://www.opencores.org/mailinglists.shtml
> > > >
> > > > --
> > > > To unsubscribe from cores mailing list please visit
> > >http://www.opencores.org/mailinglists.shtml
> > > >
> > >
> > >--
> > >To unsubscribe from cores mailing list please visit
> > >http://www.opencores.org/mailinglists.shtml
> >
> > --
> > To unsubscribe from cores mailing list please visit
>http://www.opencores.org/mailinglists.shtml
> >
>
>--
>To unsubscribe from cores mailing list please visit 
>http://www.opencores.org/mailinglists.shtml

--
To unsubscribe from cores mailing list please visit http://www.opencores.org/mailinglists.shtml

References:
- Re: [oc] LVDS core
  - From: rudi@opencores.org
- Re: [oc] Binary Divides for uP core
  - From: Richard Herveille <richard@asics.ws>
- Re: [oc] Binary Divides for uP core
  - From: Richard Herveille <richard@asics.ws>
- Re: [oc] Binary Divides for uP core
  - From: "Jim Dempsey" <dempsey@northnet.net>

Prev by Date: Re: [oc] Binary Divides for uP core
Next by Date: [oc] Wishlist and CVS Links
Prev by thread: Re: [oc] Binary Divides for uP core
Next by thread: [oc] discussion on public forums
Index(es):
- Date
- Thread