Wizard

Wizard @0FC7.ADF - Wizard Attached Processor
I0FC7.ADF - Init file for @0FC7.ADF
ADF Sections (all three of them..)
189-186 Wizard Adapter and Wizard Memory Expansion Ooption
GRIDNET: Natural Gas Operations Optimizing System

Wizard Adapter PN 34F3062

U30 IBM I0120006
U32 IBM I0142101
U33 IBM I0122188

U59 32.000 MHz osc
U62 D27C512 EPROM (BIOS)

RAM, right bank, Toshiba, 511002AZ-80
left bank Mitsubishi, M5M41002AL-80

Wizard Memory Expansion Option PN 34F3061

Each bank of 24x DRAM modules equals 3MB, so the adapter adds 6MBs to the Wizard.

Original images from Dan Snyder

       The i860 microprocessor (announced by Intel on February 27, 1989) is a RISC integer core and contains an advanced floating point processor, a graphics unit and internal instruction and data cache integrated in a single one-million transistor chip.
   The Wizard Adapter contains 2MB of DRAM (85ns). The application or subroutine running on the Wizard resides in this on-card memory. Users with application requirements greater than 2MB need the Wizard Memory Expansion Option adapter. This adapter attaches to the Wizard Adapter and
provides an additional 6MB of memory, for a maximum of 8MB.
       The Wizard operates under OS/2 Standard Edition Version 1.1 and OS/2 Extended Edition Version 1.1, and later versions, through the use of application device drivers shipped with this product. These drivers control the use of the i860 processor and provide interface functions to the 80386/80486.
       In order to take advantage of the Wizard Adapter, applications must be recompiled to run with the i860 processor, using the Intel i860 Microprocessor OS/2 Software Development Tools. These tools include a Simulator Linker, Assembler, Debugger, C Compilers and Libraries. This toolkit is available through IBM (refer to Programming Announcement 289-638, dated November 14, 1989). Intel intends to provide a FORTRAN Toolkit (including FORTRAN Vectorizer) for OS/2 FORTRAN application.

The IBM PS/2 Model 80-111 and 80-311 require planar EC C00835 in order to operate with the IBM PS/2 Wizard Adapter. Customers with PS/2 Model 80-111 with serial numbers 72-6000000 through 72-6039999 or PS/2 Model 80-311 with serial numbers 72-65000000 through 72-6509999 should contact their IBM representative or authorized dealer for information on obtaining this
modification. (Ed. I think these are the non-busmaster compatible planars)

Initially supported products:
- PS/2 Model 70 or Model 80. NOTE: The Model P70 is not supported.

Power Requirements Due to power requirements, the user must dedicate two full expansion slots to accommodate the Wizard Adapter. If the user has installed both the Wizard Adapter and the Wizard Memory Expansion Option, the two allocated slots will be sufficient.

Dan sent this as well-

i860 Overview
Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080. Any typos/misinterpretations are my own. I speak only for myself.

(what I consider interesting features of the part), taken from the "i860(tm) 64-bit Microprocessor Programmer's Reference Manual", Order Number 240329-001, (C) Intel Corp. 1989.

Overview
* 64 bit external data/instruction bus
* 128 bit on-chip data bus
* 64 bit on-chip instruction bus
* 8K data cache, virtual addressed, write-back, two-way "set associative",
2x128 lines of 32 bytes
* 4K instruction cache, virtual addressed
* 64 entry TLB
* core integer RISC unit
* floating-point unit with pipelined multiply and add units (can also be used "unpipelined")
* some multiply-accumulate type floating point instructions
* dual instruction mode can simultaneously dispatch a 32-bit core instruction and a 32-bit floating-point instruction

Data Types
* BE bit in epsr (extended processor status register) selects big/little   endian format in memory, instructions always little-endian
* 32 bit signed/unsigned integers
* IEEE 754 format single (32-bit) and double (64-bit) precision floating   point numbers
* pixels:
* stored as 8, 16, or 32 bits (always operates on 64 bits of pixels at a     time)
* colour intensity shading instructions treat divide pixels into fields:
pixel size colour 1 bits   colour 2 bits   colour 3 bits   other bits
     8       ....................N........................   8 - N
    16            6             6               4             0
    32            8             8               8             8
[These particular field assignments are a result of the pixel add instructions described below.]

Memory Management
* NO SEGMENTS!
* 32 bit virtual addresses (translation can be disabled)
* translated identically to 386 virtual address: two level address
translation, with bits 31..20 of address selecting:
* dirbase register specifies page directory
* 1st level: addr[31..22] specifies page directory entry, yielding
permissions and address of the second level page table
* 2nd level: addr[21..12] specifies page table entry, yielding additional
permissions and address of the physical page
* addr[11..0] specifies byte offset within physical page (4K pages)
* page table bits:
* P - page is present
* CD - cache disable: page is not cacheable
* WT - page is write-through. disables internal caching. Either CD or WT can be passed through to the external PTB pin, depending upon PBM bit in epsr.
* U - user: if 0, page in inaccessible in user mode.
* W - writable: if 0, page is not writable in user mode, and may be writable in supervisor mode depending upon WP bit in epsr.
* A - accessed: automatically set first time page is accessed
* D - dirty: traps when D=0 and page is written
* two bits reserved, three bits user-definable
* page directory PTE bits and second level PTE bits are combined in the most restrictive fashion
* 64 entry TLB

Caches
* Flush instruction forces a dirty data cache line (32 bytes) back to memory. Intel supplies suggested code to flush entire data cache.
* Storing to dirbase register with ITI bit set invalidates TLB and instruction caches; must flush data cache first! [Remember, the data cache is virtually addressed.]

Core Unit
* Standard 32 bit RISC architecture:
* 32 32-bit integer registers
* fault instruction, psr, epsr, dirbase, data breakpoint registers
* r0 always reads as 0
* 8, 16, 32 bit integer load/store insns, operands must be appropriately aligned; byte or word values are sign extended on load. [I hope you don't use "unsigned char" too much...]
* 2 source, 1 destination add/subtract/logical (and, andnot, or, xor)
* No integer multiply/divide instructions. To multiply, you move the operands to floating point registers, use multiply (four insns plus five free delay slots). To divide, you move the dividend to a floating point register and multiply by the reciprocal. This can be very slow (59 clocks) if the divisor is a variable (hopefully infrequent).
* 32 bit shift left/right/right-arithmetic, plus 64 bit funnel shift ("shift right double"). They ran out of bits to specify two 32 bit sources plus destination plus shift count, so the shift count of the last 32 bit shift right (automatically stored in the 5 bit SC field of the psr) is used.
* Similar to MIPS Rx000 architecture in some ways:
* load/store addressing mode is src1(src2), src1 is a register or 16 bit immediate constant.
* form 32 bit constants using andh/andnoth/orh/xorh on upper 16 bits of a register
* Only one condition code bit (CC), set in various ways by signed/unsigned add/subtract/logical operations, unaffected by shift ops
* Delayed and non-delayed branches on CC set/not set (bc[.t], bnc[.t])
* Non-delayed branch on src1 ==/!= src2 (bte, btne)
* Strange delayed branch "bla" instruction, for one instruction looping. useful for aoblss/dsz/isg type looping. Uses its own special LCC condition code bit. "Programs should avoid calling subroutines while within a bla loop, because a subroutine may use bla also and change LCC". [Ug.]
* Trap, trap on integer overflow instructions
* Call/call indirect, stores return address in r1.
* Unconditional branch, branch indirect, latter also used for return and return from trap.
* Core unit loads and stores floating point operands of 32, 64, and 128 bits
* Pipelined floating load instruction (32/64 bits) queues an address of an operand not expected to be in cache, and stores the result of the third previous pipelined floating load into the destination floating register. [This is the data-loading component of the i860 "vector" support.]
* Bus lock/unlock instructions for flexible indivisible read-modify-write sequences. Interrupts are disabled while the bus is locked. "If ... the processor does not encounter a load or store following an unlock instruction by the time it has executed 32 instructions, it triggers an instruction fault...".
For example: locked test and set is:
// r22 <- semaphore, semaphore <- r23
lock                // next cache miss load/store locks bus
ld.b    semaphore, r22
unlock             // next load/store unlocks bus
st.b    r23, semaphore

* Pixel store instructions for selectively updating particular masked pixels in a 64-bit memory location, used for Z-buffer hidden surface elimination. Pixel mask is set by fzchk instructions (in floating point/graphics unit)

Floating Point Unit
* 32 32 bit single precision floating point registers, can also be treated as 16 64 bit double precision registers.
* graphics operands also stored in the fp registers
* f0/f1 reads as 0
* pipelined multiply and add units
* floating point instructions can be non-pipelined, or pipelined
* Similar to the pipelined load above, in a pipelined multiply or add   instruction, the source operands go into the pipeline, and the result of the 3rd (or so) previous pipelined multiply or add is stored in the destination register(s).
* Pipeline lengths
* adder:     3 stages
* multiplier:2 or 3 stages (2 double precision, 3 single(!))
* graphics: 1
* load:      3 (loads issued from core unit above)
* IEEE status bits percolate through the fp pipelines, and can be reloaded,
along with the pipeline contents, after traps
* Divide? Ha! If Seymour can do it with reciprocals, so can the i860.   The frcp and frsqr insns give return approximate reciprocal and 1/square   root "with absolute significand error < 2^-7". Intel supplies routines for Newton-Raphson approximations that take 22 clocks (*almost* single   precision) or 38 clocks (*almost* double precision), and the Intel i860   library provides true IEEE divide. [RISC design principles at work:   divides are infrequent enough not to slow down/drop some other feature   to provide divide hardware.]
* Dual operation instructions (not "dual mode"): Some pipelined instructions cause both a pipelined add and a multiply operation to take place. Since the instruction can only encode two source operands, the others are taken from temporary holding registers and busses connecting the two units in various topologies, depending upon the data path control field of the instruction opcode. [Many real world computations e.g. dot product can make use of these instructions.]

Dual Instruction Mode
* DIM allows the i860 to run both a core and a floating/graphics unit insn on each cycle. The resulting 64 bit "wide instruction" must be 64 bit aligned.
* There is a two cycle latency: two cycles after a floating instruction with the D bit set, both a core and a floating insn will be issued. Similarly, if the D bit is clear, there will be no DIM two cycles (two instruction pairs) later.
* There are various sensible rules for determining the result of insn pairs
which set/use common registers, control registers, etc.

Graphics Unit
* Pipelined and non pipelined 64 bit integer add and subtract.
* 16/32 bit non/pipelined Z buffer check instructions:
"fzchks src1, src2, rdest (16 bit Z-Buffer Check)
   Consider src1, src2, and rdest as arrays of four 16 bit fields
   src1(0..3), src2(0..3), rdest(0..3), where zero denotes the
   least-significant field.

   PM <- PM >> 4
   FOR i = 0 to 3
   DO
     PM[i+4] <- src2(i) <= src1(i) (unsigned)
     rdest(i) <- smaller of src2(i) and src1(i)
   OD
   MERGE <- 0"
This particular instruction merges four (arbitrary sized) pixels whose 16 bit Z-buffer values are in one of the (64 bit) sources, and the current Z-buffer value in the other source, setting pixel mask bits (controlling the pixel store insn described above), and updating the Z-buffer depth values. [Neat! Just what my (personal) graphics package ordered!]
* Pixel add instructions, which add fixed point values, the results   accumulating in a special MERGE register. You can use these to interpolate between (for instance) two colours as you scan convert a polygon.
* Z-buffer add instructions, for the analogous case of distance interpolation.

Traps
Briefly, there are instruction, floating point, instruction access, data
access, interrupt, and reset traps. On a trap, the i860 enters supervisor
mode, saves/modifies various psr bits, saves the faulting instruction address, and jumps to the trap handler which must be at 0xFFFFFF00. There are various complications for dual instruction mode, bus lock mode, and for saving/restoring the various pipeline states.

Interlocks
The i860 is fully interlocked, so no need to insert nops. You can, of course, increase performance by reordering insns with dependencies. For instance, in the current implementation, referencing the result of a ld in the next instruction can cause a one clock delay.

Other interesting timings:
* TLB miss: five clocks plus the number of clocks to finish two reads plus
the number of clocks to set A (accessed) bit, if necessary. [I guess Intel
found Mips' and others' software TLB lookup unworthy...]
* ld/fld following st/fst hit: one clock.
* delayed branch not taken: one clock [to skip/annul the delay slot instruction]
* nondelayed branch taken: bc, bnc: one clock; bte, btne: two clocks
* st.c (store to a control register): two clocks.

Comments
Well, that about does it. Quite a neat part, I I think Intel has done
themselves proud with a very clean and well-balanced design; I guess they've been reading comp.arch... :-) I had read rumours that this was to be a floating point coprocessor for the x86, and had feared that it would be
burdened with lots of slave-processor crap, but that is not the case.
If I could change one thing, it would be to add Mips' on-chip external cache control hardware. Why hasn't anyone else picked up on this idea? I'm afraid that for some code (not *mine*, of course) the 4K on-chip insn cache will be too small; a cache controller would allow you to add big external caches with a minimum of heartache. "I guess there's no pleasing some people!"

AdapterID 0FC7 "Wizard Attached Processor"

Adapter I/O Location
Base Address must be 8AA0
<"Base Address 8AA0">

DMA Arbitration Level
DMA Arbitration level used to transfer data.
<"Level D">, E, 8, 9, A, B, C

Interrupt Level
Interrupt line used to signal the host
<"Level 11">, 10, 15, Level 5

9595 Main Page