### ECE/CS 552: Midterm Review

Instructor: Mikko H Lipasti

Fall 2010 University of Wisconsin-Madison

Lecture notes based on notes by Mark Hill and John P. Shen Updated by Mikko Lipasti

### **Computer Architecture**

- Exercise in engineering tradeoff analysis - Find the fastest/cheapest/power-efficient/etc. solution
- Optimization problem with 100s of variables
- All the variables are changing
  - At non-uniform rates
  - With inflection points
  - Only one guarantee: Today's right answer will be wrong tomorrow
- Two high-level effects:
  - Technology push
  - Application Pull

### Abstraction

- Difference between interface and implementation
  - Interface: WHAT something does
  - Implementation: HOW it does so

### What's the Big Deal?

- Tower of abstraction
- Complex interfaces implemented by layers below
- Abstraction hides detail
- Hundreds of engineers build one product
- Complexity unmanageable otherwise



### Performance vs. Design Time

- Time to market is critically important
- E.g., a new design may take 3 years
  - It will be 3 times faster
  - But if technology improves 50%/year
  - In 3 years  $1.5^3 = 3.38$
  - So the new design is worse!
     (unless it also employs new technology)

### **Bottom Line**

- Designers must know BOTH software and hardware
- Both contribute to layers of abstraction
- IC costs and performance
- Compilers and Operating Systems





### Ch 2 Summary

- Basics
- Registers and ALU ops
- Memory and load/store
- Branches and jumps
- Addressing Modes

## Summary: Instruction Formats

| R: opcode                     | rs      | rt      | rd     | shamt  | function |
|-------------------------------|---------|---------|--------|--------|----------|
| 6                             | 5       | 5       | 5      | 5      | 6        |
| I: opcode                     | rs      | rt      | addres | s/imme | diate    |
| 6                             | 5       | 5       | 16     |        |          |
| J: opcode                     | addr    |         |        |        |          |
| 6                             | 26      |         |        |        |          |
| <ul> <li>Instructi</li> </ul> | on de   | code:   |        |        |          |
| – Read i                      | nstruct | ion bit | S      |        |          |
|                               |         |         | 1      |        |          |

- Activate control signals

### Conclusions

- Simple and regular
- Constant length instructions, fields in same placeSmall and fast
- Small number of operands in registers
- Compromises inevitable
- Pipelining should not be hindered
- Make common case fast!
- Backwards compatibility!

### Basic Arithmetic and the ALU

- Number representations: 2's complement, unsigned
- $\bullet \ Addition/Subtraction$
- Add/Sub ALU
  - Full adder, ripple carry, subtraction
- Carry-lookahead addition
- Logical operations
  - and, or, xor, nor, shifts
- Overflow



- $f(b31..b0) = b31 \ge 2^{31} + \ldots + b1 \ge 2^1 + b0 \ge 2^0$
- Treat as normal binary number
   E.g. 0...01101010101
   = 1 x 2<sup>7</sup> + 1 x 2<sup>6</sup> + 0 x 2<sup>5</sup> + 1 x 2<sup>4</sup> + 1 x 2<sup>3</sup> + 0 x 2<sup>1</sup> + 1 x 2<sup>0</sup>
   = 128 + 64 + 16 + 4 + 1 = 213
- Max  $f(111...11) = 2^{32} 1 = 4,294,967,295$
- Min f(000...00) = 0
- Range  $[0,2^{32}-1] => \#$  values  $(2^{32}-1) 0 + 1 = 2^{32}$

### **Signed Integers**

- 2's complement f(b31 ... b1 b0) = -b31 x 2<sup>31</sup> + ... b1 x 2<sup>1</sup> + b0 x 2<sup>0</sup>
- Max  $f(0111...11) = 2^{31} 1 = 2147483647$
- Min f(100...00) = -2<sup>31</sup> = -2147483648 (asymmetric)
- Range[ $-2^{31}, 2^{31}-1$ ] => # values( $2^{31}-1 2^{31} 1$ ) =  $2^{32}$
- E.g. –6
  - $-000...0110 \Rightarrow 111...1001 + 1 \Rightarrow 111...1010$















### Addition Overflow

- 2 + 3 = 5 > 4: 010 + 011 = 101 =? −3 < 0 − X is f(2)
- -1 + -4: 111 + 100 = 011 > 0
- Y is  $\sim f(2)$ Overflow = f(2) \*  $\sim (a2)$ \* $\sim (b2)$  +  $\sim f(2)$  \* a(2) \* b(2)

### Subtraction Overflow

- No overflow on a-b if signs are the same
- Neg pos => neg ;; overflow otherwise
- Pos neg => pos ;; overflow otherwise

### What to do on Overflow?

- Ignore ! (C language semantics) - What about Java? (try/catch?)
- Flag condition code
- Sticky flag e.g. for floating point
  - Otherwise gets in the way of fast hardware
- Trap possibly maskable
   MIPS has e.g. add that traps, addu that does not

### Ch. 3 Summary

- Binary representations, signed/unsigned
- Arithmetic
  - Full adder, ripple-carry, carry lookahead
  - Carry-select, Carry-save
  - Overflow, negative
  - More (multiply/divide/FP) later
- Logical
  - Shift, and, or

### Ch. 4 Processor Implementation

- Heart of 552 key to project
  - Sequential logic design review (brief)
  - Clock methodology (FSD)
  - Datapath 1 CPI
    - Single instruction, 2's complement, unsigned
  - Control
  - Multiple cycle implementation (information only)
  - Microprogramming (information only)
  - Exceptions

### **Clocking Methology Our Methodology** • Motivation • Only flip-flops - Design data and control without considering clock • All on the same edge (e.g. falling) - Use Fully Synchronous Design (FSD) • All with same clock · Just a convention to simplify design process - No need to draw clock signals Restricts design freedom • All logic finishes in one cycle · Eliminates complexity, can guarantee timing correctness · Not really feasible in real designs • Even in 554 you will violate FSD FFs FFs Logic Logic

















| <b>C</b> . | <b>D</b>    | iteps                             |  |
|------------|-------------|-----------------------------------|--|
| Step       | Description | Sample Actions                    |  |
| IF         | Fetch       | IR=MEM[PC]                        |  |
|            |             | PC=PC+4                           |  |
| ID         | Decode      | A=RF(IR[25:21])                   |  |
|            |             | B=RF(IR[20:16])                   |  |
|            |             | Target=PC+SE(IR[15:0] << 2)       |  |
| EX         | Execute     | ALUout = A + SE(IR[15:0]) # lw/sw |  |
|            |             | ALUout = A op B # rrr             |  |
|            |             | if (A==B) PC = target # beq       |  |
| Mem        | Memory      | MEM[ALUout] = B # sw              |  |
|            |             | MDR = MEM[ALUout] #lw             |  |
|            |             | RF(IR[15:11]) = ALUout # rrr      |  |
| WB         | Writeback   | Reg(IR[20:16]) = MDR # lw         |  |





### Microprogramming

- Alternative way of specifying control
- FSM
  - State bubble
  - Control signals in bubble
  - Next state given by signals on arc
  - Not a great language for specifying complex events
- Instead, treat as a programming problem

### Microprogramming

- Datapath remains the same
- Control is specified differently but does the same
- Each cycle a microprogram field specifies required control signals

| Label | Alu | Src1 | Src2    | Reg       | Memory   | Pcwrite | Next?      |
|-------|-----|------|---------|-----------|----------|---------|------------|
| Fetch | Add | Pc   | 4       | Read pc   | Alu      | Alu     | +1         |
|       | Add | Pc   | Extshft | Read      |          |         | Dispatch 1 |
| Mem1  | Add | А    | Extend  |           |          |         | Dispatch 2 |
| Lw2   |     |      |         |           | Read alu |         | +1         |
|       |     |      |         | Write mdr |          |         | fetch      |

## **Exceptions: Big Picture**

- Two types:
  - Interrupt (asynchronous) or
  - Trap (synchronous)
- Hardware handles initial reaction
- Then invokes a software exception handler
  - By convention, at e.g. 0xC00
  - O/S kernel provides code at the handler address

### **Exceptions: Hardware**

- Sets state that identifies cause of exception – MIPS: in exception\_code field of Cause register
- Changes to kernel mode for dangerous work ahead
- Disables interrupts
   MIPS: recorded in status register
- Saves current PC (MIPS: exception PC)
- Jumps to specific address (MIPS: 0x80000080)
   Like a surprise JAL so can't clobber \$31

### **Exceptions: Software**

- Exception handler: - MIPS: .ktext at 0x80000080
- Set flag to detect incorrect entry - Nested exception while in handler
- Save some registers
- Find exception type
- E.g. I/O interrupt or syscall
- Jump to specific exception handler

### Exceptions: Software, cont'd

- Handle specific exception
- Jump to clean-up to resume user program
- Restore registers
- Reset flag that detects incorrect entry
- Atomically
- Restore previous mode
- Enable interrupts
- Jump back to program (using EPC)

### Implementing Exceptions

- We worry only about hardware, not s/w
- IntCause
  - 0 undefined instruction
  - 1 arithmetic overflow
- Changes to the datapath
- New states in control FSM



| Туре             | Control              | Datapath | Time (CPI, cycle time)                        |
|------------------|----------------------|----------|-----------------------------------------------|
| Single-<br>cycle | Comb + end<br>update | No reuse | 1 cycle, (imem + reg + ALU + dmem)            |
| Multi-<br>cycle  | Comb + FSM<br>update | Reuse    | [3,5] cycles,<br>Max(imem, reg, ALU,<br>dmem) |
| We<br>want?      | ?                    | ?        | ~1 cycle, Max(imem, reg, ALU, dmem            |

### Pipelining (4.5-4.9)

- Summary
  - Big Picture
  - Datapath
  - Control
  - Data Hazards
    - Stalls
  - Forwarding
     Control Hazards
  - Exceptions



### **Ideal Pipelining**

| Cycle:<br>Instr: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 1 | 1 | 1 | 1 |
|------------------|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Instr:           |   |   |   |   |   |   |   |   |   | 0 | 1 | 2 | 3 |
| i                | F | D | Х | М | W |   |   |   |   |   |   |   |   |
| i+1              |   | F | D | Х | М | W |   |   |   |   |   |   |   |
| i+2              |   |   | F | D | Х | М | W |   |   |   |   |   |   |
| i+3              |   |   |   | F | D | Х | М | W |   |   |   |   |   |
| i+4              |   |   |   |   | F | D | Х | М | W |   |   |   |   |

### **Pipelining Idealisms**

- Uniform subcomputations
- Can pipeline into stages with equal delayIdentical computations
- Can fill pipeline with identical work
- Independent computations
  - No relationships between work units
- Are these practical?
   No, but can get close enough to get significant speedup

### Complications

- Datapath
  - Five (or more) instructions in flight
- Control
  - Must correspond to multiple instructions
- Instructions may have
  - data and control flow dependences
  - I.e. units of work are not independent
    - One may have to stall and wait for another

## Program Data Dependences• True dependence (RAW) $D(i) \cap R(j) \neq \phi$ - j cannot execute until i<br/>produces its result $D(i) \cap R(j) \neq \phi$ • Anti-dependence (WAR) $R(i) \cap D(j) \neq \phi$ - j cannot write its result until i<br/>has read its sources $D(i) \cap D(j) \neq \phi$ • Output dependence (WAW)-j cannot write its result until i<br/>has written its result

### **Control Dependences**

- Conditional branches
  - Branch must execute to determine which instruction to fetch next
  - Instructions following a conditional branch are control dependent on the branch instruction

### **Resolution of Pipeline Hazards**

- Pipeline hazards
  - Potential violations of program dependences
  - Must ensure program dependences are not violated
- Hazard resolution
  - Static: compiler/programmer guarantees correctness
     Dynamic: hardware performs checks at runtime
- Pipeline interlock
  - Hardware mechanism for dynamic hazard resolution
  - Must detect and enforce dependences at runtime

### **Pipeline Hazards**

- Necessary conditions:
  - WAR: write stage earlier than read stage
     Is this possible in IF-RD-EX-MEM-WB ?
  - WAW: write stage earlier than write stage
     Is this possible in IF-RD-EX-MEM-WB ?
  - RAW: read stage earlier than write stage
     Is this possible in IF-RD-EX-MEM-WB?
- If conditions not met, no need to resolve
- Check for both register and memory





### **Pipelined Control**

- Controlled by different instructions
- Decode instructions and pass the signals down the pipe
- Control sequencing is embedded in the pipeline

### **Data Hazards**

### • Must first detect hazards

ID/EX.WriteRegister = IF/ID.ReadRegister1 ID/EX.WriteRegister = IF/ID.ReadRegister2 EX/MEM.WriteRegister = IF/ID.ReadRegister1 EX/MEM.WriteRegister = IF/ID.ReadRegister2 MEM/WB.WriteRegister = IF/ID.ReadRegister1 MEM/WB.WriteRegister = IF/ID.ReadRegister2





## **Control Flow Hazards**

- What to do?
  - Always stall
  - Easy to implement
  - Performs poorly
  - 1/6<sup>th</sup> instructions are branches, each branch takes 3 cycles
  - CPI = 1 + 3 x 1/6 = 1.5 (lower bound)

### **Control Flow Hazards**

- Predict branch not taken
- Send sequential instructions down pipeline
- Kill instructions later if incorrect
- Must stop memory accesses and RF writes Including loads (why?)
- Late flush of instructions on misprediction - Complex
  - Global signal (wire delay)

### Exceptions

- Even worse: in one cycle - I/O interrupt
  - User trap to OS (EX)
  - Illegal instruction (ID)
  - Arithmetic overflow
  - Hardware error
  - Etc.
- Interrupt priorities must be supported















| _imits on Instr           | uction Level                |
|---------------------------|-----------------------------|
| Parallelism (IL           | _P)                         |
| Weiss and Smith [1984]    | 1.58                        |
| Sohi and Vajapeyam [1987] | 1.81                        |
| Tjaden and Flynn [1970]   | 1.86 (Flynn's bottleneck)   |
| Tjaden and Flynn [1973]   | 1.96                        |
| Uht [1986]                | 2.00                        |
| Smith et al. [1989]       | 2.00                        |
| Jouppi and Wall [1988]    | 2.40                        |
| Johnson [1991]            | 2.50                        |
| Acosta et al. [1986]      | 2.79                        |
| Wedig [1982]              | 3.00                        |
| Butler et al. [1991]      | 5.8                         |
| Melvin and Patt [1991]    | 6                           |
| Wall [1991]               | 7 (Jouppi disagreed)        |
| Kuck et al. [1972]        | 8                           |
| Riseman and Foster [1972] | 51 (no control dependences) |
| Nicolau and Fisher [1984] | 90 (Fisher's optimism)      |

# Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)

### **Classifying ILP Machines** [Jouppi, DECWRL 1991]

- Scalar pipelined
- Superpipelined
- Superscalar
- VLIW
- Superpipelined superscalar

### **Review Summary**

- Ch. 1: Intro & performanceCh. 2: Instruction Sets
- Ch. 3: Arithmetic I • Ch. 4: Data path, control, pipelining
- Details
  - Fri. 10/29 2:25-3:30 (1 hour) in EH2317
  - Closed books/notes/homeworks
  - One page handwritten cheatsheet for quick reference
  - A mix of short answer, design, analysis problems