



## Silicon Debug

ECE Department, University of Texas at Austin

- Test the first chips back from fabrication
  - If you are lucky, they work the first time
  - If not . . .
- Logic bugs vs. electrical failures
  - Most chip failures are logic bugs from inadequate simulation or verification
  - Some are electrical failures
    - Crosstalk
    - Dynamic nodes: leakage, charge sharing
    - Ratio failures
    - A few are tool or methodology failures (e.g. DRC)
- Fix the bugs and fabricate a corrected chip
- Silicon debug (or "bringup") is primarily a Non-Recurring Engineering (NRE) cost (like design)
- Contrast this with manufacturing test which has to be applied to every part shipped



Abraham, November 3, 2020 2 / 54

| Shmoo Plots                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |  |  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
|                                                                                                                                                                                                | Clock period in ns on the left, frequency increases going up<br>Voltage on the bottom, increase left to right                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |  |  |  |
|                                                                                                                                                                                                | * indicates a failure                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |  |
| <ul> <li>How to diagnose<br/>failures?</li> <li>Difficult to access<br/>chips         <ul> <li>Picoprobes</li> <li>Electron<br/>beam</li> <li>Laser voltage<br/>probing</li> </ul> </li> </ul> | 1.0       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       *       * |  |  |  |
| <ul> <li>Built-in self-test</li> <li>Shmoo plots</li> </ul>                                                                                                                                    | $ \begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |  |  |  |
| <ul> <li>Vary voltage,<br/>frequency</li> <li>Look for cause of<br/>electrical failures</li> </ul>                                                                                             | 1.5       * * * *       1.5       *         1.0       1.1       1.2       1.3       1.4       1.5         "Wall"       "Reverse speedpath"         Fails at a certain voltage       Increase in voltage reduces frequency         Coupling, charge share, races       Speedpath, leakage                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |  |  |  |

ire 19. Manufacturing Te

Jacob Abraham, November 3, 2020 4 / 54

# Silicon Debug is a Growing Barrier to Market Entry

ECE Department, University of Texas at Austin





Source: N. Hakim, Intel

ECE Department, University of Texas at Aus



Jacob Abraham, November 3, 2020 6 / 54

# Validating a Design

### Difficulties

- Lack of visibility to internal blocks and interconnect buses
- High latency between an internal error caused by a fault and its observation at the pins

### Solutions

ECE Department. University of Texas at A

- Software-Based software monitor routines and processor-specific hardware allow some visibility
- Test Feature-Based reuse design-for-test (DFT) structures for functional debug
- In-Circuit Emulation a special "bond-out" version of the device is created that mirrors key internal signals on external device pins
- On-chip Emulation dedicated debug logic runs in parallel to the normal device logic

## Platform Validation Infrastructure



er 3, 2020 8 / 54

### Classes of Tests

### Functional bugs in micro-architecture

- Weighted random instructions
- Architectural simulation patterns
- Random power state transitions
- Directed tests for corner cases
- Multi-core/multi-processor tests
- Tests of virtualization system

### Memory subsystem

- Memory channel activation
- Tests for multiple cores/processors
- Directed tests for memory paging, cache coherence

## Circuit Bugs

ECE Department, University of Texas at Austin

Affect DPM - not all die behave the same way

#### Timing convergence bugs

- Speed path: circuit operates too slow
- Min-delay: circuit operating too fast (hold times)
- Race: circuit fails due to timing of multiple converging signals

### Analog bugs

ECE Department, University of Texas at Austin

- Primarily occur in I/O buffers, PLLs, and thermal sensors
- Silicon does not operate in accordance with predicted (simulated) circuit behavior
- Fundamentals for circuit bug hunting
  - Need sufficiently large population of devices
  - Need to vary environmental conditions
  - Need to stimulate stressful system behavior
  - Stimulus is generally functional failures look just like functional failures

Jacob Abraham, November 3, 2020 10 / 54

Abraham, November 3, 2020 11 / 54



# Problem: Very Long Error Detection Latencies in Practise





# Much Lower Error Detection Latency with EDDI-V-based QED (Quick Error Detection) Test

ob Abraham, November 3, 2020 14 / 54

ECE Department, University of Texas at Aus



# **EDDI-V** Transformations

ECE Department, University of Texas at a



EDDI-V Transformations: (a) with half of all general-purpose registers reserved, (b) with no registers reserved and register values stored in memory



m, November 3, 2020 16 / 54



## Manufacturing Test

- A speck of dust on a wafer is sufficient to kill chip
- Yield of any chip is < 100%
  - Must test chips after manufacturing before delivery to customers to only ship good parts
- Manufacturing testers are very expensive
  - Minimize time on tester
  - Careful selection of test vectors

ECE Department, University of Texas at Austin



Jacob Abraham, November 3, 2020 19 / 54



### Fault Models

- Numerous possible physical failures (what we are testing for)
- Can reduce the number of failure types by considering the effects of physical failures on the logic functional blocks: called a FAULT MODEL
- Most widely used fault model: "stuck-at" faults at the gate level
  - Assume that defects will cause the circuit to behave as if lines were "stuck" at logic 0 or 1
- Most commercial tools for test are based on the "stuck-at" model
- Other fault models

ECE Department, University of Texas at Austin

- "Stuck open" model for charge retained on a CMOS node
- Recent use of the "transition" fault model in an attempt to deal with delays

Jacob Abraham, November 3, 2020 21 / 54

• "Path delay" fault model would be better for small delay defects, but the large number of possible paths is an impediment to the use of this fault model

Lecture 19. Manufacturing Tes

## Generating Tests for a Fault

Start with the set of faults of interest, reduce number of faults (use "equivalence", "dominance") Find a vector or sequence of vectors (sequential circuit, delay tests) which will cause faulty to produce incorrect output Initial state assumption for sequential circuit (most general assumption: memory elements start with "X" (unknown) state)

### Steps in Test Generation

ECE Department, University of Texas at

ECE Department, University of Texas at Austin

- Activate fault (produce error at fault site)
- "Sensitize" path from fault to output (propagate error to output)
- "Justify" internal signals to primary inputs
- Choices may exist during sensitization and justification: if conflicts arise, need to backtrack
- If no test exists, fault is redundant
- Problem is NP-Complete

## **Observability and Controllability**

- **Observability**: ease of observing a value on a node by monitoring external output pins of the chip
- **Controllability**: ease of forcing a node to 0 or 1 by driving input pins of the chip
- Combinational logic is usually easier to observe and control
   Still, NP-complete problem
- Finite state machines can be very difficult, requiring many cycles to enter desired state
  - Especially if state transition diagram is not known to the test engineer, or is too large

rember 3, 2020 22 / 54

Jacob Abraham, November 3, 2020 23 / 54









## Fault Simulation

ECE Department, University of Texas at Aus

- Identify faults detected by a sequence of tests
- Provide a numerical value of coverage (ratio of detected faults to total faults)
- Correlation between high fault coverage and low defect level
- Faults considered
  - Generally, gate level "stuck-at" faults
  - Can also evaluate coverage of switch level faults
  - Can include timing and dynamic effects of failures
- Although fault simulation takes polynomial time in the number of gates, it can still be prohibitive for large designs
- Recent research: techniques for accurate estimation of the fault coverage

Jacob Abraham, November 3, 2020 28 / 54





# Functional Test Generation

ECE Department, University of Texas at A





# Memory Fault Model, Cont'd

### Decoders

ECE Department, University of Texas at Austin

ECE Department, University of Texas at Au

• The decoder will not access the addressed cell, and in addition may access non-addressed cells

Jacob Abraham, November 3, 2020 32 / 54

Jacob Abraham, November 3, 2020 33 / 54

- The decoder will access multiple cells, including the addressed cell
- Assumption that the combinational logic of the decoder will not be transformed into sequential logic
- Decoder faults look like memory cell array faults
- Fault model can be validated by simulating effects of faults at the transistor level

# O(n) "March" Test for Memories



R: Read cell and verify; C: Complement cell

Complexity of Test: 14N (N cells)

ECE Department, University of Texas at Aus

ECE Department, University of Texas at Austin

How would you test for parametric faults, data retention?

All memory tests are based on this algorithm

## **Functional Testing of Microprocessors**

- Developed in the 80s for generating tests based on information about the instruction set
  - Tests for vendor parts without knowing details
- Tests based on "functional fault models" derived from analysis of the effects of low level faults on the behavior of modules
  - Based on functional tests for memories
- Fault models at control sequencing level
- Tests based on high-level information can also be used for validating design correctness

ber 3, 2020 34 / 54

Jacob Abraham, November 3, 2020 35 / 54

### Microprocessor Functional Tests

#### Microprocessor represented as a graph

- Node: register or set of equivalent registers
- Edge: data or information transfer between nodes
- Instructions: sequence of microinstructions (set of micro-operations)
- Tests based on behavior level fault models



Example: data flow graph model of Motorola 68000

Jacob Abraham, November 3, 2020 37 / 54

nber 3, 2020 36 / 54

## Microprocessor Fault Models

ECE Department, University of Texas at ,

### Register decoding function – like a decoder

- Decoder will not access the addressed register (or storage cells)
- Decoder may access non-addressed registers or multiple registers, including addressed location (decoder remains combinational)

#### Instruction sequencing function

ECE Department, University of Texas at Austin

- One or more micro-ops are inactive, therefore the instruction is not executed completely
- Micro-ops which are normally inactive become active
- A set of micro-instructions is active in addition to, or instead of, the normal microinstructions



## Testing Instruction Decoding, Control

- Use sequence of instructions Read( $R_i$ ) which transfers data in register *i* to a location in memory without changing the internal state of the microprocessor
  - Check core instructions (Load, Compare, Branch)
  - Check that every register can be loaded and read (without disturbing other registers)
  - Test Load Register instruction for all registers, all addressing modes

Jacob Abraham, November 3, 2020 39 / 54

• Check all other instructions

ECE Department, University of Texas at Austin

 In self-test mode, compare with stored data, branch to error location if incorrect

| Examp    | le Code Words for 68000 Registe         | rs                     |
|----------|-----------------------------------------|------------------------|
| Register | Code Pattern                            |                        |
| D0       | 111111011111111111111111111111111111111 |                        |
| D1       | 1111111011111111111111111111111111111   |                        |
| D2       | 1111111101111111111111111111111111111   |                        |
| D3       | 1111111110111111111111111111111111111   |                        |
| D4       | 1111111111011111111111111111111111111   |                        |
| D5       | 1111111111101111111111111111111111111   | "man and a form" and a |
| D6       | 111111111111011111111111111111111111    | m-out-ot-n code        |
| D7       | 111111111111101111111111111111111111    | AND or OR of any       |
| A0       | 11111111111111011111111111111111111     | two code words will    |
| A1       | 1111111111111101111111111111111111      | produce a non-code     |
| A2       | 1111111111111111011111111111111111111   | word                   |
| A3       | 1111111111111111101111111111111111      |                        |
| A4       | 111111111111111110111111111111111       |                        |
| A5       | 111111111111111111101111111111111       |                        |
| A6       | 111111111111111111110111111111111       |                        |
| USP      | 11111111111111111111101111111111        |                        |
| SSP      | 111111111111111111111111111111111111111 |                        |

# Procedure to Test All Instructions

```
Load the registers with unique "code words", cw_i, using instruction sequence, WRITE(R_i)
```

```
for every instruction I {
  for every register Ri {
    Write(Ri) with cwi;
  }
  Co
  Execute I; O(
  for every register Ri {
    Read(Ri); ins:
  }
  num
}
```

ECE Department, University of Texas at Austin

```
Complexity of test patterns: O(n_in_r + n_r^4)
where n_i is the number of instructions and N_r is the number of registers
```

Jacob Abraham, November 3, 2020 41 / 54

Lecture 19. Manufacturing Test



• Test length logarithmic in number of bits (divide and conquer)

Jacob Abraham, November 3, 2020 42 / 54

• Tests will detect any "stuck-at", short or open in data path

ECE Department, University of Texas at Austin





Why doesn't the cost of testing a transistor scale like the cost of manufacturing the transistor?

## **Conventional Test**

ECE Department, University of Texas at Austin

Test problem simplified by structural, fault-based tests

### The stuck-at fault model

• The model allows structural test generation, with a number of faults which is linear in the size of the circuit

### Partitioning the circuit

ECE Department, University of Texas at Austin

• Partitioning the circuit (with scan latches for example), alleviates the test problem so that test generation does not have to deal with the entire circuit

Lecture 19. Manufacturing Test

Will this approach work for Deep SubMicron (DSM) circuits?

er 3, 2020 44 / 54

Jacob Abraham, November 3, 2020 45 / 54

## Failures May Not Be Hard: Example Resistive Opens

Experiments on real chips

 Some tests for logic-level "stuck-at" faults do not detect defects unless they are applied at speed

Interconnect opens are resistive (not complete breaks)

- Example: Cu interconnect with barrier materials
- Effect: delay faults

Increasing possibility of shorts and crosstalk



Breaks in Copper interconnect result in resistive opens because the barrier materials preventing interaction of Cu and  $SiO_2$  will still be conductive

n, November 3, 2020 46 / 54

## Effects on Chip?

ECE Department, University of Texas at Au

### Changes in delays of paths

Effects could be distributed across paths



## **Reducing Test Complexity**

- Generate tests at a higher level of abstraction?
  - Fault models at the Register Transfer Level
  - Tests may not detect DSM defects (e.g., delays)
- Exploit the design hierarchy
  - Target one module at a time for test at the structural level (can deal with opens, shorts, paths in module)
  - Problems: accessing the module from the design boundary (complexity of the rest of the design)
  - Can add logic to facilitate access to embedded modules ("design for testability"), use "Slicing"
- Experiment to determine extent of problem
  - Generate test for module by itself, then while embedded

|        | Gates | FFs  | Pls | POs | Faults |
|--------|-------|------|-----|-----|--------|
| ARM-2  | 16029 | 1270 | 63  | 67  | 99198  |
| ARM-DP | 8893  | 295  | 199 | 161 | 51824  |

## Test Generation

ECE Department, University of Texas at Aus

### Module by itself

ECE Department, University of Texas at Austin

- Sequential ATPG can easily deal with a module, example the ARM-DP by itself
- Results using commercial ATPG tool (on HPUX-715, 125 MHz processor)
  - Fault Coverage: 99.70%
  - ATPG Efficiency: 99.93%
  - Test generation time: 33.1 seconds
  - Test length: 822 cycles

#### Test generation for embedded module

• Sequential ATPG cannot deal with a module when it is embedded in even a moderately complex design

Lecture 19. Manufacturing Test

Jacob Abraham, November 3, 2020 49 / 54

- Results on ARM-DP when it is embedded in ARM-2
  - Fault Coverage: 17.66%
  - ATPG Efficiency: 17.66%
  - Test generation time: 316,199 seconds

25



# Automatic Generation of Instruction Sequences for Small Delay Defects

Jacob Abraham, November 3, 2020 50 / 54

ECE Department, University of Texas at Austin



| Results on OR1200 processorwww.opencores.org, synthesized for $0.18\mu$ TSMC processResults for Phase 1 (paths > 80% of clock)No. ofDropFunctionallyFunctionallyTimePathsTestableRedundantout27424121511812106200 |                                                                   |                                                                                                                                                                                                       |                |              |           |         |           |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|--------------|-----------|---------|-----------|
|                                                                                                                                                                                                                   | <i>www.opencores.org</i> , synthesized for $0.18\mu$ TSMC process |                                                                                                                                                                                                       |                |              |           | J       |           |
|                                                                                                                                                                                                                   | Results fo                                                        | or Phase                                                                                                                                                                                              | 1 (paths > 80) | % of clock)  |           |         | J         |
|                                                                                                                                                                                                                   | No. of                                                            | Drop                                                                                                                                                                                                  | Functionally   | Functionally | Time      |         |           |
|                                                                                                                                                                                                                   | Paths                                                             |                                                                                                                                                                                                       | Testable       | Redundant    | out       |         |           |
|                                                                                                                                                                                                                   | 27424                                                             | 12                                                                                                                                                                                                    | 15118          | 12106        | 200       |         |           |
|                                                                                                                                                                                                                   |                                                                   |                                                                                                                                                                                                       |                |              |           |         |           |
|                                                                                                                                                                                                                   | Results for                                                       | sults for Phase 2<br>% nodes with test for longest path through them                                                                                                                                  |                |              |           |         |           |
|                                                                                                                                                                                                                   | N: % noc                                                          | esults for Phase 2<br>: % nodes with test for longest path through them                                                                                                                               |                |              |           |         |           |
|                                                                                                                                                                                                                   | Module Functionally Functionally Rejected N                       |                                                                                                                                                                                                       |                |              |           |         |           |
|                                                                                                                                                                                                                   |                                                                   |                                                                                                                                                                                                       | Testable       | Redundant    | Sub-paths | (%)     |           |
|                                                                                                                                                                                                                   | or1200_                                                           | ctrl                                                                                                                                                                                                  | 1826           | 29191        | 68087     | 90.6    |           |
|                                                                                                                                                                                                                   | or1200_                                                           | D0_ctrl         1826         29191         68087         90.6           D0_alu         1427         16985         2716         100           D0_low         070         4077         3744         100 |                |              |           |         |           |
|                                                                                                                                                                                                                   | or1200_lsu 970 4077 3744 100                                      |                                                                                                                                                                                                       |                |              |           |         |           |
|                                                                                                                                                                                                                   | or1200_wbmux 1146 2285 2118 100                                   |                                                                                                                                                                                                       |                |              |           |         |           |
|                                                                                                                                                                                                                   |                                                                   | -                                                                                                                                                                                                     |                |              |           | 1 2 000 | 0 50 / 54 |

# Test of SoC Cores using Embedded Processor

Wishbone and 128-bit AES designs from opencores.org Validation vectors: random values encrypted/decrypted



ECE Department, University of Texas at Austin

| AES Core                 |       |  |  |  |  |  |
|--------------------------|-------|--|--|--|--|--|
| Inputs                   | 69    |  |  |  |  |  |
| Outputs                  | 33    |  |  |  |  |  |
| Combinational primitives | 9225  |  |  |  |  |  |
| Sequential primitives    | 1119  |  |  |  |  |  |
| Stuck-at faults          | 64070 |  |  |  |  |  |

Jacob Abraham, November 3, 2020 53 / 54

Result of Mapping AES tests to ARM instructions (one case)

|      | Size    | Fault       | Original    | No. of | Original |
|------|---------|-------------|-------------|--------|----------|
|      | (bytes) | coverage(%) | Coverage(%) | Cycles | Cycles   |
| Test | 9128    | 90.15       | 90.35       | 7816   | 7435     |

