3D-SoftChip: A novel 3D vertically integrated adaptive computing system [thesis]

Chul Kim

*Edith Cowan University*

Follow this and additional works at: [https://ro.ecu.edu.au/theses](https://ro.ecu.edu.au/theses)

Part of the *Engineering Commons*

**Recommended Citation**


This Thesis is posted at Research Online.
Edith Cowan University

Copyright Warning

You may print or download ONE copy of this document for the purpose of your own research or study.

The University does not authorize you to copy, communicate or otherwise make available electronically to any other person any copyright material contained on this site.

You are reminded of the following:

- Copyright owners are entitled to take legal action against persons who infringe their copyright.

- A reproduction of material that is protected by copyright may be a copyright infringement. Where the reproduction of such material is done without attribution of authorship, with false attribution of authorship or the authorship is treated in a derogatory manner, this may be a breach of the author’s moral rights contained in Part IX of the Copyright Act 1968 (Cth).

- Courts have the power to impose a wide range of civil and criminal sanctions for infringement of copyright, infringement of moral rights and other offences under the Copyright Act 1968 (Cth). Higher penalties may apply, and higher damages may be awarded, for offences and infringements involving the conversion of material into digital or electronic form.
USE OF THESIS

The Use of Thesis statement is not included in this version of the thesis.
3D-SoftChip:
A Novel 3D Vertically Integrated Adaptive Computing System

A Dissertation
Presented to the School of Engineering and Mathematics
Edith Cowan University
Western Australia

In partial fulfillment of the requirements for the degree of
Master of Engineering Science

by
Chul Kim

Supervisor: Dr. Alexander Rassau
Submission Date: June 2005
Dedication

To my fiancé Sang-Mi Hyun,
my father Nam-Gil Kim,
my mother Sung-Sun Park,
my brother-in-law Sun-Shin Lee,
and my sisters Hee-Joung, Su-Joung, Youn-Joung Kim.
Table of Contents

USE OF THESIS.................................................................................................................. 5
DECLARATION......................................................................................................................... 6
ACKNOWLEDGMENTS........................................................................................................... 7
ABSTRACT........................................................................................................................................... 8
PUBLICATIONS......................................................................................................................... 9
LIST OF FIGURES.................................................................................................................. 11
LIST OF TABLES...................................................................................................................... 13

1. INTRODUCTION.................................................................................................................. 14
  1.1 3D VERTICALLY INTEGRATED SYSTEMS OVERVIEW .................................................. 16
  1.2 ADAPTIVE COMPUTING SYSTEMS OVERVIEW .......................................................... 18
    1.2.1 Adaptive Computing Systems .................................................................................. 18
    1.2.1.1 The Need for Adaptive Computing Systems ................................................. 18
    1.2.1.2 The Concept of Adaptive Computing Systems ............................................. 19
    1.2.2 Classification of Adaptive Computing Systems .................................................. 21
    1.2.2.1 Previous Works .............................................................................................. 21
    1.2.2.2 MorphoSys Vs 3D-SoftChip ............................................................................ 24
  1.3 MOTIVATION FOR THESIS........................................................................................... 25
  1.4 SCOPE OF THESIS.......................................................................................................... 25
    1.4.1 Scope of Each Chapters............................................................................................ 26
  1.5 CONCLUSIONS................................................................................................................. 26

2. SYSTEM ARCHITECTURE OF 3D-SOFTCHIP .................................................................. 27
  2.1 CORE TECHNOLOGY FOR 3D-SOFTCHIP .................................................................. 27
  2.2 OVERALL ARCHITECTURE OF 3D-SOFTCHIP ............................................................ 28
  2.3 FEATURES OF 3D-SOFTCHIP......................................................................................... 29
  2.4 SYSTEM COMPONENTS................................................................................................. 33
    2.4.1 Configurable Array Processor (CAP) Chip........................................................... 33
    2.4.1.1 Heterogeneous Types of PEi .......................................................................... 33
    2.4.2 Intelligent Configurable Switch (ICS) Chip ....................................................... 33
    2.4.2.1 Switch Block .................................................................................................... 33
    2.4.2.2 ICS_RISC ....................................................................................................... 34
    2.4.2.3 Data Frame Buffer .......................................................................................... 34
    2.4.2.4 Program Memory ............................................................................................. 34
    2.4.2.5 Data Memory .................................................................................................. 34
    2.4.2.6 DMA Controller ............................................................................................. 34
    2.4.2.7 3D Interconnection Technology ................................................................... 35
  2.5 DESIGN GUIDELINES .................................................................................................... 35
  2.6 DESIGN METHODOLOGY ............................................................................................. 36
    2.6.1 Suggested HW/SW Co-design and Verification Methodology ................................. 36
  2.7 CONCLUSIONS................................................................................................................. 37

3. ARCHITECTURE OF CAP CHIP....................................................................................... 39
  3.1 OVERALL ARCHITECTURE OF CAP CHIP ................................................................ 39
  3.2 TWO TYPES OF PROCESSING ELEMENT (PE)S....................................................... 40
# 4. ARCHITECTURE OF ICS CHIP

4.1 SWITCH BLOCK ........................................................................................................... 49
4.2 ICS_RISC .................................................................................................................... 50
4.2.1 Features of ICS_RISC .......................................................................................... 51
4.2.2 System Components of ICS_RISC ........................................................................ 51
4.2.3 Types of Instruction Set ......................................................................................... 52
4.2.4 ICS_RISC Instruction Set Architecture-Version1.0 ............................................. 53
4.3 HIGH BANDWIDTH DATA INTERFACE UNIT ......................................................... 55
4.4 CONCLUSIONS ......................................................................................................... 55

# 5. ARCHITECTURE OF UNITCHIP

5.1 UNITCHIP ARCHITECTURE ...................................................................................... 57
5.2 PIPELINED OPERATION MECHANISM OF UNITCHIP ........................................ 58
5.3 AREA ESTIMATIONS AND CONSTRAINTS ................................................................ 60
5.4 CONCLUSIONS ......................................................................................................... 60

# 6. INTERCONNECTION NETWORK

6.1 HIERARCHICAL INTERCONNECTION ARCHITECTURE ........................................... 61
6.1.1 PE and Switch Block Array Interconnection ......................................................... 63
6.1.1.1 Programmable Nature of PE Array Interconnection ........................................ 63
6.1.2 Indium Bump Interconnection .............................................................................. 64
6.2 CONCLUSIONS ......................................................................................................... 65

# 7. HIGH-LEVEL MODELING OF 3D-SOFTCHIP USING SYSTEMC

7.1 SYSTEMC OVERVIEW ............................................................................................. 66
7.1.1 CAD Environment for SystemC ........................................................................... 68
7.2 SYSTEM-LEVEL MODELING OF 3D-SOFTCHIP .................................................. 69
7.2.1 Standard-PE ......................................................................................................... 69
7.2.2 Processing Accelerator-PE .................................................................................. 70
7.2.3 ICS_RISC ............................................................................................................ 71
7.2.4 UnitChip ............................................................................................................... 74
7.3 CONCLUSIONS ......................................................................................................... 75

# 8. APPLICATION MAPPING FOR 3D-SOFTCHIP

8.1 FULL SEARCH BLOCK MATCHING ALGORITHM (FBMA) ................................. 76
8.2 FBMA MAPPING METHOD FOR 3D-SOFTCHIP .................................................. 78
8.3 PERFORMANCE ANALYSIS .................................................................................... 80
8.4 CONCLUSIONS ......................................................................................................... 82

# 9. CONCLUSIONS

9.1 CONTRIBUTIONS ...................................................................................................... 83
9.2 FUTURE WORK ......................................................................................................... 84
Declaration

I certify that this thesis does not incorporate without acknowledgement any material previously submitted for a degree or diploma in any institution of higher education; and to the best of my knowledge and belief it does not contain any material previously published or written by another person except where due reference is made in the text.

Signature

Date: 12/03/05
Acknowledgements

I would like to express my gratitude to the following people, who helped me to stand this position.

Prof. Kamran Eshraghian as my principle supervisor who initiated the research program and gave me the opportunity to commence my master course at Edith Cowan University providing financial support and great inspiration towards my research. Unfortunately, he left the university towards the end of my research however he left significant impression of his great leadership that I want to follow.

Prof. Mike Myung-Ok Lee, who inspired me to study overseas and gave me an opportunity and warmth supervision during my course. I have learned strong propulsion and passion through his supervision.

Prof. Byung-Lok Cho, he used to be my supervisor during my undergraduate study. I have started with his great supervision and learned the life and belief as an electronic engineer. I will not forget his guidance that has changed my whole life.

Dr. Alexander Rassau, my principle supervisor, I am really a lucky fellow to meet him as a principle supervisor. Sometimes, he becomes my ear, mouth, hands and legs. I can not forget his infinite interest and supervision capacity for me. I could not finish my course without his great supervision which will never be forgotten and it is very much appreciated. Thanks Dr. Alexander Rassau.

My family is the most precious in my life. They motivated and encouraged me unfailingly so my deepest gratitude goes to my family and I dedicate my dissertation to my family; my father, my mother, my brother-in-law, my sisters and my future new family; my father-in-law, mother-in-law and my new brother-in-law.

Lastly, my fiancé Sang-Mi, she pushing me to study hard, but ironically, she gave me so many interruptions as well. But I even love these interruptions. I will promise that I will be a good husband and I will love you forever.
ABSTRACT

At present, as we enter the nano and giga-scaled integrated-circuit era, there are many system design challenges which must be overcome to resolve problems in current systems. The incredibly increased nonrecurring engineering (NRE) cost, abruptly shortened Time-to-Market (TTA) period and ever widening design productive gaps are good examples illustrating the problems in current systems. To cope with these problems, the concept of an Adaptive Computing System is becoming a critical technology for next generation computing systems. The other big problem is an explosion in the interconnection wire requirements in standard planar technology resulting from the very high data-bandwidth requirements demanded for real-time communications and multimedia signal processing. The concept of 3D-vertical integration of 2D planar chips becomes an attractive solution to combat the ever increasing interconnect wire requirements. As a result, this research proposes the concept of a novel 3D integrated adaptive computing system, which we term 3D-ACSoC. The architecture and advanced system design methodology of the proposed 3D-SoftChip as a forthcoming giga-scaled integrated circuit computing system has been introduced, along with high-level system modeling and functional verification in the early design stage using SystemC.

A major challenge in this research is to explore the proposed 3D-SoftChip platform to investigate the effectiveness of the first novel 3D vertically integrated Adaptive Computing System-on-Chip (ACSoC) as a next generation computing system. The suggested 3D-SoftChip has been modeled at a system level using SystemC and the functional verification of the modeled system has been firmly verified. The hand-crafted assembler code for implementation of the MPEG4 motion estimation algorithm has been applied with more than 3.8 times performance improvement over conventional systems. It can be clearly demonstrated that it is a highly suitable architecture for next generation computing systems. Finally, further work to realize the full implementation of the novel concept of a 3D-ACSoC has been suggested.
Publications

The following is a list of papers published during the course of this research.

**International Journals**


**International Conferences**


List of Figures

FIGURE 1.1: 3D-SoftChip Physical Architecture ................................................................................................... 15
FIGURE 1.3: Computing Systems ........................................................................................................................... 18
FIGURE 1.4: An Example of "Do-It-All" Device .................................................................................................. 20
FIGURE 2.1: Core Technology for 3D-SoftChip .................................................................................................. 28
FIGURE 2.2: Overall Architecture of 3D-SoftChip ........................................................................................... 28
FIGURE 2.3: Computation Algorithm: 3 Types of SIMD Computation Models (A) Massively Parallel SIMD Computational Model, (B) Multithreaded SIMD Computational Model, (C) Pipelined SIMD Computational Model ...................................................................................................... 31
FIGURE 2.4: Word-Length Configuration Algorithm (A) 8bit Configuration, (B) 16bit Configuration, (C) 32bit Configuration .................................................................................................................................................................. 32
FIGURE 2.5: Suggested HW/SW Co-design and Verification Methodology ...................................................... 38
FIGURE 3.1: Types of PEs (A) Homogeneous Type, (B) Heterogeneous Type, (C) Heterogeneous Type with Dedicated Functions for Special Purpose ........................................................................................................... 40
FIGURE 3.2: Two Types of PE (A) Standard-PE, (B) Processing Accelerator-PE ............................................ 41
FIGURE 3.3: PE Instruction Formats (A) Standard-PE Instruction Format, (B) Processing-Accelerator-PE Instruction Format ........................................................................................................................................ 43
FIGURE 3.4: PE Array Operation Modes (A) Horizontal Mode, (B) Vertical Mode, (C) Circular Mode .................................................................................................................................................................. 44
FIGURE 3.5: A Generic 1x1-bit Multiplier Cell for N=1 ...................................................................................... 45
FIGURE 3.6: 8x8 Multiplier Using 4-bit Generic Cells .......................................................................................... 46
FIGURE 3.7: Quad-PE ........................................................................................................................................... 47
FIGURE 3.8: UniCap Chip Architecture .............................................................................................................. 48
FIGURE 4.1: Architecture of Switch Block: A 6-sided Switch Block, 7-sided Switch Block and 8-sided Switch Block .................................................................................................................................................................. 50
FIGURE 4.2: Architecture of ICS_RISC 32-bit Dedicated Control Processor .................................................... 51
FIGURE 4.3: A Detailed Architecture of ICS_RISC .......................................................................................... 52
FIGURE 4.4: DMA Controller Architecture and Instructions for DMA Controller ........................................... 55
FIGURE 5.1: Overall Architecture of UnitChip ..................................................................................................... 58
FIGURE 6.1: Three Hierarchical Interconnection Networks: (A) PE Array Interconnection Network: 2D-Mesh Interconnection for Local Interconnection, (B) Switch Block Array Interconnection Network: 2D-Mesh Interconnection for Long Interconnection, (C) Indium Bump Interconnection: Single Indium Bump after Reflow .......................................................................................................................... 63
FIGURE 6.2: Quad-PE and Programmable Interconnect Architecture .............................................................. 63
List of Tables

TABLE 1.1: 3D FABRICATION TECHNOLOGIES .............................................................................................................. 17
TABLE 1.2: RECONFIGURABLE COMPUTING Vs ADAPTIVE COMPUTING ........................................................................ 20
TABLE 1.3: RECONFIGURABLE AND ADAPTIVE COMPUTING SYSTEMS ........................................................................ 23
TABLE 1.4: COMPARISON OF MORPHOsys WITH 3D-SOFTCHIP .................................................................................... 24
TABLE 3.1: CHARACTERISTICS OF EACH PE TYPES ...................................................................................................... 40
TABLE 3.2: CHARACTERISTICS OF THE TWO TYPES OF PE .............................................................................................. 41
TABLE 3.3: STANDARD-PE FUNCTIONS ........................................................................................................................ 42
TABLE 3.4: PROCESSING ACCELERATOR-PE FUNCTIONS .............................................................................................. 43
TABLE 4.1: TYPES OF INSTRUCTION SET ...................................................................................................................... 43
TABLE 4.2: INSTRUCTION SET SUMMARY (ICS_RISC ISA VERSION 1.0) ..................................................................... 52
TABLE 4.1: PIPELINED UNIT CHIP OPERATION MECHANISM .......................................................................................... 58
TABLE 5.2: AREA ESTIMATION AND CONSTRAINT OF UNIT CHIP (TARGET TECHNOLOGY: 0.13UM PROCESS) .......... 60
TABLE 6.1: INTER-PE BUS (IPB) INTERCONNECTION CONNECTIVITY ........................................................................... 64
Chapter 1

Introduction

System design is becoming increasingly challenging as the complexity of integrated circuits and the time-to-market pressures relentlessly increase. Adaptive computing is a critical technology to develop for future computing systems in order to resolve most of the problems that system designers are now faced with due in no small part to its potential for wide applicability. Up until now, however, this concept has not been fully realized because of the many constraints such as chip real-estate limitations and the software complexity. Advancements of semiconductor processing technology and software technology, however, adaptive computing is now facing a turning point. For instance, the concept of reconfigurable computing has more recently started to receive considerable research attention [2, 3, 7] and this concept is now starting to move and expand into the realm of adaptive computing. Software defined virtual hardware [9] and “Do-it-all” devices [12] are good examples that demonstrate this development direction for computing systems.

Another growing problem in advanced computation systems, particularly for real-time communication or video processing applications, is the data bandwidth necessary to satisfy the processing requirements. A novel 3D integration system such as 3D SoC [24], 3D-SoftChip [14,15] which is able to satisfy the severe demand of more computation throughput by effectively manipulating the functionality of hardware primitives through
vertical integration of two 2D chips is another concept proposed for next generation computing systems. This research explores the proposed 3D-SoftChip platform to investigate the effectiveness of the first novel 3D vertically integrated Adaptive Computing System-on-Chip (3D-ACSoC) as a next generation computing system. This thesis outlines research into the system level design and functional verification of 3D-SoftChip in the initial stage of development of the novel 3D vertically integrated ACSoC.

Figure 1.1: 3D-SoftChip Physical Architecture

Figure 1.1 illustrates the physical architecture of the 3D-SoftChip comprising the vertical integration of two 2D chips. The upper chip is the Intelligent Configurable Switch (ICS). The lower chip is the Configurable Array Processor (CAP). Interconnection between the two 2D chips is achieved via Indium bump interconnections. As the starting point for our 3D mapping, the 2-D plane architecture of the 3D-SoftChip is also illustrated in Figure 1.2 in order to demonstrate the principle.
1.1 3D Vertically Integrated Systems Overview

During the past few years, there has been significant research demand for 3D vertically integrated systems due to the ever growing wiring requirements, which are fast becoming the major bottleneck for future gigascale integrated systems [23,24]. In Very Deep Submicron silicon geometry, standard planar technology has many drawbacks such as performance, reliability etc. caused by limitations in the wiring. Moreover, the data bandwidth requirements for the next generation computing systems are becoming ever larger. To overcome these problems, the concept of 3D-SoC, 3D-SoftChip has been developed, which exploits the vertical integration of two or more 2D planar chips to effectively manipulate computation throughput. Previous work has shown that the 3D integration of systems can significantly reduce interconnection requirements [25]. As described by Joyner, et al [25], 3D system integration offers a 3.9 times increase in wire-limited clock frequency, an 84% decrease in wire-limited area or a 25% decrease in the
number of metal levels required per stratum. There are three feasible 3D integration methods: a stacking of packages, a stacking of ICs and Vertical System Integration as was introduced by IMEC [23]. There are four main enabling technologies for the fabrication of 3D-Integrated Circuits, Beam Recrystallization, Silicon Epitaxial Growth, Solid Phase Crystallization and Processed Wafer Bonding [26]. Table 1.1 shows the main characteristics of each of these 3D fabrication technologies. In this research, however, the focus is on the use of processed wafer bonding technology using an indium bump interconnection array (IBIA). The reason why wafer bonding technology is adopted for this work is because the process has particular benefits for applications where each chip carries out independent processing. The characteristic of the 3D-SoftChip are that each of the two planar chips should be effectively manipulated to maximize computation throughput with parallelism. Also indium has good adhesion, a low contact resistance and can be readily utilized to achieve an interconnect array with a pitch as low as 10µm. The development of the 3D integrated systems will allow improvements that should be seen in the packaging cost, the performance, the reliability and a reduction in the size of the chips.

<table>
<thead>
<tr>
<th>3D Fabrication Technologies</th>
<th>Characteristics</th>
</tr>
</thead>
<tbody>
<tr>
<td>Beam Recrystallization</td>
<td>Deposit poly-silicon and fabricate Thin-film Transistors (TFTs). High performance of TFT's. High temperature of melting poly-silicon (Not practical Fab.Tech.) Suffers from low carrier mobility</td>
</tr>
<tr>
<td>Silicon Epitaxial (SE) Growth</td>
<td>Epitaxially grow a single crystal Si. High temperature causes degradation in quality of devices. Process not yet manufacturable</td>
</tr>
<tr>
<td>Solid Phase Crystallization</td>
<td>Low temperature alternative to SE. Flexibility of creating multiple layers. Compatible with current processing environments. Useful for stacked SRAM and EEPROM cells</td>
</tr>
<tr>
<td>Processed Wafer Bonding</td>
<td>Bond two fully processed wafers together. Similar electrical properties on all devices. Independent of temperature since all chips are fabricated then bonded / Good for applications where chips do independent processing. Lack of precision (alignment) restricts inter-chip communication to global metal line</td>
</tr>
</tbody>
</table>

Table 1.1: 3D Fabrication Technologies
1.2 Adaptive Computing Systems Overview

There are three types of computing systems currently in existence: a general-purpose computing system, a reconfigurable/adaptive computing system, and an application-specific computing system. The general-purpose computing system is based on using a general-purpose processor for broad applications. Discrete application-specific ICs are used for application-specific computing systems for dedicated and limited applications. These computing systems have certain drawbacks such as low performance in the case of the general-purpose computing system, or extremely limited applicability for the application-specific computing system. The reconfigurable/adaptive computing system, however, allows for an optimum trade-off between flexibility and performance. Because of this fact, reconfigurable/adaptive computing systems are attracting attention as a new alternative for the next generation of computing systems. Figure 1.3 illustrates how the reconfigurable/adaptive computing system provides an optimum trade-off between flexibility and performance.

![Figure 1.3: Computing systems](image)

1.2.1 Adaptive Computing Systems

1.2.1.1 The Need for Adaptive Computing Systems

The nonrecurring engineering (NRE) costs associated with the design and testing of complex chips are one of the great threatening factors in current system design approaches. According to the International Technology Roadmap for Semiconductors...
(ITRS), the manufacturing engineering costs of complex chips have reached almost one million dollars. The associated design NRE costs almost reached tens of millions of dollars in year 2003 [21]. Moreover, product life cycles are getting ever shorter due to rapid changes in technology and as a result the time-to-market (TTM) period is keenly shortened. On the other hand, design and verification cycle times are getting longer into the months or even years. As a consequence of these issues, a reconfigurable/adaptive computing system that could be metamorphosed across multiple standards and applications becomes very attractive for the next generation of computing systems.

1.2.1.2 The Concept of Adaptive Computing Systems

A reconfigurable system is one that has reconfigurable hardware resources that can be adapted to the application currently under execution providing the possibility to customize across multiple standards and applications. In most of the previous research the concepts of reconfigurable and adaptive computing have been described interchangeably. In this document, however, these two concepts will be more specifically described and differentiated. Adaptive computing will be treated as a more extended and advanced concept of reconfigurable computing systems, which means it includes more advanced software technology to effectively manipulate the mapping and scheduling of context memory over a wide range of applications along with more advanced reconfigurable hardware resources to support fast and seamless execution across these applications. Table 1.2 shows the differentiations between reconfigurable computing and adaptive computing. The benefits of adaptive computing are silicon reuse, bug-fixing post-shipping, updating and fixing in market allowing for standards evolution, faster TTM and lower costs. The reconfiguration capacity allows for significant reuse of silicon. If bugs are found post-shipping or standards evolve, the adaptive computing system is easy to fix and update simply by changing the contexts in the reconfigurable hardware resource. The forthcoming impact from the deployment of adaptive computing is “Do-it-all” devices. A small handheld PDA size device can assume the functionality of about 10 standard devices simply depending on the context programs included such as a cellular phone, a GPS receiver, an MP3 player, an e-book reader, a digital camera, a portable television, a
satellite radio, a held-held gaming platform etc. Figure 1.4 shows the futuristic concept of “Do-it-all” devices.

Figure 1.4: An Example of “Do-it-all” Device (*Source: www.chosun.com)

Table 1.2: Reconfigurable Computing Vs Adaptive Computing.

<table>
<thead>
<tr>
<th></th>
<th>Reconfigurable Computing</th>
<th>Adaptive Computing</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Hardware Resources</strong></td>
<td>Linear array of homogeneous elements (Logic gates, look-up tables)</td>
<td>Heterogeneous algorithmic elements (Complete function units such as ALU, Multiplier)</td>
</tr>
<tr>
<td><strong>Configuration</strong></td>
<td>Static, Dynamic configuration Slow reconfiguration time</td>
<td>Dynamic, partial run-time reconfiguration.</td>
</tr>
<tr>
<td><strong>Mapping methods</strong></td>
<td>Manual routing , conventional ASIC Design tools (HDL)</td>
<td>High-level language (SystemC,C)</td>
</tr>
<tr>
<td><strong>Characteristics</strong></td>
<td>Large silicon area, Low speed (high capacitance), high power consumption, high cost</td>
<td>Smaller silicon size, high speed, high performance, low power consumption, low cost</td>
</tr>
</tbody>
</table>
1.2.2. Classification of Adaptive Computing Systems

Adaptive computing systems are mainly classified in terms of granularity, programmability, reconfigurability, computational methods, hardware mapping methods and target applications. The granularity is the basic data size of the reconfigurable hardware resources. In fine grained systems, the primitive reconfigurable hardware resources are typically logic gates, flip-flops and look-up tables and operate using bit-level computations. Field Programmable Gate Array (FPGA) and Complex Programmable Logic Gates (CPLD) are good examples of fine grained systems. In contrast, the coarse grained systems have complete function units such as ALU, multiplier and dedicated functional units and operate using word-level computations. The combination of the fined grained systems and the coarse grained systems creates a mixed grained system.

The programmability relates to the capacity of the configuration. Single-programmability allows only one customization, while multiple-programmability allows for customization on-the-fly. The reconfigurability is executed by changing the context memory. Static (interrupted execution) and Dynamic (in parallel execution) are two categories of reconfigurability. Common computational methods used in the adaptive computing systems are Single-Instruction stream Multiple-Data stream (SIMD)/Multiple-Instruction stream Multiple-Data stream (MIMD) and Very-Long Instruction Word (VLIW). The hardware mapping methods vary depending on developed systems from manual routing to high-level language compilation. Most of the target applications for adaptive computing are in the areas of wired and wireless communications and multimedia digital signal processing.

1.2.2.1 Previous Works

The research and commercial development of reconfigurable/adaptive computing systems has been going vigorously since the early 1990's. According to the classification of adaptive computing described above, the nature of this research is classified in the Table 1.3[3, 22] and it shows the best-known existing coarse-grain reconfigurable
systems, the fine-grain reconfigurable systems have been excluded because these are different category from our research.

The Matrix [1], REMARC [5] and MorphoSys [3] belong to the category of mesh-based reconfigurable systems which is a combination of an array of word-level processing elements with a control processor, such as a multi-granular array of Basic Functional Units (BFUs) in the case of the MATRIX, an 8 by 8 array of 16-bit nanoprocessor with MIPS-II RISC processor in the REMARC, or an 8 by 8 array of reconfigurable cells with MIPS-like processors in the MorphoSys. These are dynamic reconfiguration, mesh based hierarchical interconnection fabric architectures. Their application is restricted only to DSP type tasks and they have certain disadvantages in term of the power consumption because of frequent data movement between the control processor and processing elements. As well as, need to access external memory resources.

Another category is a linear array-based reconfigurable system such as, RaPiD [6] or PipeRench [2]. These are linear arrays of processing elements with row-wide interconnection fabrics. Each combination of the processing element array and the row-wide interconnection can make a pipeline stage. The target application of these systems is pipelining regular computation-intensive applications. The other categories such as crossbar-based [35, 36] and reconfigurable processors [37] have been excluded in this table.

The Trisend A7 [10] is considerably similar to other mesh-based reconfigurable systems, the difference is in the granularity of the processing elements. The A7 has a fine-grain reconfigurable fabric in comparison with the word-level processing elements in the mesh-based reconfigurable system.

The MRC6011 [11], Adapt 2400 [16], DFA1000 [9], PCI02 [17] are up-to-date commercially developed adaptive computing systems, which have mostly heterogeneous arrays of reconfigurable hardware except the DFA1000 and dynamic configurability. The main target application is computation-intensive multimedia DSP and communication signal processing. These have more advanced adaptive computing characteristics compared with the systems introduced earlier.
As indicated, the early research and development was into single linear array type reconfigurable systems with single and static configuration [8,1,6,5,4,2] but this has evolved to large adaptive SoCs with heterogeneous types of reconfigurable hardware resources and multiple and dynamic configurability. The MRC6011, Adapt2400, DFA100, PC102 and 3D-SoftChip are good examples to show the current research and commercial development directions. The ultimate goal for the adaptive computing system is currently the "Do-it-all" device as explained before.

### Table1.3: Reconfigurable and Adaptive Computing Systems

<table>
<thead>
<tr>
<th>System</th>
<th>Granularity</th>
<th>Programmability</th>
<th>Reconfiguration</th>
<th>Computation Method</th>
<th>Mapping</th>
<th>Target Application</th>
</tr>
</thead>
<tbody>
<tr>
<td>PADDI [8]</td>
<td>Coarse(16bit)</td>
<td>Multiple</td>
<td>Static</td>
<td>VLIW, SIMD</td>
<td>Routing</td>
<td>DSP applications</td>
</tr>
<tr>
<td>REMARC [5]</td>
<td>Coarse(16bit)</td>
<td>Multiple</td>
<td>Static</td>
<td>SIMD</td>
<td>N/A</td>
<td>Data-parallel application</td>
</tr>
<tr>
<td>QuickSilver</td>
<td>Coarse(8,16,24, 32bit)</td>
<td>Multiple</td>
<td>Dynamic</td>
<td>Heterogeneous Nodes array</td>
<td>SilverC</td>
<td>Comm., Multimedia DSP</td>
</tr>
<tr>
<td>picoChip PC102 [17]</td>
<td>Coarse(16bit)</td>
<td>Multiple</td>
<td>Dynamic</td>
<td>3way-LIW</td>
<td>Assembler</td>
<td>Wireless Communications</td>
</tr>
</tbody>
</table>
1.2.2.2 MorphoSys Vs 3D-SoftChip

One of the most successful reconfigurable systems to date is the MorphoSys system, so it is meaningful to make a comparison of the proposed 3D-SoftChip architecture to this. Table 1.4 shows the comparison between the MorphoSys and the 3D-SoftChip. It can be seen that the 3D-SoftChip is more appropriate to the most up-to-date adaptive computing system.

Table 1.4: Comparison of MorphoSys with 3D-SoftChip

<table>
<thead>
<tr>
<th></th>
<th>MorphoSys</th>
<th>3D-SoftChip</th>
</tr>
</thead>
<tbody>
<tr>
<td>Integrated Model</td>
<td>System-on-Chip except main</td>
<td>Vertically Integrated complete</td>
</tr>
<tr>
<td></td>
<td>memory</td>
<td>System-on-Chip with abundant</td>
</tr>
<tr>
<td></td>
<td></td>
<td>memory capacity</td>
</tr>
<tr>
<td>Memory Interface</td>
<td>Employs a two-set data buffer</td>
<td>Using Indium bump technology,</td>
</tr>
<tr>
<td></td>
<td>that enable overlap of</td>
<td>vertical data communication.</td>
</tr>
<tr>
<td></td>
<td>computation with data</td>
<td>Variable memory word-length</td>
</tr>
<tr>
<td></td>
<td>transfers.</td>
<td>for adaptive computing</td>
</tr>
<tr>
<td>Reconfiguration</td>
<td>Multiple contexts on-chip (32</td>
<td>Multiple context on-chip with</td>
</tr>
<tr>
<td></td>
<td>planes) with dynamic and</td>
<td>dynamic and single-cycle</td>
</tr>
<tr>
<td></td>
<td>single-cycle.</td>
<td></td>
</tr>
<tr>
<td>Controller</td>
<td>On-chip general-purpose</td>
<td>Every unit 3D-SoftChip has an</td>
</tr>
<tr>
<td></td>
<td>processor.</td>
<td>ICS_RISC which role of control</td>
</tr>
<tr>
<td></td>
<td></td>
<td>processor.</td>
</tr>
<tr>
<td>Examples of Application</td>
<td>MPEG-2 Video Compression,</td>
<td>Real time communication and</td>
</tr>
<tr>
<td></td>
<td>Encoder</td>
<td>multimedia signal processing</td>
</tr>
<tr>
<td></td>
<td>Automatic Target Recognition</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Data Encryption</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Characters</td>
<td>SIMD nature.</td>
<td>Various types of Computational</td>
</tr>
<tr>
<td></td>
<td>Fixed Word length</td>
<td>models (SISD,SIMD,MISD,MIMD)</td>
</tr>
<tr>
<td></td>
<td>Comprehensive tool sets.(mView</td>
<td>And 3 types of SIMD Computation</td>
</tr>
<tr>
<td></td>
<td>mLoad, mSched, mC, mLat,</td>
<td>models (massively parallel,</td>
</tr>
<tr>
<td></td>
<td>MorphoSim)</td>
<td>multithreaded, pipelined)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Configurable word length and</td>
</tr>
<tr>
<td></td>
<td></td>
<td>variable memory word length for</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Adaptive Computing.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>3D Vertically Integrated System</td>
</tr>
<tr>
<td></td>
<td></td>
<td>– High speed data interface.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Optimum System Architecture for</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Comm. and Multimedia Signal</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Processing</td>
</tr>
</tbody>
</table>
1.3 Motivation of Thesis

As the microelectronics industry enters the nano and giga-scaled integrated circuit era, many problems, as described before, have been to occur. To cope with these problems, especially the system-on-chip complexity and interconnection crisis, innovative new computing systems with novel interconnection methods will be required. A very promising candidate to overcome these problems is the concept of a 3D vertically integrated adaptive computing system-on-chip (3D-ACSoC). This concept may well be a critical technology for the next generation of computing systems because of its wide applicability/adaptability and because of the significant benefits gained from 3D systems such as reduction in interconnect delays and densities, and reduction in chip areas due to the possibility for more efficient layouts etc.

Conventional SoC design methodologies include many error-prone and tedious iteration processes which can result in a lack of system reliability and extend the design time. Moreover, the portion taken up by verification processes in the total design time is exponentially increasing. By adopting the suggested SoC design methodology using SystemC, the design time can be significantly reduced and more reliable systems can be realised. To satisfy these needs, the concept of the 3D-ACSoC and advanced HW/SW co-design and verification methodology has been suggested.

1.4 Scope of Thesis

In this thesis, the novel 3D-SoftChip architecture for real-time communication and multimedia signal processing is introduced, and its high-level system modelling and functional verification using SystemC is described. The 3D-SoftChip has been fully modelled using SystemC at high-level and implementation of the MPEG4 full search block matching motion estimation algorithm has been mapped to the modelled 3D-SoftChip. Finally, the performance analysis is detailed in the last chapter. The thesis is composed of nine chapters including this one. The following is the scope to be covered by each of the following chapters.
1.4.1. Scope of Each Chapters

Chapter 2 is an introduction to the overall 3D-SoftChip architecture. The novel architecture and several salient features for next generation computing system will be introduced along with the suggested HW/SW co-design and verification methodology. The detailed architecture of the CAP chip will be described in Chapter 3. Heterogeneous types of Processing Elements architecture and functions will be presented. Chapter 4 covers the ICS chip, its components and the ICS_RISC instruction Set Architecture with instruction set summary. Chapter 5 presents the architecture of the UnitChip, its pipeline operation mechanism and area constraint. A three hierarchical interconnection architecture and the configurable nature of the inter-PE bus will be introduced in Chapter 6. Chapter 7 presents the high-level modelling of 3D-SoftChip using SystemC. The simulation result of each component of the 3D-SoftChip is also provided to show the verification of the functionality of the each component. Chapter 8 introduces application mapping for high-level modelled 3D-SoftChip with the MPEG4 full search block matching algorithm. The performance analysis will be performed in comparison with conventional systems. Finally, the last Chapter outlines the contributions of this thesis and suggested future work.

1.5 Conclusions

In this chapter, the motivation for the emergence of the novel 3D vertically integrated system-on-chip and the benefits which can be acquired through its use have been described. This concept will be a promising candidate for the next generation of computing system.
Chapter 2
System Architecture of 3D-SoftChip

In this chapter, the core technology for the 3D-SoftChip along with its detailed and overall architecture will be described. Finally a design guideline and the suggested design methodology will be introduced.

2.1 Core Technology for 3D-SoftChip

The core technology for the 3D-SoftChip can be mainly classified into 3 fields of technology is follows, a Very Deep Submicron (VDSM) silicon process technology, a 3D Interconnection technology and an advanced software technology. The target silicon process technology for the 3D-SoftChip is less than 0.13um to maximize the effect of large scale integration in order to fit as much as possible into the Processing Element (PE)s. This large scale integration into the PEs can be leveraged to amplify the computation capacity of the 3D-SoftChip because of the SIMD computation nature of the 3D-SoftChip. The 3D Interconnection technology using the Indium Bump Interconnection Array (IBIA) is another state-of the art technology for the 3D-SoftChip. The IBIA can cope with the severe demand for data bandwidth from real time communication and multimedia signal processing applications. The last technology is the advanced software technology, which is able to effectively execute context mapping and scheduling within the context memory in the 3D-SoftChip.
2.2 Overall Architecture of 3D-SoftChip

Figure 2.2 shows the overall architecture of the 3D-SoftChip.

As can be seen, it is comprised of 4 UnitChips. By including four separate unit chips in the architecture, sufficient flexibility is provided to allow multiple optimized task
threads to be processed simultaneously. Given the primary target applications of communication and multimedia processing four UnitChips should be sufficient for all such requirements. Each UnitChip has a PE array, a dedicated control processor and a high bandwidth data interface unit. According to a given application program, the PE array processes a large amount of data in parallel, the ICS controls the overall system and directs the PE array execution and data and address transfers within the system.

2.3 Features of 3D-SoftChip

The 3D-SoftChip has 4 distinctive features: Various types of computation modes, adaptive Word-length configuration [14], optimized system architecture for real-time communication and multimedia signal processing and dynamic reconfigurability for adaptive computing.

- **Computation Algorithm : Various Computation Models**

  As described above, one 32-bit RISC controller can supply control, data and instruction addresses to 16 sets of PEs through the completely freely controllable switch block so various computation models can be achieved such as SISD, SIMD, MISD, MIMD as required. Enough flexibility is thus achieved for an adaptive computing (AC) system. Especially, in the SIMD computation model, 3 types of different SIMD computational model can be realized, massively parallel, multithreaded and pipelined SIMD computational models [13]. In the massively parallel SIMD computation model, each UnitChip operates with the same global program memory. Every computation is processed in parallel, maximizing computational throughput. In the Multithreaded SIMD computation model, the executed program instructions in each UnitChip can be different from the others, so multithreaded programs can be executed. The final one is the pipelined SIMD computation model. In this case each UnitChip executes a different pipelined stage. These three computational models are illustrated in Figure 2.3.
(a) Massively Parallel SIMD Computational Model

(b) Multithreaded SIMD Computational Model
- Word-length Configuration

This is a key characteristic in order to classify the 3D-SoftChip as an adaptive computing system. Each PE’s basic processing word-length is 4-bit. This can, however, be configured up to 32-bit according to the application in the program memory. Figure 2.4 illustrates the proposed word-length configuration algorithm. When 2 PEs configure together, an 8-bit word-length system is created. If 4 PEs configure together this extends to 16-bit. And finally when 8 PEs configure together a full 32-bit word length is achieved. This flexibility is possible due to the configurable nature of the arithmetic primitives in the PEs [18], (see chapter3.5) and the completely freely controllable switch block architecture in the ICS chip
• **Optimized System architecture for Communication and Multimedia Signal Processing**

  There are many similarities between communications and multimedia signal processing, such as data parallelism, low precision data and high computation rates. The different characteristics of communication signal processing are basically more data reorganization such as matrix transposition and potentially higher bit level computation. To fulfill these signal processing demands, each UnitChip contains two types of PE. One is a standard-PE for generic ALU functions, which is optimized for bit-level computation. The other is a processing accelerator-PE for Digital Signal Processing (DSP). In addition, special addressing modes to leverage the localized memory along with 16 sets of Loop buffers to generate iterative
address in the ICS_RISC add to the specialized characteristics for optimized communication and multimedia signal processing.

- **Dynamic Reconfigurability for Adaptive Computing**
  Every PE contains a small quantity of local embedded SRAM memory and additionally the ICS chip has an abundant memory capacity directly addressable from the PEs. With multiple sets of program memory and the abundant memory capacity, it is possible to switch programs easily and seamlessly, even at run-time.

**2.4 System Components**

As introduced above, the 3D-SoftChip consists of a linear array of heterogeneous PEs with an associated array of Indium bump 3D Interconnects, dedicated Switch Blocks, the ICS_RISC and a high bandwidth data interface unit.

**2.4.1 Configurable Array Processor (CAP) Chip**

**2.4.1.1 Heterogeneous Types of PEs**

The CAP chip comprises a linear array of two types of PE, a Standard-PE and a Processing Accelerator-PE. The advantages of heterogeneous PEs with dedicated functions for special purpose DSP are more suitability for specific applications with only a medium flexibility trade-off compared with homogeneous type PEs. In this case, two Standard-PEs and two Processing Accelerator-PEs form one Quad-PE. These will be in detail in a later section.

**2.4.2 Intelligent Configurable Switch (ICS) Chip**

**2.4.2.1 Switch Block**

Each group of 4 PEs (Called Quad-PEs) are controlled by one Switch Block through the IBIA. This transfers data from/to each PE and also provides instruction data for the
PEs. It can completely freely configure each PE group, and makes it possible to achieve efficient variable word-length configuration.

2.4.2.2 ICS_RISC

A 32-bit dedicated RISC processor is used to control each set of 4 Quad-PEs (called UnitCAP). It controls the execution of the PE array and provides control and address signals to the Switch Block and the high bandwidth data interface unit in the UnitChip.

2.4.2.3 Data Frame Buffer

Two sets of Data Frame Buffers are included to support the transfer of large volumes of data from/to data/program memory and the ISC.

2.4.2.4 Program Memory

This is separated into two areas. One is a program memory for the ICS_RISC and the other is the program memory for the PE array. This memory supports adaptively configured word-lengths to increase the computation efficiency dependent on the application. Additionally, multiple sets of program memory are included to allow dynamic program switching.

2.4.2.5 Data Memory

Abundant memory capacity is one of the characteristic of the 3D-Softchip with each PE containing its own embedded local memory along with a high bandwidth connection to the memory store on the ICS.

2.4.2.6 DMA Controller

A dedicated controller is included to facilitate the transfer of large volumes of data from/to program memory, data memory and the ICS. This provides a high efficiency data interface between any of these units.
2.4.2.7 3D Interconnection Technology

The CAP chip carries out all data manipulation operations in the system. There is rarely the need for data transfer within the CAP beyond basic nearest neighbor interconnects, except for computation with word-lengths configured to > 4-bit. All the manipulated data is, therefore, transferred through the Indium Bump Interconnection Array (IBIA) and processed by the ICS allowing for very high speed computation because the IBIA provides very high bandwidth and very low inductance/capacitance [15].

2.5 Design Guidelines

The design guidelines and constraints to satisfy the design goals are as follows.

- The 3D-SoftChip is the first novel 3D vertically integrated Adaptive Computing System-on-Chip (3D-ACSoC)
- Using Indium bump technology, data can be manipulated at very high speed with wide bandwidth.
- The variable memory word-length and configurable word-length are unique features for an adaptive computing system.
- Various computation models (SISD, SIMD, MISD, MIMD) are possible for adaptability/flexibility in accordance with the current application and 3 types of SIMD computational models (massively parallel, multithreaded, pipelined) allow for maximized computational throughput.
- The heterogeneous types of PE architectures are optimized for communication and multimedia signal processing.
- Dynamic run-time reconfigurability for adaptive computing. (Multiple sets of program memory and abundant memory capacity)
- The area constraint of PE should be minimized as much as possible (less than 60um x 60um in 0.13um technology) for a 4-bit word size.
2.6 Design Methodology

2.6.1. Suggested HW/SW Co-design and Verification Methodology

HW/SW co-design is a development methodology that supports the concurrent and co-operative development of hardware and software (co-specification, co-development, co-verification). It helps to evaluate the effect of design decisions and to explore the design space at an early stage to obtain the optimal architecture. As a result of this, design cost and design cycle time can be reduced and more reliable system can be realised because of the verification at the high-level of the system. Figure 2.5 shows a suggested HW/SW co-design methodology for the 3D-SoftChip. Once the system specification is firmly decided, HW/SW partitioning is executed to determine which functions should be implemented in hardware and which in software. The HW can then be modeled using SystemC [19] and SW modeled in C. After that, a co-simulation and verification process is implemented to verify the 3D-SoftChip operation and performance and to decide on an optimal HW/SW architecture.

More specifically, the SW is modelled using a modified GNU C Compiler and Assembler. After the compiler and assembler for ICS_RISC has been finalised, a program for the implementation of the MPEG4 motion estimation algorithm will be developed and compiled using it. After that, object code can be produced, which can be directly used as the input stimulus for an instruction set simulator and system level simulation. The HW/SW verification process can be achieved through the comparison between the results from instruction level simulation and system level simulation. From this point on, the rest of the procedure can be processed using any conventional HW design methodology, such as full and semi-custom design. N.B. SystemC is a system design language which supports concurrent HW/SW co-design methodologies and offers a simulation kernel that supports hardware modeling concepts at the system level, behavioral level and register transfer level [20].
2.7 Conclusions

The core technology, overall and detailed architecture of the 3D-SoftChip has been presented. The four kinds of salient features, as described Section 2.3 can differentiate the 3D-SoftChip from conventional reconfigurable/adaptive computing systems. The design time and reliability of the system will be significantly improved by adopting the suggested HW/SW co-design and verification methodology using SystemC.
Figure 2.5: Suggested HW/SW Co-design and Verification Methodology
Chapter 3

Architecture of CAP Chip

In this chapter, the overall architecture of the Configurable Array Processor (CAP) chip will be described along with the PE architecture for communication and multimedia signal processing. The integration of 4 heterogeneous PEs forms one Quad-PE, and four Quad-PE make up the UnitCAP chip.

3.1 Overall Architecture of CAP Chip

The basic architecture of the CAP chip is a linear array of heterogeneous PEs. Figure 3.1 shows three possible architecture choices for the PEs. The architecture in Figure 3.1(b) is suggested as the most feasible architecture for the PE in the 3D-SoftChip because it has the optimum trade-off between application specific performance and flexibility. Examples of type A can be seen in [1,2,3], type B in [16] and type C in [17]. The CAP chip has the basic role of the processing engine for the 3D-SoftChip. It manipulates large amounts of data at a high computational rate using any of the three different SIMD computation models previously described.
Figure 3.1: Types of PEs (a) homogeneous type, (b) heterogeneous type, (c) heterogeneous type with dedicated functions for special purpose.

Table 3.1: Characteristics of each PE types

<table>
<thead>
<tr>
<th>PE Type</th>
<th>PE Architecture</th>
<th>Flexibility</th>
<th>Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Type A</strong></td>
<td>Homogeneous type PEs with Embedded memory, ALU, MAC, Address decoder etc.</td>
<td>Suitable for general purpose High flexibility</td>
<td>Relative low performance for specific applications.</td>
</tr>
<tr>
<td>Example: Each PEs are optimized for special functions</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Type B</strong></td>
<td>Combination of the Type B arch. with dedicated functions for special purpose</td>
<td>Suitable for specific applications. Medium flexibility</td>
<td>Relative medium performance for specific applications</td>
</tr>
<tr>
<td>Example:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PE1: Multiple MAC, ALU array</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PE2: Bit-oriented operations</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PE3: General purpose RISC or Control Logic</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PE4: Memory</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Type C</strong></td>
<td>Suitable for dedicated applications. Low flexibility</td>
<td>Relative high performance for specific applications</td>
<td></td>
</tr>
<tr>
<td>Example:</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PE1: Multiple MAC, ALU array</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PE2: Memory and Control</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>PE3,PE4: A Co-processor optimized for dedicated signal processing functions (FEC, Preamble detect etc)</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

3.2 Two Types of Processing Element (PE)s

Figure 3.2 illustrates the two type of PE architecture chosen to optimize communication and multimedia signal processing type applications. Table 3.2 shows the characteristics of the two type of PE.
Table 3.2: Characteristics of the two type of PE

<table>
<thead>
<tr>
<th>Components</th>
<th>Standard-PE</th>
<th>Processing Accelerator-PE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Purpose</td>
<td>Bit-wise manipulation, Standard ALU functions</td>
<td>Dedicated for MAC, MAS functions (for DSP application)</td>
</tr>
<tr>
<td>Characteristics</td>
<td>Standard ALU functions Comparison operation.</td>
<td>Single clock cycle MAC, MAS absolute value computation operations 8bit barrel shifter. (Logical, Arithmetical Shift)</td>
</tr>
</tbody>
</table>

3.2.1. Standard-PE (S-PE)

The S-PE is for standard ALU functions and is also optimized for bit-level operation for communication signal processing. It comprise 4 sets of 19-bit registers for S-PE instruction decoding, two multiplexers to select input operands from the data bus, adjacent PEs or internal registers, a standard ALU with bit-serial multiplier, adder, subtractor and comparator, embedded local SRAM and 4 sets of Registers. The arithmetic primitives are scalable so as to make it possible to reconfigure the word-length for
specific tasks. The scalable architecture arithmetic primitive architecture is presented in [18].

3.2.2. Processing Accelerator-PE (PA-PE)

The PA-PE is dedicated specifically for Digital Signal Processing (DSP) operations. It consists of 4 sets of 19-bit registers for PA-PE instruction decoding, two multiplexers to select input operands from the data bus, adjacent PEs or internal registers, a signed 4-bit scalable parallel/parallel multiplier, and accumulator/subtractor modified to enable Multiple-and-Accumulate (MAC), Multiple-and-Subtract (MAS) operations within one clock cycle, an 8-bit configurable barrel shifter, embedded local SRAM and 4 sets of Registers. Two shifters in the Quad-PE can also be configured to produce a 16-bit barrel shifter. Its distinctive features are the single clock cycle MAC, MAS operations and parallel/parallel multiplier to accelerate DSP applications. Moreover it can execute single clock cycle absolute value computation.

3.3 PE Functions

PE functions are mainly divided into S-PE or PA-PE functions.

3.3.1. Standard-PE Functions

Table 3.3 shows the functions of S-PE. It is useful for bit-wise manipulation and generic ALU functions.

<table>
<thead>
<tr>
<th>Function</th>
<th>Mnemonics</th>
</tr>
</thead>
<tbody>
<tr>
<td>A and B</td>
<td>AND</td>
</tr>
<tr>
<td>A or B</td>
<td>OR</td>
</tr>
<tr>
<td>not A</td>
<td>NOT</td>
</tr>
<tr>
<td>A xor B</td>
<td>XOR</td>
</tr>
<tr>
<td>A + B</td>
<td>ADD</td>
</tr>
<tr>
<td>A - B</td>
<td>SUB</td>
</tr>
<tr>
<td>A x B</td>
<td>SPMUL</td>
</tr>
<tr>
<td>A comp B</td>
<td>COMP</td>
</tr>
</tbody>
</table>
3.3.2. Processing Accelerator-PE Functions

Table 3.4 describes the PA-PE functions. It is specialized for DSP such as MAC, MAS, logical Shift, Arithmetic Shift, Rotate function, absolute value computation.

![Table 3.4: Processing Accelerator-PE Functions](image)

3.3.3. PE Instruction Formats and Operation Modes

The PE instruction format consists of a 19-bit instruction word. The most significant 2-bits, 18 and 17 in the instruction word (WS_en/RS_en, WR_en/RR_en) are used for the Read/Write enable bit of the embedded SRAM and registers. Bits 16 to 10 are used for SRAM and register selection (addressing). Bit 9 is used for data output register enable signal and bits 8 to 6 are used to specify the PE operation. Finally, bits 5 to 0 are used to control the input multiplexers for input operand selection. This format is illustrated in Figure 3.3 below.

![Figure 3.3: PE Instruction formats](image)
Figure 3.4 illustrates 3 types of PE operation modes that can be realized on the PE array: Horizontal mode, Vertical mode and Circular mode. In the horizontal and vertical mode, the each rows or columns of the PEs can connected together respectively. These operation modes optimized for the SIMD computational method. Lastly in the circular mode, the PEs in the one Quad-PE connects together and each Quad-PE can work separatively. These allow for even greater flexibility and help to maximize computational throughput according to the target application.

3.4 Embedded Local SRAM

Each PE has a small quantity of local embedded SRAM. As the effective memory bandwidth is increased dramatically by as much as the number of the PEs, which will result in an increase in effective processing speed in many applications. Bus traffic can also be reduced because many data transmission operations can be contained within a PE. Consequently, a lowering of power dissipation will also be achieved. Effectively this can act as cache, which can be continuously refreshed.

3.5 Configurable Nature of Arithmetic Primitives

As described in the Chapter 2.3, one of the distinguished features as an adaptive computing system is the word-length configuration. The basic word-length of each PE is 4-bit. It can be configured 8, 16, 32-bit according to the target application. The
configurable nature of the arithmetic primitives in the PE allows this configuration [18]. The most complex component in the PE is multiplier so the example of configurable arithmetic primitives, the configurable parallel multiplier will be introduced.

3.5.1. Scalable Parallel Multiplier Cell

Figure 3.5 shows a generic 1×1-bit multiplier cell. It includes a full adder, an AND gate and three multiplexers to select the input operand through the control signals CTRLH and CTRLL. In this figure, A represents the multiplicand and B is the multiplier. Sin is the SUM signal from the adjacent cell above, Cout is the propagated carry output, Cin is the carry input from the adjacent multiplier cell. Mout represent the multiplication result.

The 2×2-bit multiplier can be implemented using the generic 1-bit cell and moreover an 8×8-bit multiplier can be realised by arranging the basic 4×4 primitive in a 2×2 array as shown in figure 3.6 [18]. Because of this configurable characteristic, the word-length can be extended up to 32-bit in the 3D-SoftChip.

![Figure 3.5: A generic 1×1-bit Multiplier Cell for n=1](image-url)
3.6 Quad-PE

As previously described one Quad-PE consists of two pairs of PEs (two S-PE and two PA-PE). The Quad-PE is controlled and configured by the Switch Block according to the control and address data from the ICS_RISC transmitted through the IBIA. Figure 3.7 shows the architecture of a single Quad-PE.
3.7 UnitCAP Chip Architecture

The CAP chip consists of 4 sets of UnitCAP. Each UnitCAP has an array of 16 heterogeneous S-PEs and PA-PEs. Figure 3.8 shows the UnitCAP chip architecture. The configurable interconnectivity is realised through the input multiplexer in each PE. The detailed description of interconnection between the PEs will be described in Chapter 6.

3.8 Conclusions

The heterogeneous types of PE architecture for communication and multimedia signal processing have been described. The adoption of the PE architecture can accelerate the
performance where intensive bit-level computation and digital signal processing is required and achieve more flexibility compare with homogeneous types of PE array. The suggested PE architecture has been fully modelled and its functionality verified using SystemC at high-level. The details regarding the system level modelling of the PE will be introduced in Chapter 7.

Figure 3.8: UnitCAP Chip Architecture
Chapter 4
Architecture of ICS Chip

The ICS chip comprises the Switch Blocks, ICS RISC, program memory, data memory, data frame buffers and DMA controller. The ICS chip is a control processor which controls the CAP chip via the IBIA as well as the overall system. The ICS_RISC provides control and address signals and data to the system as a whole. The switch blocks configure each PE based on the current program instruction. The high bandwidth data Interface Unit enables efficient transmission of data and instructions within the system. In this chapter, the detailed architecture of the ICS chip is described.

4.1 Switch Block

The Switch Block provides data from/to each PE and also provides instruction data to each PE. Three types of Switch Block, 6-sided, 7-sided and 8-sided provide optimized interconnection within the ICS chip. Figure 4.1 shows the Switch Block architecture which connects between the PEs and other Switch Blocks. The architecture of the Switch Block is similar to conventional Switch Blocks in Field Programmable Gate Arrays (FPGA) [32]. The lines in the figure represent switches to connect data/instruction data within the PEs, Switch Blocks and the ICS chip. A pass transistor design is used to optimize performance and minimise area, allowing a completely free configuration for each PE.
4.2 ICS_RISC

The ICS_RISC is a 32-bit dedicated RISC control processor. The ICS_RISC controls the execution of the PE array and provides control and address signals to program/data memory, the data frame buffers and the DMA controller. It has a 3 stage pipelined architecture that is Fetch (F), Decode (D) and Execute (E). To cope with the iterative nature of DSP arithmetic, it has 16 sets of loop buffers so as to provide direct instruction to instruction decoding instead of fetching from program memory in each case. This significantly reduces bus utilization allowing for improved performance and lower power dissipation. Moreover 32 general purpose registers and specialized addressing modes are provided for optimized communication and multimedia signal processing. For detailed architecture descriptions refer to Appendix B.

![Architecture of Switch Block](image)

Figure 4.1: Architecture of Switch Block: A 6-sided Switch Block, 7-sided Switch Block and 8-sided Switch Block
4.2.1 Features of ICS_RISC

The ICS_RISC has a simple and efficient architecture. It has a harvard architecture and simple 3 stage pipelined architecture. Memory access during the execution stage is carried out using load/store instructions only and all operations, except load/store, PE and DMA operations, are register-to-register within the ICS_RISC. This provides improvements in the performance and power dissipation.

4.2.2 System Components of ICS_RISC

The ICS_RISC consists of a 32 × 32-bit general purpose register, a program counter which is the 32th general purpose register, a 16 × 32-bit loop buffer to generate instruction addresses for iterative sets of instructions, a status register (N:Negative/Less than, Z:Zero, C:Carry/Borrow, V:Overflow), an instruction register for instruction decoding, ALU, shifter, multiplier and 32-bit data input/output registers [30,31]. Figure 4.3 shows the detailed architecture of the ICS_RISC.
4.2.3. Types of Instruction Set

Table 4.1 describes the instruction set and instruction processing components of the 3D-SoftChip. All control instructions are executed in the ICS chip, while computation instructions, such as arithmetic and logical operations for PEs are executed in the CAP chip using various computation methods (SISD, SIMD, MISD, MIMD). The detailed instruction set is described in Appendix A.

<table>
<thead>
<tr>
<th>Function</th>
<th>Processing Component</th>
</tr>
</thead>
<tbody>
<tr>
<td>Move</td>
<td>ICS</td>
</tr>
<tr>
<td>Arithmetic (S-PE, PA-PE)</td>
<td>CAP</td>
</tr>
<tr>
<td>Logical (S-PE)</td>
<td>CAP</td>
</tr>
<tr>
<td>Arithmetic</td>
<td>ICS</td>
</tr>
<tr>
<td>Logical</td>
<td>ICS</td>
</tr>
<tr>
<td>Branch</td>
<td>ICS</td>
</tr>
<tr>
<td>Load</td>
<td>ICS</td>
</tr>
<tr>
<td>Store</td>
<td>ICS</td>
</tr>
<tr>
<td>Addressing Mode/Loop Buffer</td>
<td>ICS</td>
</tr>
<tr>
<td>Addressing</td>
<td>ICS</td>
</tr>
<tr>
<td>PE Control</td>
<td>ICS</td>
</tr>
</tbody>
</table>
4.2.4. ICS_RISC Instruction Set Architecture – Version 1.0

Table 4.2 shows the instruction set architecture (ISA) for ICS_RISC. This is the first version of the ISA, more efficient and dedicated instructions can be added is needed. It has 50 instructions, largely divided into arithmetic and logic, branch, data transfer, bit and bit-test, PE control, DMA control and lastly a loop buffer instruction.

<table>
<thead>
<tr>
<th>Mnemonic</th>
<th>Operation</th>
<th>Operands</th>
<th>Flags</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADD</td>
<td>Add Two Registers</td>
<td>Rd, Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>ADDI</td>
<td>Add Register and Constant</td>
<td>Rd, Rs1, #I</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>SUB</td>
<td>Subtract Two Registers</td>
<td>Rd, Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>SUBI</td>
<td>Subtract Register and Constant</td>
<td>Rd, Rs1, #I</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>MUL</td>
<td>Multiply Two Registers</td>
<td>Rd, Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>MULI</td>
<td>Multiply Register and Constant</td>
<td>Rd, Rs1, #I</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>AND</td>
<td>Logical AND Registers</td>
<td>Rd, Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>ANDI</td>
<td>Logical AND Register and Constant</td>
<td>Rd, Rs1, #I</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>OR</td>
<td>Logical OR Registers</td>
<td>Rd, Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>ORI</td>
<td>Logical OR Register and Constant</td>
<td>Rd, Rs1, #I</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>XOR</td>
<td>Logical XOR Registers</td>
<td>Rd, Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>XORI</td>
<td>Logical XOR Register and Constant</td>
<td>Rd, Rs1, #I</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>NOT</td>
<td>Logical NOT Registers</td>
<td>Rd, Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>NOTI</td>
<td>Logical NOT Register and Constant</td>
<td>Rd, Rs1, #I</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>BREQ</td>
<td>Branch if Equal (Z=1)</td>
<td>PC, Offset</td>
<td>None</td>
</tr>
<tr>
<td>BRNE</td>
<td>Branch if NOT Equal (Z=0)</td>
<td>PC, Offset</td>
<td>None</td>
</tr>
<tr>
<td>JMP</td>
<td>Unconditional Branch (PC=PC+Offset)</td>
<td>PC, Offset</td>
<td>None</td>
</tr>
<tr>
<td>CMP</td>
<td>Compare Registers</td>
<td>Rs1, Rs2</td>
<td>N,Z,C,V</td>
</tr>
<tr>
<td>CMPI</td>
<td>Compare Register and Constant</td>
<td>Rd, #I</td>
<td>N,Z,C,V</td>
</tr>
</tbody>
</table>

BRANCH INSTRUCTIONS

DATA TRANSFER INSTRUCTIONS

MOVA  Move between Registers (Rd=Rs1)  Rd, Rs1  None
MOVAI Move between Reg & Const. (Rd=Const) Rd, #I  None
MOVB  Move between Registers (Rd=Rs2)  Rd, Rs2  None
MOVBI Move between Reg & Const. (Rd=Const) Rd, #I  None
MSR   Move Register to Status Register(SR=Rs1) SR, Rs1  None
<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
<th>Source Register</th>
<th>Destination Register</th>
<th>Flags</th>
</tr>
</thead>
<tbody>
<tr>
<td>MSRI</td>
<td>Move Imm value to Status Register(SR=#1)</td>
<td>SR, #1</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>MRS</td>
<td>Move Status Register to Register(Rs1=SR)</td>
<td>Rs1, SR</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>LD</td>
<td>Load indirect with Register(Rd=Mem[Rb])</td>
<td>Rd, Rb</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>ST</td>
<td>Store indirect with Register(Mem[Rb]=Rd)</td>
<td>Rd, Rb</td>
<td>None</td>
<td></td>
</tr>
</tbody>
</table>

**BIT AND BIT-TEST INSTRUCTIONS**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
<th>Source Register</th>
<th>Destination Register</th>
<th>Flags</th>
</tr>
</thead>
<tbody>
<tr>
<td>LSL</td>
<td>Logical Shift Left</td>
<td>Rd, Rs1</td>
<td>N.Z.C.V</td>
<td></td>
</tr>
<tr>
<td>LSR</td>
<td>Logical Shift Right</td>
<td>Rd, Rs1</td>
<td>N.Z.C.V</td>
<td></td>
</tr>
<tr>
<td>ASR</td>
<td>Arithmetic Shift Right</td>
<td>Rd, Rs1</td>
<td>N.Z.C.V</td>
<td></td>
</tr>
<tr>
<td>ROT</td>
<td>Rotate</td>
<td>Rd, Rs1</td>
<td>N.Z.C.V</td>
<td></td>
</tr>
</tbody>
</table>

**PE CONTROL INSTRUCTIONS**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
<th>None</th>
<th>None</th>
</tr>
</thead>
<tbody>
<tr>
<td>PECON4</td>
<td>PE Word-Length Configuration (4-bit)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PECON8</td>
<td>PE Word-Length Configuration (8-bit)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PECON16</td>
<td>PE Word-Length Configuration (16-bit)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PECON32</td>
<td>PE Word-Length Configuration (32-bit)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PESEL</td>
<td>Select certain PE (PE0-PE15)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PEMODH</td>
<td>PE Operation mode (Horizontal mode)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PEMODV</td>
<td>PE Operation mode (Vertical mode)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PEMODC</td>
<td>PE Operation mode (Circular mode)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PEEXEH</td>
<td>Execute specific program to each PEs in the same Horizontal line</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PEEXEV</td>
<td>Execute specific program to each PEs in the same Vertical line</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>PEEXEC</td>
<td>Execute specific program to each PEs in the same Circular line</td>
<td>None</td>
<td>None</td>
</tr>
</tbody>
</table>

**DMA INSTRUCTIONS**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
<th>Source Register</th>
<th>Destination Register</th>
<th>None</th>
</tr>
</thead>
<tbody>
<tr>
<td>LDPEPRG</td>
<td>Load Program Data from Program memory to Instruction decoder in PE.</td>
<td>addrMem</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>LDDFB</td>
<td>Load large amount of processing data for PEs from Data Memory to Data Frame Buffer</td>
<td>addrMem, addrDFB</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>LDPEDATA</td>
<td>Load large amount of processing data for PEs from DFB to Embedded SRAM in PE</td>
<td>addrDFB, addrSRAM</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>WBREG</td>
<td>Write back processed data in Embedded SRAM to the registers in the ICS_RISC</td>
<td>addrSRAM</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>WBDFB</td>
<td>Write back processed data in Embedded SRAM to DFB</td>
<td>addrSRAM, addrDFB</td>
<td>None</td>
<td></td>
</tr>
<tr>
<td>STDFB</td>
<td>Load large amount of processed data in PEs from Data Frame Buffer to Data Memory</td>
<td>addrDFB, addrMem</td>
<td>None</td>
<td></td>
</tr>
</tbody>
</table>

**LOOP BUFFER INSTRUCTION**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
<th>Source Register</th>
<th>Destination Register</th>
<th>None</th>
</tr>
</thead>
<tbody>
<tr>
<td>LBEN</td>
<td>Generate an Iterative Set of Instruction Addresses (16 sets of Loop Buffer)</td>
<td>PC</td>
<td>None</td>
<td></td>
</tr>
</tbody>
</table>
4.3 High Bandwidth Data Interface Unit

The high bandwidth data interface unit allows the efficient transfer of data within the 3D-SoftChip. Two sets of data frame buffer and the DMA controller make it easy to transfer large amounts of data. Multiple sets of program memory support run-time program switching and, because of this dynamic reconfigurable feature, adaptive computing is possible. The data memory has a variable word width so it can easily be combined to build wider/deeper memories and thus increase flexibility for different application programs. The DMA instructions and data flow for the DMA controller can be seen in Figure 4.4. A detailed description of the operations of the DMA instructions can be seen in Appendix A.

Figure 4.4: DMA Controller Architecture and Instructions for DMA Controller

4.4 Conclusions

The ICS chip architecture has been described in this chapter. The system components in the ICS Chip allow it to efficiently supply data and instructions to the PEs through the IBIA. The PE array can be freely configured due to the highly controllable characteristic of the switch block. This allows more than sufficient adaptability/flexibility for adaptive
computing systems. Moreover, the DMA controller enables transfer of the bulk data fast and effectively through the 3D-SoftChip.
Chapter 5

Architecture of UnitChip

The 3D-SoftChip consists of 4 sets of UnitChip. Each UnitChip has one UnitCAP and one UnitICS. As described in the chapter 3, the UnitCAP comprises 16 sets of heterogeneous arrays of S-PEs and PA-PEs and the UnitICS consists of a switch block, a 32-bit dedicated RISC control processor, a high bandwidth data interface unit, 2 sets of data frame buffers and program/data memory for both the ICS and the PE array. In this chapter, the UnitChip architecture and its pipeline operation mechanism which can maximize the computational throughputs [3], are described.

5.1 UnitChip Architecture

As mention above, the UnitChip is a combination of the UnitCAP and the UnitICS chip and four UnitChips form the complete 3D-SoftChip. Figure 5.1 illustrates the overall architecture of the UnitChip. The control, data and instructions transfer through the IBIA to the UnitCAP, and the processed data from the UnitCAP can be rapidly transferred back to the ICS_RISC to be manipulated and stored in the data memory.
5.2 Pipelined Operation Mechanism of UnitChip

Table 5.1: Pipelined UnitChip Operation Mechanism

<table>
<thead>
<tr>
<th>Stage (1)</th>
<th>Stage (2)</th>
<th>Stage (3)</th>
<th>Stage (4)</th>
<th>Stage (5)</th>
<th>Stage (6)</th>
</tr>
</thead>
<tbody>
<tr>
<td>ICS_RISC Instructions</td>
<td>ICS_RISC Instructions</td>
<td>ICS_RISC Instructions</td>
<td>ICS_RISC Instructions</td>
<td>ICS_RISC Instructions</td>
<td>ICS_RISC Instructions</td>
</tr>
<tr>
<td>LDPEPRG, PESEL, PEMODH,V,C</td>
<td>LDPEPRG, PESEL, PEMODH,V,C</td>
<td>LDPEPRG, PESEL, PEMODH,V,C</td>
<td>PECON4,8,16,32</td>
<td>PECON4,8,16,32</td>
<td>PECON4,8,16,32</td>
</tr>
<tr>
<td>PE SEL</td>
<td>Execute (1-1)</td>
<td>Execute (1-2)</td>
<td>Execute (2-1)</td>
<td>PEEXEH,V,C</td>
<td>PEEXEH,V,C</td>
</tr>
<tr>
<td>LDPEPRG, PESEL, PEMODH,V,C</td>
<td>PECON4,8,16,32</td>
<td>PECON4,8,16,32</td>
<td>PECON4,8,16,32</td>
<td>PECON4,8,16,32</td>
<td>PECON4,8,16,32</td>
</tr>
<tr>
<td>PROGRAM for PEs (in Local memory)</td>
<td>PROGRAM for PEs (in Local memory)</td>
<td>PROGRAM for PEs (in Local memory)</td>
<td>PROGRAM for PEs (in Local memory)</td>
<td>PROGRAM for PEs (in Local memory)</td>
<td>PROGRAM for PEs (in Local memory)</td>
</tr>
<tr>
<td>Load PRGM for PEs (1)</td>
<td>PRGM for PEs (1-1)</td>
<td>Load PRGM for PEs (2)</td>
<td>PRGM for PEs (2-1)</td>
<td>PRGM for PEs (2-2)</td>
<td></td>
</tr>
<tr>
<td>Memory</td>
<td>Memory</td>
<td>Memory</td>
<td>Memory</td>
<td>Memory</td>
<td>Memory</td>
</tr>
<tr>
<td>Load Data for PEs (1)</td>
<td>Data for PEs (1-1)</td>
<td>Data for PEs (1-2)</td>
<td>Write back Execution (1-1) results</td>
<td>Data for PEs (2-1)</td>
<td>Data for PEs (2-2)</td>
</tr>
<tr>
<td>Data Frame Buffer 0</td>
<td>Data Frame Buffer 0</td>
<td>Data Frame Buffer 0</td>
<td>Data Frame Buffer 0</td>
<td>Data Frame Buffer 0</td>
<td>Data Frame Buffer 0</td>
</tr>
<tr>
<td>Data Frame Buffer 1</td>
<td>Data Frame Buffer 1</td>
<td>Data Frame Buffer 1</td>
<td>Data Frame Buffer 1</td>
<td>Data Frame Buffer 1</td>
<td>Data Frame Buffer 1</td>
</tr>
<tr>
<td>Data for PEs (2)</td>
<td>Data for PEs (2-1)</td>
<td>Data for PEs (2-2)</td>
<td>Write back Execution (1-1) results</td>
<td>Write back Execution (1-1) results</td>
<td></td>
</tr>
</tbody>
</table>

* Dark-sided boxes: DMA Control Instructions
Table 5.1 illustrates the pipelined operation mechanism of UnitChip to improve its performance. The detailed explanation is as follows.

- **STEP 1 - LOAD PROGRAM FOR PEs:** The first operation is to load 16 instruction words for PEs from program memory to the instruction decoder in each PE, the row and column decoder in the UnitCAP can specify a certain PE to load the programs, depending on the desired computational mode (e.g., SIMD, MIMD).

- **STEP 2 - LOAD PROCESSING DATA FOR PEs (1):** Load large amount of processing data for PEs from data memory to data frame buffer. The start address of memory and an amount of data to transfer can be indicated by the DMA instructions.

- **STEP 3 - LOAD PROCESSING DATA FOR PEs (2):** Load the processing data for PEs from data frame buffer to embedded SRAM in each PEs. The row and column decoder in the UnitCAP can specify a certain PE to load the processing data.

- **STEP 4 - EXECUTE PEs:** Execute the PE array

- **STEP 5 - RELOAD PROGRAM FOR PEs (1):** Re-load 16 instruction words from program memory to the instruction decoder in each PE, the row and column decoder in the UnitCAP can again specify a certain PE to load the programs to.

- **STEP 6 - RELOAD PROCESSING DATA FOR PEs (2):** Re-load the processing data for PEs from data frame buffer to each PEs, the row and column decoder in the UnitCAP can specify a certain PE to load the data into.

- **STEP 7 - WRITE BACK PROCESSED DATA TO DFB:** Write back processed data from embedded SRAM in each PEs to data frame buffer

- **STEP 8 - TRANSFER PROCESSED DATA TO Memory:** Transfer large amount of processed data from data frame buffer to memory
5.3 Area Estimations and Constraints

Table 5.2 shows the feasible estimated area of 3D-SoftChip components. The performance of the integrated circuits largely depends on integration density. The tight area constraints can be achieved through more integration density, which means it can maximize benefit from large scaled integration. The area constraints should be tight in order to achieve the best performance.

<table>
<thead>
<tr>
<th>Component</th>
<th>Estimated Area</th>
</tr>
</thead>
<tbody>
<tr>
<td>S-PE</td>
<td>60 um x 60 um</td>
</tr>
<tr>
<td>PA-PE</td>
<td>60 um x 60 um</td>
</tr>
<tr>
<td>IBIA</td>
<td>15 um x 15 um</td>
</tr>
<tr>
<td>One Quad-PE</td>
<td>130 um x 130 um</td>
</tr>
<tr>
<td>UnitCAP</td>
<td>500 um x 500 um</td>
</tr>
<tr>
<td>CAP(4x4 UnitCAP)</td>
<td>1100 um x 1100 um</td>
</tr>
<tr>
<td>CAP(16x16 UnitCAP)</td>
<td>2200 um x 2200 um</td>
</tr>
<tr>
<td>ICS_RISC</td>
<td>300 um x 300 um</td>
</tr>
</tbody>
</table>

5.4 Conclusions

As explained above, by using the pipeline operation mechanism that is a 6-stages pipelined architecture, the performance of the UnitChip can be 6 times more improved. This pipelined operation is another distinguished character to accelerate the computational throughput as the computation is executed simultaneously as much as the pipelined stages.
Chapter 6
Interconnection Network

In this chapter, the three hierarchical interconnection architectures: Inter-PE bus, Switch Block Array interconnection and IBIA, will be introduced along with the configurable nature of the Inter-PE bus using the input operand multiplexer in each of the PEs.

6.1 Hierarchical Interconnection Architecture

The interconnection network of the 3D-SoftChip can be broken into three hierarchical levels. The Inter-PE bus between PEs in the CAP chip is the first level. This local interconnection network has a 2D-mesh architecture providing nearest-neighbor interconnection between the PEs. The second level of the interconnection network is the switch block array interconnection. This supports longer interconnections on the ICS chip but also has a basic 2D-mesh architecture. The last hierarchical level of interconnection is the IBIA. With progression of technology to ever decreasing semiconductor geometry scales, the prediction of interconnection delay and the portion of interconnection delay in the total system delay are crucial factors. It is also a major factor in the limitation of overall system performance. To overcome these problems, 3D interconnection technology using Indium bump becomes very attractive because it supports a very high bandwidth coupled with a very low inductance/capacitance (and thus low power dissipation) and can be readily utilized to achieve an interconnect array with a pitch as
low as 10µm. The development of 3D integrated systems will allow improvements in packaging costs, performance, reliability and a reduction in the size of the chips [15]. However, any other equivalent 3D interconnection technology could also be applied to realize this interconnection level within the 3D-SoftChip architecture. Figure 6.1 shows the three hierarchical interconnection networks.

(a) PE Array Interconnection Network: 2D-mesh interconnection for local interconnection

(b) Switch Block Array Interconnection Network: 2D-mesh interconnection for long interconnection
6.1.1. PE and Switch Block Array Interconnection

6.1.1.1 Programmable Nature of PE Array Interconnection

Figure 6.2: Quad-PE and Programmable Interconnect Architecture
Table 6.1: Inter-PE Bus (IPB) interconnection connectivity

<table>
<thead>
<tr>
<th>IPB Signal Name</th>
<th>Source (Output)</th>
<th>Destination (Input)</th>
</tr>
</thead>
<tbody>
<tr>
<td>IPB1</td>
<td>SPE1(dOutadjPE)</td>
<td>PAPE1(dLeft)</td>
</tr>
<tr>
<td>IPB2</td>
<td>SPE1(dOutadjPE)</td>
<td>PAPE2(dUp)</td>
</tr>
<tr>
<td>IPB3</td>
<td>PAPE1(dOutadjPE)</td>
<td>SPE1(dRight)</td>
</tr>
<tr>
<td>IPB4</td>
<td>PAPE1(dOutadjPE)</td>
<td>SPE2(dUp)</td>
</tr>
<tr>
<td>IPB5</td>
<td>PAPE1(dOutadjPE)</td>
<td>Next Quad-PE(SPE1(dLeft))</td>
</tr>
<tr>
<td>IPB6</td>
<td>PAPE2(dOutadjPE)</td>
<td>SPE1(dDown)</td>
</tr>
<tr>
<td>IPB7</td>
<td>PAPE2(dOutadjPE)</td>
<td>SPE2(dLeft)</td>
</tr>
<tr>
<td>IPB8</td>
<td>PAPE2(dOutadjPE)</td>
<td>Downside Quad-PE(SPE1(dUp))</td>
</tr>
<tr>
<td>IPB9</td>
<td>SPE2(dOutadjPE)</td>
<td>PAPE1(dDown)</td>
</tr>
<tr>
<td>IPB10</td>
<td>SPE2(dOutadjPE)</td>
<td>PAPE2(dRight)</td>
</tr>
<tr>
<td>IPB11</td>
<td>SPE2(dOutadjPE)</td>
<td>Next Quad-PE(PAPE2(dLeft))</td>
</tr>
<tr>
<td>IPB12</td>
<td>SPE2(dOutadjPE)</td>
<td>Downside Quad-PE(PAPE1(dUp))</td>
</tr>
</tbody>
</table>

Figure 6.2 shows the Quad-PE architecture and Inter-PE interconnection architecture [3]. Because of the input multiplexer in each PE, the connectivity can be readily configured. The input multiplexer can choose certain input operands from among the 6 different inputs; data input, data from left side, right side, upward side and down side PE (dIn, dLeft, dRight, dUp, dDown) and each PE’s output (dOutadjPE) becomes input operand to the neighbour PEs. Table 6.1 describes the connectivity within one Quad-PE and indicates that it can be configured by the PE programming according to the target application.

6.1.2. Indium Bump Interconnection

Indium is an excellent material to use as an interconnect material due to its excellent adhesion to most metals, including aluminum, which is the metallization for the pads used in most VLSI technologies. Indium has a low melting point, which implies a low work hardening coefficient, allowing for direct bonding on processed VLSI wafers. Additionally, it provides excellent mechanical as well as electrical connectivity (contact resistance < 1 mΩ per bump). Reflow techniques can be used for flexibility and to increase the bump height to width ratio as needed. Such techniques can also be used to incorporate self-alignment features to the bonding process. Figure 6.3 illustrates 3D flip-chip wafer bonding technology using indium bump interconnection arrays.
6.2 Conclusions

The three hierarchical interconnection network architectures have been described. With the exception of the 3D interconnection there are similar to conventional interconnection architectures in reconfigurable systems. The Inter-PE bus provides configurable connectivity with 2D mesh architecture and the switch block interconnection offers longer interconnection in the ICS chip. Lastly, the IBIA presents vertical interconnection between the two separated chips providing a high bandwidth, high speed, low power memory bus, reducing and eliminating the needs for external memory resources.
In this chapter, the high-level modelling of 3D-SoftChip using SystemC will be introduced. Firstly, an overview of SystemC, Computer Aided Design (CAD) environment for SystemC will be briefly described, followed by a presentation of the high-level simulation output waveforms for each of the 3D-SoftChip components and analysis of these. Finally, some conclusions are provided.

7.1 SystemC Overview

SystemC is a C++ class library and design methodology which can effectively design a software algorithm, hardware architecture, interface with SoC and system level designs. System-level modelling, quick simulation to validate and optimize design and HW architecture and various software algorithms explorations can all be achieved using conventional C++ development environments. The current system design methodology is for the system engineer to write high-level language (C, C++, Matlab etc.) programs to verify the concepts and algorithms at system-level. After the concepts and algorithms are validated, the high-level modelled designs are manually converted to the Hardware Description Languages (VHDL, Verilog-HDL) in order to implement the hardware. But
this approach gives rise to a number of problems, such as errors arising from the manual conversion from C to HDL, a disconnection between the system level model and HDL model and conversion limitation as design sizes is get ever bigger and more complex. As a result of this, new C language based system design languages are starting to emerge as a new design methodology. Figure 7.1 shows the conventional system design in contrast to a SystemC based design methodology.

![Diagram of System Design Methodology](image)

**Figure 7.1: System Design Methodology:**
(a) Conventional Design Methodology, (b) SystemC Design Methodology (*Source: www.systemc.org*)

The system design methodology using SystemC has many advantages over the conventional system design methodology including increased more productivity and reliability from the progressive refinement process and the use of a single language. In the design methodology using SystemC, the time consuming manual conversion process is no longer necessary because the high-level modelled code becomes a more reliable and high performance hardware model while hardware concepts and timing constructs can be added through the progressive refinement process. More productivity can be achieved by using a single design language, the high-level modelled SystemC code can result in smaller code that is easier to write as well as relatively faster simulation time, moreover
the testbench code for functional verification at high-level can be reused at any level or
design stage[19,20].

7.1.1. CAD Environment for SystemC

As described above, SystemC is a C++ class library, which means any conventional
C++ compiler can be a CAD development environment for SystemC. Any Unix, linux or
PC based C++ compiler can be used, however, in this research, the PC based CAD
environment (Microsoft Visual C++ Version 6.0) has used to compile the high-level
modelled SystemC code because of its easy accessibility. Once SystemC code is
compiled, the results are stored in various types of file. The most common file type for
the results is a Value Change Dump (VCD) type and the GTKWave waveform viewer is
used to validate VCD type of results. The figure below shows the Visual C++ and
GTKWave waveform viewer.

![CAD Environment for SystemC](image1)

Figure 7.2: The CAD Environment for SystemC; Visual C++ Version 6.0, GTKWave Waveform Viewer.
7.2 System-level Modeling of 3D-SoftChip

In this section, the high-level modelled single Standard-PE, Processing-Accelerator-PE, ICS_RISC and UnitChip will be introduced with output simulation waveform. The functionality of these components has been fully verified. For a more detailed description of the system-level modelling of 3D-SoftChip see Appendix B.

7.2.1. Standard-PE

The detailed architecture of the S-PE was introduced in Chapter 3. Based on the architecture, it has been high-level modelled using SystemC. Figure 7.3 shows the block diagram and SystemC file structure of the S-PE.

Figure 7.4 shows the output waveform of the S-PE execution results after ALU instructions between data from internal registers and embedded SRAM. The input signals (dIn, dLeft, dRight, dUp, dDown) have been selected by the input multiplexer. The ALU output signals can be seen in the dOut, and dOutadjPE signals. The functionality of the S-PE was confirmed by checking the output result.
7.2.2. Processing Accelerator-PE

The PA-PE architecture has been described in Chapter 3. The high-level modelling was executed from this description. Figure 7.5 shows the PA-PE block diagram and file structure for the SystemC modelling.

![Figure 7.4: The Output Waveform of S-PE](image)

![Figure 7.5: High-level modeling of PA-PE:](image)

(a) PA-PE block diagram, (b) file structure of PA-PE
The figure above shows the output waveform of the high-level modelled PA-PE. The selected input operand through the input multiplexer executes the ALU instruction (MAC, MAS, Shift, etc) and the results are then stored to the embedded SRAM. The output signal shows the operation executed as required.

7.2.3. ICS_RISC

The ICS_RISC and instruction set architecture was introduced in Chapter 4. The ICS_RISC can largely be classified into control and datapath units. The 32 × 32-bit general purpose register, a program counter, a 16 × 32-bit loop buffer, a status register, an instruction register, ALU, shifter, multiplier and 32-bit data input/output registers form the datapath architecture. The fetch, decoding and execution unit make up the control unit. Additionally, a bus control unit is used to control the 32-bit operand A, operand B, data write bus, input bus and output bus to avoid data collision. Figure 7.7 shows the top block diagram of the ICS_RISC and its SystemC file structures.
The output waveform shows the results after execution of simple loop and ALU instructions. Figure 7.8 shows the pseudo code for the instructions. The circle in figure 7.9 which is written as a loop instruction indicates the internal general purpose register address. It increases as programmed and the other circle presents the output result of the ALU operations.

Figure 7.8: The Psuedo Code for ICS_RISC

```plaintext
//Simple Loop & ALU Instruction
MOV R0, #0;  //Simple Loop Inst
MOV R1, #1;
MOV R2, #2;
MOV R3, #3;
MOV R4, #4;
MOV R5, #5;
MOV R6, #6;
MOV R7, #7;
MOV R8, R0;
MOV R9, R1;
MOV R10, R2;
MOV R11, R3;
MOV R12, R4;
MOV R13, R5;
MOV R14, R6;
MOV R15, R7;  //End of Loop Inst.
AND R16, R8, R9;  //ALU Inst
OR R17, R10, R11;
XOR R18, R12, R13;
ADD R19, R14, R15;
SUB R20, R14, R15;  //End
```
Figure 7.9: The Output Waveform of ICS_RISC

Figure 7.10: The Instruction Index

Figure 7.10 shows the instruction index during ICS_RISC instruction execution for debugging purposes. The instruction index was perfectly matched with the instruction of the pseudo code.
7.2.4. UnitChip

The composition of the UnitCAP and UnitICS becomes the UnitChip. It can be largely divided into 4 kinds of sub-SystemC files, that is ICS_RISC, Memory, DMA and UnitCAP. As described in Chapter 5, the architecture and the pipelined operation mechanism can be identified in the high-level system simulation results.

Figure 7.11 illustrates the UnitChip block diagram and SystemC file structure of the UnitChip. Each sub-SystemC block’s functionality has been described before, the UnitChip is a simple combination (port-mapping) of these sub-blocks at the top module. The simple ALU instruction has been mapped in this high-level modelled UnitChip. The simulation result shows its functionality. In figure 7.12, the upper side circle indicates the ICS_RISC operation introduced before, and lower circle shows the PEs operations, which is the execution of simple ALU functions for the PEs with parallelism. The signal named as a PE1.dOut means the output signal from PE1. The functionality can be verified by checking these signals (from PE1~PE16) and is as expected.
7.3 Conclusions

In this chapter, the overview of SystemC and its CAD tool develop environment has been introduced. The high-level system modelling and functional verification of the 3D-SoftChip using SystemC has been described and some simulation results provided. The waveforms show the correct functionality for each of the sub-blocks and for the top module of the UnitChip.
Chapter 8
Application Mapping for 3D-SoftChip

The MPEG4 Full Search Block Matching Motion Estimation Algorithm (FBMA) has been applied to the high-level system modeled 3D-SoftChip to verify its functionality and demonstrate its architectural superiority. The hand-crafted assembler code for implementation of the algorithm becomes the input stimulus of the system-level modeled 3D-SoftChip. The performance will be analyzed in comparison with a conventional DSP processor, Application Specific ICs (ASICs) and MorphoSys.

8.1 Full Search Block Matching Algorithm (FBMA)

Motion estimation (ME) is introduced to exploit the temporal redundancy of video sequences and is an indispensable part of video compression standards such as the ISO/IEC, MPEG-1, MPEG-2, MPEG-4 and the CCITT, H.261/ITU-T, H.263 etc. Since ME is computationally the most demanding portion of the video encoder, it can take up to 80% of total computation time and it can be a major limiting factor for the performance. Among the many different ME algorithms, FBMA is one of the most widely used in hardware, despite its high computational cost because it has the optimal performance and lowest control overhead. The block matching motion estimation algorithm compares a specific sized block of pixels in the current frame with a range of equally sized pixel blocks in the previous frame to find the best match (minimum difference) between two of the blocks. The position of the best matched block can then be encoded as a motion
• **STEP 1 - LOAD REF. BLOCK DATA INTO PE ARRAY SRAM:** The first operation is to load reference block data \((I_i(m,n))\) into embedded SRAM in each PE in the array.

• **STEP 2 - EACH PE MOVES THIS DATA TO INTERNAL REGISTER:** Each PE moves the reference data from the embedded SRAM into an internal register so it is available to be used for calculation of SAD values for the entire search window.

• **STEP 3 - LOAD FIRST SEARCH POSITION BLOCK DATA INTO PE ARRAY SRAM:** The block data for the first search position \((I_{k-1}(m+dx,n+dy))\) is then loaded into the embedded SRAM in each PE in the array ready for calculation of the SAD value between the reference block and this first search position.

• **STEP 4 - EACH PE EXECUTES SUBTRACTION AND ABSOLUTE VALUE COMPUTATION:** In this step, each PE carries out a subtraction operation between the reference block data and the current search position in SRAM, the absolute value of this resulting difference is stored as the absolute difference value for that block position.

• **STEP 5 - PARTIAL SUMMATION (1):** In this step every odd columned PE performs a partial sum operation of its absolute difference value with the value from the PE to its immediate right in the array, the result is stored as a double-word value across both PEs.

• **STEP 6 - PARTIAL SUMMATION (2):** In this step the two partial sums computed in the previous step are summed in the same way, every odd columned PE pair sums its result with the result from the PE pair to its right, this result is stored as a quad-word value across all four PEs in each row.

• **STEP 7 - PARTIAL SUMMATION (3):** In this step the column wise operation carried out in step 5 is repeated row wise to accumulate another set of partial sums,
in this case, however, the second row of PEs accumulated its result with the result from the row above, while the third row of PEs accumulates its result with the result from the row below.

- **STEP 8 – PARTIAL SUMMATION (4):** In this final partial sum accumulation, the second row of PEs sums its result with the result from the third row, producing the total SAD value for that search position.

- **STEP 9 – WRITE BACK RESULT DATA TO THE ICS_RISC:** Finally the resultant SAD value calculated in STEP 8 is written back to the internal register in the ICS_RISC for comparison with the previous minimum and updating of the motion vector if applicable.

- **STEP 10 – REPEAT STEPS 4 TO 9:** The next search position data block can be loaded into the SRAM in the PE array while the SAD calculation is being carried out for the current search position so once the result had been written back the calculation of the SAD for the next search position can be begun immediately.

### 8.3 Performance Analysis

Figure 8.3 shows the performance comparison of the 3D-SoftChip with a DSP processor, several ASICs and MorphoSys for matching on 8×8 reference block against its search area of 8 pixels displacement. There are 81 candidate blocks (27 iterations) in each search area [33]. In the 3D-SoftChip, as described above, the number of processing cycles for one candidate block is just 7 clock cycles (each UnitChip computes one quarter block, so with 4 UnitChips one complete block is computed every 7 cycles), so the total number of processing cycles for the 3D-SoftChip becomes 567 (81 iterations of 7 cycles each).

The number of clock cycles required is very close to that reported for MorphoSys, with just 4 UnitChips, this, however, can readily be improved simply by increasing the number of UnitChips on a scaled up 3D-SoftChip. A 4×4 UnitChip array, for example, would have an effective throughput of one block every 142 cycles. In addition to this, considering the characteristics of the 3D system, there are other significant advantages.
Data dependency is largely eliminated so there after the initial set-up there is a 100% PE utilisation. The reference and candidate block data can be moved into the embedded SRAM in the PE concurrently with array execution, so the PEs can operate continuously. Also low power consumption can be achieved through a minimisation of the number of data accesses, because most of data manipulation can be executed within the PE array. Most importantly, however, because all memory is directly accessible within the 3D-SoftChip via the IBIA there are effectively zero external data reads and thus power consumption will be greatly improved over all the other approaches.

![Clock Cycles Graph](image)

**Figure 8.3: Performance comparison for Motion Estimation**

When comparing with the performance of the DSP processor and dedicated ASICs, the performance of the suggested 4x4 UnitChip 3D-SoftChip has remarkable advances with a theoretical capability of more than 3.8 times the performance. Given its wide applicability/adaptability to any number of other applications, the performance achieved compared to these dedicated processors is a potentially enormous advancement. This clearly demonstrates the architectural superiority of the suggested novel 3D-SoftChip.
8.4 Conclusions

In this chapter, the mapping of the implemented MPEG4 full search block matching algorithm has been applied to the system-level modelled 3D-SoftChip in order to demonstrate its architectural superiority. According to the described results, the proposed 3D-SoftChip architecture has the potential for a more than 3.8 times performance improvement over conventional systems. The suggested 3D-ACSoC is clearly a highly suitable system for the coming giga-scaled integrated computing age.
Chapter 9
Conclusions

In this chapter, the contribution of this thesis will be summarised and future research work will be suggested.

9.1 Contributions

In this thesis, a novel 3D vertically integrated adaptive system-on-chip architecture as a next generation computing system along with its functional verification and the mapping of an MPEG4 motion estimation algorithm has been presented. The suggested architecture has a number of advantages compared with conventional current generation reconfigurable/adaptive computing systems, such as wide applicability, various and powerful computation methods, adaptive word-length configuration and benefits from the architecture including 3D interconnect performance, reliability and a reduction in the size of the chips (and thus the cost), as described before. As outlined in chapter 5.3, the size of total chip as described is relatively small at around 1.1 mm$^2$ for an array of 2x2 UnitChips or 2.2 mm$^2$ for an array of 4x4 UnitChips. This is based on a 4-bit word-length for the PEs so there is also ready potential to extend to a wider word-length (8-bit word-length) and more integration of the PEs to maximise the computational throughput and benefits from large integration. Moreover, the ICS_RISC can also be readily extended on the upper chip layer by adopting advanced computation algorithms and dedicated instructions for specific applications to allow more efficient controllability and
performance over the current relatively simple ICS_RISC design. As minimum feature sizes continue to decrease in more advanced chip fabrication processes the inherent scalability of the UnitChip design means that the array size can simply be increased to within the constraints of the maximum die size to realise ever more power adaptive computing systems.

The performance of the execution of the MPEG4 full search block matching motion estimation algorithm has been shown to be more 3.8 times improved over current generation processors. Due to these significant performance, power and cost advantages it can be shown that the suggested 3D-ACSoC is one of the most suitable architectures for the next generation of computing system.

Moreover, the suggested advanced HW/SW co-design and verification methodology can accelerate the reliability and significantly reduce the design time, especially the time and effort required for verification. This thesis indicates a highly promising research direction for future adaptive computing systems and an advanced and efficient HW/SW development methodology for ever more complicated SoCs.

9.2 Future Work

As introduced in the suggested design methodology, the high-level modelling and functional verification has been carried out, the next task is the architectural explorations to obtain an optimized HW specification. The method to explore various architecture options is through parameterized memory, data frame buffer and DMA controller modelling using SystemC, followed by simulation with various HW configurations so as to find the best HW specification. The use of the parameterized modelling method makes the architecture exploration considerably easier, the parameter values can simply be changed in the SystemC code. Figure 9.1 shows the SystemC modelling of the parameterized memory. Once the optimum HW specification is decided, the rest of the procedure can be executed with any conventional hardware design method, such as full and semi-custom design and the SW design should be concurrently performed so that the novel concept of an adaptive system-on-chip computing system can be realised.
// Parameterized RAM

#ifndef RAMT_H
#define RAMT_H
#include "systemc.h"

template < class T, int size = 100>
SC_MODULE(ram) {
  sc_in<bool> clock;
  sc_in<bool> nRW; // Read/Write
  sc_in<int> addr; // Address
  sc_inout<T> data; // Parameterized Word-length

  void ram_proc();
  SC_HAS_PROCESS(ram);

  ram(sc_module_name name_, bool debug_ = false);
  sc_module(name_), debug(debug_)
  {
    SC_THREAD(ram_proc);
    sensitive << clock.pos;
    buffer = new T[size];

    if (debug)
      cout << "Running constructor of" << name() << endl;
    cout << "Number of location is" << size << endl;
  }

private:
  T* buffer;
  const bool debug;
};

template < class T, int size>
void ram<T, size>::ram_proc()
{
  while(true)
  {
    wait();
    if (nRW)
      data = buffer[addr];
    else
      buffer[addr] = data;
  }
}
#endif

Figure 9.1: The parameterized Memory modeling example using SystemC
Bibliography


# ICS_RISC Instruction Set Architecture Version 1.0

<table>
<thead>
<tr>
<th>Immediate (1 word)</th>
<th>Opcode</th>
<th>Rd</th>
<th>Immediate (4.8:16-bit)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 0 0 0 0 0</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Immediate (2 word)</th>
<th>Opcode</th>
<th>Rd</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 0 0 0 0 1</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Register</th>
<th>Opcode</th>
<th>Rd</th>
<th>Rs2</th>
<th>Rs1</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 0 0 0 1 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>LB Addressing</th>
<th>Opcode</th>
<th>Rd</th>
<th>Rs2</th>
<th>Rs1</th>
<th>IMEm</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 0 0 0 1 1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Shift / Rotate</th>
<th>Shift</th>
<th>x</th>
<th>Rs2</th>
<th>Rs1</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 1 0 1 0 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Load</th>
<th>Opcode</th>
<th>Rd</th>
<th>X</th>
<th>ShiftAmt</th>
<th>Rs1</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 1 0 1 0 1 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Store</th>
<th>Opcode</th>
<th>X</th>
<th>Rd</th>
<th>X</th>
<th>Rb</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 1 0 1 0 1 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Branch</th>
<th>Cond</th>
<th>x</th>
<th>Offset</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 1 0 1 1 1</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>PE Control</th>
<th>PE Op</th>
<th>x</th>
<th>Opcode</th>
<th>Config</th>
<th>PE Sel</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 0 1 0 0 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>DMA Control</th>
<th>DMA Op</th>
<th>DB</th>
<th>Amount of Data to Transfer</th>
<th>Start address of DFB (S ow/Dst)</th>
<th>SRAM Reg</th>
<th>Start address of SRAM/ICS Reg( S/D)</th>
<th>Mem Sel</th>
<th>Start address of Program/Data Memory (S ow/Dst)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Multiply</th>
<th>Opcode</th>
<th>Rd</th>
<th>Rs2</th>
<th>Rs1</th>
<th>Unused</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 1 1 1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Dedicated Instructions</th>
<th>Not yet decided</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
### 3D-SoftChip

**A Novel 3D Vertically Integrated Adaptive Computing System**

Appendix A – RISC ISA Version 1.0

<table>
<thead>
<tr>
<th>Opcodes</th>
<th>Mnemonics</th>
<th>Description (Immediate)</th>
<th>Description (Register)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 0</td>
<td>MOVA</td>
<td>Rd = Immediate</td>
<td>Rd = Rs1</td>
</tr>
<tr>
<td>0 0 0 1</td>
<td>MOVB</td>
<td>Rd = Immediate</td>
<td>Rd = Rs2</td>
</tr>
<tr>
<td>0 0 1 0</td>
<td>AND</td>
<td>Rd = Rd &amp; Immediate</td>
<td>Rd = Rs1 &amp; Rs2</td>
</tr>
<tr>
<td>0 0 1 1</td>
<td>OR</td>
<td>Rd = Rd</td>
<td>Immediate</td>
</tr>
<tr>
<td>0 1 0 0</td>
<td>XOR</td>
<td>Rd = Rd ^ Immediate</td>
<td>Rd = Rs1 ^ Rs2</td>
</tr>
<tr>
<td>0 1 0 1</td>
<td>NOT</td>
<td>Rd = ~ Immediate</td>
<td>Rd = ~Rs1</td>
</tr>
<tr>
<td>0 1 1 0</td>
<td>ADD</td>
<td>Rd = Rs1 + Immediate</td>
<td>Rd = Rs1 + Rs2</td>
</tr>
<tr>
<td>0 1 1 1</td>
<td>SUB</td>
<td>Rd = Rs1 - Immediate</td>
<td>Rd = Rs1 - Rs2</td>
</tr>
<tr>
<td>1 0 0 0</td>
<td>CMP</td>
<td>Compare Rs1 and Immediate</td>
<td>Compare Rs1 and Rs2</td>
</tr>
<tr>
<td>1 0 0 1</td>
<td>MSR</td>
<td>Status Register = Immediate</td>
<td>Status Register = Rs1</td>
</tr>
<tr>
<td>1 0 1 0</td>
<td>MRS</td>
<td>N/A</td>
<td>Rs1 = Status Register</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Shift</th>
<th>Mnemonics</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0</td>
<td>LSL</td>
<td>Shift Left</td>
</tr>
<tr>
<td>0 0 1</td>
<td>LSR</td>
<td>Shift Right</td>
</tr>
<tr>
<td>0 1 0</td>
<td>ASR</td>
<td>Arithmetic Shift Right</td>
</tr>
<tr>
<td>1 0 0</td>
<td>ROT</td>
<td>Rotate</td>
</tr>
</tbody>
</table>
### PE Operations

<table>
<thead>
<tr>
<th>Cond</th>
<th>Mnemonics</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0</td>
<td>EQ</td>
<td>Equal</td>
</tr>
<tr>
<td>0 0 1</td>
<td>NE</td>
<td>Not Equal</td>
</tr>
<tr>
<td>0 1 0</td>
<td>AL</td>
<td>Always (Unconditional)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>PE Operations</th>
<th>Mnemonics</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0</td>
<td>PECNF</td>
<td>Configuration of each PEs (4,8,16,32 bits)</td>
</tr>
<tr>
<td>0 0 1</td>
<td>PESEL</td>
<td>To select certain PE (PE0 ~ PE15)</td>
</tr>
<tr>
<td>0 1 0</td>
<td>PEMODE</td>
<td>To select PE operation modes (Horizontal/Vertical/Circular modes)</td>
</tr>
<tr>
<td>0 1 1</td>
<td>PEVEXE</td>
<td>To execute specific program to each PEs in the same vertical line</td>
</tr>
<tr>
<td>1 0 0</td>
<td>PEHEXE</td>
<td>To execute specific program to each PEs in the same horizontal line</td>
</tr>
<tr>
<td>1 0 1</td>
<td>PECXE</td>
<td>To execute specific program to each PEs in the same circular line</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>DMA Operations</th>
<th>Mnemonics</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0</td>
<td>LDPEPRG</td>
<td>Load maximum 16 program data from Program memory to Embedded SRAM in PEs</td>
</tr>
<tr>
<td>0 0 1</td>
<td>LDDFB</td>
<td>Load large amount of processing data for PEs from Memory to Data Frame Buffer</td>
</tr>
<tr>
<td>0 1 0</td>
<td>LDPEDATA</td>
<td>Load large amount of processing data for PEs from DFB to Embedded SRAM in PEs</td>
</tr>
<tr>
<td>0 1 1</td>
<td>WBREG</td>
<td>Write back processed data in Embedded SRAM to the Registers in the ICS_RISC</td>
</tr>
<tr>
<td>1 0 0</td>
<td>WBDFB</td>
<td>Write back processed data in Embedded SRAM to DFB</td>
</tr>
<tr>
<td>1 0 1</td>
<td>WBMEM</td>
<td>Write back processed data in DFB to Data Memory</td>
</tr>
</tbody>
</table>
I. Instruction descriptions

- Immediate addressing: Short immediate values: 4, 8, 16 bit (1 instruction word). Long immediate value: 32 bits (2 instruction words)
  \[ \text{Rd} = \text{Rd} \oplus \text{Immediate (4,8,16,32 bit)} \]

<table>
<thead>
<tr>
<th>1 Instruction word</th>
<th>Opcode</th>
<th>Rd</th>
<th>4, 8, 16 bit Constant</th>
</tr>
</thead>
<tbody>
<tr>
<td>31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0 0 0 0 0 0 0 0</td>
<td>Opcode</td>
<td>Rd</td>
<td>4, 8, 16 bit Constant</td>
</tr>
<tr>
<td>31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- Register addressing:
  \[ \text{Rd} = \text{Rs1} \oplus \text{Rs2} \]
  Description: Rs1 and Rs2 indicates the address of internal regFile (32 sets of 32 bit data (32 x 32 bit)). The opcode identifies the operations and the manipulated data between Rs1 and Rs2 is stored in the register which indicated by Rd.

- LB Addressing:
  \[ \text{Rd} = \text{Rs1} \oplus \text{Rs2} \]
  Description: When the LB Addressing becomes active, the sources of addresses become a Loop Buffer. It has 16 depths of looping capacity.

- Shift / Rotate:
  \[ \text{Rd} = \text{Rs1} \text{ Shift by Amount} \]
  Description: According to the shiftCtl and shiftAmt, the shifter can shift the input operands.

- Load:
  \[ \text{Rd} = \text{Mem}[\text{Rb}] \]
Description: Rd in the regFile can load the data from data memory address which indicates by Rb.

- **Store:**
  
  \[ \text{Mem}[Rb] = Rd \]
  
  Description: Data in the regFile can store to the data memory address which indicates by Rb.

- **Branch:**
  
  \[ \text{If (Cond) } PC = PC + \text{Offset} \]
  
  Description: According to the Cond signals, the Program Counter value can increase as much as offset value.

- **Multiply:**
  
  \[ Rd = Rs1 \ast Rs2 \]
  
  Description: The operands can multiplied and stored in the Rd.

- **PE Control:**
  
  PECNF, PESEL, PEMODE (Horizontal/Vertical/Circular modes), PEEXE

- **DMA Control:**
  
  LDPEPRG: Load maximum 16 program data from Program memory to Instruction Decoder in PEs
  
  LDDFB: Load large amount of processing data for PEs from Data Memory to Data Frame Buffer
  
  LDPDATA: Load large amount of processing data for PEs from DFB to Embedded SRAM in PE
  
  WBREG: Write back processed data in Embedded SRAM in PE to the registers in the ICS_RISC
  
  WBDFB: Write back processed data in Embedded SRAM in PE to DFB
  
  WBMEM: Write back processed data in DFB to Data Memory

- **Dedicated Instructions**
  
  Not yet decided
1.2 Addressing modes.

- Immediate
- Register
- Loop Buffer(LB) Addressing
High-level Modeling of 3D-SoftChip using SystemC

1 Configurable Array Processor (CAP) Chip
   1.1 Processing Element: Standard-PE

1.2 System Components
   • MUX A, MUX B: input operand selection
   • Instruction Decoder: 4 sets of ID, each ID have 4 sets of 19-bit registers for S-PE instruction decoding
   • ALU: 4-bit ALU with bit-serial multiplier, adder, subtractor, comparator
   • Registers: 4 sets of registers
   • DourReg: data out register to send data for adjacent PEs (Up/Down/Left/Right)
   • Embedded SRAM: embedded SRAM (word-length: 4-bit, address: 0-15)
1.3 S-PE functions

<table>
<thead>
<tr>
<th>Function</th>
<th>Mnemonics</th>
</tr>
</thead>
<tbody>
<tr>
<td>A and B</td>
<td>AND</td>
</tr>
<tr>
<td>A or B</td>
<td>OR</td>
</tr>
<tr>
<td>not A</td>
<td>NOT</td>
</tr>
<tr>
<td>A xor B</td>
<td>XOR</td>
</tr>
<tr>
<td>A + B</td>
<td>ADD</td>
</tr>
<tr>
<td>A - B</td>
<td>SUB</td>
</tr>
<tr>
<td>A × B</td>
<td>SPMUL</td>
</tr>
<tr>
<td>A comp B</td>
<td>COMP</td>
</tr>
</tbody>
</table>

Table 1.1: S-PE functions

1.4 S-PE Instruction Format

Figure 1.2: S-PE instruction format

1.5 S-PE Block Diagram (In/Output Pin Description)

Figure 1.3: S-PE block diagram (Input/Output Pin Description)
1.6 Data-path Architecture of S-PE

![Diagram of data-path architecture of S-PE]

Figure 1.4: Data-path architecture of S-PE

1.7 S-PE Operation Flow

![Flowchart of S-PE operation flow]

Figure 1.5: S-PE operation flow
1.8 SystemC File Structure

Figure 1.6: SystemC file structure of S-PE

1.9 SystemC Codes for S-PE

See Appendix C

1.10 Output Waveform

Figure 1.7: Top level simulation result of S-PE
1.11 Processing Element: Processing Accelerator-PE

![Figure 1.8: Processing Accelerator-PE architecture](image)

1.12 System Components

- MUX A, MUX B: input operand selection
- Instruction Decoder: 4 sets of ID, each ID has 4 sets of 19-bit registers for PA-PE instruction decoding
- Multiplier: a signed 4-bit scalable parallel/parallel multiplier
- Accumulator/Subtractor: to enable MAC, MAS operations within one clock cycle.
- 8-bit Barrel shifter
- Registers: 4 sets of registers.
- Embedded SRAM: embedded SRAM (word-length: 4-bit, address: 0-15)

1.13 PA-PE Functions

<table>
<thead>
<tr>
<th>Function</th>
<th>Mnemonics</th>
</tr>
</thead>
<tbody>
<tr>
<td>A × B</td>
<td>PAMUL</td>
</tr>
<tr>
<td>A × B + out(t)</td>
<td>MAC</td>
</tr>
</tbody>
</table>
### 1.14 PA-PE Instruction Format

<p>| | | | | | | | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>18</td>
<td>17</td>
<td>16</td>
<td>15</td>
<td>12</td>
<td>11</td>
<td>10</td>
<td>9</td>
<td>8</td>
<td>6</td>
<td>5</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>WS_en</td>
<td>WR_en</td>
<td>SRAM</td>
<td>SRAM Selection</td>
<td>Register</td>
<td>DonR</td>
<td>PA-PE_OP</td>
<td>MUX_B</td>
<td>MUX_A</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>RS_en</td>
<td>RR_en</td>
<td>en</td>
<td>Selection</td>
<td>Ctl</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Figure 1.9: PA-PE instruction format

### 1.15 PA-PE Block Diagram (In/Output Pin Description)

![PA-PE Block Diagram](image-url)

Figure 1.10: PA-PE block diagram (Input/Output Pin Description)
1.16 Data-path Architecture of PA-PE

Figure 1.11: Data-path architecture of PA-PE

1.17 PA-PE Operation Flow

Figure 1.12: PA-PE operation flow
1.18 SystemC File Structure

![SystemC File Structure Diagram]

Figure 1.13: SystemC file structure of PA-PE

1.19 SystemC Codes for PA-PE

See Appendix C.

1.20 Output Waveform

![Output Waveform Diagram]

Figure 1.14: Top level simulation of PA-PE
2 ICS(Intelligent Configurable Switch) Chip

2.1 ICS_RISC (32-bit Dedicated RISC Control Processor)

Figure 2.1: Overall architecture of ICS_RISC

2.2 ICS_RISC-Detailed Architecture

Figure 2.2: Detailed ICS_RISC Architecture
2.3 Special Features (ICS_RISC)

- Harvard architecture, 3 Stage Pipelined architecture (Fetch, Decode, Execute)
- Memory access, during the execution stage, is done by load/store instructions only
- All operations except load/store, PE and DMA operations, are register-to-register within the ICS RISC
- Single-cycle instruction execution

2.4 System Components (ICS_RISC)

- Program Counter: 32th GPR is a program counter
- Loop Buffer: 16 x 32-bit buffer to generate instruction address for iterative characteristic instructions
- Register file (General Purpose Register): 32 x 32-bit general purpose register
- Status Register: 4 kinds of flags (N: Negative / Less Than, Z: Zero, C: Carry / Borrow, V: Overflow)
- Instruction Register: Instruction decoder for ALU and Control Unit
- ALU & Control Unit: It is consist of ALU, Shifter, Multiplier
- I/O Unit: 32-bit Data input/output register (dInReg, dOutReg)

2.5 ICS RISC Functions

- See Appendix A

2.6 ICS_RISC Block Diagram (Input/Output Pin Description)

![ICS_RISC Block Diagram](image_url)

Figure 2.3: ICS_RISC Block Diagram (Input/Output Pin Description)
2.7 UnitICS Block Diagram (Input/Output Pin Description)

![UnitICS Block Diagram](image)

Figure 2.4: UnitICS Block Diagram (Input/Output Pin Description)

2.8 Three-stage Pipeline Architecture (ICS_RISC)

![Three-stage Pipeline Architecture](image)

Figure 2.5: Three-stage Pipeline Architecture (ICS_RISC)

2.9 Register Architecture (ICS_RISC)

![Register Architecture](image)

Figure 2.6: Register Architecture (ICS_RISC)
2.10 Data-path Architecture of ICS_RISC

Figure 2.7: Data-path architecture of ICS_RISC

2.11 SystemC File Structure

Figure 2.8: SystemC file structure of ICS_RISC

2.12 SystemC Modeling of Data-path Architecture of ICS_RISC

- System Components: (1) Program Counter, (2) Status Register, (3) Loop Buffer, (4) General Purpose Register, (5) ALU, (6) Barrel Shifter, (7) Multiplier, (8) Data Input Register, (9) Data Output Register

- Program Counter (PC)
Figure 2.9: Output waveform of PC

- Status Register

Figure 2.10: Output waveform of Status Register

- Loop Buffer (LB)
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix B-High-level Modeling of 3D-SoftChip Using SystemC

Figure 2.11: Output waveform of Loop Buffer

- General Purpose Register (GPR)

Figure 2.12: Output waveform of Register File

- ALU

Table 2.1: ALU Functions

<table>
<thead>
<tr>
<th>Opcodes</th>
<th>Mnemonics</th>
<th>Description (Immediate)</th>
<th>Description (Register)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0 0</td>
<td>MOVA</td>
<td>Rd = Immediate</td>
<td>Rd = Rs1</td>
</tr>
<tr>
<td>0 0 0 1</td>
<td>MOVB</td>
<td>Rd = Immediate</td>
<td>Rd = Rs2</td>
</tr>
<tr>
<td>0 0 1 0</td>
<td>AND</td>
<td>Rd = Rd &amp; Immediate</td>
<td>Rd = Rs1 &amp; Rs2</td>
</tr>
<tr>
<td>0 0 1 1</td>
<td>OR</td>
<td>Rd = Rd</td>
<td>Immediate</td>
</tr>
<tr>
<td>0 1 0 0</td>
<td>XOR</td>
<td>Rd = Rd ^ Immediate</td>
<td>Rd = Rs1 ^ Rs2</td>
</tr>
</tbody>
</table>

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE 108
**3D-SoftChip**

A Novel 3D Vertically Integrated Adaptive Computing System

Appendix B: High-level Modeling of 3D-SoftChip Using SystemC

| 0 1 0 1 | NOT | Rd = ~ Immediate | Rd = ~Rs1 |
| 0 1 1 0 | ADD | Rd = Rs1 + Immediate | Rd = Rs1 + Rs2 |
| 0 1 1 1 | SUB | Rd = Rs1 – Immediate | Rd = Rs1 – Rs2 |
| 1 0 0 0 | CMP | Compare Rs1 and Immediate | Compare Rs1 and Rs2 |
| 1 0 0 1 | MSR | Status Register = Immediate | Status Register = Rs1 |
| 1 0 1 0 | MRS | N/A | Rs1 = Status Register |

Figure 2.13: Output waveform of ALU

- 32bit Barrel Shifter

<table>
<thead>
<tr>
<th>Shift</th>
<th>Mnemonics</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 0 0</td>
<td>LSL</td>
<td>Shift Left</td>
</tr>
<tr>
<td>0 0 1</td>
<td>LSR</td>
<td>Shift Right</td>
</tr>
<tr>
<td>0 1 0</td>
<td>ASR</td>
<td>Arithmetic Shift Right</td>
</tr>
<tr>
<td>0 1 1</td>
<td>ROT</td>
<td>Rotate</td>
</tr>
</tbody>
</table>
Figure 2.14: Output waveform of 32-bit Barrel Shifter

- 32 × 32 Signed Multiplier

Figure 2.15: Output waveform of Signed 32 × 32 Multiplier

- Data Input Register

- Data Output Register
2.13 Control Architecture of ICS_RISC

- Fetch Unit: Fetch the instructions

- Decoder Unit

Figure 2.16: Output waveform of top module in data-path architecture

Figure 2.17: Instruction Decoding (1)
Table 2.3: Instruction ID for Instruction Decoding

<table>
<thead>
<tr>
<th>Instruction ID</th>
<th>Instruction[31:25]</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>INST_ALUIS</td>
<td>000/0000</td>
<td>ALU Immediate (1 Inst. Word)</td>
</tr>
<tr>
<td>INST_ALUIL</td>
<td>000/0001</td>
<td>ALU Immediate (2 Inst. Word)</td>
</tr>
<tr>
<td>INST_ALUR</td>
<td>000/0010</td>
<td>ALU Register</td>
</tr>
<tr>
<td>INST_ALULB</td>
<td>000/0011</td>
<td>ALU Loop Buffer Addressing</td>
</tr>
<tr>
<td>INST_SHRO</td>
<td>001/0100</td>
<td>Shift / Rotate</td>
</tr>
<tr>
<td>INST_LOAD</td>
<td>001/0101</td>
<td>Load</td>
</tr>
<tr>
<td>INST_STORE</td>
<td>001/0110</td>
<td>Store</td>
</tr>
<tr>
<td>INST_BRANCH</td>
<td>001/0111</td>
<td>Branch</td>
</tr>
<tr>
<td>INST_PECN</td>
<td>010/1000</td>
<td>PE Control</td>
</tr>
<tr>
<td>INST_DMA</td>
<td>1xx/xxxx</td>
<td>DMA Control</td>
</tr>
<tr>
<td>INST_MUL</td>
<td>011/1111</td>
<td>Multiply</td>
</tr>
</tbody>
</table>
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix B: High-level Modeling of 3D-SoftChip Using SystemC

Figure 2.19: Output waveform of Instruction Decoding

* Execute Unit

Table 2.4: Control Signal according to the Instruction

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>ALU Immediate</td>
<td>Op Code</td>
<td>Enable</td>
<td>Disable</td>
<td>Disable</td>
<td>Rd</td>
<td>Immediate</td>
</tr>
<tr>
<td>(1 Inst. Word)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(4,8,16bit)</td>
</tr>
<tr>
<td>ALU Immediate</td>
<td>Op Code</td>
<td>Enable</td>
<td>Disable</td>
<td>Disable</td>
<td>Rd</td>
<td>Immediate</td>
</tr>
<tr>
<td>(2 Inst. Word)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(32bit)</td>
</tr>
<tr>
<td>ALU Register</td>
<td>Op Code</td>
<td>Enable</td>
<td>Disable</td>
<td>Disable</td>
<td>Rs1</td>
<td>Rs2</td>
</tr>
<tr>
<td>ALU LB Addr.</td>
<td>Op Code</td>
<td>Enable</td>
<td>Disable</td>
<td>Disable</td>
<td>Rs1</td>
<td>Rs2</td>
</tr>
<tr>
<td>Shift / Rotate</td>
<td>Don’t Care</td>
<td>Disable</td>
<td>Enable</td>
<td>Disable</td>
<td>Rb (Rs1)</td>
<td>ShiftAmt</td>
</tr>
<tr>
<td>Load</td>
<td>MOV</td>
<td>Disable</td>
<td>Disable</td>
<td>Disable</td>
<td>Rb (Rs1)</td>
<td>Don’t Care</td>
</tr>
<tr>
<td>Store</td>
<td>MOV</td>
<td>Disable</td>
<td>Disable</td>
<td>Disable</td>
<td>Rb (Rs1)</td>
<td>Rd</td>
</tr>
<tr>
<td>Branch</td>
<td>ADD</td>
<td>Enable</td>
<td>Disable</td>
<td>Disable</td>
<td>PC</td>
<td>Immediate</td>
</tr>
<tr>
<td>PE Control</td>
<td>Don’t Care</td>
<td>Disable</td>
<td>Disable</td>
<td>Disable</td>
<td>Don’t Care</td>
<td>Don’t Care</td>
</tr>
<tr>
<td>DMA Control</td>
<td>Don’t Care</td>
<td>Disable</td>
<td>Disable</td>
<td>Disable</td>
<td>Don’t Care</td>
<td>Don’t Care</td>
</tr>
<tr>
<td>Multiply</td>
<td>Don’t Care</td>
<td>Disable</td>
<td>Disable</td>
<td>Enable</td>
<td>Rs1</td>
<td>Rs2</td>
</tr>
</tbody>
</table>

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix B-High-level Modeling of 3D-SoftChip Using SystemC

Figure 2.20: Output waveform of Instruction Execution

- Pipeline Register

Figure 2.21: Conventional Pipeline Register Architecture

Figure 2.22: Modified Pipeline Register Architecture (High-Speed)
* Pipeline Control (reset, flush and refill)

<table>
<thead>
<tr>
<th>Address</th>
<th>MOV</th>
<th>Fetch</th>
<th>Decode</th>
<th>Execute</th>
<th>PIPELINE REFILL</th>
</tr>
</thead>
<tbody>
<tr>
<td>N+1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>N+2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>N+3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>N+4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>N+5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DST</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DST+1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DST+2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DST+3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Figure 2.23: Branch Instruction Execution**

2.14 Top-level Simulation Result of ICS_RISC

* Simple Program for Verification

<table>
<thead>
<tr>
<th>Address</th>
<th>Instruction</th>
<th>Register 1</th>
<th>Register 2</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>0000/0000</td>
<td>MOV R0, #0</td>
<td></td>
<td></td>
<td>/Simple Loop Program</td>
</tr>
<tr>
<td>0001/0001</td>
<td>MOV R1, #1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0002/0002</td>
<td>MOV R2, #2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0003/0003</td>
<td>MOV R3, #3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0004/0004</td>
<td>MOV R4, #4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0005/0005</td>
<td>MOV R5, #5</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0006/0006</td>
<td>MOV R6, #6</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0007/0007</td>
<td>MOV R7, #7</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0408/0000</td>
<td>MOV R8, R0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>0409/0020</td>
<td>MOV R9, R1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>040A/0040</td>
<td>MOV R10, R2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>040B/0060</td>
<td>MOV R11, R3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>040C/0080</td>
<td>MOV R12, R4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>040D/00A0</td>
<td>MOV R13, R5</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>040E/00C0</td>
<td>MOV R14, R6</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>040F/00E0</td>
<td>MOV R15, R7</td>
<td></td>
<td></td>
<td>/End</td>
</tr>
<tr>
<td>0450/4280</td>
<td>AND R16, R8 &amp; R9</td>
<td></td>
<td></td>
<td>/Simple ALU Program</td>
</tr>
<tr>
<td>0471/52C0</td>
<td>OR R17, R10</td>
<td>R11</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0492/6340</td>
<td>XOR R18, R12 &amp; R13</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>04D3/6BB0</td>
<td>ADD R19, R14 + R15</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Figure 2.24: Top level Simulation Result of ICS_RISC

Figure 2.25: Instruction Index
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix B-High-level Modeling of 3D-SoftChip Using SystemC

3 UnitChip

3.1 SystemC File Structure

![Figure 3.1: SystemC file structure of UnitChip](image1)

3.2 Top-level Simulation Result of UnitChip

![Figure 3.2: Top-level Simulation Result of UnitChip](image2)
References for the ICS_RISC


http://www.opencores.org/projects/riscmcu
SystemC Codes

1 Standard-PE.

/*
 * iReg: Instruction Reg for Standard-PE(header file for iReg)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM(ckim@student.ecu.edu.au)
 * File name: iReg.h
 * Revision history: Version1
 * Date: 17/1/2005
 */

#include "systemc.h"

SC_MODULE(iReg) {
    sc_in<sc_uint<19>> inst;
    //instruction input
    sc_out<sc_uint<3>> muxACtl;
    //muxA Ctrl
    sc_out<sc_uint<3>> muxBCtl;
    //muxB Ctrl
    sc_out<sc_uint<3>> sopSel;
    //S-PE operation sel
    sc_out<sc_uint<2>> regSel;
    //internal reg sel
    sc_out<sc_uint<4>> sramSel;
    //SRAM sel
    sc_out<sc_uint<4>> sramEn;
    //SRAM enable signal
    sc_out<sc_uint<4>> rwRegEn;
    //internal reg read/write signal
    sc_out<sc_uint<4>> rwSEn;
    //SRAM read/write enable signal

    void do_iReg();

    SC_CTOR(iReg) {
        SC_METHOD(do_iReg);
        sensitive << inst;

        #ifdef SIM
            muxACtl.initialize(0);
            muxBCtl.initialize(0);
            sopSel.initialize(0);
            doutRCtl=0;
            regSel.initialize(0);
            sramSel.initialize(0);
            sramEn=0;
            rwRegEn=0;
            rwSEn=0;
        #endif
    }
};

/*
 * iReg: Instruction Reg for Standard-PE(source file for iReg)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM(ckim@student.ecu.edu.au)
 * File name: iReg.cpp
 * Revision history : Version1
 * Date: 17/1/2005
 */

#include "iReg.h"

void iReg::do_iReg() {
    sc_uint<19> tmp_inst;
    tmp_inst = inst.read();
    rwSEn = tmp_inst[18];
    rwRegEn = tmp_inst[17];
}
#include "systemc.h"

SC_MODULE(mux) {
    sc_in<sc_uint<3>> din;
    sc_in<sc_uint<4>> dReg;
    sc_in<sc_uint<4>> dLeft;
    sc_in<sc_uint<4>> dRight;
    sc_in<sc_uint<4>> dUp;
    sc_in<sc_uint<4>> dDown;
    sc_out<sc_uint<4>> muxOut;
    sc_out<bool> dReq;

    SC_CTOR(mux) {
        dOutRCtl = tmp_inst.range(11,10);
    }

    void do_mux() {
        switch (muxCtl.read()) {
        case 0: muxOut = din; break;
        case 1: muxOut = dReg; break;
        case 2: muxOut = dLeft; break;
        case 3: muxOut = dRight; break;
        case 4: muxOut = dUp; break;
        case 5: muxOut = dDown; break;
        default: break;
        }
    }

    #ifndef SIM
    muxOut.initialize(0);
    dReq=0;
    #endif
}

Mux: Mux for Standard-PE (header file for Mux)
Copyright(c) 2005 by Chul KIM, All right reserved
* Author: Chul KIM (ckim@student.ecu.edu.au)
* File name: mux.h
* Revision history: Version 1
* Date: 17/1/2005
*/

#include "mux.h"

void::do_mux() {  
    switch (muxCtl.read()) {
    case 0: muxOut = din; dReq=0; break;
    case 1: muxOut = dReg; dReq=1; break;
    case 2: muxOut = dLeft; dReq=0; break;
    case 3: muxOut = dRight; dReq=0; break;
    case 4: muxOut = dUp; dReq=0; break;
    case 5: muxOut = dDown; dReq=0; break;
    default: break;
    }
}
#include "systemc.h"
#include "iReg.h"
#include "mux.h"

SC_MODULE(spe) {
  sc_in<bool> clock;
  sc_in<bool> reset;
  sc_in<sc_uint<15>>  inst;  //Instruction input from ICS
  sc_in<sc_uint<4> >  dIn, dLeft, dRight, dUp, dDown;  //data inputs
  sc_out<sc_uint<4> >  dOut;  //data output
  sc_out<sc_uint<4> >  dOutAdJPE;  //data output for adjacent PEs
  //temp signal for Instruction
  sc_signal<sc_uint<15>>  s_inst;

  //temp signals from iReg(Instruction Decoder)
  sc_signal<sc_uint<3>>  s_muxACI;
  sc_signal<sc_uint<3>>  s_muxBCI;
  sc_signal<sc_uint<3>>  s_sopSel;
  sc_signal<sc_uint<3>>  s_muxSel;
  sc_signal<sc_uint<3>>  s_regSel;
  sc_signal<sc_uint<3>>  s_muxBIA;
  sc_signal<sc_uint<3>>  s_muxBIB;
  sc_signal<sc_uint<3>>  dRegA, dRegB;  //data request for register

  //temp signals for mux input/output and ALU inputs
  sc_signal<sc_uint<4>>  s_dIn, s_dLeft, s_dRight, s_dUp, s_dDown;
  sc_signal<sc_uint<4>>  s_muxOutA;  //reg out for muxA input
  sc_signal<sc_uint<4>>  s_muxOutB;  //reg out for muxB input
  sc_signal<sc_uint<4>>  s_muxAOut;
  sc_signal<sc_uint<4>>  s_muxBOut;
  sc_signal<sc_uint<4>>  s_regOut;

  //temp signal for ALU output
  sc_signal<sc_uint<4>>  sMUXOut;

  //temp signal for internal register
  sc_signal<sc_uint<4>>  s_regIn;
  sc_signal<sc_uint<4>>  s_regOut;

  //temp signals for SRA:
  sc_signal<sc_uint<4>>  sramData;
  sc_in<int>  sramAddr[16];

  //temp signal for output data bus
  sc_signal<sc_uint<4>>  doutBus;

  void do_IAddr();
  void do_ALU();
  void do_REG();
  void do_SRAM();
  void do_doutReg();

  iReg*  IRreg1;
  iReg*  IRreg2;
  iReg*  IRreg3;
  iReg*  IRreg4;
A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C: System C Codes

```c
void SC_CTOR(spe) {
    iReg1 = new Reg("Reg1");
    iReg1->insts_inst();
    iReg1->muxBCtrl(s_muxBCtl);
    iReg1->doutRCtl(s_doutRCtl);
    iReg1->sramSel(s_sramSel);
    iReg1->rwRegEn(s_rwRegEn);
    iReg2 = new Reg("Reg2");
    iReg2->insts_inst();
    iReg2->muxBCtrl(s_muxBCtl);
    iReg2->doutRCtl(s_doutRCtl);
    iReg2->sramSel(s_sramSel);
    iReg2->rwRegEn(s_rwRegEn);
    iReg3 = new Reg("Reg3");
    iReg3->insts_inst();
    iReg3->muxBCtrl(s_muxBCtl);
    iReg3->rwRegEn(s_rwRegEn);
    iReg4 = new Reg("Reg4");
    iReg4->insts_inst();
    iReg4->muxBCtrl(s_muxBCtl);
    iReg4->rwRegEn(s_rwRegEn);

    muxA = new mux("muxA");
    muxA->muxCtl(s_muxACtl);
    muxA->muxOut(muxAOut);
    muxA->dln(s_dln);
    muxA->dLeft(s_dLeft);
    muxA->dUp(s_dUp);
    muxA->dRight(s_dRight);
    muxA->dOut(s_dOut);

    muxB = new mux("muxB");
    muxB->muxCtl(s_muxBCtl);
    muxB->muxOut(muxBOut);
    muxB->dln(s_dln);
    muxB->dLeft(s_dLeft);
    muxB->dUp(s_dUp);
    muxB->dRight(s_dRight);
    muxB->dDownt(s_dDownt);

    doutBus.initialize(0);
    dOut.initialize(0);
    dOutAdjPE.initialize(0);
    for (int i = 0; i < 16; i++) ramData[i] = "XXXX";
}
```

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
/*
SPE: Standard PE for CAP(Conifigurable Array Processor)(source file for SPE)
Copyright(c) 2005 by Chui KIM, All right reserved
Author: Chui KIM(ckim@student.cmu.edu)
File name: spe.cpp
Revision history: Version1
Date: 17/1/2005
*/

#include "spe.h"

// Latch
void spe::do_latch() { 
if (reset) {
    s_inst.write(0);
    s_dln.write(0);
    s_dLeft.write(0);
    s_dRight.write(0);
    s_dUp.write(0);
    s_dDown.write(0);
} else {
    s_inst.write(instCS.read());
    s_dln.write(dln.read()); //input data
    s_dLeft.write(dLefRead());
    s_dRight.write(dRight.read());
    s_dUp.write(dUp.read());
    s_dDown.write(dDown.read());
}

// ALU
#define comp(a,b) (((a)>(b))?1: (((a)=(b))?0: 1)) //comparator

void spe::do_alu() {
    #ifdef SIM
    unsigned short result=0;
    #else
    unsigned short result;
    #endif
    unsigned short src1=maxAOut.read();
    unsigned short src2=maxBOut.read();
    switch(s_sopSel.read()) {
    case 0: result = src1 & src2; break;  //and
    case 1: result = src1 | src2; break;  //or
    case 2: result = ~src1; break;        //not
    case 3: result = src1 ^ src2; break;  //xor
    case 4: result = src1 + src2; break;  //add
    case 5: result = src1 * src2; break;  //sub
    case 6: result = src1 * src2; break;  //pmul
    case 7: result = comp(src1,src2); break;  //comp
    default:
        aluOut.write(result);
        regIn.write(result);
    }

    // Internal Register
    void spe::do_reg() {
    if (s_rwRegEn) { //operation
        switch (s_regSel.read()) {
        case 0: regOut.write(tmp1); break;
        case 1: regOut.write(tmp2); break;
        case 2: regOut.write(tmp3); break;
        case 3: regOut.write(tmp4); break;
        default: break;
        }
        if (dReqA) {
            dRegOutA = regOut;
        } else {
            doutBus = sc_16<6> (regOut);
            dOut = sc_uint<4> (doutBus);
        }
    }
if (dReqB) {
  dRegOutB = regOut;
} else {
  doutBus = sc_lv<4> (regOut);
  dOut = sc_uint<4> (doutBus);
} else {  //write operation
  switch (s_regSel.read()) {
    case 0: tmp1 = regIn; break;
    case 1: tmp2 = regIn; break;
    case 2: tmp3 = regIn; break;
    case 3: tmp4 = regIn; break;
    default:
  }
}

// SRAM
void spc::do_sram() {
  if (s_sramEn) {
    if (s_rwSEn) {
      //read operation
      sramData.write(runData[s_sramSel.read()]);
      doutBus = sramData;
      dout = sc_uint<4> (doutBus);
      doutBus = sc_uint<4> (rcgOut);  //dOut has a dummy value(0F)
      } else {  //write operation
        sramData = sc_lv<4> (rcgOut);
        runData[s_sramSel.read()] = sramData;
      } else {
        sramData = "ZZZZ";
      }
    }
  }

// Data output register
void spc::do_doutReg() {
  if (s_doutRCtl) {
    dOutadjPE = sc_uint<4> (doutBus);
  }
}
2 Proceesing Accelerator-PE.

/*
 * iReg: Instruction Reg for Processing Accelerator-PE (header file for IReg)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM (cklm@student.een.edu.au)
 * File name: IReg.h
 * Revision history: Version1
 * Date: 29/1/2005
 */

#include "systemc.h"

SC_MODULE(IReg) {
    sc_in<sc_uint<19>> inst;         //Instruction Input
    sc_out<sc_uint<3>> muxACtl;       //muxA Ctrl
    sc_out<sc_uint<3>> muxBCtl;       //muxB Ctrl
    sc_out<sc_uint<3>> sopSel;        //S-PE operation sel
    sc_out<sc_uint<3>> doutRCtl;      //data-out reg ctrl
    sc_out<sc_uint<3>> regSel;        //internal reg sel
    sc_out<sc_uint<2>> sramSel;       //SRAM sel
    sc_out<bool> sramEn;             //SRAM enable signal
    sc_out<bool> rwRegEn;            //SRAM read/write enable signal
    sc_out<bool> rwSEn;
    sc_out<sc_uint<16>>[] doutReg;
    void do_IReg();
}

SC_CTOR(IReg) {
    SC_METHOD(do_IReg);
    sensitive << inst;
    #ifdef SIM
    muxACtl.initialize(0);
    muxBCtl.initialize(0);
    sopSel.initialize(0);
    doutRCtl = 0;
    regSel.initialize(0);
    sramSel.initialize(0);
    sramEn = 0;
    rwRegEn = 0;
    rwSEn = 0;
    #endif
}

/*
 * iReg: Instruction Reg for Processing Accelerator-PE (source file for IReg)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM (cklm@student.een.edu.au)
 * File name: IReg.cpp
 * Revision history: Version1
 * Date: 29/1/2005
 */

#include "IReg.h"

void IReg::do_IReg() {
    inst = inst.read();
    tmpInst = inst.read();
    rwSEn = tmpInst[18];
    rwRegEn = tmpInst[17];
    sramEn = tmpInst[16];
    sramSel = tmpInst.range(15,12);
    regSel = tmpInst.range(11,10);
    doutRCtl = tmpInst[9];
    sopSel = tmpInst.range(8,6);
    muxBCtl = tmpInst.range(5,3);
    muxACtl = tmpInst.range(2,0);
}

//Instruction
//Input
//muxACtl
//muxBCtl
//sopSel
//doutRCtl
//regSel
//sramSel
//sramEnable
//rwRegEnable
//RWSEnable
//MUXB
//MUXA
//SRAM
//SRAM Enable
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C-SystemC Codes

/*
 * Mux: Mux for Processing Accelerator-PE (header file for Mux)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM (ckim@student.ccu.edu.au)
 * File name: mux.h
 * Revision history: Version1
 * Date: 29/1/2005
 */

#include "systemc.h"

SC_MODULE(mux) {
    sc_in<sc_uint<3>> muxCtl;       //mux ctl input
    sc_in<sc_uint<4>> dln;           //input data
    sc_in<sc_uint<4>> dReg;          //data from internal Reg
    sc_in<sc_uint<4>> dLeft;         //data from adjacent PE(from left PE)
    sc_in<sc_uint<4>> dRight;        //data from adjacent PE(from right PE)
    sc_in<sc_uint<4>> dUp;           //data from adjacent PE(from upside PE);
    sc_in<sc_uint<4>> dDown;         //data from adjacent PE(from downside PE);
    sc_out<sc_uint<4>> muxOut;       //data request for internal register
    sc_out<sc_uint<4>> dReq;

    void do_mux();

    SC_CTOR(mux) {
        SC_METHOD(do_mux);
        sensitive << muxCtl << dln << dReg << dLeft << dRight << dUp << dDown;
        muxOut.initialize(0);
        dReq=0;
    }
};

/*
 * Mux: Mux for Processing Accelerator-PE (source file for Mux)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM (ckim@student.ccu.edu.au)
 * File name: mux.cpp
 * Revision history: Version1
 * Date: 29/1/2005
 */

#include "mux.h"

void mux::do_mux() {
    switch (muxCtl.read()) {
    case 0: muxOut = dln;    dReq=0;   break;
    case 1: muxOut = dReg;   dReq=1;   break;
    case 2: muxOut = dLeft;  dReq=0;   break;
    case 3: muxOut = dRight; dReq=0;   break;
    case 4: muxOut = dUp;    dReq=0;   break;
    case 5: muxOut = dDown;  dReq=0;   break;
    default:
        dReq=0;   break;
    }
}

/*
 * ALU: ALU for Processing Accelerator-PE (header file for ALU)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM (ckim@student.ccu.edu.au)
 * File name: alu.h
 * Revision history: Version1
 * Date: 29/1/2005
 */

#include "systemc.h"
SC_MODULE(alu) {
  sc_in<sc_uint<4>> aluAIn;
  sc_in<sc_uint<4>> aluBIn;
  sc_in<sc_uint<4>> aluCl;
  sc_out<sc_uint<4>> aluOut;
  sc_out<sc_uint<4>> regIn;
  sc_out<sc_uint<4>> regOut;
  sc_out<sc_uint<4>> regTmp; //temp reg for MAC, MAS & output for test

  void do_alu();

  SC_CTOR(alu) {
    SC_METHOD(do_alu);
    sensitive << aluAIn << aluBIn << aluCl;
    #ifdef SIM
    aluOut.initialize(0);
    regIn.initialize(0);
    regTmp.initialize(0);
    #endif
  }

  #*/
  /*
   * ALU: ALU for Processing Accelerator-PE (source file for ALU)
   * Copyright(c) 2005 by Chui KIM, All right reserved
   * Author: Chui KIM (ckim@student.ceu.edu.au)
   * File name: alu.cpp
   * Revision history: Version 1
   * Date: 29/1/2005
   */
  #include "alu.h"
  
  //define MAC(A,B,P) (((A)*B)+(P)) //mac
  //define MAS(A,B,P) (((A)*B)-(P)) //mas
  //define ROR(A) (((A&Ox0f)&Ox1f)<<(A&A&Ox0f)>>(A&A&Ox0f)>>(Ox1f)) //rotate
  //define ABS(A) (((A)&Ox0f)<Ox0f?(-1*(A&Ox0f)):(A&Ox0f)) //abs

  void alu::do_alu() {
    sc_uint<4> result, src1, src2, tmp, mulTmp; //temp signals
    src1 = aluAIn.read();
    src2 = aluBIn.read();

    switch (aluCl.read()) {
      case 0: result = src1 * src2; break; //pamul
      case 1: mulTmp = src1 * src2;
            result = mulTmp + sc_uint<4>(regTmp); break; //mac
      case 2: mulTmp = src1 * src2;
            result = mulTmp - sc_uint<4>(regTmp); break; //mas
      case 3: result = src1 << 1; break; //id
      case 4: result = src1 >> 1;
            //when the data-type is signed, it should be modified (asr)
            if (result < 0) {
              result = -result;
            }
            break; //asr(divider/2)
      case 5: result = src1 >> 1;
            //when the data-type is signed, it can be applied (abs)
            if (result < 0) {
              result = -result;
            }
            break; //abs
      case 6: result = ROR(src1);
            //when the data-type is signed, it should be applied (asr)
            if (result < 0) {
              result = -result;
            }
            break; //asr
    default:
    }
    aluOut.write(result);
    regIn.write(result);
    tmp = aluOut;
    regTmp.write(tmp); //defined in the header file, signal for Test
    //sc_out<sc_uint<4>> regTmp;
  }

  void do_alu();
SC_MODULE(pape) {
  sc_in<bool> sc_in;
  sc_in<sc_uint<19>> sc_in<sc_uint<19>>;
  sc_in<sc_uint<4>> sc_in<sc_uint<4>>;
  sc_in<sc_uint<4>> sc_in<sc_uint<4>>;
  sc_out<sc_uint<4>> sc_out<sc_uint<4>>;
  sc_out<sc_uint<4>> sc_out<sc_uint<4>>;
  //temp signal for instruction
  sc_signal<sc_uint<19>> s_last;
  //temp signals from IReg/Instruction Decoder
  sc_signal<sc_uint<3>> s_muxACtl;
  sc_signal<sc_uint<3>> s_muxBCtl;
  sc_signal<sc_uint<3>> s_muxDCtl;
  sc_signal<sc_uint<3>> s_muxEtl;
  s_muxFtl;
  sc_signal<sc_uint<3>> s_muxGtl;
  sc_signal<sc_uint<3>> s_muxHtl;
  sc_signal<sc_uint<3>> s_muxJtl;
  sc_signal<sc_uint<3>> s_muxKtl;
  sc_signal<sc_uint<3>> s_muxLtl;
  sc_signal<sc_uint<3>> s_muxMt;
  sc_signal<sc_uint<3>> s_muxNtl;
  sc_signal<sc_uint<3>> s_muxOtl;
  sc_signal<sc_uint<3>> s_muxPtl;
  sc_signal<sc_uint<3>> s_muxQt;
  sc_signal<sc_uint<3>> s_muxRtl;
  sc_signal<sc_uint<3>> s_muxStl;
  sc_signal<sc_uint<3>> s_muxTtl;
  sc_signal<sc_uint<3>> s_muxUtl;
  sc_signal<sc_uint<3>> s_muxVtl;
  sc_signal<sc_uint<3>> s_muxWtl;
  sc_signal<sc_uint<3>> s_muxXtl;
  sc_signal<sc_uint<3>> s_muxYtl;
  sc_signal<sc_uint<3>> s_muxZtl;
  //temp signals for dual input and ALU inputs
  sc_signal<sc_uint<16>> s_dIn, s_dLeft, s_dRight, s_dUp, s_dDown;
  sc_signal<sc_uint<16>> s_dOut;
  sc_signal<sc_uint<16>> s_dOutAdJPE;
  //temp signals for SRAM
  sc_signal<sc_uint<16>> sramData;
  //temp signal for output data bus
  sc_signal<sc_uint<16>> s_regTmp;
  //temp signal for output data bus
  sc_signal<sc_uint<16>> s_regD;
  void do_latch(){
    void do_reg();
    void do_sram();
    void do_doutReg();
  }
}

3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C: SystemC Codes

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C: SystemC Codes

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
/**
 * PACE: Processing Accelerator-PE for CAP(source file for PAPE)
 * Copyright(c) 2005 by Chui KIM. All right reserved
 * Author: Chui KIM(kim@student.ccu.edu.au)
 * File name: papc.cpp
 * Revision History: Version 1
 * Date: 29/1/2005
 */

#include "papc.h"

// Latch
void papc::do_latch()
{
    if (reset) {
        s_inst.write(0);
        s_dln.write(0);
        s_dRight.write(0);
        s_dUp.write(0);
        s_dDown.write(0);
    } else {
        s_inst.write(lastCS.read());
        s_dln.write(dln.read());
        s_dRight.write(dRight.read());
        s_dUp.write(dUp.read());
        s_dDown.write(dDown.read());
    }
}

// Internal Register
void papc::do_reg()
{
    if (s_rwRegEn) {
        // read operation
        switch (s_regSel.read())
        {
            case 0: regOut.write(tmp1); break;
            case 1: regOut.write(tmp2); break;
            case 2: regOut.write(tmp3); break;
            case 3: regOut.write(tmp4); break;
            default: break;
        }

        if (dReqA) {
            dRegOutA = regOut;
        } else {
            doutBus = sc_lv<4>(regOut);
            dOut = sc_uint<4>(doutBus);
        }

        if (dReqB) {
            dRegOutB = regOut;
        } else {
            doutBus = sc_lv<4>(regOut);
            dOut = sc_uint<4>(doutBus);
        }
    } else if (s_rWRegEn==0) {
        // write operation
        switch (s_regSel.read())
        {
            case 0: tmp1 = s_regIn; break;
            case 1: tmp2 = s_regIn; break;
            case 2: tmp3 = s_regIn; break;
            case 3: tmp4 = s_regIn; break;
            default: break;
        }
    }
}

// SRAM
void papc::do_sram()
{
    if (s_sramEn) {
        if (s_rwSEn) {
            // read operation
            sramData.write(sramData[0][s_sramSel.read()]);
            doutBus = sramData;
            dOut = sc_uint<4>(sramData);
            //
        } else {
            sramData = sc_lv<4>(s_regIn);
        }
    }
}
31-SofChips
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C-System C Codes

ramData[s_sramSel.read()] = sramData;

} else {
    sramData = "ZZZZ";
}

// Data output register
void pc::do_doutReg() {
    if (s_doutRCtl) {
        doutRegPE = sc_uint<4>(doutBus);
    }
}

3 ICS_RISC
3.1 Datapath Architecture

/*
 * PC: Program Counter
 * for ICS(Intelligent Configurable Switch) RISC Core(header file for pc)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM(cklm@student.ecu.edu.au)
 * File name: pc.h
 * Revision history: Version 1
 * Date: 23/3/2005
 */

#include "systemc.h"

SC_MODULE(pc) {
    sc_in<sc_logic> clock;
    sc_in<sc_logic> reset;
    sc_in<sc_logic> lAddrCtl;  // Select signal between aluOut/incrOut
    sc_in<sc_logic> dAddrCtl;
    sc_out<sc_uint<32>> aluOut;
    sc_out<sc_uint<32>> lAddr;  // Instruction Address
    sc_out<sc_uint<32>> dAddr;  // Data Address

    void do_pc();
    void do_autoincr();

    sc_uint<32> lAddrTmp;
    sc_uint<32> incrOut;
    sc_uint<32> lAddrOut;
    sc_uint<32> dAddrIn;

    SC_CTOR(pc) {
        SC_METHOD(do_pc);
        sensitive << clock.pos() << reset << lAddrCtl << dAddrCtl << aluOut;
        SC_METHOD(do_autoincr);
        sensitive << clock.pos() << reset;

        #ifdef SIM
        lAddr.initialize(0);
        dAddr.initialize(0);
        #endif
    }
}
/*
 * File name: pc.cpp
 * Revision history: Version 1
 * Date: 22/3/2005
 */

#include "pc.h"

void pc::do_pc() {
    bool IAddrOutTmp;
    if(reset) {
        IAddr = 0;
        dAddr = 0;
    } else {
        IAddrTmp = IAddrIn;
        incrOut = IAddrTmp + 1;
    }
    if (!IAddrCtl) {
        IAddr = aluOut;
        IAddrOutTmp = 0;
    } else {
        IAddr = incrOut;
        IAddrOutTmp = 1;
    }
    IAddrOut = IAddrOutTmp;
    if (!dAddrCtl) {
        dAddr = aluOut;
    }
}

void pc::do_autoIncr() {
    if(reset) {
        IAddrIn = 0;
    } else if (!IAddrOut) {
        if (clock.posedge()) {
            IAddrIn++;
        }
    }
}

/*
 * SR: Status Register
 * for ICS(Intelligent Configurable Switch)RISC Core(header file for sr)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM(kkim@student.ecu.edu.au)
 * File name: sr.h
 * Revision history: Version 1
 * Date: 3/2/2005
 */

#include "systemc.h"

SC_MODULE(sr) {
    sc_in<bool> clock;
    sc_in<sc_uint<4> > reset;
    sc_in<sc_uint<4> > condFlag;
    sc_in<sc_uint<4> > wbData;
    sc_in<bool> wbSel;
    sc_in<bool> srOEn;
    sc_in<bool> srWbEn;
    sc_out<sc_uint<4> > zFlag;
    sc_out<sc_uint<4> > rdData;

    void do_sr();
    sc_uint<4> srData;
}
SCCTOR(sr) {
    SC_METHOOL(do_sr);
    sensitive << clock << reset << condFlag << wbData
    << wbSel << srOEn << srWbEn;
    #ifdef SIM
        zFlag.initialize(0);
        rdData.initialize(0);
    #endif
};

/*
 * SR: Status Register
 * for ICS(Intelligent Configurable Switch) RISC Core(source file for sr)
 * Copyright(c) 2005 by Chul KIM, All right reserved
 * Author: Chul KIM(chkim@student.ceu.edu.au)
 * File name: sr.cpp
 * Revision history: Version 1
 * Date: 3/2/2005
 */

#include "sr.h"

void sr::do_sr() {
    if (reset) {
        srData = 0;
    } else if (srWbEn) {
        if (wbSel) {
            srData = condFlag;
        } else {
            srData = wbData;
        }
    }
    zFlag = srData[2];
    if (srOEn) {
        rdData = srData;
    }
    // rdData = srOEn ? sc_i<4>(srData) : "ZZZZ";
}

/*
 * LF: Loop Buffer
 * for ICS(Intelligent Configurable Switch) RISC Core(header file for lf)
 * Copyright(c) 2005 by Chul KIM, All right reserved
 * Author: Chul KIM(chkim@student.ceu.edu.au)
 * File name: lf.h
 * Revision history: Version 1
 * Date: 14/3/2005
 */

#include "systeme.h"

SC_MODULE(lf) {
    sc_in<bool> clock;
    sc_in<bool> reset;
    sc_in<bool> lbEn; // Loop Buffer Enable
    sc_in<bool> lbRWEn; // Loop Buffer Read/Write Enable
    sc_in<int<32>> lAddrIn; // Addr Input for LF
    sc_out<int<32>> lAddrOut; // Addr Output for LF
    sc_out<int<64>> lNerr;
    void do_lf();
}
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C: System Codes

```c
#include "system.h"

SC_MODULE(n"lf") {
    sc_uint<32> buff[16];
    sc_uint<4> incrTmp;

    SC_CTOR() {
        SC_METHOD(do_lf);
        sensitive << clock.posedge() << reset;
    #ifdef SIM
        clock=0;
        reset=1;
        lbEn=0;
        lbRWE=0;
        lAddrIn.initialize(0);
        incrTmp.initialize(15);
    #endif
    }
}

/*
* LF: Loop Buffer
* for ICS(Intelligent Configurable Switch)RISC Core(source file for lf)
* Copyright(c) 2005 by Chui KIM, All right reserved
* Author: Chui KIM(ckim@student.ccu.edu.au)
* File name: lf.cpp
* Revision history: Version 1
* Date: 17/3/2005
*/

#include "system.h"

void lf::do_lf() {
    if (clock.posedge()) {
        if (lbEn) {
            if (lbRWE) {
                // read operation
                incrTmp++;
                lAddrOut.write(buff[incrTmp]);
            } else {
                // write operation
                incrTmp--;
                buff[incrTmp] = lAddrIn;
            }
        }
        // incr = incrTmp;
    }
}

/*
* RegFile: 32 x 32 Register file
* for ICS(Intelligent Configurable Switch)RISC Core(source file for regFile)
* Copyright(c) 2005 by Chui KIM, All right reserved
* Author: Chui KIM(ckim@student.ccu.edu.au)
* File name: regFile.h
* Revision history: Version 1
* Date: 3/2/2005
*/

#include "system.h"

SC_MODULE(n"regFile") {
    sc_in<bool> clock;
    sc_in<sc_uint<5>> rdAAddr;          // read index A
    sc_in<sc_uint<5>> rdBAddr;          // read index B
    sc_in<bool> rdAOEn;                // read A output enable
    sc_in<bool> rdBOEn;                // read B output enable
    sc_in<sc_uint<5>> wIndex;           // writeback index
    sc_in<sc_uint<32>> wbData;          // writeback data
    sc_in<bool> wbEn;                  // writeback enable
}
```

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE 134
A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C: System Codes

sc_out<sc_uint<32>> rdAData; //read data A
sc_out<sc_uint<32>> rdBData; //read data B

sc_signal<sc_uint<32>> gpr0,gpr1,gpr2,gpr3,gpr4,gpr5,gpr6,gpr7,gpr8,gpr9,
gpr10,gpr11,gpr12,gpr13,gpr14,gpr15,gpr16,gpr17,gpr18,gpr19,gpr20,
gpr21,gpr22,gpr23,gpr24,gpr25,gpr26,gpr27,gpr28,gpr29,gpr30,gpr31;

void do_regFile();

SC_CTOR(regFile) {  
SC_METHOD(do_regFile);
  sensitive << clock.pos() << rdAIds << rdBIds << rdAOEn << rdBOEn << whIds
  << whData << whEn << pc;
}

#define SIM
rdAData.initialize(0);
rdBData.initialize(0);
#endif
}

/* RegFile: 32 x 32 Register file  
   for ICS Intelligent Configurable Switch RISC Core (source file for regFile)  
   Copyright(c) 2005 by Chui KIM. All right reserved  
   Author: Chui KIM (ckim@student.ecu.edu.au)  
   File name: regFile.cpp  
   Revision history: Version1  
   Date: 3/2/2005  */

#include "regFile.h"

void regFile::do_regFile() {
  switch (whIds.read()) {
  case 0: gpr0.write(whData); break;
  case 1: gpr1.write(whData); break;
  case 2: gpr2.write(whData); break;
  case 3: gpr3.write(whData); break;
  case 4: gpr4.write(whData); break;
  case 5: gpr5.write(whData); break;
  case 6: gpr6.write(whData); break;
  case 7: gpr7.write(whData); break;
  case 8: gpr8.write(whData); break;
  case 9: gpr9.write(whData); break;
  case 10: gpr10.write(whData); break;
  case 11: gpr11.write(whData); break;
  case 12: gpr12.write(whData); break;
  case 13: gpr13.write(whData); break;
  case 14: gpr14.write(whData); break;
  case 15: gpr15.write(whData); break;
  case 16: gpr16.write(whData); break;
  case 17: gpr17.write(whData); break;
  case 18: gpr18.write(whData); break;
  case 19: gpr19.write(whData); break;
  case 20: gpr20.write(whData); break;
  case 21: gpr21.write(whData); break;
  case 22: gpr22.write(whData); break;
  case 23: gpr23.write(whData); break;
  case 24: gpr24.write(whData); break;
  case 25: gpr25.write(whData); break;
  case 26: gpr26.write(whData); break;
  case 27: gpr27.write(whData); break;
  case 28: gpr28.write(whData); break;
  case 29: gpr29.write(whData); break;
  case 30: gpr30.write(whData); break;
  case 31: gpr31.write(pc); break; //for PC
  }

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
A DISSERTATION PERTAINING TO THE DEGREE OF MASTER OF ENGINEERING SCIENCE

3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C - SystemC Codes

```c
3I)-SoftChip
A

\begin{verbatim}
if (rdA0En) {
  switch (rdAidx.read()) {
    case 0: rdAData = gpr0; break;
    case 1: rdAData = gpr1; break;
    case 2: rdAData = gpr2; break;
    case 3: rdAData = gpr3; break;
    case 4: rdAData = gpr4; break;
    case 5: rdAData = gpr5; break;
    case 6: rdAData = gpr6; break;
    case 7: rdAData = gpr7; break;
    case 8: rdAData = gpr8; break;
    case 9: rdAData = gpr9; break;
    case 10: rdAData = gpr10; break;
    case 11: rdAData = gpr11; break;
    case 12: rdAData = gpr12; break;
    case 13: rdAData = gpr13; break;
    case 14: rdAData = gpr14; break;
    case 15: rdAData = gpr15; break;
    case 16: rdAData = gpr16; break;
    case 17: rdAData = gpr17; break;
    case 18: rdAData = gpr18; break;
    case 19: rdAData = gpr19; break;
    case 20: rdAData = gpr20; break;
    case 21: rdAData = gpr21; break;
    case 22: rdAData = gpr22; break;
    case 23: rdAData = gpr23; break;
    case 24: rdAData = gpr24; break;
    case 25: rdAData = gpr25; break;
    case 26: rdAData = gpr26; break;
    case 27: rdAData = gpr27; break;
    default: break;
  }
}

if (rdB0En) {
  switch (rdBidx.read()) {
    case 0: rdBData = gpr0; break;
    case 1: rdBData = gpr1; break;
    case 2: rdBData = gpr2; break;
    case 3: rdBData = gpr3; break;
    case 4: rdBData = gpr4; break;
    case 5: rdBData = gpr5; break;
    case 6: rdBData = gpr6; break;
    case 7: rdBData = gpr7; break;
    case 8: rdBData = gpr8; break;
    case 9: rdBData = gpr9; break;
    case 10: rdBData = gpr10; break;
    case 11: rdBData = gpr11; break;
    case 12: rdBData = gpr12; break;
    case 13: rdBData = gpr13; break;
    case 14: rdBData = gpr14; break;
    case 15: rdBData = gpr15; break;
    case 16: rdBData = gpr16; break;
    case 17: rdBData = gpr17; break;
    case 18: rdBData = gpr18; break;
    case 19: rdBData = gpr19; break;
    case 20: rdBData = gpr20; break;
    case 21: rdBData = gpr21; break;
    case 22: rdBData = gpr22; break;
    case 23: rdBData = gpr23; break;
    case 24: rdBData = gpr24; break;
    case 25: rdBData = gpr25; break;
    case 26: rdBData = gpr26; break;
    case 27: rdBData = gpr27; break;
    default: break;
  }
}
\end{verbatim}
```
/*
 * aluDef: Definition of the ALU functions
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM(ckim@student.ecu.edu.au)
 * File name: aluDef.h
 * Revision history: Version1
 * Date: 2/2/2005
 */

#define _ALU_DEFINE_HI_ 
#define _ALU_DEFINE_HI_

// ALU Function Definitions
#define CMD_MOVA 0x0
#define CMD_MOVBl 0x1
#define CMD_AND 0x2
#define CMD_OR 0x3
#define CMD_XOR 0x4
#define CMD_NOT 0x5
#define CMD_ADD 0x6
#define CMD_SUB 0x7
#define CMD_CMP 0x8

#endif

/*
 * aluICS: ALU for ICS(Intelligent Configurable Switch)RISC Core(header file for aluICS)
 * Copyright(c) 2005 by Chui KIM, All right reserved
 * Author: Chui KIM(ckim@student.ecu.edu.au)
 * File name: aluICS.h
 * Revision history: Version1
 * Date: 2/2/2005
 */

#include "systemc.h"
#include "aluDef.h"

SC_MODULE(aluICS) {
    sc_in<sc_uint<32>> a1uIn;
    sc_in<sc_uint<32>> a1uBin;
    sc_in<sc_uint<32>> a1uCtl;
    //
    sc_in<sc_uint<32>> c1n;  // carry input
    sc_out<sc_uint<32>> a1uOut;
    sc_out<sc_uint<32>> condFlag;  // conditional flags
    void do_alu();
    sc_signal<bool> cf, vf, rf, zf;

    SC_CTOR(aluICS) {
        SC_METHOD(do_alu);
        sensitive << aluIn << aluBin << a1uCtl;
        #ifdef SIM
            aluOut.initialize(0);
            condFlag.initialize(0);
        #endif
    }
};
#include "aluICS.h"

#define comp(a,b) (((a)^=(b))?1:((!(a))==(b) ?0:1))

// ALU
void aluICS::do_alu(
    #ifdef SIM  
    signed short result = 0;
    #else
    signed short result;
    #endif
    signed short src1 = aluAIn.read();
    signed short src2 = aluBIn.read();
    // ALU Flags
    switch (cmd & 0xF)
    {
    case 0: result = src1; break; //CMD_MOVA
    case 1: result = src2; break; //CMD_MOVB
    case 2: result = src1 & src2; break; //CMD_AND
    case 3: result = src1 | src2; break; //CMD_OR
    case 4: result = src1 ^ src2; break; //CMD_XOR
    case 5: result = ~src1; break; //CMD_NOT
    case 6: result = src1 + src2; break; //CMD_ADD
    case 7: result = src1 - src2; break; //CMD_SUB
    case 8: result = comp(src1,src2); break; //CMD_CMP
    default: break;
    }
    // Conditional Flags
    if (result & 0xFFFF0000) cf.write(1); //carry/Borrow flag
    else cf.write(0);
    if (result & 0xFFFF0000) vf.write(1); //overflow flag
    else vf.write(0);
    result &= 0xFFFF;
    if (result == 0) zf.write(1); //zero flag
    else zf.write(0);
    if (result & 0x8000) nf.write(1); //negative flag
    else nf.write(0);
    aluOut.write(result);
    tmpCond[3]=nf;
    tmpCond[2]=zf;
    tmpCond[1]=cf;
    tmpCond[0]=vf;
    condFlag=tmpCond;
}

*/
* aluICS: ALU for ICS(Intelligent Configurable Switch)RISC Core(source file for aluICS)
* Copyright(c) 2005 by Chui KIM. All right reserved
* Author: Chui KIM(chkim@student.ceu.edu.au)
* File name: aluICS.cpp
* Revision history: Version1
* Date: 2/21/2005
*/
#include "systemc.h"

SC_MODULE(mul) {
    sc_in<bool> clock;
    sc_in<bool> reset;
    sc_in<sc_uint<32>> mulAIn;
    sc_in<sc_uint<32>> mulBIn;
    sc_out<sc_uint<32>> mulOut;

    void do_mul();

    SC_CTOR(mul) {
        SC_METHOD(do_mul);
        sensitive << clock << reset << mulAIn << mulBIn;
    }
}

/*
 * MUL: multiplier
 * for ICS(Intelligent Configurable Switch)RISC Core(source file for multiplier)
 * Copyright(c) 2005 by Chul KIM, All right reserved
 * Author: Chul KIM(ckim@student.ece.queensu.ca)
 * File name: mul.cpp
 * Revision history: Version 1
 * Date: 14/3/2005
 */

#include "mul.h"

void mul::do_mul() {
    sc_uint<32> src1, src2, result;
    src1 = mulAIn;
    src2 = mulBIn;

    result = src1 * src2;
    mulOut.write(result);
}

/*
 * Shifter: Shifter for ICS(Intelligent Configurable Switch)RISC Core(source file for Shifter)
 * Copyright(c) 2005 by Chul KIM, All right reserved
 * Author: Chul KIM(ckim@student.ece.queensu.ca)
 * File name: shifter.h
 * Revision history: Version 1
 * Date: 2/2/2005
 */

#include "systemc.h"

SC_MODULE(shifter) {
    sc_in<sc_uint<32>> shiftIn;   // shift input
    sc_in<sc_uint<8>> shiftAmt;   // shift amount
    sc_in<sc_uint<32>> shiftCtl;  // control input
    sc_out<sc_uint<32>> shiftOut; // shift output
}

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
void do_shift();

SC_CTOR(shifter) {
    SC_METHOD(do_shift);
    sensitive << shiftIn << shiftAmt << shiftCtl;
}

#include "shifter.h"

void shifter::do_shift() {
    sc_uint<32> w_shiftln, w_shiftOut;
    sc_uint<5> w_shiftAmt;
    sc_uint<3> w_shiftCtl;

    w_shiftln = shiftIn;
    w_shiftAmt = shiftAmt;
    w_shiftCtl = shiftCtl;

    switch (w_shiftCtl) {
    case 0:
        w_shiftOut = w_shiftln << w_shiftAmt; break; // logical shift left
    case 1:
        w_shiftOut = w_shiftln >> w_shiftAmt; break; // logical shift right
    case 2:
        w_shiftOut = ((32-w_shiftln[31]) << (32-w_shiftAmt)) | (w_shiftln >> w_shiftAmt);
        break;
    case 3:
        w_shiftOut = (w_shiftln >> w_shiftAmt) | (w_shiftln << (32-w_shiftAmt)); break;
        // Rotate
    default:
        break;
    }
    shiftOut.write(w_shiftOut);
}

#include "systeme.h"
#include "pc.h"
#include "arh"
SC_MODULE(datapath) {
    sc_in<bool> clock;
    sc_in<bool> reset;
    sc_in<sc_uint<32> > ic_axc;  //Immediate Data
    sc_in<bool> cmpFlag;  //Compare Flag
    sc_in<sc_uint<4> > shiflCU;  //Shifter Control Signal
    sc_in<bool> shiflOEn;  //Shifter Output Enable
    sc_in<sc_uint<32> > mulOEn;  //Multiplier Output Enable
    sc_in<sc_uint<12> > opAIdx;  //Operand A Index
    sc_in<sc_uint<12> > rdAIdx;  //Read A Output Enable
    sc_in<sc_uint<12> > rdBOEn;  //Read B Output Enable
    sc_in<sc_uint<12> > wbEn;  //Writeback Enable
    sc_in<sc_uint<12> > immOEn;  //Immediate Output Enable
    sc_in<sc_uint<12> > IAregCtl;  //Instruction Address Register Control
    sc_in<sc_uint<12> > dAddrCU;  //Data Address Register Control
    sc_in<sc_uint<12> > dAddr;  //Data Input
    sc_in<sc_uint<32> > ldAddr;  //Instruction Address
    sc_in<sc_uint<32> > dOut;  //Data Output
    sc_in<sc_uint<32> > s_aluOut;  //ALU Output Signal
    sc_in<sc_uint<12> > s_condFtag;  //Conditional Flag
    sc_in<sc_uint<12> > s_shiftOut;  //Shifter Output Signal
    sc_in<sc_uint<12> > s_mulOut;  //Multiplier Output Signal
    sc_in<sc_uint<12> > tmpbusA;  //Temp Signals for SR
    sc_in<sc_uint<12> > tmpbusW;  //Temp Signals for SF
    sc_in<sc_uint<12> > shiflAmt;  //Signal for Shifter Amount Control
    sc_in<sc_uint<12> > llAddr;  //Instruction Address from IF
    sc_in<sc_uint<12> > plAddr;  //Instruction Address from PC
    sc_out<sc_uint<12> > busAI, busWI, busBI;
    sc_out<sc_uint<12> > busW, busB;
    sc_out<sc_uint<12> > busA, busB, busW;
    sc_out<sc_uint<32> > s_aluOut;  //ALU Output Signal
    sc_out<sc_uint<12> > s_condFtag;  //Conditional Flag
    sc_out<sc_uint<12> > s_shiftOut;  //Shifter Output Signal
    sc_out<sc_uint<12> > s_mulOut;  //Multiplier Output Signal
    sc_out<sc_uint<12> > tmpbusA;  //Temp Signals for SR
    sc_out<sc_uint<12> > tmpbusW;  //Temp Signals for SF
    sc_out<sc_uint<12> > shiflAmt;  //Signal for Shifter Amount Control
    sc_out<sc_uint<12> > llAddr;  //Instruction Address from IF
    sc_out<sc_uint<12> > plAddr;  //Instruction Address from PC
    tmpbusA = busAI.range(31,28);
    tmpbusW = busWI.range(31,28);
    s_aluOut = busBI.range(15,11);  //Signal for Shift Amount Control
}

void do_outCtl();  //Function for Output Control
void do_InOutReg();  //Function for Data In/Out Register
void do_sigDiv() {  //Function for Gen. Signals for Status Register
    sc_uint<32> busA, busW, busB;
    busA = busA;
    busW = busW;
    busB = busB;
}

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C: SystemC Codes

#include "datapath.h"

void datapath::do_outCtl() {
  if (immOEn) {
    busB = imm;
  }
  if (aluOEn) {
    busW = s_aluOut;
  }
  if (shiftOEn) {
    busW = s_shiftOut;
  }
  if (mulOEn) {
    busW = s_mulOut;
  }
  if (lbEn) { // Loop Buffer Addressing
    IAddr = IIAddr;
    // Instruction Address from LB
  } else {
    IAddr = ptAddr;
    // Instruction Address from PC
  }
}

#include "datapatb.h"

#define SIM

void datapath::do_outCtl() {
  if (immOEn) {
    busB = imm;
  }
  if (aluOEn) {
    busW = s_aluOut;
  }
  if (shiftOEn) {
    busW = s_shiftOut;
  }
  if (mulOEn) {
    busW = s_mulOut;
  }
  if (lbEn) { // Loop Buffer Addressing
    IAddr = IIAddr;
    // Instruction Address from LB
  } else {
    IAddr = ptAddr;
    // Instruction Address from PC
  }
}
3D-SoftChip

A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C-System Codes

```cpp
void datapath::doInOutReg()
{
  if (dInCtl) {
    busW = dIn;
  }
  if (dOutCtl) {
    dOut = busB;
  }
}
```

3.2 Control Architecture

/*
 * Def: Macros for ICS_RISC
 */

#define INST_ALUIS 0 //ALU Imm. Short(1 Inst. word)
#define INST_ALUL 1 //ALU Imm. Long(2 Inst. word)
#define INST_ALUR 2 //ALU Register
#define INST_ALULB 3 //ALU Loop Buffer Addressing
#define INST_SHRO 4 //Shift/Rotate
#define INST_LOAD 5 //Load
#define INST_STORE 6 //Store
#define INST_BRANCH 7 //Branch
#define INST_PECON 8 //PE Control
#define INST_DMA 9 //DMA Control
#define INST_MUL 10 //Multiply

#define INST_EQ 0 //Equal
#define INST_NE 1 //Not Equal
#define INST_AL 2 //Always
#define INST_NV 3 //Never
#define OP_MOVA 0
#define OP_MOVB 1
#define OP_AND 2
#define OP.OR 3
#define OP_XOR 4
#define OP_NOT 5
#define OP.ADD 6
#define OP.SUB 7
#define OP.CMP 8
#define OP.MSR 9
#define OP.MRS 10
#define SII_LSL 0
#define SII_LSR 1
#define SII_ASR 2
#define SII_ROT 3

*/

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
#include "systemc.h"

SC_MODULE(fetch) {
    sc_in<bool> clock;
    sc_in<sc_uint<32>> din;
    sc_out<sc_uint<32>> flush;
    void do_fetch();
}

SC_CTOR(fetch) {
    #ifdef SIM
        // Simulink initialization
    #endif
} /*
* Fetch: Fetch Unit for ICS_RISC(Source file for fetch)
* Copyright(c) 2005 by Chui KIM, All right reserved
* Author: Chui KIM(ckim@student.ecu.edu.au)
* File name: fetch.cpp
* Revision history: Version1
* Date: 1/5/2005
*/

#include "fetch.h"

void fetch::do_fetch() {
    ifm = din.read();
}

/**
* Decode: Instruction Decoder Unit for ICS_RISC(Source file for decode)
* Copyright(c) 2005 by Chui KIM, All right reserved
* Author: Chui KIM(ckim@student.ecu.edu.au)
* File name: decode.h
* Revision history: Version1
* Date: 1/5/2005
*/

#include "systemc.h"
#include "def.h"

SC_MODULE(decode) {
    sc_in<bool> clock;
    sc_in<bool> reset;
    sc_in<sc_uint<32>> flush;
    sc_out<bool> refill;
    sc_out<sc_uint<32>> instrId;
    sc_out<sc_uint<32>> cond;
    sc_out<sc_uint<32>> opcode;
    sc_out<sc_uint<32>> shift;
    sc_out<sc_uint<32>> rs1Idx;
    sc_out<sc_uint<32>> rs2Idx;
    sc_out<sc_uint<32>> rs3Idx;
    sc_out<sc_uint<32>> imm;
    sc_out<bool> immFlag;
    sc_out<bool> cmpFlag;
    sc_out<bool> branchFlag;
    sc_out<bool> callFlag;
    sc_out<bool> srOEEn;
    sc_out<bool> srWhEn;
    //Extended Output
}

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C: SystemC Codes

* File name: execute.h
* Revision history: Version
* Date: 2/5/2005
* /

#include "systemc.h"
#include "def.h"

//Operand A Control
#define ARD 0 //Operand A: Rd
#define ARS1 1 //Operand A: Rs1
#define APC 2 //Operand A: PC

//Operand B Control
#define BIM 0 //Operand B: Immediate
#define BRS2 1 //Operand B: Rs2/ShiftAmt
#define BRD 2 //Operand B: Rd

SC_MODULE(execute) {
    sc_in<bool> clock;
    sc_in<sc_uint<4>> instId; //Instruction ID
    sc_in<sc_uint<3>> cond; //Condition
    sc_in<sc_uint<4>> opcode; //Op code
    sc_in<sc_uint<3>> shift; //Shift Type
    sc_in<sc_uint<5>> rs1Idx; //Rs1 Index
    sc_in<sc_uint<5>> rs2Idx; //Rs2/Shift Index
    sc_in<sc_uint<5>> rdIdx; //Rd Index
    sc_in<sc_uint<32>> imm; //Immediate data
    sc_in<bool> immFlag; //Immediate Operand Flag
    sc_in<bool> cmpFlag; //Compare Flag (update SR, No writeback)
    sc_in<bool> srOEEn; //Status Register Output Enable
    sc_in<bool> srWEn; //Status Register Writeback Enable

    //Output Signals
    sc_out<sc_uint<4>> aluCll; //ALU Control
    sc_out<bool> aluOEEn; //ALU Output Enable
    sc_out<sc_uint<3>> shiftCl; //Shifter Control
    sc_out<bool> shiftOEEn; //Shifter Output Enable
    sc_out<bool> mulOEEn; //Multiplier Output Enable
    sc_out<sc_uint<5>> opAIdx; //Operand A Index
    sc_out<sc_uint<5>> opBIdx; //Operand B Index/Shift Amount
    sc_out<bool> rdAOEn; //Read A Output Enable
    sc_out<bool> rdBOEn; //Read B Output Enable
    sc_out<sc_uint<5>> wbIdx; //Writeback Index
    sc_out<bool> wbEn; //Writeback Enable
    sc_out<bool> immOEEn; //Immediate Output Enable
    sc_out<bool> lAddrCl; //Instruction Address Register Control
    sc_out<bool> dAddrCl; //Data Address Register Control
    sc_out<bool> dInCl; //Data Input Control
    sc_out<bool> dOutCl; //Data Output Control

    // sc_out<sc_uint<5>> shiftAmt; //Shift Amount

    void do_ctlSigGen(); //Function for Control Signal Generate
    void do_opSel(); //Function for Select Input Operand A,B
    void do_aluCll(); //Function for Arrange AluCll Signals
    void do_shiftCl(); //Function for Arrange ShiftCl Signals

    sc_uint<4> opcodeTmp;
    sc_uint<2> opA, opB;

    SC_CTOR(execute) {
        SC_METHOD(do_ctlSigGen);
        sensitive << clock.pos() << instId;
        SC_METHOD(do_opSel);
        sensitive << clock.pos() << rs1Idx << rs2Idx;
        SC_METHOD(do_aluCll);
        sensitive << clock.pos() << opcode;
        SC_METHOD(do_shiftCl);
        sensitive << clock.pos() << shift;
    }

    #ifndef SIM
#include "cxccutd1"

void execute::do_dSigGen() {
    sc_uint<4> topCodeTmp;
    sc_uint<2> topA, topB;
    bool taluOEn, tshiftOEn, tmulOEn, twbEn, tdAregCtl, tdArcgCtl, tdlnCtl, tdOutCtl;
    bool wbEnTmp;
    sc_uint<4> instIdTmp;
    lstIdTmp = instId.read();
    //
    //include "execute.h"
    
    if (instIdTmp == 0) { //INST_ALUIS
        topCodeTmp = opcode.read();
        taluOEn = 1;
        tshiftOEn = 0;
        tmulOEn = 0;
        topA = ARD;
        topB = BMI;
        twbEn = 1;
        tAregCtl = 0;
        tdAregCtl = 0;
        tdlnCtl = 0;
        tdOutCtl = 0;
    } else if (instIdTmp == 1) { //INST_ALUI
        topCodeTmp = opcode.read();
        taluOEn = 1;
        tshiftOEn = 0;
        tmulOEn = 0;
        topA = ARD;
        topB = BMI;
        twbEn = 1;
        tAregCtl = 0;
        tdAregCtl = 0;
        tdlnCtl = 0;
        tdOutCtl = 0;
    } else if (instIdTmp == 2) { //INST_ALVR
        topCodeTmp = opcode.read();
        taluOEn = 1;
        tshiftOEn = 0;
        tmulOEn = 0;
        topA = ARS1;
        topB = BRS2;
    }
    
    #endif
}
# 3D-SoftChip

A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C: System C Codes

```c
if (instldTmp == 3) {  //INST_ALULB
topcodeTmp = opcode.read();
taluOEn = 1;
tshftOEn = 0;
tmulOEn = 0;
topA = ARSI;
topB = BRS2;
twbEn = 1;
tAregCtl = 0;
tdAregCtl = 0;
tdlnCtl = 0;
tdOutCtl = 0;
} else if (instldTmp == 4) {  //INST_SHIRO
  topcodeTmp = 0;
taluOEn = 0;
tshftOEn = 1;
tmulOEn = 0;
topA = ARSI;
topB = BRS2;
  tshftAmt = BRS2;  //BRS2 = ShiftAmt
twbEn = 1;
tAregCtl = 0;
tdAregCtl = 0;
tdlnCtl = 0;
tdOutCtl = 0;
} else if (instldTmp == 5) {  //INST_LOAD
  topcodeTmp = OP_MOVA;
taluOEn = 0;
tshftOEn = 0;
tmulOEn = 0;
topA = ARSI;
topB = BBR;
twbEn = 1;
tAregCtl = 0;
tdAregCtl = 0;
tdlnCtl = 1;
tdOutCtl = 0;
} else if (instldTmp == 6) {  //INST_STORE
  topcodeTmp = OP_MOVA;
taluOEn = 0;
tshftOEn = 0;
tmulOEn = 0;
topA = ARSI;
topB = BBR;
twbEn = 1;
tAregCtl = 0;
tdAregCtl = 0;
tdlnCtl = 1;
tdOutCtl = 1;
} else if (instldTmp == 7) {  //INST_BRANCH
  topcodeTmp = OP_ADD;
taluOEn = 1;
tshftOEn = 0;
tmulOEn = 0;
topA = APC;
topB = BIM;
twbEn = 0;
tAregCtl = 0;
tdAregCtl = 0;
tdlnCtl = 1;
tdOutCtl = 0;
} else if (instldTmp == 10) {  //INST_MUL
  topcodeTmp = 0;
taluOEn = 0;
tshftOEn = 0;
}
```
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C: System C Codes

```c
#include "system.h"

void execute::do_opSel() {
  //ARD
  if (opA == 0) { opAIdx = rdIdx; }
  else if (opA == 1) { //ARSI
    opAIdx = rsIdx;
  }
  else if (opA == 2) { //APC
    opAIdx = 15;
  }

  //BIM
  if (opB == 0) {
    //BIM
  }
```
opBldx = 3;
    } else if (opB == 2) {  //BRD
opBldx = rdldx;
    } else if (opB == 1) {  //BRS2
opBldx = rsldx;
    }
    whilex = rdldx;
}

void execute::do_aluCtl() {
opcodeTmp = opcode.read();
    switch(opcodeTmp) {
        case (OP_MOV_A):          aluCtl = 0;  break;
        case (OP_MOV_B):          aluCtl = 1;  break;
        case (OP_ADD):            aluCtl = 2;  break;
        case (OP_AND):            aluCtl = 3;  break;
        case (OP_XOR):            aluCtl = 4;  break;
        case (OP_OR):             aluCtl = 5;  break;
        case (OP_MUL):            aluCtl = 6;  break;
        case (OP_SUB):            aluCtl = 7;  break;
        case (OP_CMP):            aluCtl = 8;  break;
        case (OP_MSR):            aluCtl = 9;  break;
        case (OP_MRS):            aluCtl = 10; break;
        default:                  aluCtl = 11; break;
    }
}

void execute::do_shiftCtl() {
    sc_uint<3> shiftTmp;
    shiftTmp = shift.read();

    switch(shiftTmp) {
        case SII_LSL:           shiftCtl = 0;  break;
        case SII_LSR:           shiftCtl = 1;  break;
        case SII_ASR:           shiftCtl = 2;  break;
        case SII_ROT:           shiftCtl = 3;  break;
        default:                shiftCtl = 4;  break;
    }
}

/ *
 * Control: Control Arch for ICS_RISC (header file for control)
 * Copyright© 2005 by Chui KIM, All right reserved
 * Author: Chui KIM (ckim@student.ceu.edu.au)
 * File name: control.h
 * Revision history: Version 1
 * Date: 5/5/2005
 */
#include "systemc.h"
#include "def.h"
#include "fetch.h"
#include "decode.h"
#include "execute.h"
#include "debug.h"

SC_MODULE(control) {
    sc_in<bool> clock;  // Clock
    sc_in<bool> reset;  // Data Input
    sc_in<sc_uint<31>> dIn;
    sc_in<bool> zFlag;  // Zero Flag
    sc_out<sc_uint<32>> imm;  // Immediate Data
    sc_out<bool> cmpFlag;  // Compare Flag
    sc_out<bool> srOEEn;  // SR Output Enable
    sc_out<bool> srWEEn;  // SR Writeback Enable
    sc_out<sc_uint<4> > aluCtl;  // ALU Control
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C: SystemC Codes
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C - System Codes

sc_signal< bool > dRdBOEn; // Read B Output Enable
dWbIdx; // Writeback Index
dWbEn; // Writeback Enable
dImmOEn; // Immediate Output Enable
dSrcArgClk; // Source Address Register Control
dDArgClk; // Data Address Register Control
dDInClk; // Data Input Control
dDOOutClk; // Data Output Control
dShiftAmt; // Shift Amount

// Pipeline Registers
instIdText; // Instruction ID Debug Information
aluText; // ALU Debug Information
// Pipeline Registers

sc_uint<6>
sc_uint<5>
bool
branchFlag; // Branch Flag
eExitFlag; // Exit Flag

// ALU Flags
srOEn; // SR Output Enable
srWbEn; // SR Writeback Enable
aluClk; // ALU Control
aluOEn; // ALU Output Enable
shiftClk; // Shift Output Control
shiftOEn; // Shift Output Enable
mulOEn; // Multiplier Output Enable
opAIdx; // Operand A Index
opBIdx; // Operand B Index
rdAOEn; // Read A Output Enable
rdBOEn; // Read B Output Enable
wblf; // Writeback Index

// Instruction ID
instId; // Instruction ID
cond; // Execution Condition
cmpFlag; // Compare Flag
eBranchFlag; // Branch Flag
eExitFlag; // Exit Flag
eBranchFlag; // Branch Flag
eExitFlag; // Exit Flag

// PE Flags
aluCU; // ALU Control
aluOEn; // ALU Output Enable
shinCU; // Shifter Control
shiftOEn; // Shifter Output Enable
mulOEn; // Multiplier Output Enable
opAIdx; // Operand A Index
opBIdx; // Operand B Index
rdAOEn; // Read A Output Enable
rdBOEn; // Read B Output Enable
wblf; // Writeback Index

// Immediate Data
imm; // Immediate Data
immOEn; // Immediate Output Enable
cArgClk; // Instruction Address Register Control
dArgClk; // Data Address Register Control
dInClk; // Data Input Control
dOutClk; // Data Output Control

void doPipeReg();
void doConExe();
fetch* ifetch;
decode* idecode;
execute* isexecute;
dbg* idbgs;

SC_CTOR(control) {
ifetch = new fetch("fetch");
ifetch->clock(clock);
ifetch->domain(dln);
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->reset(reset);
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->InstInsn();
ifetch->Instr
void control::do_pipeReg() {
    if (reset) {
        cond = 0;
        eSrWbEn = 0;
        eWbEn = 0;
        eArregCtl = 0;
        branchFlag = 0;
        eExitFlag = 0;
    } else {
        instrld = dInstrld;
        cond = dCond;
        cmpFlag = dCmpFlag;
        branchFlag = dBranchFlag;
        eExitFlag = dEExitFlag;
    }

    // SC_METHOD(do_pipeReg);
    // sensitive << clock.pos() << reset;
    // SC_METHOD(do_condExec);
    // sensitive << clock.pos();
}

#include "control.b"

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C-SystemC Codes

srOEn = dSrOEn;
cSrWbEn = dSrWbEn;
aluCtl = dAluCtl;
aluOEn = dAluOEn;
shiftCtl = dShiftCtl;
shiftOEn = dShiftOEn;
mulCEn = dmulCEn;
opAldx = dOpAldx;
opBldx = dOpBldx;
rdOEn = drdOEn;
wrldx = dWrldx;
srWbEn = dsrWbEn;
immOEn = dimmOEn;
imm = dimm;
cArrgCtl = dArrgCtl;
dArrgCtl = dArrgCtl;
Imm = dImm;
cDOutCtl = dDOutCtl;
lbEn = dlbEn;
lbRWEEn = dlbRWEEn;
PEUp = dPEUp;
PEOpnodece = dPEOpnodece;
PEConfig = dPEConfig;
PESel = dPESel;
DMAOp = dDMAOp;
DFBSel = dDFBSel;
dataAmt = dDataAmt;
startAddrDFB = dStartAddrDFB;
SRAMRegSel = dSRAMRegSel;
startAddrSRAMReg = dStartAddrSRAMReg;
memSel = dMemSel;
startAddrProgDaMem = dStartAddrProgDaMem;

void control::do_condExe() {
  bool execFlag; //Execute Flag
  bool exitFlag;

  if (((cond==COND_AL) || (cond==COND_EQ) && (zFlag=1)) ||
      ((cond==COND_NE) && (zFlag=0))) {
    execFlag = 1;
  }

  flush = branchFlag & execFlag;

  wrldx = (execFlag && !refill) ? wrldx : 0;
  cSrWbEn = (execFlag && !refill) ? cSrWbEn : 0;
  lArrgCtl = (execFlag && !refill) ? lArrgCtl : 0;
  dOutCtl = (execFlag && !refill) ? dOutCtl : 0;
  exitFlag = (execFlag && !refill) ? exitFlag : 0;
}
3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System

Appendix C: System C Codes

```c
#include "debug.h"

#define ALUIS 0;
define ALUL 1;
define ALUR 2;
define ALULB 3;
define LOAD 5;
define STORE 6;
define BRANCH 7;
define MUL 8;
define PECON 9;
define DMA 10;

void debug::do_debug() {
    sc_uint<4> instId;
    sc_uint<4> aluCtl;
    sc_uint<4> instIdText;
    sc_uint<4> aluText;

    void do_debug();

    SC_CTOR(debug) {
        SC_METHOD(do_debug);
        sensitive <= instId <= aluCtl;
    };

    /* Debug: Debug Information for ICS_RISC (source file for debug) *
    * Copyright(c) 2005 by Chul KIM, All right reserved *
    * Author: Chul KIM(kkim@student.ecu.edu.au) *
    * File name: debug.cpp *
    * Revision history: Version1 *
    * Date: 5/3/2005 *
    */

    #include "debug.h"

define ALUIS 0;
define ALUL 1;
define ALUR 2;
define ALULB 3;
define LOAD 5;
define STORE 6;
define BRANCH 7;
define MUL 8;
define PECON 9;
define DMA 10;

void debug::do_debug() {
    sc_uint<4> instIdTmp;
    sc_uint<4> aluCtlTmp;
    instIdTmp = instId.read();
    aluCtlTmp = aluCtl.read();

    switch (instIdTmp) {
        case INST_ALUIS : instIdText = ALUIS; printf("ALUIS \n"); break;
        case INST_ALUL : instIdText = ALUL; printf("ALUL \n"); break;
        case INST_ALUR : instIdText = ALUR; printf("ALUR \n"); break;
        case INST_ALULB : instIdText = ALULB; printf("ALULB \n"); break;
        case INST_SHIRO : instIdText = SHIRO; printf("SHIRO \n"); break;
        case INST_LOAD : instIdText = LOAD; printf("LOAD \n"); break;
        case INST_STORE : instIdText = STORE; printf("STORE \n"); break;
        case INST_BRANCH: instIdText = BRANCH; printf("BRANCH \n"); break;
        case INST_MUL : instIdText = MUL; printf("MUL \n"); break;
        case INST_PECON : instIdText = PECON; printf("PECON \n"); break;
        case INST_DMA : instIdText = DMA; printf("DMA \n"); break;
        default: printf("Not Defined Instruction \n");
    }
}

/* BusCtl: I/O Bus Control for ICS_RISC (header file for busCtl) *
* Copyright(c) 2005 by Chul KIM, All right reserved *
* Author: Chul KIM(kkim@student.ecu.edu.au) *
* File name: busCtl.h */

A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE

155
#include "systemc.h"

SC_MODULE(busCtl) {  
    sc_in<bool> nRW;  
    sc_in<sc_uint<32>> dataOut;  
    sc_out<sc_uint<32>> dataIn;  
    sc_inout_rv<32> data;  
    void do_busCtl();  
    SC_CTOR(busCtl) {  
        SC_METHOD(do_busCtl);  
        sensitive << nRW << dataOut;  
    }  
};

void busCtl::do_busCtl() {  
    dataIn = sc_uint<32>(data);  
    if (nRW) {  
        data = sc_lv<32>(dataOut);  
    } else {  
        data = "ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ";
    
    }
}

/*  
ICS_RISC: Top module for ICS_RISC(header file for ICS_RISC)  
* Copyright(c) 2005 by Chui KIM, All right reserved  
* Author: Chui KIM(ckim@student.ecu.edu.au)  
* File name: ICS_RISC.h  
* Revision history: Version1  
* Date: 5/5/2005  
*/

#include "systemc.h"
#include "datapath.h"
#include "control.h"
#include "busCtl.h"

SC_MODULE(ICS_RISC) {  
    sc_in<bool> clock;  
    sc_in<bool> reset;  
    sc_in<sc_uint<32>> iData;  //Instruction Data  
    sc_out<sc_uint<32>> nRW;  
    sc_out<sc_uint<32>> iAddr;  //Instruction Address  
    sc_out<sc_uint<32>> dAddr;  //Data Address  
    sc_out<sc_uint<32>> dData;  //Data Bus  
    sc_out<sc_uint<32>> PEOp;  //PE Execution Operation  
    sc_out<sc_uint<32>> PEOp;  //PE Execution Operation  
}
A DISSERTATION FOR THE DEGREE OF MASTER OF ENGINEERING SCIENCE

3D-SoftChip
A Novel 3D Vertically Integrated Adaptive Computing System
Appendix C: System Codes

SCCTOR(ICS_RISC)

| sc signal<sc_uint<32>> lmm; |
| sc signal<bool> cmpFlag; |
| sc signal<bool> srOEEn; |
| sc signal<bool> srWEEn; |
| sc signal<sc_uint<4>> aluCtl; |
| sc signal<sc_uint<4>> shiftCtl; |
| sc signal<sc_uint<6>> shiftOEEn; |
| sc signal<sc_uint<6>> mulOEEn; |
| sc signal<sc_uint<5>> opAldx; |
| sc signal<sc_uint<5>> opBldx; |
| sc signal<bool> rdAOEn; |
| sc signal<bool> rdBOEn; |
| sc signal<sc_uint<5>> wbldx; |
| sc signal<sc_uint<6>> wbEn; |
| sc signal<sc_uint<5>> lmmOEEn; |
| sc signal<sc_uint<6>> lARegCl; |
| sc signal<sc_uint<6>> dARegCl; |
| sc signal<sc_uint<6>> dINCl; |
| sc signal<sc_uint<6>> dOutCl; |
| sc signal<sc_uint<5>> din; |
| sc signal<sc_uint<5>> dOut; |
| sc signal<bool> zFlag; |
| sc signal<bool> lbEn; |
| sc signal<bool> lbRWEn; |
| sc signal<bool> tnRW; |

void do_OutCtl();

busClt* ibusClt;
control* icmpClt;
datapath* idatapath;

SCCTOR(ICS_RISC) { ibusClt->new busClt("busClt"); ibusClt->rW(bRW); ibusClt->data(xData); icmpClt->new control("control"); icmpClt->rW(cmpFlag); icmpClt->sWEn(sWEn); icmpClt->aluCtl(aluCtl); icmpClt->shiftCtl(shiftCtl); icmpClt->aluOEEn(aluOEEn); icmpClt->shiftOEEn(shiftOEEn); icmpClt->opAldx(opAldx); icmpClt->opBldx(opBldx); icmpClt->rdAOEn(rdAOEn); icmpClt->rdBOEn(rdBOEn); icmpClt->wbEn(wbEn); icmpClt->lARegCl(lARegCl); icmpClt->dARegCl(dARegCl); icmpClt->dINCl(dINCl); icmpClt->dOutCl(dOutCl); icmpClt->lREn(lREn); icmpClt->lWEn(lWEn); icmpClt->PEOp(PEOF); icmpClt->iWEn(iWEn); icmpClt->iREn(iREn); icmpClt->rWEn(rWEn); icmpClt->sWEn(sWEn); }

>PEConfig(PEConfig){ icmpClt->rWEn(rWEn); icmpClt->sWEn(sWEn); icmpClt->aluCtl(aluCtl); icmpClt->shiftCtl(shiftCtl); icmpClt->aluOEEn(aluOEEn); icmpClt->shiftOEEn(shiftOEEn); icmpClt->opAldx(opAldx); icmpClt->opBldx(opBldx); icmpClt->rdAOEn(rdAOEn); icmpClt->rdBOEn(rdBOEn); icmpClt->wbEn(wbEn); icmpClt->lARegCl(lARegCl); icmpClt->dARegCl(dARegCl); icmpClt->dINCl(dINCl); icmpClt->dOutCl(dOutCl); icmpClt->lREn(lREn); icmpClt->lWEn(lWEn); icmpClt->PEOp(PEOF); icmpClt->iWEn(iWEn); icmpClt->iREn(iREn); icmpClt->rWEn(rWEn); icmpClt->sWEn(sWEn); }
#include "ICS_RISC.h"

void ICS_RISC::do_OutCtl() {
    tnRW = dOutCtl;
    nRW = tnRW;
    
    // ICS_RISC: Top module for ICS_RISC (source file for ICS_RISC)
    // Copyright (c) 2005 by Chul Kim, All rights reserved
    // Author: Chul Kim (kim@student.ecu.edu.au)
    // File name: ICS_RISC.cpp
    // Revision history: Version 1
    // Date: 5/5/2005
    
    #include "ICS_RISC.h"

    void ICS_RISC::do_OutCtl() { }
    tnRW = dOutCtl;
    nRW = tnRW;
}