# Development of Heterogenous Multi-core Processor "Hy-DiSC" with Dynamic Reconfigurable Processor

Takuro UCHIDA<sup>1</sup>, Yasuhiro NISHINAGA<sup>1</sup>, Tetsuya ZUYAMA<sup>1</sup>,

Kazuya TANIGAWA<sup>1</sup>, and Tetsuo HIRONAKA<sup>1</sup> <sup>1</sup>Graduate School of Information Sciences, Hiroshima City University 3-4-1, Ozuka-higashi, Asaminami-ku, Hiroshima, 731-3194, Japan

E-mail: reconf08@csys.ce.hiroshima-cu.ac.jp

Abstract: For accelerating multimedia applications by streaming, we have proposed the heterogeneous multicore processor Hy-DiSC with dynamic reconfigurable processor DS-HIE. The goal of the Hy-DiSC processor is to achieve high performance in a small chip area. In this paper, the performance improvement of the Hy-DiSC processor adopting the DS-HIE processor was evaluated. The application programs that selected for performance evaluation was a JPEG encoding process and the 2-D DCT included in the JPEG encoding process. And, the DS-HIE processor accelerates the 2-D DCT included in the JPEG encoding process. As a result, compared with the MeP processor the DS-HIE processor achieved 28 times higher performance on the execution of 2-D DCT. And, the DS-HIE processor requires fewer transistor counts to implement it.

## **1. INTRODUCTION**

Recent years, the use of embedded systems has been expanded to various field. And, such systems can play a movie with large size and complex codec in real-time. But, these embedded system requires large chip area and complex hardware design so that the time to market becomes longer and the design cost becomes higher. To reduce the problem, many researchers and engineers has been interested in dynamic reconfigurable processors. And, especially, to solve the problem with small chip area and high performance, we have proposed the heterogeneous multicore processor Hy-DiSC with dynamic reconfigurable processor DS-HIE [1]. The target of the Hy-DiSC processor is to achieve high performance in applications that can be accelerated by streaming, in the field of embedded systems. Therefore, the Hy-DiSC processor is required to achieve both high performance and small chip area. In order to achieve the requirements of the DS-HIE processor, which plays role as an accelerator in the Hy-DiSC processor, adopts bit serial computation and Beneš network as its routing resource. With the bit serial computation, the DS-HIE processor achieves compact chip area with high throughput for computation. And with the Beneš network it achieves routing path with high availability and compact chip area.

In this paper, we evaluated the performance of the Hy-DiSC processor on JPEG encoding processing and then reported the transistor counts of the DS-HIE processor.

## 2. Hy-DiSC PROCESSOR

In this section, we describe the overviews of the Hy-DiSC processor.



Figure 1. Block diagram of the Hy-DiSC processor

## 2.1. Overview of the Hy-DiSC processor

Fig.1 shows the block diagram of the Hy-DiSC processor. The Hy-DiSC processor consists of a RISC processor and the DS-HIE processor. The RISC processor works as a control processor, and the DS-HIE processor works as a accelerator processor which executes operations in a streaming manner. Also, to decrease necessary hardware area while preserving its performance, the DS-HIE processor has the following design policies:

- map many operations in small reconfigurable area as possible, and
- keep the usage of reconfigurable area high.

To satisfy the Policy1, bit serial computation is adopted to the DS-HIE processor. The bit serial computation takes several cycles to calculate one data, but the chip area needed to implement its operation unit is significantly smaller than traditional bit-parallel operation unit. The disadvantage of bit serial computation is its larger latency. To decrease this disadvantage, the DS-HIE processor executes mapped operations in streaming manner.

To satisfy the Policy2, the DS-HIE processor adopts Beneš network as routing resources. The Beneš network is a multistage network and provides multiple paths between the source and the destination nodes. Therefore, it can achieve high wiring efficiency.

## 2.2. Structure of the DS-HIE processor

The DS-HIE processor consists of Input/Output Buffer and Reconfigurable Part. We explain about each part of the DS-HIE processor in the following.



Figure 2. Structure of Input/Output Buffer



Figure 3. Details of routing resources

## 2.2.1. Structure of Input/Output Buffer

Fig.2 shows the structure of Input/Output Buffer in the DS-HIE processor. The Input/Output Buffer is organized so it can function as a double buffer. The Hy-DiSC processor transfers data between the RISC processor and the DS-HIE processor. Therefore, the synchronization is needed between these processors which results in performance overhead. To reduce the overhead, the double buffer in the Input/Output Buffer is adopted.

#### 2.2.2. Structure of Reconfigurable Part

The Reconfigurable Part consists of several Operation Stages. And the Operation Stage consists of the Beneš Network and Function Units. Fig.3 shows the routing resources in the Operation Stage. As Function Units, there are two types: ALU and MSU. The ALU type Function Unit can process addition/subtraction and logic operation. And the MSU type can process multiplication, logical/arithmetic shift, and synchronization. The number of ALU and MSU implemented in a Operation Stage is kept same. The switch in the Beneš Network has D-flip-flops(D-FFs) and multiplexers to adjust the difference in data arriving timing caused by different operation sequences. Also, for keeping the clock frequency independent from the selected path, pipeline register is added to the Beneš Network.

### **3. PROCESSING FLOW**

This section describes the detail of the processing flow and the execution cycles.

#### 3.1. Detail of processing flow

Fig.4 shows the execution of the program on the Hy-DiSC processor in detail. As shown in the figure, while executing operations on the Reconfigurable Part, the RISC processor ac-

cesses Input/Output Buffer. The bit width for the bus between the RISC processor and the DS-HIE processor is 32bit and it takes 1 cycle to transfer 32 bit data. Therefore, two 16 bit data is transferred at one cycle. In the following, we describe details of execution on the RISC processor and the DS-HIE processor.

#### *Phase.1*(*Fig.4*(a))

The RISC processor sets data to the Input Buffer. The Input Buffer adopts double buffer method so that it allows both access, from the RISC processor and the DS-HIE processor.

The Reconfigurable Part of DS-HIE processor is idle until the Input Buffer is ready.

## Phase.2(Fig.4(b))

The RISC processor sets up next data to the Input Buffer. To supply the data to the Reconfigurable Part, the RISC processor change the input port of the Input Buffer from one to another.

The DS-HIE processor fetches data from the Input Buffer and executes operations on the Reconfigurable Part. The DS-HIE processor executes the mapped operations in streaming manner. This works as a deep execution pipeline. Each pipeline stages in the pipeline work as a multi-cycle staged pipeline, with cycle by cycle bit serial processing. So as soon as a Function Unit completes to calculate the first result bit, next Function Unit can start its calculation immediately. This deep pipeline execution improves total performance.

Phase.3(Fig.4(c))

After the DS-HIE processor completes to write results of calculation to the Output Buffer, the RISC processor takes out the result from the Output Buffers. At this moment, the simultaneous access to the Input Buffer and the Output Buffer is not allowed. So the RISC processor accesses the Output Buffer first, then after that, it accesses the Input Buffer. At this moment, behavior of the DS-HIE processor is the same as behavior in Streaming Phase.

#### 3.2. Details of execution cycles

To calculate the performance of the DS-HIE processor, we should know how many cycles is needed for operation and data transfer in detail. The detail of operation cycles and data transfer cycles are described in the following.

### 3.2.1. Operation cycles

Operation cycles are the cycles need for the data in the Input Buffer to be consumed and output the results into the Output Buffer. And, the DS-HIE processor is possible to output data for every several cycles after the first output data is calculated. Also, the overall latency is calculated by the sum of the latency of Function Unit and switches in the Beneš network on the critical path of the mapped data flow graph. Therefore, the output timing of data is different in according to the used Function Unit. The details of operation cycle are shown below. In addition, the word width of the Function Unit is 16 bits.



Figure 4. The execution in the Hy-DiSC processor



Figure 5. Timing chart of multiplication

### addition/subtraction:

It takes 17 cycles to start a new calculation after previous data is inputted. The calculation cycles consist of details in the following.

- clock synchronization: 1 cycle, and
- input/output of data: 16 cycles.

#### multiplication/leftshift:

Fig.5 shows the timing chart of multiplication by reference. It takes 36 cycles to start a new calculation after previous data is inputted. The calculation cycles consist of details in the following.

- clock synchronization: 1 cycle,
- input/output of data: 32 cycles, and
- change state: 3 cycles.

### right shift:

It takes shift amount plus 36 cycles to start a new calculation after previous data is inputted. The calculation cycles consist of details in the following.

- clock synchronization: 1 cycle,
- input/output of data: 32 cycles,
- change state: 3 cycles, and
- delay by shift: cycles equal to shift amount.



Figure 6. The minimum area of structure for Reconfigurable Part

### 3.2.2. Data transfer cycles

The bit width for the bus between the RISC processor and the DS-HIE is 32bit and it takes 1 cycle for 32 bit data. Therefore, two 16 bit data is transfered at one cycle. Data transfer cycles depend on the number of used Input/Output Buffer.

## 4. EVALUATION

In this section, we evaluate the performance of the Hy-DiSC processor. To evaluate how much the execution is accelerated by the DS-HIE processor, we also estimated the performance of the RISC processor. In this evaluation, we used the Toshiba MeP processor [2] which is 32 bit RISC processor for multimedia applications, as the RISC processor in the Hy-DiSC processor. The application programs that selected for performance evaluation was a JPEG encoding process and the 2-D DCT included in the JPEG encoding process. And, the DS-HIE processor accelerates the 2-D DCT included in the JPEG encoding process.

#### 4.1. Evaluation environment

Tools used for evaluation is shown in the following.

- mep-elf-gcc
- mep-elf-gprof
- sid
- DS-HIE mapping simulator

To evaluate the execution cycles of the MeP processor, we used the development environment tools for MeP [3]. The tools include mep-elf-gcc(gcc compiler), mepelf-gprof(profiling tool) and sid(simulator) for MeP processor. To evaluate the execution cycles of the DS-HIIE processor, we developed the mapping simulator for the DS-HIE processor. The DS-HIE mapping simulator is tool for the DS-HIE processor. In this simulator, we can specify the structure of Reconfigurable Part and the program which are described as data flow graph. And the simulator maps the program to the specified Reconfigurable Part. In this mapping, the simulator finds out the minimum number of Function Units needed in a Operation Stage. After that, the simulator reports the number of Function Units, and total transistor count need to implement, etc. But we have not develop the Hy-DiSC simulator yet, so the execution cycles to transfer data between the MeP

Table 1. structure information of Reconfigurable Part

| the number of Operation Stage        | 2       |
|--------------------------------------|---------|
| Function Units per a Operation Stage | 250     |
| the size of Beneš network            | 512x512 |

Table 2. result of the performance evaluation for Hy-DiSC

|                   | execution time(msec) |      |  |
|-------------------|----------------------|------|--|
|                   | MeP + DS-HIE         | MeP  |  |
| JPEG              |                      |      |  |
| encode processing | 1012                 | 1199 |  |
| 2-D DCT           | 6.93                 | 194  |  |

processor and the DS-HIE processor were calculated manually. Also, Fig.6 shows minimum area of Reconfigurable Part with the DS-HIE processor for the 2-D DCT and Table.1 summaries its specification.

### 4.2. Evaluation method

The evaluation method of an execution cycles are shown in the following.

- The encoded image is 1024x768 pixels, and
- the structure of the MeP processor is c4-E composition, and clock frequency is set to 150MHz(we actually evaluated by performing logic synthesis in the 180nm CMOS process), and
- the DS-HIE processor performs dynamic reconfiguration two times for executing the 2-D DCT.

### 4.3. Evaluation result

We evaluated the performance gain by using the DS-HIE processor. Table.2 shows the evaluation result of the Hy-DiSC processor. From Table.2, compared with the MeP processor, the DS-HIE processor achieved 28 times higher performance on the execution of 2-D DCT. For the whole execution of JPEG processing, the DS-HIE processor achieved 1.18 times higher performance.

While getting the higher performance, the transistor count to implement the DS-HIE processor is small. Table.3 shows the transistor counts for the DS-HIE processor and Pentium 4 processor [4] which was implemented in about same process technology as the prototype processor of the DS-HIE processor. From the Table.3, the DS-HIE processor can be implement with 9.35 % of the transistor count of the Pentium 4 processor. From this result, we can say that the DS-HIE processor achieves high performance and requires fewer transistor counts to implement it.

## 5. CONCLUSION

We evaluated the performance of the Hy-DiSC multi-core processor which includes the reconfigurable processor DS-HIE, The MeP processor was adopted as a RISC processor. And, the application programs that were selected for performance evaluation was a JPEG encoding processing. The DS-HIE processor accelerates the 2-D DCT included in the JPEG encoding process. As a result, compared with the MeP processor the DS-HIE processor achieved 28 times higher perfor-

| Table 5. spec of processors | Table | 3. | spec | of | processors |
|-----------------------------|-------|----|------|----|------------|
|-----------------------------|-------|----|------|----|------------|

|                                  | DS-HIE | Pentium  |  |
|----------------------------------|--------|----------|--|
|                                  |        | 1 chidum |  |
| clock frequency(Hz)              | 500M   | 2.4G     |  |
| process technology(nm)           | 180    | 65       |  |
| transistor count of              | 197k   | _        |  |
| the input output buffers(Tr)     | -,     |          |  |
| transistor count of              | 1052k  | _        |  |
| function units(Tr)               | 100211 |          |  |
| transistor count of              | 4208k  | -        |  |
| Bene $\check{s}$ network(Tr)     |        |          |  |
| total transistor count(Tr)       | 5260k  | 58200k   |  |
| relative transistor count(times) | 1      | 10.7     |  |

mance on the execution of 2-D DCT. And, the DS-HIE processor requires fewer transistor counts to implement it, compared with the Pentium 4 processor.

In this evaluation, the part of the result is calculated manually, so it is difficult to calculate a performance in a case of using more larger application. So, we plan to develop the Hy-DiSC simulator which enables us to evaluate performance on execution of larger application programs.

## Acknowledgement

This work is supported by VDEC "VLSI Design and Education Center", the University of Tokyo in collaboration with Synopsys, Inc. and Hitachi Ltd. This research was partially supported by the Ministry of Education, Culture, Sports Science and Technology(MEXT), Grant-in-Aid for Scientific Research, 18300016, 2006.

## References

- Tetsuya Zuyama, Kazuya Tanigawa, Tetsuo Hironaka, "Development of DS-HIE Architecture," Proceedings of the ITC-CSCC 2007, Vol.1,pp.47-48, 2007
- [2] "Product Information", http://www.semicon.toshiba.co.jp/eng/product/micro/ mep/index.html
- [3] "MeP Documents", http://www.semicon.toshiba.co.jp/ eng/product/micro/mep/document/index.html
- [4] "Intel Microprocessor Quick Reference Guide. Product Family",

http://www.intel.com/pressroom/kits/quickreffam.htm, 2003.