System-on-chip for real-time applications pdf


















A processor could be a conferencing, number of sub-processes will be even larger. A processor has a set of attributes. These are: cost of the processor, frequency, number of cycles taken 25, 1, 24, 0. Hence, to take care of interconnection header extraction; cost, the component library also defines cost of each link.

Processes communicate with each other using cessors, local and shared memories, binding of processes to pro- FIFO queues and computation and communication within any pro- cessors and queues to local or shared memories and allocation of cess is interleaved.

All the processes are assumed to be periodic. In communication components such as buses to minimize total cost. In this number of tokens on queues connected to them. We further assume example, the application KPN is composed of 3 processes and 3 that size of a token being communicated on any queue always re- queues.

The synthesized architecture consists of 2 processors, 1 mains same and size of each queue maximum number of tokens local memory and 1 shared memory. Queue 1 is mapped to the lo- within it is known a priori. We note that once size of each queue cal memory of Processor 1 because reader and writer processes of is specified, writes on any queue also becomes blocking.

We fur- Queue 1 are mapped here. On the other hand Queue 2 and Queue ther assume that a dynamic scheduler will take care of the run time 3 are mapped to the shared memory as their reader and writer pro- scheduling of mapped process network. We want to find out equivalent waiting time on the processor. Shared Memory 1 Consider synthesis example of Figure 4. To compute it, queuing delays are multiplied by the throughput re- The objective function of our synthesis problem is to minimize quirements of Queue 2.

In the mapping stage, it essentially consists of cost In [16], we proposed an approach to estimate the queuing de- of processors used, local memory modules, shared memory mod- lays faced by individual processors while accessing some queue on ules and interconnection cost.

Now the synthesis problem needs to per access basis. We used this method and pre-computed all possi- be solved under a number of constraints which are described next. We do this for all the processor pairs present in the 3. Here write-read refers These are the constraints which define mapping of process net- to the case when processor 1 is making a write access to one of work and architecture instance.

A process can be mapped to only one processor. Similarly other delays can also be interpreted. A queue is mapped onto a local memory only when its reader Now, if two queues Qr and Qs are mapped to the same shared and writer processes are mapped to the same processor.

Similar equations can also be written for write- write data wwcon p cases. In this paper, we have considered shared 3. However, Arrival rate of read or write requests at a shared memory mod- Equation 2 can be easily refined for arbitrary number of ports. This request is at some queue of the process As we discussed earlier, a processor offers number of time units network mapped onto SMl.

Bandwidth bwmml of this shared mem- equal to its clock frequency cycles per second. This must ac- ory module should be larger than arrival rate. If QSMl is the set of commodate computation requirements of processes mapped, con- queues mapped to shared memory SMl , thr j is the throughput con- text switch overheads and waiting time due to data communication straint for queue Q j and sz j is token size for it, then queuing delays.

Either processor is executing some where ncycontxtk is the number of cycles taken by processor type process, or it is switching context from one process to other or it is PRk in a context switch. The left hand side of above Equation is waiting due to conflicts while accessing shared memories. A pro- composed of total computation requirements of processes mapped, cessor offers number of time units equal to its frequency in one sec- total context switch overheads, and queuing delays respectively.

Now, total number of cycles required in various states of the processor in one second must not 4. This is what we check while verifying whether We propose a heuristic based solution for our synthesis problem. We construct the solution by employing a fast dynamic program- Context Switch Overheads ming based algorithm. Our approach not only allows it to be used When multiple processes are mapped onto a single processor, there in a design space exploration loop, but also provides a good initial will be context switch overheads.

A process gets blocked while solution for an iterative refinement phase. We notice that a writer process gets blocked the cost of synthesized architecture. This Here, we assume that a process can be blocked only when it is in turn helps to reduce interconnection network IN cost.

One can make the observation that out of above considerations, last in CRG, is chosen as root. Weighted topological sort is performed two become effective when two adjacent processes from the pro- on CRG starting from this vertex. In this phase, a new edge on the cess network graph are mapped to the same processor. The con- path is chosen which has the maximum weight and leads to another dition is that these two processes should either be communicating vertex which has not yet been visited.

In Figure 5, it starts from T1 over a number of queues or communicating more frequently. The whole path is deleted mapping such adjacent processes onto lower cost processors would from CRG and new root is chosen if there is no path from the last result in a cost effective solution. This is the basis of Algorithm node on this path to other vertices. Processes of adjacent process list can be mapped onto different processors in a number of ways.

Figure 6 shows two such possi- Algorithm 1 synthesize architecture bilities for the process network of Figure 1. Next, we assume that there 2: Create adjacent process list by performing weighted topolog- are three processors PR1 , PR2 and PR3 in the component library. This can be done in a 4: Refine architecture using shared memory merging number of ways which correspond to various sub-trees in Figure 6. Hence, this list needs to be broken In the first three steps of Algorithm 1, we incrementally build further.

We stop exploring this sub-tree as soon as all the group partial map for all the processes. We define the partial map PM of processes in it are mapped onto some processor under perfor- for a set of processes. PM for this set essentially contains its map- mance constraint. Similarly other sub-trees are also evaluated and ping information i. We term this as multiple processor mapping MPM. We term this as sin- gle processor mapping SPM.

We also note that queues other than what has been shown Figure 6: Example solution tree in Figure 7 are still unassigned to any memory. That is why this mapping information is partial. In rest of Section, we will describe We observe that the architecture can be synthesized by building how adjacent processes adjacent process list are found and par- partial maps recursively in a bottom up manner. This suggests that tial map is computed. From now onwards, we will use term partial a dynamic programming based algorithm can be used for synthesis.

In step 3 of Algorithm 1, partial map matrix PM for the solution is created using a similar algorithm. Here, partial map PM[m, n] refers 4. CRG is Now, there exists two possible solutions for the partial mapping derived by collapsing all the edges between two vertices in original of processes of adjacent process list.

Hence, weight on an edge between two 4. Here 5-tuple mance constraints. If there is no such processor, then cost of this next to each edge is the parameter values for size of a token be- solution is infinite. Otherwise, single processor partial mapping ing transferred, the maximum number of tokens allowed in queue, SPM[m, n, PRk ] cost becomes: throughput constraint, number of tokens produced or consumed by SPM[m, n, PRk ].

In Equation 4, the cost is composed of cost of the 6. Latter Slice Header extraction decoding Addition 6. Rest Next, the vertex, having maximum weight on one of its edges of the queues connected to this partial map will remain unassigned. Total number of entries in partial map matrix PM are T2. So, tiated and cost of new links introduced. While combining Now, the number of pivots in Equation 6 are at most T. Further, partial maps PM[2, 3] [T4 , T3 ] and PM[4, 5] [T5 , T0 ] , we instan- evaluation of a solution at any pivot can be done in O Q because tiate new shared memories for queues Q12 and Q13 because partial in equation 5, at most Q queues need to be checked.

Hence, com- maps communicate via these queues. Therefore, new links are also introduced for each queue. MPM[m, p, n]. Fur- We choose the lowest cost solution out of all single processor ther, checking performance constraints as per Equation 3 is bounded mappings and multiple processor mappings. So time complexity of shared memory merging given in Equation 6. Process Header extraction finds out header in- Q12 Q13 Shared memory Shared memory formation of the video sequence.

Process Slice decoding performs variable length decoding at slices. These four processes work on macroblocks. We note that partial map matrix is basically an upper decoder KPN. These parameters were obtained by using the proce- triangular matrix. Each valid entry is a 2-tuple. Second part of Table 2 show processor parameters. Each type -1 means that corresponding partial map requires multiple processor type PRk has associated cost pr costk , context switch processors.

In this mapping unit size of the queue onto local memory cost base lmk. Moreover, one shared memory was found Table 1: Partial map matrix to be sufficient. Process parameters 4. This results in a costlier 1 interconnection network. Hence, in step 4 of Algorithm 1, we cre- 2 ate new mappings by pairwise merging of shared memory instances Processor parameters in the given mapping under performance constraints.

For example, PRk contxtk pr costk f reqk cost base lmk Consider a merger of both shared memories shown in Figure 7. Advancements in the field of digital signal processing, II.

RPUTS development of low-cost, portable, and computationally architecture consists five major modules: i a efficient systems. Zynq SoC manages the overall system configuration which includes AD clock management unit iv Step as well as the execution of the signal processing algorithms. Zedborad is the back-end sub-system. In this study, an efficient SEL Transducer Driver Driver architecture is designed and realized for high performance 4 ultrasonic signal processing applications.

Target Signal 2 1 Several features such as the incident beam pattern, center Figure 1. RPUTS block diagram frequency of operation, and signal conditioning are to be carefully tuned for optimum performance depending on Zynq SoC within Zedboard is the main controller of the particular application and the target material being overall system.

Unlike conventional systems, in this study, Cortex A9 processors running up to MHz, and a the components such as noise attenuator, beam controller, programmable logic FPGAc which are connected over AXI and amplifier are selected to allow future upgradability. In interconnect [8]. Furthermore, the digitized echoes are LM [3] is used to excite the ultrasonic transducer. The processed within Zynq SoC. Restrictions apply. Figure 2.

The scanning of the components. Beamformer is capable III. After evaluation applications [4]. In addition, the availability of configuring the transmitting wavefront profile and receive different configurations for beamforming and beam steering channel gain, the pulser excitation signal is fired. Then the helps in realizing several ultrasonic signal processing reflected echoes are captured and assessed for expected algorithms such as discrete wavelet transform DWT [10], quality by determining its signal-to-noise ratio SNR and and split-spectrum processing SSP [11].

Low quality echo signals are individually programmed ON or OFF allowing for low rejected and the system is re-configured with modified power operation by selectively configuring the desired parameter settings to obtain good quality echo signals that channels through SPI.

High voltage pulser is programmed are acceptable for signal processing applications. The GUI is developed on the eclipse development environment and is written in java. The program runs yes on a tabbed environment where different chips can be configured in their own tabs, before beginning transmission Send Configuration to Zynq SoC Zynq SoC of ultrasound signal. Once the programming is complete, Zynq SoC sends a transmit signal to the beamformer via SPI bus to Stop begin the ultrasound signal transmission.

Now the user is allowed to modify the Z configuration settings and initiate a new transmission. The flowchart that outlines this process is shown in Figure 4. Figure 5 shows the ultrasound experimental setup for data acquisition, where the surface of the steel block is Steel scanned by using the transducer with the help of two step Block motors [13], one for movement in X-direction, and another for movement in Y-direction.

Ultrasonic experimental setup for the immersion-type 3. In this study, DWT based ultrasonic 3D compression three axial directions.



0コメント

  • 1000 / 1000