Monday, April 5, 2010
Memory disambiguation hardware:
1. INTRODUCTION
With high operation frequency, modern out-of-order processors often need to buffer a very large amount of instructions to be able to overlap useful processing with relatively long latencies associated with accesses to lower levels of the memory hierarchy. Processor features such as multithreading further increase the demand on the instruction buffering capability. However, increasing the number of in-flight instructions requires scaling up different micro architectural structures, which has a significant impact on energy consumption, especially if the structure is accessed associatively. One such example is the logic that enforces correct memory-based dependences, commonly referred to as the load-store queue (LSQ), and typically implemented as two separated queues: the load queue (LQ) and the store queue (SQ). Conventional implementations of these queues contain complete addresses and their entries are allocated in program order. To enable early execution of loads without compromising program correctness, memory instructions are tracked by the two queues and associative searches are used to find the correct producer or to detect dependence violations. These associative search operations are a major concern for the scalability of these queues. Not only energy consumption increases with the size of the queue, the latency of accesses also worsens and may present complications in the logic design. As such, a range of implementations that avoid associative searches have been explored recently. The main observation behind these designs is that memory-based dependencies are very infrequent and hence, through clever filtering or prediction, it is posible to reduce the number of associative accesses. Sections II and III recap the conventional design of the LSQ and the main alternatives. Section IV explores our proposals. Finally, Section V concludes.
2. CONVENTIONAL DESIGN
Modern out-of-order processors usually employ an array of sophisticated techniques to allow early execution of loads to improve performance. Almost all designs include techniques such as load bypassing and load forwarding. Both schemes allow early execution of loads when all preceding stores have calculated their addresses. More aggressive implementations go a step further and allow execution of loads when the address of a preceding store is not yet resolved. Such speculative execution can be premature if an earlier store in program order writes to the memory space loaded and executes afterwards. Clearly, this speculation has to be applied such that program correctness is not compromised. Thus, the processor needs to detect, squash and re-execute (or replay) premature loads and their dependents. To simplify implementation, processors typically replay many more instructions (such as all instructions following the store [1]), as these premature loads are rare in general and sometimes extra logic is employed to further reduce their occurrence [2].
The dependence enforcement is achieved using age-ordered load queue and store queue. A memory instruction of one type needs to check the queue of the opposite kind in an associative fashion (see Figure 1): a load searches the SQ to forward data from an earlier, in-flight store and a store searches the LQ to identify loads that have executed prematurely (wrongly speculated).
[FIGURE 1 OMITTED]
3. LSQ: STATE OF THE ART
The LSQ is a hardware structure that exhibits two main problems: 1) its logic is complex as it involves associative comparison of wide operands, which implies a high energy consumption, and 2) the scaling of the LSQ increases its access latency, which makes it hard to integrate it in high-frequency designs. We can identify three different approaches to overcome these problems. Based on the observed behavior of memory instructions (dependences and forwardings are infrequent), many researches have proposed filtering techniques to reduce the number of associative searches. Other designs adopt a two-level approach for disambiguation and forwarding. The guiding principle is largely the same: use a first level structure small but still able to perform a large majority of the work. This first level is backed up by a much larger second level structure to correct/complement its work. Finally, other designs try to simplify/remove the associative hardware of the LSQ looking for a simpler and cheaper management of load store queue operations. In the following sub-sections we summarize the main contributions. Before reviewing them, we'll start with the most important memory dependence prediction techniques.
Subscribe to:
Post Comments (Atom)











No comments:
Post a Comment