Browsing by Subject "Microprocessors"
Now showing 1 - 12 of 12
- Results Per Page
- Sort Options
Item Address Hashing in Intel Processors(2018-09-25) McCalpin, JohnTo implement a distributed shared last-level cache, addresses must be distributed across the set of cache "slices" in a way that maintains an acceptable degree of uniformity for many common access patterns. This presentation reviews the properties of the address hashes used in Intel Xeon Phi x200 and Intel Xeon Scalable Processors as determined by microbenchmark experimentation. Several cases of conflicts are discussed, along with possible workarounds.Item Binary adders(1996-05) Lynch, Thomas Walker; Swartzlander, Earl E.This thesis focuses on the logical design of binary adders. It covers topics extending from cardinal numbers to carry skip optimization. The conventional adder designs are described in detail, including: carry completion, ripple carry, carry select, carry skip, conditional sum, and carry lookahead. We show that the method of parallel prefix analysis can be used to unify the conventional adder designs under one parameterized model. The parallel prefix model also produces other useful configurations, and can be used with carry operator variations that are associative. Parallel prefix adder parameters include group sizes, tree shape, and device sizes. We also introduce a general algorithm for group size optimization. Code for this algorithm is available on the World Wide Web. Finally, the thesis shows the derivation for some carry operator variations including those originally given by Majerski and Ling.Item Disabled Core Patterns and Core Defect Rates in Xeon Phi x200 ("Knights Landing") Processors(2021-10-18) McCalpin, John D.The Intel Xeon Phi x200 (“Knights Landing”, “KNL”) processor was Intel’s second-generation commercial many-core processor offering and the first offered as a standalone processor. Each processor die has 76 cores arranged in 38 pairs. Unlike Intel’s mainstream multicore processors, there were no product offerings with less than 84% of the cores enabled, making issues of yield critical. The Texas Advanced Computing Center deployed its 4200 Xeon Phi 7250 (68-core) processors in two phases: 504 nodes in June of 2016 and the remaining 3696 nodes in April 2017. Over 1100 different patterns of disabled cores are observed across the systems, with approximately 75% appearing only once. The most common pattern is seen in over 30% of nodes, with cores disabled at the tiles immediately above and below the two memory controllers. Interpreting these as the “default” cores to be disabled in the absence of defective cores allows disambiguation of cores that are disabled due to defects and those disabled to meet the target enabled core count. Analysis of the statistics of disabled cores in each of these two deployments supports the hypothesis that that core defects are random and independent, with a statistically significant reduction in the probability of defects between the first and second deployments.Item Efficient runahead execution processors(2006) Mutlu, Onur; Patt, Yale N.High-performance processors tolerate latency using out-of-order execution. Unfortunately, today’s processors are facing memory latencies in the order of hundreds of cycles. To tolerate such long latencies, out-of-order execution requires an instruction window that is unreasonably large, in terms of design complexity, hardware cost, and power consumption. Therefore, current processors spend most of their execution time stalling and waiting for long-latency cache misses to return from main memory. And, the problem is getting worse because memory latencies are increasing in terms of processor cycles. The runahead execution paradigm improves the memory latency tolerance of an out-of-order execution processor by performing potentially useful execution while a longlatency cache miss is in progress. Runahead execution unblocks the instruction window blocked by a long-latency cache miss allowing the processor to execute far ahead in the program path. This results in other long-latency cache misses to be discovered and their data to be prefetched into caches long before it is needed. This dissertation presents the runahead execution paradigm and its implementation on an out-of-order execution processor that employs state-of-the-art hardware prefetching techniques. It is shown that runahead execution on a 128-entry instruction window achieves the performance of a processor with three times the instruction window size for a current, 500-cycle memory latency. For a near-future 1000-cycle memory latency, it is shown that runahead execution on a 128-entry window achievesthe performance of a conventional processor with eight times the instruction window size, without requiring a significant increase in hardware cost and complexity. This dissertation also examines and provides solutions to two major limitations of runahead execution: its energy inefficiency and its inability to parallelize dependent cache misses. Simple and effective techniques are proposed to increase the efficiency of runahead execution by reducing the extra instructions executed without affecting the performance improvement. An efficient runahead execution processor employing these techniques executes only 6.2% more instructions than a conventional out-of-order execution processor but achieves 22.1% higher Instructions Per Cycle (IPC) performance. Finally, this dissertation proposes a new technique, called address-value delta (AVD) prediction, that predicts the values of pointer load instructions encountered in runahead execution in order to enable the parallelization of dependent cache misses using runahead execution. It is shown that a simple 16-entry AVD predictor improves the performance of a baseline runahead execution processor by 14.3% on a set of pointer-intensive applications, while it also reduces the executed instructions by 15.5%. An analysis of the high-level programming constructs that result in AVD-predictable load instructions is provided. Based on this analysis, hardware and software optimizations are proposed to increase the benefits of AVD prediction.Item Generating RTL for microprocessors from architectural and microarchitectural description(2011-05) Bansal, Ankit Sajjan Kumar; Chiou, Derek; Abraham, JacobDesigning a modern processor is a very complex task. Writing the entire design using a hardware description language (like Verilog) is time consuming and difficult to verify. There exists a split architecture/microarchitecture description technique, in which, the description of any hardware can be divided into two orthogonal descriptions: (a) an architectural contract between the user and the implementation, and (b) a microarchitecture which describes the implementation of the architecture. The main aim of this thesis is to build realistic processors using this technique. We have designed an in-order and an out-of-order superscalar processor using the split-description compiler. The backend of this compiler is another contribution of this thesis.Item Mapping Addresses to L3/CHA Slices in Intel Processors(2021-09-10) McCalpin, John D.The distributed, shared L3 caches in Intel multicore processors are composed of “slices” (typically one “slice” per core), each assigned responsibility for a fraction of the address space. A high degree of interleaving of consecutive cache lines across the slices provides the appearance of a single cache resource shared by all cores. A family of undocumented hash functions is used to distribute addresses to slices, with different hash functions required for different numbers of slices. In all systems studied to date, the hash consists of a relatively short (16 to 16384 elements) “base sequence” of slice numbers, which is repeated with binary permutations for consecutive blocks of memory. The specific binary permutation used is selected by XOR-reductions of different subsets of the higher-order address bits. This report provides the base sequences and permutation select masks for Intel Xeon Scalable Processors (1st and 2nd generation) with 14, 16, 18, 20, 22, 24, 26, 28 slices, for 3rd Generation Intel Xeon Scalable Processors with 28 slices, and for Xeon Phi x200 processors with 38 slices.Item Mapping Core and L3 Slice Numbering to Die Location in Intel Xeon Scalable Processors(2021-02-28) McCalpin, JohnA methodology for mapping from user-visible core and L3 slice numbers to locations on the processor die is presented, along with results obtained from systems with Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”) at the Texas Advanced Computing Center. The current methodology is based on the data traffic counters in the 2-D mesh on-chip-network, with the measurements revealing unexpected and counterintuitive transformations of the meanings of “left” and “right” in different regions of the chip. Results show that the numbering of L3 slices is consistent across processor models, while the numbering of cores displays a small number of different patterns, depending on processor model and system vendor.Item Mapping Core and L3 Slice Numbering to Die Locations in Intel Xeon Scalable Processors(2021-01-13) McCalpin, John D.A methodology for mapping from user-visible core and L3 slice numbers to locations on the processor die is presented, along with results obtained from systems with Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”) at the Texas Advanced Computing Center. The current methodology is based on the data traffic counters in the 2-D mesh on-chip-network, with the measurements revealing unexpected and counterintuitive transformations of the meanings of “left” and “right” in different regions of the chip. Results show that the numbering of L3 slices is consistent across processor models, while the numbering of cores displays a small number of different patterns, depending on processor model and system vendor.Item Mapping Core, CHA, and Memory Controller Numbers to Die Locations in Intel Xeon Phi x200 ("Knights Landing", "KNL") Processors(2021-05-20) McCalpin, JohnA methodology for mapping from user-visible core, CHA, and memory controller numbers to locations on the processor die is presented, along with results obtained from systems with Intel Xeon Phi x200 (“Knights Landing”, “KNL”) processors at the Texas Advanced Computing Center. The current methodology is based on the data traffic counters in the 2-D mesh on-chip-network, with the measurements revealing unexpected and counterintuitive transformations of the meanings of “left”, “right”. “up”, and “down” in different regions of the chip. For the systems tested, all CHAs were active and had the same mapping of CHA number to physical location on the die. In contrast to our observations with Xeon Scalable Processors, the x2APIC IDs of the cores in Xeon Phi x200 are not mapped independently of the CHAs – the x2APIC ID of any enabled core contains the CHA number in bits [8:3]. Disabled cores are identified by x2APIC values not seen in any active core. In all cases tested, Logical Processor numbers were assigned to the active physical cores using a simple monotonic mapping.Item Network processor design: benchmarks and architectural alternatives(2005) Lee, Byeong Kil; John, Lizy KurianThe last decade saw phenomenal growth in information technology and network communication. The network interfaces also have to keep up with the speed, throughput and capability to support all the workloads. Network processors (NPs) have recently been introduced in the network interfaces to process complex workloads. This dissertation investigates architectural alternatives for network processors. The network processor should be able to process modern network workloads without slowing down line speed. In order to handle variety of emerging applications, good understanding of the target application from the architectural perspectives is essential. While most of the previous research and commercial products for NPs are dedicated to routing and communication related to data-plane applications, control-plane applications where congestion control and QoS issues are dealt with are not well understood. With the demands of emerging network applications, it is imperative to develop and quantitatively characterize the NP control plane workloads to guide architects for designing future NPs. In this dissertation, a new benchmark suite, called NpBench, is proposed for network processors and its architectural workload characteristics are studied. The NpBench suite includes 5 control plane applications and 5 data plane applications. The NpBench suite is implemented using C and is opened to public. Large number of institutions in the world has licensed and several papers and articles cite the NpBench. The NpBench suite fills a major void that exists in the evaluation and benchmarks of NPs. Another major contribution of this dissertation is architectural enhancements for network processing. First the parallelism characteristics of network processing applications were investigated to see the possibility of identifying it statically. Based on the investigation, it is found that the success of VLIW in the multimedia field can be applied to the network processor domain as a processing element for a parallel architectural implementation. As alternative solutions of existing network processor architectures, hardware acceleration techniques are proposed to deal with new emerging workloads. Also, the feasibility of extracting common ISA extensions over variety of network workloads is investigated for accelerating the capability of a processing element within a parallel architecture.Item Observations on Core Numbering and "Core ID's" in Intel Processors(2020-11-30) McCalpin, JohnThis report describes and analyzes the patterns of logical processor distribution (across sockets) and the patterns of the "core ID" numbers provided by the hardware in recent and current Intel-processor-based systems at the Texas Advanced Computing Center.Item Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors(2018-04-12) McCalpin, JohnIntel's second-generation Xeon Phi (Knights Landing) and Xeon Scalable Processor ("Skylake Xeon") are both based on a new 2-D mesh architecture with significant changes to the cache coherence protocol. This talk will review some of the most important new features of the coherence protocol (such as "snoop filters", "memory directories", and non-inclusive L3 caches) from a performance analysis perspective. For both of these processor families, the mapping from user-visible information (such as core numbers) to spatial location on the mesh is both undocumented and obscured by low-level renumbering. A methodology is presented that uses microbenchmarks and performance counters to invert this renumbering. This allows the display of spatially relevant performance counter data (such as mesh traffic) in a topologically accurate two-dimensional view. Applying these visualizations to simple benchmark results provides immediate intuitive insights into the flow of data in these systems, and reveals ways in which the new cache coherence protocols modify these flows.