Browsing by Subject "SQL"

Now showing 1 - 3 of 3

CPU performance in the age of big data : a case study with Hive
(2016-12) Shulyak, Alexander Cole; John, Lizy Kurian
Distributed SQL Query Engines (DSQEs), like Hive, Shark, and Impala, have become the de-facto database set-up for Decision Support Systems with large database sizes. Unlike their single-threaded counterparts like MySQL, DSQEs experience inefficiencies related to the algorithm, code base, OS, and CPU micro-architecture that limit throughput despite the speedup from distributed execution. In my thesis, I present a detailed performance analysis of a DSQE called Hive, comparing it to MySQL, a single-threaded database application. Hive has difficulty converting queries into a set of MapReduce jobs for distributed execution. Hive also experiences a startup phase that is a significant overhead for short running queries. Additionally, both Hive and MySQL, like other server applications, experience high L1I miss rates due to a large code footprint. However, because MySQL is algorithmically efficient and traverses the database at a faster rate, it incurs a larger back-end bottleneck from LLC misses, which hides the front-end bottleneck. In contrast, Hive does not hide the high L1I cache miss rate with back-end stalls. Additionally, the higher context switch rates experienced by multi-process Hive setups thrash the first level caches, further inflaming the L1I cache miss rate. To address this micro-architectural inefficiency, I propose an instruction prefetch mechanism called Runahead Prefetch. It is similar to previously proposed branch prediction base prefetchers [19], but designed to easily extend modern Intel microarchitectures. Despite newer instruction prefetch mechanisms that discount branch prediction based prefching potential [8] [9] [12], I show Runahead Prefetch can eliminate 92% of L1I misses and 96% of icache stalls on average given modern branch misprediction rates and sufficient runahead.
Evaluation of relational database implementation of triple-stores
(2011-05) Funes, Diego Leonardo; Miranker, Daniel P.; Barber, K. Suzanne
The Resource Description Framework (RDF) is the logical data model of the Semantic Web. RDF encodes information as a directed graph using a set of labeled edges known formally as resource-property-value statements or, in common usage, as RDF triples or simply triples. Values recorded in RDF triple form are either Universal Resource Identifiers (URIs) or literals. The use of URIs allows links between distributed data sources, which enables a logical model of data as a graph spanning the Internet. SPARQL is a standard SQL-like query language on RDF triples. This report describes the translation of SPARQL queries to equivalent SQL queries operating on a relational representation of RDF triples, and the physical optimization of that representation using the IBM DB2 relational database management system. Performance was evaluated using the Berlin SPARQL Benchmark. The results show that the implementation can perform well on certain queries, but more work is required to improved overall performance and scalability.
SQL database design static analysis
(2010-12) Dooms, Joshua Harold; Krasner, Herb; Perry, Dewayne E.
Static analysis of database design and implementation is not a new idea. Many researchers have covered the topic in detail and defined a number of metrics that are well known within the research community. Unfortunately, unlike the use of metrics in code development, the use of these metrics has not been widely adopted within the development community. It seems that a disjunction exists between the research into database design metrics and the actual use of databases in industry. This paper describes new metrics that can be used in industry to ensure that a database's current implementation supports long term scalability, to support easily developed and maintainable code, or to guide developers towards functions or design elements that can be modified to improve scalability of their data systems. In addition, this paper describes the production of a tool designed to extract these metrics from SQL Server and includes feedback from professionals regarding the usefulness of the tool and the measures contained within its output.