Show simple item record

dc.contributor.advisorMiranker, Daniel P.en
dc.creatorBarbanson, François Gérarden
dc.date.accessioned2008-08-28T22:03:59Zen
dc.date.available2008-08-28T22:03:59Zen
dc.date.issued2005en
dc.identifierb59808111en
dc.identifier.urihttp://hdl.handle.net/2152/1504en
dc.descriptiontexten
dc.description.abstractAfter nearly 30 years, database integration remains the province of engineers and application developers. In an informal proof, Krishnamurthy, Litwin and Kent [KLK91] demonstrated that only higher order relational languages such as SchemaSQL and SchemaLog [LSS96] are general enough to concisely describe the merging of data from multiple heterogeneous sources. However, those languages are incrementally harder to program than SQL. A modern solution is to provide a GUI language with the same capabilities. However, a Query-by-Example (QBE) [Zloof77] type interface does not allow the unambiguous specification of higher order data integrating queries. We propose an architecture comprising three layers. In the middle layer, the user expresses the desired federated view through a QBE inspired user interface. Further learning proceeds via a sample selection method asking the user to validate examples of federated records. This interaction ends when the system is satisfied it has converged to the exact view definition the user intends. The bottom layer provides the execution mechanism for higher order data manipulations by compiling higher order relational definitions into first-order SQL programs. The top layer component of the architecture assists the learning algorithm by collecting meta-data and catalog statistics. Our primary contributions comprise a taxonomy examining trade-offs between complexity and completeness and identifying various classes of higher order relational data manipulations. The architecture delimits three separate challenges, which must be overcome in order to propose a solution. Our compilation for SchemaSQL proposes novel theoretical complexity guarantees. Type-based vertical partitioning of the meta- data ensures that the result can be properly optimized by existing SQL engines. Sample selection constraints specific to databases require the introduction of a third kind of instance label in the training set of our learner. We derive a new algorithm by modifying Mitchell’s version spaces [Mitchell82] in order to handle this new kind of label. We prove that the modified algorithm preserves the original properties of version spaces and avoids the possibility of deadlock. We introduce a sample selection heuristic that converts catalog statistics into a classic inductive bias. Finally, we develop the Sphinx prototype, carry out experiments and demonstrate the system on an application.
dc.format.mediumelectronicen
dc.language.isoengen
dc.rightsCopyright is held by the author. Presentation of this material on the Libraries' web site by University Libraries, The University of Texas at Austin was made possible under a limited license grant from the author who has retained all copyrights in the works.en
dc.subject.lcshDatabase managementen
dc.subject.lcshDatabase searchingen
dc.subject.lcshSQL (Computer program language)en
dc.titleActive learning and compilation of higher order schema integration queriesen
dc.description.departmentComputer Sciencesen
dc.identifier.oclc61185671en
dc.identifier.proqst3165094en
dc.type.genreThesisen
thesis.degree.departmentComputer Sciencesen
thesis.degree.disciplineComputer Sciencesen
thesis.degree.grantorThe University of Texas at Austinen
thesis.degree.levelDoctoralen
thesis.degree.nameDoctor of Philosophyen


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record