Automatic data integration with generalized mapping definitions
MetadataShow full item record
Data integration systems provide uniform access to a set of heterogeneous structured data sources. An essential component of a data integration system is the mapping between the federated data model and each data source. The scale of interconnect among data sources in the big data era is a new impetus for automating the mapping process. Despite decades of research on data integration, generating mappings still requires extensive labor. The thesis of this research is that the progress on automatic data integration has been limited by a narrow definition of mapping. The common mapping process is to find correspondences between pairs of entities in the data models, and create logic expressions over the correspondences as executable mappings. This does not cover all issues in real world applications. This research aims to overcome this problem in two ways: (1) generalize the common mapping definition for relational databases; (2) address the problem in a more general framework, the Semantic Web. The Semantic Web provides flexible graph based data models and reasoning capabilities as in knowledge representation systems. The new graph data model introduces opportunities for new mapping definitions. The comparison of mapping definitions and solutions for both relational databases and the Semantic Web is discussed. In this dissertation, I propose two generalizations of mapping problems. First, the common schema matching definition for relational databases is generalized from finding correspondences between pairs of attributes to finding correspondences consisting of relations, attributes, and data values. This generalization solves real world issues that are not previously covered. The same generalization can be applied to ontology matching in the Semantic Web. The second piece of work generalizes the ontology mapping definition from finding correspondences between pairs of entities to pairs of graph paths (sequences of entities). As a path provides more context than a single entity, mapping between paths can solve two challenges in data integration: the missing mapping challenge and the ambiguous mapping challenge. Combining the two proposed generalizations together, I demonstrate a complete data integration system using the Semantic Web techniques. The complete system includes the components of automatic ontology mapping and query reformulation, and semi-automatically federates the query results from multiple data sources.