Abstract:
This dissertation introduces the problem of query consolidation, which seeks to interpret
a set of disparate queries submitted to independent databases with a single “global” query.
The problem has multiple applications, from improving virtual database design, to aiding
users in information retrieval, to protecting against inference of sensitive data from a seemingly
innocuous set of apparently unrelated queries. The problem exhibits attractive duality
with the much-researched problem of query decomposition, which has been addressed intensively
in the context of multidatabase environments: How to decompose a query submitted
to a virtual database into a set of local queries that are evaluated in individual databases.
The new problem is set in the architecture of a canonical multidatabase system, using it
in the “reverse” direction. The reversal is built on the assumption of conjunctive queries
and source descriptions. A rational and efficient query decomposition strategy is also assumed,
and this decomposition is reversed to arrive at the original query by analyzing the
decomposed components
The process incorporates several steps where a number of solutions must be considered,
due to the fact that query decomposition is not injective.
Initially, the problem of finding the most likely join plan between component queries is
investigated. This is accomplished by leveraging the referential constraints available in the
underlying multidatabase, or by approximating these constraints from the data when not
available. This approximation is done using the information theoretic concept of conditional
entropy. Furthermore, the most likely join plans are enhanced by the expansion of their
projections and adding precision to their selection constraints by estimating the selection
constraints that would be applied to these consolidations offline.
Additionally, the extraction of a set of queries related to the same retrieval task from
an ongoing sequence of incoming queries is investigated. A conditional random field model
is trained to segment and label incoming query sequences. Finally, the candidate consolidations
are re-encapsulated with a genetic programming approach to find simpler intentional
descriptions that are extensionally equivalent to discover the original intent of the query.
The dissertation explains and discusses all of the above operations and validates the
methods developed with experimentation on synthesized and real-world data. The results
are highly encouraging and verify that the accuracy, time performance, and scalability of the
methods would make it possible to exploit query consolidation in production environments.