| Home | Overview | Publications | People | Contact Us | Downloads | Tools |
|
BioMediator: Information Integration in the Presence of Uncertainty (UII) (Funding provided by NSF #IIS-0513877) |
Table of contents
Support for uncertain information integration (UII) The “Information Integration in the Presence of
Uncertainty” NSF funded research project (UII) is a concurrent and
complementary effort to the BioMediator project. The primary goal of
the UII project is foundational methodological research around the
“design and implementation of information integration systems that
handle uncertainty about data and results at all levels of the
integration process.” More specifically, the aims of the UII project
are to a) design formalisms for modeling uncertainty about federated
data, b) extend query evaluation methods to handle uncertainty, c)
develop techniques for recording the lineage of query results, and d)
develop methods for resolving uncertainty through user feedback and/or
consultation with additional sources. To create a UII prototype we are
extending the core BioMediator system by including the necessary
mechanisms to handle and process various sources of uncertainty. In
doing so we hope to leverage the working distributed data integration
system, BioMediator, as a test bed for developing methods for tracking
and resolving such uncertainties. Sources of uncertainty in BioMediator The BioMediator system enables users to pose queries over a distributed network of data sources. The data, found in each source in the network, is itself inherently uncertain. Data uncertainties arise from factors such as the amount of additional supporting data, the experimental methods and controls employed, as well as the degree and type of processing undergone by the data. Mediated systems, by definition, require mapping source entity and relationship types onto a common mediated schema. For such an approach to facilitate distributed data integration, shared mediated schema elements must align to elements of multiple sources. Often, because these source elements are not exactly equivalent, the shared mediated schema element cannot map exactly to all source elements. For example, because the representation of gene differs slightly, between Entrez and Swissprot, any representation of gene in a mediated schema including these two sources must differ from at least one of the source representations. Such imprecise mappings lead to mediation uncertainties . In the Biomediator system, we broaden a set of query
results by expanding previously unexplored references from existing
results. Such references may be explicit, foreign key type links. They
may also be based on partial matching or alignment algorithms, such as
Qgram or BLAST. These algorithms provide inexact approximate matches. Partial matching uncertainties are introduced by such approximate methods. Rationale for tracking uncertainties in BioMediator A result set in BioMediator can be conceptualized as a graph where individual results form the nodes (entities) and the interconnections between results in the mediated network are represented by connecting edges (relationships). Additionally this graph contains a node which represents the original user query. A query path , which begins at the user query node and ends at a given query result node, represents the route through the distributed data sources via which the result was derived by the system. Because each node represents result data with inherent uncertainty, and because each edge carries uncertainty regarding a reference from one result to another, the belief in results necessarily degrades (or at best remains constant) as one traverses farther along a query path. Such paths provide valuable feedback to users about the level of confidence that they should have in a given result. Calculating the total path cost from the query node
to each result node allows us to rank result nodes in terms of our
belief in their validity or applicability to the query. Such ranking
can be an invaluable aid in pruning large result sets. It also allows
us to prioritize query expansions to favor the most fruitful paths. Measures of Uncertainty in the BioMediator System In the BioMediator/UII prototype, now in development, we have introduced into the system several new measures for quantifying result belief (inversely proportional to uncertainty). Two of these measures, Ps and Pr, represent the level of belief in a result node, independent of its position on a query path. Qs and Qr quantify the belief in the connections between results. Lastly we associate a UII score measure, for each node, which represents the total path cost, considering all query paths, from the user query to a given result. Fig. 1 shows how these belief measures are
incorporated into BioMediator. The dashed circles indicate sets of
entities of a common type in a given source. The dashed arrows
emanating from these circles point to the mediated schema elements to
which members of the set are mapped. Solid arrows represent
relationships between one source entity and another, in the Source Data
box, or between one mediated schema type and another, in the Mediated
Schema box. Triangles and diamonds represent data elements of two
different types (i.e. genes and proteins).
Ps is a static measure used to express a user's belief, in general, in entities of a given type found in a given source (e,g. their confidence in information about genes represented in Entrez). Factors such as curation method, data gathering technique, and provenance are included in Ps only if they hold for all entities of the given type and source. Ps accounts for the user's belief in the relevance of such entities to the application for which the mediated schema was intended as well as their belief in the extent to which a source-specific type maps onto the corresponding mediated schema type (e.g. quantitative value of how close the concept of a protein in one source maps to the concept of a protein in the mediated schema?) . Ps1 , in Fig. 1, corresponds to a user's
general belief in the quality of information, contained in Src1, about
“triangle” entities. Additionally it accounts for the user's belief in
how well Src1's concept of a triangle maps to that of the triangle type
asserted by the mediated schema. Similarly Ps2 represents a
user's confidence in the quality of data found in Src2 about “diamonds”
and in the mapping of diamond from Src2 to the mediated schema. The
value for Ps2 sets the maximum belief that should be assigned to any individual result, from Src2, which is of type diamond. While values for Ps are static and user-group specific, values for Pr are calculated on a per-result basis. If factors such as curation method, data gathering technique, and provenance differ between entities of the same type for a given source, then such factors are included in the Pr measure. Entrez, for example, assigns a “RefSeq status” attribute to gene records, with values such as “inferred”, “predicted”, and “reviewed”. As these values could influence a user's belief in a particular Entrez gene record, and because these values differ between Entrez gene records, they are accounted for in Pr. Pr1 and Pr2 in Fig. 1
illustrate that two “diamond” entities, even if they occur in the same
source, can have differing values for Pr. While Ps2 sets the maximum belief that should be assigned to any diamond entry from Src2, Pr1 and Pr2 may reduce belief in individual diamonds based on entry specific information. Qs is a static measure used to express a user's general belief in the quality of intra- or inter-source relationships between entity types (e.g. relationship of gene in Entrez to gene in OMIM or relationship of gene in Entrez to protein in Entrez). Qs includes the user's belief in the quality of the mapping of a source level relationship onto a relationship type found in the mediated schema. Qs sets a maximum belief threshold for any relationship instance of a particular relationship type. In Fig. 1, Qs2 represents a user's
general belief in members of the set of references from “diamond”
entities in Src1 to “diamond” entities in Src2. Additionally it
includes the user's belief regarding how well such links are
represented by the mediated schema diamond-to-diamond relationship
type. Qs1 is similar, but indicates that the relationship
type may be between entities of different types and/or between types
found within the same source. Factors such as relationship quality are included in Qs if they hold for all instances of a particular relationship type. However, when such factors differ between instances the Qr measure is used for quantifying such differences. Some of the relationships found in a BioMediator result graph are generated using approximate matching techniques such as Qgram, a partial string matching algorithm, or BLAST, a local alignment algorithm for quantifying sequence relations. In these cases the degree of a match (reference) affects the user's uncertainty and is included in the Qr measure. In Fig. 1 both Qr1 and Qr2
refer to relationship instances and account for all instance-specific
information which might affect a user's uncertainty. Both links are of
the mediated type triangle-to-diamond, but Qr1 is an inter-source link while Qr2 is intra-source link. BioMediator assigns two local belief values Ps and
Pr to each result entity and two local belief values Qs and Qr to each
connecting edge. P values are assigned to results independent of other
results, and likewise for Q.Of course if a result is derived by
following chains of uncertain references through a series of uncertain
result nodes then the user's uncertainty should degrade along this
query path. UII scores are added, as a global measure, to each result
in order to capture the total cumulative uncertainty. A UII score, for
a particular result, is calculated based on the P and Q uncertainty
values along all paths from the user query to this result. Incorporating Uncertainties into the BioMediator System BioMediator belief values range between 0.0 and 1.0, inclusive, where larger values indicate stronger belief. We interpret P and Q values as the probability that a given result node or edge is valid or relevant in the given local context. UII scores have a similar interpretation but in the broader context of a complete result set. Table 1 illustrates the correspondence between the sources of uncertainty described previously and the UII measures.
Table 1: Sources of Uncertainty in BioMediator |
| Home | Overview | Publications | People | Contact Us | Downloads | Tools |
|
the Division of Biomedical and Health Informatics [DBHI], the Department of Pediatrics [Peds], the Department of Biological Structure [BIOSTR], and the Department of Computer Science and Engineering [CSE]. Copyright © 2006-2007, University of Washington. All rights reserved. |