Home Overview Publications People Contact Us Downloads Tools

BioMediator: Information Integration in the Presence of Uncertainty (UII)
(Funding provided by NSF #IIS-0513877)

 

Table of contents

Support for uncertain information integration (UII)

The “Information Integration in the Presence of Uncertainty” NSF funded research project (UII) is a concurrent and complementary effort to the BioMediator project. The primary goal of the UII project is foundational methodological research around the “design and implementation of information integration systems that handle uncertainty about data and results at all levels of the integration process.” More specifically, the aims of the UII project are to a) design formalisms for modeling uncertainty about federated data, b) extend query evaluation methods to handle uncertainty, c) develop techniques for recording the lineage of query results, and d) develop methods for resolving uncertainty through user feedback and/or consultation with additional sources. To create a UII prototype we are extending the core BioMediator system by including the necessary mechanisms to handle and process various sources of uncertainty. In doing so we hope to leverage the working distributed data integration system, BioMediator, as a test bed for developing methods for tracking and resolving such uncertainties.

Sources of uncertainty in BioMediator

The BioMediator system enables users to pose queries over a distributed network of data sources. The data, found in each source in the network, is itself inherently uncertain. Data uncertainties arise from factors such as the amount of additional supporting data, the experimental methods and controls employed, as well as the degree and type of processing undergone by the data.

Mediated systems, by definition, require mapping source entity and relationship types onto a common mediated schema. For such an approach to facilitate distributed data integration, shared mediated schema elements must align to elements of multiple sources. Often, because these source elements are not exactly equivalent, the shared mediated schema element cannot map exactly to all source elements. For example, because the representation of gene differs slightly, between Entrez and Swissprot, any representation of gene in a mediated schema including these two sources must differ from at least one of the source representations. Such imprecise mappings lead to mediation uncertainties .

In the Biomediator system, we broaden a set of query results by expanding previously unexplored references from existing results. Such references may be explicit, foreign key type links. They may also be based on partial matching or alignment algorithms, such as Qgram or BLAST. These algorithms provide inexact approximate matches. Partial matching uncertainties are introduced by such approximate methods.

Rationale for tracking uncertainties in BioMediator

A result set in BioMediator can be conceptualized as a graph where individual results form the nodes (entities) and the interconnections between results in the mediated network are represented by connecting edges (relationships). Additionally this graph contains a node which represents the original user query. A query path , which begins at the user query node and ends at a given query result node, represents the route through the distributed data sources via which the result was derived by the system. Because each node represents result data with inherent uncertainty, and because each edge carries uncertainty regarding a reference from one result to another, the belief in results necessarily degrades (or at best remains constant) as one traverses farther along a query path. Such paths provide valuable feedback to users about the level of confidence that they should have in a given result.

Calculating the total path cost from the query node to each result node allows us to rank result nodes in terms of our belief in their validity or applicability to the query. Such ranking can be an invaluable aid in pruning large result sets. It also allows us to prioritize query expansions to favor the most fruitful paths.

Measures of Uncertainty in the BioMediator System

In the BioMediator/UII prototype, now in development, we have introduced into the system several new measures for quantifying result belief (inversely proportional to uncertainty). Two of these measures, Ps and Pr, represent the level of belief in a result node, independent of its position on a query path. Qs and Qr quantify the belief in the connections between results. Lastly we associate a UII score measure, for each node, which represents the total path cost, considering all query paths, from the user query to a given result.

Fig. 1 shows how these belief measures are incorporated into BioMediator. The dashed circles indicate sets of entities of a common type in a given source. The dashed arrows emanating from these circles point to the mediated schema elements to which members of the set are mapped. Solid arrows represent relationships between one source entity and another, in the Source Data box, or between one mediated schema type and another, in the Mediated Schema box. Triangles and diamonds represent data elements of two different types (i.e. genes and proteins).

Fig. 1: BioMediator/UII Uncertainty Measures

Ps

Ps is a static measure used to express a user's belief, in general, in entities of a given type found in a given source (e,g. their confidence in information about genes represented in Entrez). Factors such as curation method, data gathering technique, and provenance are included in Ps only if they hold for all entities of the given type and source. Ps accounts for the user's belief in the relevance of such entities to the application for which the mediated schema was intended as well as their belief in the extent to which a source-specific type maps onto the corresponding mediated schema type (e.g. quantitative value of how close the concept of a protein in one source maps to the concept of a protein in the mediated schema?) .

Ps1 , in Fig. 1, corresponds to a user's general belief in the quality of information, contained in Src1, about “triangle” entities. Additionally it accounts for the user's belief in how well Src1's concept of a triangle maps to that of the triangle type asserted by the mediated schema. Similarly Ps2 represents a user's confidence in the quality of data found in Src2 about “diamonds” and in the mapping of diamond from Src2 to the mediated schema. The value for Ps2 sets the maximum belief that should be assigned to any individual result, from Src2, which is of type diamond.

Pr

While values for Ps are static and user-group specific, values for Pr are calculated on a per-result basis. If factors such as curation method, data gathering technique, and provenance differ between entities of the same type for a given source, then such factors are included in the Pr measure. Entrez, for example, assigns a “RefSeq status” attribute to gene records, with values such as “inferred”, “predicted”, and “reviewed”. As these values could influence a user's belief in a particular Entrez gene record, and because these values differ between Entrez gene records, they are accounted for in Pr.

Pr1 and Pr2 in Fig. 1 illustrate that two “diamond” entities, even if they occur in the same source, can have differing values for Pr. While Ps2 sets the maximum belief that should be assigned to any diamond entry from Src2, Pr1 and Pr2 may reduce belief in individual diamonds based on entry specific information.

Qs

Qs is a static measure used to express a user's general belief in the quality of intra- or inter-source relationships between entity types (e.g. relationship of gene in Entrez to gene in OMIM or relationship of gene in Entrez to protein in Entrez). Qs includes the user's belief in the quality of the mapping of a source level relationship onto a relationship type found in the mediated schema. Qs sets a maximum belief threshold for any relationship instance of a particular relationship type.

In Fig. 1, Qs2 represents a user's general belief in members of the set of references from “diamond” entities in Src1 to “diamond” entities in Src2. Additionally it includes the user's belief regarding how well such links are represented by the mediated schema diamond-to-diamond relationship type. Qs1 is similar, but indicates that the relationship type may be between entities of different types and/or between types found within the same source.

Qr

Factors such as relationship quality are included in Qs if they hold for all instances of a particular relationship type. However, when such factors differ between instances the Qr measure is used for quantifying such differences. Some of the relationships found in a BioMediator result graph are generated using approximate matching techniques such as Qgram, a partial string matching algorithm, or BLAST, a local alignment algorithm for quantifying sequence relations. In these cases the degree of a match (reference) affects the user's uncertainty and is included in the Qr measure.

In Fig. 1 both Qr1 and Qr2 refer to relationship instances and account for all instance-specific information which might affect a user's uncertainty. Both links are of the mediated type triangle-to-diamond, but Qr1 is an inter-source link while Qr2 is intra-source link.

UII Score

BioMediator assigns two local belief values Ps and Pr to each result entity and two local belief values Qs and Qr to each connecting edge. P values are assigned to results independent of other results, and likewise for Q.Of course if a result is derived by following chains of uncertain references through a series of uncertain result nodes then the user's uncertainty should degrade along this query path. UII scores are added, as a global measure, to each result in order to capture the total cumulative uncertainty. A UII score, for a particular result, is calculated based on the P and Q uncertainty values along all paths from the user query to this result.

Incorporating Uncertainties into the BioMediator System

BioMediator belief values range between 0.0 and 1.0, inclusive, where larger values indicate stronger belief. We interpret P and Q values as the probability that a given result node or edge is valid or relevant in the given local context. UII scores have a similar interpretation but in the broader context of a complete result set. Table 1 illustrates the correspondence between the sources of uncertainty described previously and the UII measures.

Sources of Uncertainty
U2 Measure
Data uncertainties:
 
   Quality of entity type for specific source
Ps
   Quality of specific entity instance
Pr
   Quality of property type for specific source
Qs
   Quality of specific property instance
Qr
 
Mediation uncertainties:
 
   Quality of source entity mapping to mediated schema
Ps
   Quality of source property mapping to mediated schema
Qs
 
Partial matching uncertainties:
 
   Quality of query entity match to source entity
Pr
   Quality of query property match to source property
Qr

Table 1: Sources of Uncertainty in BioMediator


Home Overview Publications People Contact Us Downloads Tools

BioMediator™ is being developed at the University of Washington [UW] by collaborators in:
the Division of Biomedical and Health Informatics [
DBHI], the Department of Pediatrics [Peds],
the Department of Biological Structure [
BIOSTR], and the Department of Computer Science and Engineering [CSE].
Copyright © 2006-2007, University of Washington. All rights reserved.