Thursday, July 24, 2008

WHY UNCERTAINTY IN DATA IS GREAT

Background: Uncertain Data

With the advent and growing popularity of several new applications (such as information extraction on the web, information integration, scientific databases, sensor data management, entity resolution), there is an increasing need to manage uncertain data in a principled fashion. Examples of uncertain data are:
  1. Extraction: You extract structure from an HTML table/text on the Web, but due to the inherent uncertainty in the extraction process, you only have some confidence on the result. You extract the fact that John works for Google, but are not entirely sure, so associated a probability 0.8 with this fact.
  2. Human Readings: Let’s suppose people are viewing birds in California (Christmas Bird Count: http://www.audubon.org/Bird/cbc/). George reports that he saw a bird fly past, but wasn’t sure whether it was a Crow or a Raven. He may also attach confidences with each of them: He is 75% sure it was a Crow, and associates only a 25% chance of it being a Raven.
  3. Sensors: Sensor S1 reported the temperature of a room to be 75±5.
  4. Inherent Data Uncertainty: From weather.com, you extract the fact that there will be rain in Stanford on Monday, but you only have a 75% confidence in this.
  5. Data Integration/Entity Resolution: You are mapping schemas of various tables, and are unsure of whether "mailing-address" in one table corresponds to "home-address" or "office-address" in another table. We are de-duplicating a database of names, and are not sure whether "John Doe" and "J. Doe" refer to the same person.
There are many other examples of uncertainty arising in real-world scenarios.

How Do We Deal With Uncertainty

With large volumes of uncertain data being produced that needs to be subsequently queried and analyzed, there is a dire need to deal with this uncertainty in some way. At a high-level, there are two approaches to dealing with data uncertainty:
  1. CLEAN (Approach-C): "Clean" the data as quickly as possible to get rid of the uncertainty. Once the data has been cleaned, it can be stored in and queried by any traditional DBMS, and life is good thereafter.
  2. MANAGE (Approach-M): In contrast, we could keep the uncertainty in data around, and "manage" it in a principled fashion. This involves building DBMSs that can store such uncertain data, and process them correctly, i.e., handle the probabilities, range of values, dependencies, etc.
Let us compare the two approaches. Both Approach-C and Approach-M entail several technical challenges. Cleaning uncertain data is not a trivial process by any stretch of imagination. There has been work in the database community in cleaning uncertain data in various environments. (For one piece of work on cleaning in sensor/RFID networks, check this IQIS 2006 paper.) Once the data has been cleaned, there is no additional effort involved, as it can be processed by any off-the-shelf DBMS. In contrast, with Approach-M less upfront effort is involved. However, processing uncertain data becomes significantly more challenging. I would like to highlight the Trio project (http://infolab.stanford.edu/trio/) at Stanford that is building a system to manage uncertain data along with data lineage (also known as history or provenance). The lineage feature in Trio allows for an intuitive representation (see our VLDB 2006 paper), and efficient query processing (see our ICDE 2008 paper or a short DBClip). I would also like to note that several other database groups are also studying the problem of managing uncertain data.

Why Managing Is Better Than Cleaning

Without going into technical details, I would like to describe why Approach-M is in general better than Approach-C. While Approach-C gives instant gratification by removing all "dirty data," Approach-M gives better long term results. Intuitively, Approach-C greedily eliminates all uncertainty, but Approach-M could resolve uncertainty more accurately later on because it has more information. Another way to look at it is that in Approach-C, the error involved in cleaning the data keeps compounding as we further query the certain data. But in Approach-M, since the uncertainty is explicitly modeled, we don’t have this problem.

Let us take a very simple example to see how uncertainty can be resolved using Approach-M. Consider the Christmas Bird Count described earlier, where people report bird sightings with a relational schema (Observer, Bird-ID, Bird-Name). (Suppose for the sake of this example birds are tagged with an identifying number, and the main challenge is in associating the number with the bird species.) In this extremely simple example, suppose there is just one sighting in our database by Mary, who saw Bird-1, and identified it as being either a Finch (80%) or a Toucan (20%). At this point we know that Bird-1 is definitely a Finch or a Toucan, but we are not sure which one it is; using Approach-C, we would like to create a certain database, and since Mary feels much more confident that Bird-1 is a Finch, we would enter the tuple (Mary, Bird-1, Finch) into the database and forget the information that it could have possible been a Toucan as well.

In Approach-M, we would store Mary’s sighting as is, which could be represented as:
(Mary, Bird-1, {Finch: 0.8, Toucan: 0.2})
Why is this better? Suppose next day we get another independent observer, Susan, reports the sighting of Bird-1, and she thinks it’s either a Nightingale (70%) or Toucan (30%). The following day we get another independent observer’s sighting who says Bird-1 is either a Hummingbird (65%) or Toucan (35%). Clearly when we reconcile all these sightings, the likelihood of Bird-1 being Toucan is quite high. The method of "reconciling" these readings can be quite complicated, and is an important topic of research, but any reasonable reconciliation should indicate the probability of Bird-1 being Toucan quite high since all other sightings are conflicting. However, using Approach-C, all the three observers’ readings of Toucan would have been weeded out.

Uncertainty In Data Integration

For readers who are not convinced by the synthetic example above, here’s a very real application: data integration, which has been an important area of research for several years.

Data integration systems offer a uniform interface to a set of data sources. As argued in our chapter to appear in a book on uncertain data, data integration systems need to model uncertainty at their core. In addition to the possibility of data being uncertain, semantics mappings and the mediated schema may be approximate. For example, in an application like Google Base that enables anyone to upload structured data, or when mapping millions of sources on the deep web, we cannot imagine specifying exact mappings. The chapter provides the theoretical foundations for modeling uncertainty in a data integration system.

At Google, we built a data integration system that incorporates this probabilistic framework, and completely automatically sets up probabilistic mediated schemas and probabilistic mappings, after which queries can be answered. We applied our system to integrate tables gathered from all over the web in multiple domains, including 50-800 data sources. Details of this work can be found in our SIGMOD 2008 paper. We observed that the system is able to produce high-quality answers with no human intervention. The answers obtained using the probabilistic framework was significantly better than any deterministic technique compared against.

No comments: