tag:blogger.com,1999:blog-8264075710331628372023-11-15T06:23:07.898-08:00InfernoAnish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.comBlogger14125tag:blogger.com,1999:blog-826407571033162837.post-30461482877432914992008-07-24T09:06:00.000-07:002008-07-24T09:07:50.720-07:00WHY UNCERTAINTY IN DATA IS GREAT<span style="font-size:130%;">Background: Uncertain Data</span><br /><br />With the advent and growing popularity of several new applications (such as information extraction on the web, information integration, scientific databases, sensor data management, entity resolution), there is an increasing need to manage <i>uncertain data</i> in a principled fashion. Examples of uncertain data are:<br /><ol><li><span style="font-weight: bold;">Extraction:</span> You extract structure from an HTML table/text on the Web, but due to the inherent uncertainty in the extraction process, you only have some <a href="http://en.wikipedia.org/wiki/Confidence"><i>confidence</i></a> on the result. You extract the fact that John works for Google, but are not entirely sure, so associated a <a href="http://en.wikipedia.org/wiki/Probability">probability</a> 0.8 with this fact.</li><li><span style="font-weight: bold;">Human Readings:</span> Let’s suppose people are viewing birds in California (Christmas Bird Count: <a href="http://www.audubon.org/Bird/cbc/">http://www.audubon.org/Bird/cbc/</a>). George reports that he saw a bird fly past, but wasn’t sure whether it was a Crow or a Raven. He may also attach confidences with each of them: He is 75% sure it was a Crow, and associates only a 25% chance of it being a Raven.</li><li><span style="font-weight: bold;">Sensors:</span> Sensor S1 reported the temperature of a room to be 75±5.</li><li><span style="font-weight: bold;">Inherent Data Uncertainty:</span> From weather.com, you extract the fact that there will be rain in Stanford on Monday, but you only have a 75% confidence in this.</li><li><span style="font-weight: bold;">Data Integration/Entity Resolution:</span> You are mapping schemas of various tables, and are unsure of whether "mailing-address" in one table corresponds to "home-address" or "office-address" in another table. We are de-duplicating a database of names, and are not sure whether "John Doe" and "J. Doe" refer to the same person.</li></ol>There are <i>many</i> other examples of uncertainty arising in real-world scenarios.<br /><br /><span style="font-size:130%;">How Do We Deal With Uncertainty</span><br /><br />With large volumes of uncertain data being produced that needs to be subsequently queried and analyzed, there is a dire need to deal with this uncertainty in some way. At a high-level, there are two approaches to dealing with data uncertainty:<br /><ol><li><span style="font-weight: bold;">CLEAN (Approach-C):</span> "Clean" the data as quickly as possible to get rid of the uncertainty. Once the data has been cleaned, it can be stored in and queried by any traditional DBMS, and life is good thereafter.</li><li><span style="font-weight: bold;">MANAGE (Approach-M):</span> In contrast, we could keep the uncertainty in data around, and "manage" it in a principled fashion. This involves building DBMSs that can store such uncertain data, and process them <i>correctly</i>, i.e., handle the probabilities, range of values, dependencies, etc.</li></ol>Let us compare the two approaches. Both Approach-C and Approach-M entail several technical challenges. Cleaning uncertain data is not a trivial process by any stretch of imagination. There has been work in the database community in cleaning uncertain data in various environments. (For one piece of work on cleaning in sensor/RFID networks, check this <a href="http://dbpubs.stanford.edu:8090/pub/2005-37">IQIS 2006 paper</a>.) Once the data has been cleaned, there is no additional effort involved, as it can be processed by any off-the-shelf DBMS. In contrast, with Approach-M less upfront effort is involved. However, processing uncertain data becomes significantly more challenging. I would like to highlight the Trio project (<a href="http://infolab.stanford.edu/trio/">http://infolab.stanford.edu/trio/</a>) at Stanford that is building a system to manage uncertain data along with data <i>lineage</i> (also known as history or provenance). The lineage feature in Trio allows for an intuitive representation (see our <a href="http://dbpubs.stanford.edu:8090/pub/2005-39">VLDB 2006 paper</a>), and efficient query processing (see our <a href="http://dbpubs.stanford.edu:8090/pub/2007-15">ICDE 2008 paper</a> or a short <a href="http://youtube.com/watch?v=C_ogE7U4sHY">DBClip</a>). I would also like to note that several other database groups are also studying the problem of managing uncertain data.<br /><br /><span style="font-size:130%;">Why <span style="font-style: italic;">Managing</span> Is Better Than <span style="font-style: italic;">Cleaning</span></span><br /><br />Without going into technical details, I would like to describe why Approach-M is in general better than Approach-C. While Approach-C gives instant gratification by removing all "dirty data," Approach-M gives better long term results. Intuitively, Approach-C greedily eliminates all uncertainty, but Approach-M could <i>resolve</i> uncertainty more accurately later on because it has more information. Another way to look at it is that in Approach-C, the error involved in cleaning the data keeps compounding as we further query the certain data. But in Approach-M, since the uncertainty is explicitly modeled, we don’t have this problem.<br /><br />Let us take a very simple example to see how uncertainty can be resolved using Approach-M. Consider the Christmas Bird Count described earlier, where people report bird sightings with a relational schema (Observer, Bird-ID, Bird-Name). (Suppose for the sake of this example birds are tagged with an identifying number, and the main challenge is in associating the number with the bird species.) In this extremely simple example, suppose there is just one sighting in our database by Mary, who saw Bird-1, and identified it as being either a Finch (80%) or a Toucan (20%). At this point we know that Bird-1 is definitely a Finch or a Toucan, but we are not sure which one it is; using Approach-C, we would like to create a certain database, and since Mary feels much more confident that Bird-1 is a Finch, we would enter the tuple <i>(Mary, Bird-1, Finch)</i> into the database and forget the information that it could have possible been a Toucan as well.<br /><br />In Approach-M, we would store Mary’s sighting as is, which could be represented as:<br /><div style="text-align: center;"><blockquote style="font-style: italic;">(Mary, Bird-1, {Finch: 0.8, Toucan: 0.2})</blockquote></div>Why is this better? Suppose next day we get another independent observer, Susan, reports the sighting of Bird-1, and she thinks it’s either a Nightingale (70%) or Toucan (30%). The following day we get another independent observer’s sighting who says Bird-1 is either a Hummingbird (65%) or Toucan (35%). Clearly when we reconcile all these sightings, the likelihood of Bird-1 being Toucan is quite high. The method of "reconciling" these readings can be quite complicated, and is an important topic of research, but any reasonable reconciliation should indicate the probability of Bird-1 being Toucan quite high since all other sightings are conflicting. However, using Approach-C, all the three observers’ readings of Toucan would have been weeded out.<br /><br /><span style="font-size:130%;">Uncertainty In Data Integration</span><br /><br />For readers who are not convinced by the synthetic example above, here’s a very real application: <a href="http://en.wikipedia.org/wiki/Data_integration">data integration</a>, which has been an important area of research for several years.<br /><br />Data integration systems offer a uniform interface to a set of data sources. As argued in <a href="http://dbpubs.stanford.edu/pub/2008-29">our chapter</a> to appear in <a href="http://www.springer.com/computer/database+management+&+information+retrieval/book/978-0-387-09689-6">a book on uncertain data</a>, data integration systems need to model uncertainty at their core. In addition to the possibility of data being uncertain, <i>semantics mappings</i> and the <i>mediated schema</i> may be approximate. For example, in an application like <a href="http://www.google.com/base"><i>Google Base</i> </a>that enables anyone to upload structured data, or when mapping millions of sources on the <a href="http://en.wikipedia.org/wiki/Deep_web"><i>deep web</i></a>, we cannot imagine specifying exact mappings. The chapter provides the theoretical foundations for modeling uncertainty in a data integration system.<br /><br />At Google, we built a data integration system that incorporates this probabilistic framework, and completely automatically sets up <i>probabilistic mediated schemas</i> and <i>probabilistic mappings</i>, after which queries can be answered. We applied our system to integrate tables gathered from all over the web in multiple domains, including 50-800 data sources. Details of this work can be found in our <a href="http://dbpubs.stanford.edu/pub/2008-8">SIGMOD 2008 paper</a>. We observed that the system is able to produce high-quality answers with no human intervention. The answers obtained using the probabilistic framework was significantly better than any deterministic technique compared against.Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com0tag:blogger.com,1999:blog-826407571033162837.post-13786292137499335222008-01-09T18:21:00.000-08:002008-01-09T18:24:32.307-08:00A Thousand MilesA thousand miles, so far away,<br />My heart and soul, there they lie.<br />Every moment, each passing day,<br />Like the birds, I long to fly.<br /><br />How shall I explain to thee,<br />Without you, what becomes of me<br />Dried, shrivelled, impotent, full of sorrow<br />After all, what's a bow without an arrow?<br /><br />I've heard people say, the world is your oyster,<br />I give myself to thee, seeking thy shelter,<br />Well I hope in my oyster, you are the pearl.<br />My heart smiles a rainbow, so long as you are my girl.<br /><br />I gaze at that lonely star,<br />shining bright in the twilight sky,<br />I know I'll be there, not too far,<br />but still, why do I always cry?<br /><br />Let's sail through the vast oceans,<br />beyond anything anyone ever knew,<br />Come with me, my dear,<br />you are the ship, and I am the crew.<br /><br />Never again shall I leave your view,<br />Not for the world, mark my promise.<br />Tell me, thou shalt keep me with you,<br />Please, please, my pretty little miss?<br /><br />No, I'm not a writer, not nearly a poet,<br />Nor am I magician, hidden behind a closet.<br />I am the best donning cupid's bonnet.<br />And so I dream, that you shall see...<br />I'm a fanatic lover, and come embrace me?<br /><br />A thousand miles, so far away,<br />My heart and soul, there they lie.<br />Every moment, each passing day,<br />Like the birds, I long to fly.Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com4tag:blogger.com,1999:blog-826407571033162837.post-6350321202631789812007-10-11T15:52:00.000-07:002007-10-11T15:55:10.950-07:00Weekly TopsSongs:<br /><br />- "One Way Ticket" - Eruption<br />- "From of a Distance"<br />- "One Last Breath" - Creed<br />- "Chanda Ke Kiranon Mein": Kishore kumar<br />- "Unwritten" - Natasha Bedingfield<br />- "You and Me" - Lifehouse<br />- "Escape" - Enrique<br />- "When You Say Nothing At All" - Ronan Keating<br />- "You took my heart away" - MLTR<br />- "Kuch to log kahenge" - Kishore Kumar<br />- "koyi hota jisko apna" - Kishore kumar<br />- "But it rained" - parikrama<br />- "Mamma I'm coming home"<br /><br />Movies:<br /><br />- Chak de India<br />- Bourne UltimatumAnish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com0tag:blogger.com,1999:blog-826407571033162837.post-90428487318845423982007-10-11T15:10:00.000-07:002007-10-11T16:26:53.557-07:00Welcome Back!Welcome back rhapsodic readers; your best-loved blogger is back! Let not my hiatus from blogging, my friends, insinuate my hiatus from fun, work, life, and love. A lot has transpired in the interim and I dare only pen a 20,000 feet executive summary, not doing justice to any of the bullets, lest I fear getting lost in the labyrinth of the glorious summer I've left behind. Not to mention losing my readers bored by the details :-). Hmm... where shall I start. Indeed it has been an amazing summer; the best since I left home, long long ago.<br /><br />- Undoubtedly, the place to start is update you of my burgeoning love life. Mallika Kumar is her name. I reserve the word "SHE" in the rest of my blog for her, forever. It all started that thunderous full moon night on the 1st of July. And since then, I've never looked back...<br /><br />- And yes, I was interning at Google. I loved it! Yeah, that does include work, not just food and pool.<br /><br />- Worth noting is our escapade to Steffi's baby shower. As we walked toward Steffi's apartment, Mallika told me: "And BTW, I didn't tell you the baby shower is technically only for girls" :P. While she was trying to pull a fast one, as we entered, to both our surprise Steffi said: "Uhmmm, Anish, I must warn you, you might be the only guy here". And sure enough, that evening I was with 10 gorgeous women in one room with no reason to complain.<br /><br />- What else, I bought tickets for my India trip this winter! I'll be off for more than a month in Bangalore. Can't wait to get back home after a long year!<br /><br />- I've moved on-campus :). EV 86. Stop by for unprecedented hospitality: dirty apartment and no food or soda to offer. J/K.<br /><br />- Four of us (Atish, Himani, Mallika, and I) made a trip to Orlando, Florida. We saw alligators, parasailed in Daytona, went on the fieriest rides in disney, and much more...<br /><br />Oh, I almost forgot, I owe you an explantion. Not for the hiatus, but for the return to the dark. SHE said "You write really well! Why have you stopped blogging?" :)<br /><br />Alright, stay tuned to this column for further posts. In the meanwhile, stay beautiful.Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com0tag:blogger.com,1999:blog-826407571033162837.post-85752619027963644732007-07-23T22:34:00.000-07:002007-07-23T22:37:01.373-07:00Weekly TopsSongs:<br />-------<br />1) "Because of you - Kelly Clarkson" (looking for mp3)<br />2) "How Do I Live" - Trisha Yearwood<br />3) "These Dreams - Heart" (looking for mp3)<br />4) 'Kandisa" - Indian Ocean<br />5) "Lay a whisper" - roxette?<br />6) "Leaving on a jetplan" - John Denver<br /><br />Movies:<br />--------<br />1) Silence of the lambs<br />2) ...censored... <censored></censored>Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com0tag:blogger.com,1999:blog-826407571033162837.post-83127463219236309342007-07-23T14:52:00.000-07:002007-10-11T15:04:36.293-07:00Poker Party MinutesFriday, 20th July was a night to remember... a night of fun, frolic, fights, derision, and debauchery.<br /><br /><strong>What:</strong> Poker, Games, and more...<br /><strong>Who:</strong> Kavya, Ranjit, Aditya Jami, Dilys, Sidjon, VK, Mallika, Atish, Anish<br /><br /><u><strong>Standings:</strong></u><br /><strong>1) Winner:</strong> Atish<br /><strong>2) Runner up:</strong> Anish<br /><strong>3) Third spot:</strong> Ranjit<br /><br /><strong><u>Credits:<br /></u>- The thinker:</strong> VK (Thought for 5 minutes and folded :)<br /><strong>- First time drinker:</strong> Sidjon (drank a glass of vodka+cranberry)<br /><strong>- Teetotallers:</strong> Dilys, VK, Mallika<br /><strong>- Most Illustrious player:</strong> Ranjit (back after playing the WSO Poker tourney at Vegas.)<br /><strong>- Surprise performer:</strong> Aditya Jami (First time player performed really well. Neophyte's serendipty.)<br /><strong>- Drunken Driving:</strong> Kavya :P (flunked her driving test next day :( Good luck next time round!)<br /><strong>- The strategist:</strong> Mallika (ask her why!)<br /><strong>- From rags to riches:</strong> Atish (was penniless in between, then ended up winning eventually)Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com1tag:blogger.com,1999:blog-826407571033162837.post-89939405265410842342007-07-23T14:26:00.000-07:002007-07-26T19:46:26.716-07:00Free Furniture (first-come-first-served)Hello,<br /><br />If you or anyone you know needs furniture, we are giving away furniture for free. The price you have to pay is transport it from our place (749 Stanford Avenue, Palo Alto, CA 94306) by 28th July, or latest 29th July. Call or email ASAP if you are interested in looking at or taking any furniture.<br /><br />We have the following (in addition to several other smaller things. Drop by or let me know if you are interested in the complete list and I can send it over):<br /><br />- couch<br />- 2 convertible beds (foldable to make it a chair)<br />- coffee table<br />- wooden dining table with 4 chairs<br />- side table<br />- 2 large computer tables<br />- 2 large chests/drawers for clothes<br />- small bed table<br />- modem<br /><br />All the stuff above is available for free (only because I need to clear my apartment by 29th July :( ). You can also buy the stuff below if you wish:<br /><br />- Comfortable leather chair ($20)<br />- Queen size bed + mattres ($40)<br />- Microwave ($20)Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com0tag:blogger.com,1999:blog-826407571033162837.post-64310360553856740302007-07-08T16:09:00.001-07:002007-07-08T16:14:55.155-07:00Top SongsI saw only a couple of movies last week and hence publish here my top songs' list (in no specific order):<br /><br />"How do I live" - Trisha Yearwood<br />"Top of the world" - Carpenters<br />"Where'd you go" - Fort Minor<br />"Ankhiyon ke jharokhon se, maine dekha jo saanvare" - Hemlata<br />"Right here waiting" - Richard Marx<br />"Ever the same" - Rob ThomasAnish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com3tag:blogger.com,1999:blog-826407571033162837.post-1109565840668830832007-07-05T12:40:00.000-07:002007-07-05T12:42:01.281-07:00The Road Not TakenTwo roads diverged in a wood, and I-<br />I took the one less traveled by,<br />And that has made all the difference.Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com0tag:blogger.com,1999:blog-826407571033162837.post-8046092374176236502007-06-28T11:14:00.000-07:002007-10-11T15:07:16.676-07:00TahoeLake Tahoe it was the past weekend. Preceeding our trip to Tahoe was amazing food+movie+games at Uttu+Manjeera's place (kudos to Manjeera and Himani for the cooking!). After we lost SidJon, Deva, and Dina Fri night, the gang that embarked on the memorable journey to Tahoe was: Utkarsh, Manjeera, Atish, Himani, Mallika, VK, and me. As the days passed, the numbers dwindled. First night, the horrendous fateful night, was the campfire night (next was the forest fire night). Next morning we went rafting, and then in the evening to the casino. That's the executive summary.<br /><br />I'll spare you all the gory details of every minute of the trip, instead just give you samples to convince you that it was all happening there. Beautiful, crazy, wondeful things happened: one among us was christened "handsome retro", VK got a speeding ticket, Manjeera had a coupla milkshakes :), awesome cooking by manjeera+himani, atish fell into the water rafting, Uttu accepted a dare and walked on the edge of the raft while in motion, himani was so drunk and deluded she thought she won a fortune at the casino, videos of VK+uttu snoring were taken, Mallika was caught "playing" with a teddy bear, we got free drinks at the casino, there was an embarrassing restroom episode following anish+mallika getting lost in a 500 mt radius circle, resulting in a 911 call, a graphic story was constructed on the drive back from Tahoe, in addition to lampooning Mithun Da, fortune cookies of atish&himani caused them to want a "shield" in the car, and not to forget, witnessing the taho forest fire in all its fantastic fury. The rest is history. It'll suffice to say it was one helluva trip!Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com2tag:blogger.com,1999:blog-826407571033162837.post-13020994768517568412007-06-21T02:15:00.001-07:002007-06-21T02:22:39.518-07:00Weekly TopsWeekly posting of top songs I've been listening to, and movies I saw. In no specific order...<br /><br />Top Songs:<br />========<br />- "She Hates Me", Puddle of Mudd<br />- "Here I am", Spirit - Stallion of the Cimarron<br />- "Lag Jaa Gale Ke Phir Yeh", Woh Kaun Thi<br />- "Chiquitita", Abba<br />- "Circle of Life", Lion King, Elton John<br /><br />Top Movies:<br />=========<br />- Kill Bill<br />- Life is Beautiful<br />- censored... <censored><censored>(kidding :P)<br /><br /><br />If you want your favorite song/movie to appear here next week and/or think I've missed something, lemme know! I'll check em out...Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com1tag:blogger.com,1999:blog-826407571033162837.post-25986586522679647942007-06-18T02:18:00.000-07:002007-10-11T15:09:28.610-07:00Beijing!<p>Was in Beijing 9-16th June for SIGMOD 2007. Presented a paper titled "Leveraging Aggregate Constraints for Deduplication" (I know you don't care, but for completeness :). What did I do apart from attending talks and schmoozing with people? Here are some highlights:<br /></p><p>(1) Went to the Great Wall, which blew me away. 3000 miles long wall built atop endless mountains, winding all around. No wonder it's a wonder of the world. Hiked up to (the most famous and highest) peak of the wall, got a certificate for reaching there :), and took the cable car back down. Was amazing!<br /></p><p>(2) Ate some snake delicacy, lots of roast duck, and several other weird things I don't remember. The snake was better than the cocoon I ate at Korea :D. Oh yes, there was this interesting street in Beijing I went to late at night where you get tons of food on skewers, including scorpions, snakes, starfish, dogs, and more. I didn't partake any of it as I had filled myself to the brim at a traditional dumplings restaurant earlier.<br /></p><p>(3) Went to the "Pub Street" in Beijing. A street full of pubs. We had chinese beer in one of them. Was additionally exciting as all four of us were out there very late at night, and didn't know the language. Four strangers left at the mercy of cab drivers, and a piece of paper with our hotel name written in chinese to take us back. The story about our journey from the pub back to the hotel, and the adventures thereupon stay in Beijing!<br /></p><p>(4) Went to this street mall were young pretty chicks try to sell you clothes. Don't remember what I wore that day but plenty of the chicks came to me and said "handsome guy", etc :-). Then when I didn't end up buying things from them, they'd say "dirty/mean guy". It's true I'm both (both doesn't stand for dirty and mean, but stands for handsome and dirty/mean!) LOL...<br /></p><p>(5) The conference banquet was at the gorgeous Summer Palace. Took a boat to reach this island palace. Saw a traditional chinese (changing face) opera which was amazing. Also witnessed an acrobatic performance by young kids, a Beijing opera, a magic show and more. In all it was wonderful.<br /></p><p>Pics of the trip coming soon, so stay tuned.</p>Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com4tag:blogger.com,1999:blog-826407571033162837.post-11198360256494694082007-06-18T02:15:00.000-07:002007-06-18T02:16:51.883-07:00Blackwleder Ping-pong Tourney Update<p>The Stanford blackwelder ping-pong tourney concluded a few weeks back. Needless to say I won the men's singles, and VK and I won the men's doubles. Too bad the mixed doubles got cancelled else I'd have been part of the winning team there as well :-)</p><p><br />Anyway, the prize for the men's singles+dubs has given me close to ~$30 worth of Jamba juice, and $15-20 worth of stuff from the bookstore, none of which I've used yet.</p>Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com2tag:blogger.com,1999:blog-826407571033162837.post-90531310271853767032007-06-05T17:29:00.000-07:002007-06-05T17:31:05.005-07:00Stopping By Woods on a Snowy EveningThe woods are lovely, dark and deep.<br />But I have promises to keep,<br />And miles to go before I sleep.<br />And miles to go before I sleep.Anish Das Sarmahttp://www.blogger.com/profile/06464098403241790130noreply@blogger.com4