Professional Activities

CSE department

SUNY Buffalo


Towards Veracity Challenge in Big Data
Jing Gao, Qi Li, Bo Zhao, Wei Fan, Jiawei Han
SIAM International Conference on Data Mining, Miami, FL, May 2016

Big data leads to big challenges, not only in the volume of data but also in its velocity, variety and veracity. Especially, the veracity issue poses great difficulty to many decision making tasks when the data contains inaccurate or even false information that could mislead the decisions and eventually result in invaluable loss. Unfortunately, we cannot expect real-world data to be clean and accurate, instead, data inconsistency, ambiguity and uncertainty widely exist. Such ubiquitous veracity problems motivate numerous efforts towards improving the information quality, trustworthiness and reliability. The efforts are taken from different perspectives to identify reliable information sources and trustworthy claims: 1) A series of approaches have recently been developed to estimate source reliability and detect true claims simultaneously by examining the relationship between sources and claims, and 2) some other approaches infer the trustworthiness of claims or the reliability of sources by building analytic models based on claims or sources' features. Due to their important role in solving veracity issue, the approaches in both categories have attracted considerable attention, but a combined view of both types of approaches has never been presented. To answer the need of a systematic introduction and comparison of the techniques, we will present an organized picture towards the veracity issue in this tutorial. We will discuss various causes of veracity problems, present the state-of-the-art approaches that infer source reliability and information trustworthiness, demonstrate real-world application examples and conclude with future research directions.

The slides in PDF can be found here.

Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective
Jing Gao, Qi Li, Bo Zhao, Wei Fan, Jiawei Han
International Conference on Very Large Data Bases (VLDB), Kohala Coast, HI, August 2015

In the era of Big Data, data entries, even describing the same objects or events, can come from a variety of sources, where a data source can be a web page, a database or a person. Consequently, conflicts among sources become inevitable. To resolve the con- flicts and achieve high quality data, truth discovery and crowdsourcing aggregation have been studied intensively. However, although these two topics have a lot in common, they are studied separately and are applied to different domains. To answer the need of a systematic introduction and comparison of the two topics, we present an organized picture on truth discovery and crowdsourcing aggregation in this tutorial. They are compared on both theory and application levels, and their related areas as well as open questions are discussed.

The slides in PDF can be found here, and a summary can be found here.

Outlier Detection for Temporal Data
Manish Gupta, Jing Gao, Charu Aggarwal, Jiawei Han
ACM International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, October 2013
SIAM International Conference on Data Mining (SDM), Austin, TX, May 2013

Outlier (or anomaly) detection is a very broad field which has been studied in the context of a large number of research areas like statistics, data mining, sensor networks, environmental science, distributed systems, spatio-temporal mining, etc. The first few articles in outlier detection focused on time series based outliers (in statistics). Since then, outlier detection has been studied on a large variety of data types including high-dimensional data, uncertain data, stream data, network data, time series data, spatial data, and spatio- temporal data. While there have been many tutorials and surveys for general outlier detection, we focus on outlier detection for temporal data in this tutorial.

A large number of applications generate temporal datasets. For example, in our everyday life, various kinds of records like credit, personnel, financial, judicial, medical, etc. are all temporal. This stresses the need for an organized and detailed study of outliers with respect to such temporal data. In the past decade, there has been a lot of research on various forms of temporal data including consecutive data snapshots, series of data snapshots and data streams. Besides the initial work on time series, researchers have focused on rich forms of data including multiple data streams, spatio-temporal data, network data, community distribution data, etc. Compared to general outlier detection, techniques for temporal outlier detection are very different, like AR models, Markov models, evolutionary clustering, etc.

In this tutorial, we will present an organized picture of recent research in temporal outlier detection. We begin by motivating the importance of temporal outlier detection and briefing the challenges beyond usual outlier detection. Then, we list down a taxonomy of proposed techniques for temporal outlier detection. Such techniques broadly include statistical techniques (like AR models, Markov models, histograms, neu- ral networks), distance and density based approaches, grouping based approaches (clustering, community detection), network based approaches, and spatio-temporal outlier detection approaches. We summarize by presenting a collection of applications where temporal outlier detection techniques have been applied to discover interesting outliers.

The slides in PowerPoint can be found here.

Data Stream Mining: Challenges and Techniques
Latifur Khan, Wei Fan, Jiawei Han, Jing Gao, Mohammad M. Masud
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Shenzhen, China, May 2011

Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labeled data. Concept-drift occurs in data streams when the underlying concept of data changes over time. Concept-evolution occurs when new classes evolve in streams. Feature-evolution occurs when feature set varies with time in data streams. Data streams also suffer from scarcity of labeled data since it is not possible to manually label all the data points in the stream. Each of these properties adds a challenge to data stream mining. This tutorial presents an organized picture on how to handle various data mining techniques in data streams: in particular, how to handle classification and clustering in evolving data streams by addressing these challenges.

On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled
Jing Gao, Wei Fan, Jiawei Han
SIAM Data Mining Conference (SDM), Columbus, OH, May 2010

Ensemble methods have emerged as a powerful method for improving the robustness as well as the accuracy of both supervised and unsupervised solutions. Moreover, as enormous amounts of data are continuously generated from different views, it is important to consolidate different concepts for intelligent decision making. In the past decade, there have been numerous studies on the problem of combining competing models into a committee, and the success of ensemble techniques has been observed in multiple disciplines, including recommendation systems, anomaly detection, stream mining, and web applications.

The ensemble techniques have been mostly studied in supervised and unsupervised learning communities separately. However, they share the same basic principles, i.e., combination of diversified base models strengthens weak models. Also, when both supervised and unsupervised models are available for a single task, merging all of the results leads to better performances. Therefore, there is a need of a systematic introduction and comparison of the ensemble techniques, combining the views of both supervised and unsupervised learning ensembles.

In this tutorial, we will present an organized picture on ensemble methods with a focus on the mechanism to merge the results. We start with the description and applications of ensemble methods. Through reviews of well-known and state-of-the-art ensemble methods, we show that supervised learning ensembles usually learn" this mechanism based on the available labels in the training data, whereas unsupervised ensembles simply combine multiple clustering solutions based on consensus". We end the tutorial with a systematic approach to combine both supervised and unsupervised models, and several applications of ensemble methods.

The slides in PowerPoint can be found here, and more information can be found here.

Selected Recent Invited Talks
  • Truth Discovery from Crowdsourced Data. Tsinghua University, Microsoft Research Asia, June 2016.

  • Mining Truth from Multi-Source Data. Chinese Academy of Science, June 2016; Harbin Institute of Technology, May 2016; Yahoo! Lab, August 2015; Baidu Research, August 2015.

  • Inferring Information Trustworthiness from Multiple Sources of Heterogeneous Data. Xerox PARC, May 2015; KDD 2014 Workshop on Big Data, Streams and Heterogeneous Source Mining, August 2014; NEC Labs America, August 2014; SDM 2014 Workshop on Heterogeneous Learning, April 2014; Houghton College, March 2014.

  • Exploring the Power of Heterogeneous Information Sources. EITA-YIC conference, August 2013; Stony Brook University, November 2012.
Selected Conference Presentations