Professional Activities

CSE department

SUNY Buffalo

Research Summary (Under Construction)

The main focus of my research centers around exploring the power of multiple heterogeneous information sources. Recent years have witnessed a dramatic increase in our ability to extract and collect data from the physical world. An important feature of the data collection is its wide variety, i.e., data about the same object can be obtained from various sources. For example, customer information can be found from multiple databases in a company; a patient's medical records may be scattered at different hospitals; a news event can be characterized by text, images, and video; and an activity can be captured by multiple surveillance cameras and live video feeds. Many interesting patterns cannot be extracted from a single data collection, but have to be discovered from the integrative analysis of all heterogeneous data sources available. My solution to the problem of learning from multiple sources is to extract trustworthy information from different sources and integrate their complimentary perspectives to reach a more accurate and robust decision. My major research topics are summarized as follows.

Truth Discovery from Multi-Source Data
Representative work: [SIGMOD14,VLDB15,KDD15,KDD16]

An ongoing project is to detect true facts from multiple conflicting data sources. Huge amount of information is generated every day and one of the fundamental difficulties is that freely-created information is massive in volume, but it is usually of low quality. Facing the daunting scale of data, it is unrealistic to expect human to label or tell which data source is more reliable or which piece of information is correct. Our position is to detect truths without supervision, by integrating source reliability estimation and truth finding. We developed a series of innovative techniques that are widely recognized as the state-of-the-art solutions to the truth discovery problem. These methods can successfully resolve conflicts among multiple sources of heterogeneous, dynamic, and long-tail data [SIGMOD14, VLDB15, KDD15, KDD15, KDD16, KDD16, SoCG16, TKDE16]. It is shown that the proposed approaches can outperform not only the naive majority voting/averaging schemes but also the other truth discovery schemes. Our research in this direction can benefit numerous applications where critical decisions have to be made based on the correct information extracted from noisy input sources. We have several tutorials in recent KDD, VLDB and SDM conferences, and an extensive survey on this topic [KDDEXP16].

Crowdsourcing Data Aggregation
Representative work: [KDD15,WSDM16]

Recent years have witnessed an astonishing growth of crowdcontributed data, which has become a powerful information source that covers almost every aspect of our lives, including traffic conditions, environmental conditions, health, public events, and many others. With the proliferation of mobile devices and social media platforms, now any person can publicize his observations about any activities, events or objects anywhere and at any time. The confluence of these enormous crowdsourced data can contribute to an inexpensive, sustainable and large-scale decision system that has never been possible before. In this research direction, we propose to extract "crowd wisdom" by wisely aggregating massive crowdsourced data. We adapted the truth discovery technique to the task of crowdsourced data aggregation in which each participating user's weight is estimated based on the user's ability of giving correct answers [IAAI16]. Experimental results on aggregating answers for "who wants to be a millionaire" game demonstrate the significant improvement in accuracy achieved by the proposed weighted aggregation approach. In [WSDM16], we designed an effective budget allocation strategy that can adjust allocation policy according to the requirement on aggregation accuracy when we only have a fixed tight budget during crowdsourcing. We also developed effective ways to model the expertise of crowd workers during crowdsourced data aggregation [KDD15, SDM16]. The proposed research potentially can benefit numerous applications where crowdsourced data are ubiquitous. Particularly, we have investigated challenges in crowd sensing systems and developed novel approaches to conduct data aggregation and privacy preservation in these systems [SenSys15,SenSys15].

Multi-source Information Trustworthiness Analysis
Representative work: [ICDM11,ICDM12,KDD13,SDM15,WWW15]

In recent years, information trustworthiness has become a serious issue when user-generated contents prevail in our information world. We investigated the important problem of estimating information trustworthiness from the perspective of correlating and comparing multiple data sources. To a certain extent, the consistency degree is an indicator of information reliability--Information unanimously agreed by all the sources is more likely to be reliable. Based on this principle, we developed effective computational approaches to identify consistent information from multiple data sources [KDD13,ICDM12]. The idea was applied to network traffic anomaly detection and showed its advantages in detecting meaningful anomalies [SDM15, WWW15]. In a more general setting, we proposed to detect unusual, suspicious and anomalous behavior across multiple heterogeneous sources [ICDM11]. We proposed to link the various sources by the common knowledge hidden in the data and detect the anomalies that do not follow the multi-source correlation patterns. By exploring the discrepancies across heterogeneous sources, our approaches can detect anomalies that cannot be found by traditional anomaly detection techniques and provide new insights into the application area.

Consensus Maximization among Multiple Sources
Representative work: [NIPS09,TKDE13]

In the task of classification, we need a training set consisting of labeled data to infer the correlations between feature values and class labels for future label prediction. Although unsupervised information sources do not directly generate classification results, they provide useful constraints for the task. To fully utilize all possible types of knowledge to facilitate classification, we proposed to calculate a consolidated solution for a set of objects by maximizing the consensus among both supervised predictions and unsupervised grouping constraints [NIPS09,TKDE13]. This consensus maximization approach crosses the boundary between supervised and unsupervised learning. With this framework, labeled data are no longer a requirement for successful classification, instead, the use of existing labeling efforts were maximized by integrating knowledge from relevant domains and unlabeled information sources. This framework's power has been demonstrated in many real-world problems. power has been shown in many real-world problems. In particular, it was used to solve the problems of video categorization [MM09], network traffic anomaly detection [INFOCOM11], classification in sensor networks [SenSys11,RTSS12], and informative gene discovery [BCB12]. Results showed the advantages of the proposed method in combining heterogeneous channels of information to provides a robust and accurate solution. We also held a well-received tutorial at SDM'10 conference based on these research results and a survey on the ensemble approaches in supervised and unsupervised learning [SDM10]. Recently, we began to develop model combination techniques for alleviating overfitting or conducting multi-label classification [KDD14].

Ensemble Learning on Stream Data
Representative work: [SDM07,ICDM07]

Ensemble methods, which combine competing models learnt from labeled data, have demonstrated to be effective in many disciplines. My contribution is to utilize ensemble methods to improve classification performance on stream data, i.e., continuously arriving data. In fact, many real applications, such as chronicle disease monitoring, traffic flow and electricity meter readings, generate such stream data, and our goal is to correctly classify an incoming data record based on the model learnt from historical labeled data. The challenge is the existence of distribution evolution or concept drifts because one actually may never know either how or when the distribution changes. We proposed robust model averaging frameworks combining multiple supervised models, and demonstrated both formally and empirically that it can reduce generalization errors and outperform single models on stream data [SDM07, ICDM07, IEEEIC]. This work drew people's attention to the inevitable concept drifts in data streams, showed how the traditional approaches become inapplicable when data distributions evolve continuously, and most importantly, demonstrated the power of ensemble methods in stream data classification. A series of our follow-up work tackled other challenges in stream data classification [ICDM08, PKDD10, ICDM10, ICDM11, KAIS11, TKDE11, TKDE12, PKDD10, SDM14], such as novelty class, time constraints and dynamic feature space.

Multi-view Clustering and Fusion
Representative work: [SDM13,CIKM13]

Many real-world datasets are comprised of different representations or views which often provide information complementary to each other. To integrate information from multiple views in the unsupervised setting, multi-view clustering algorithms have been developed to cluster multiple views simultaneously to derive a solution which uncovers the common latent structure shared by multiple views. We proposed a novel NMF-based multi-view clustering algorithm by searching for a factorization that gives compatible clustering solutions across multiple views [SDM13]. We also proposed a multimodal feature fusion framework to construct meaningful feature sets from image and text views [CIKM13]. This framework is trained as a combination of multi-modal deep networks having two integral components: An ensemble of image descriptors and a recursive bigram encoder with fixed length output feature vector. The proposed framework can not only model the unique characteristics of images or texts, but also take into account their correlations at the semantic level.

Multiple Source Transfer Learning
Representative work: [KDD08,SDM13]

In many applications, it is expensive or impossible to collect enough labeled data for accurate classification in the domain of interest (target domain), however, there are abundant labeled data in some relevant domains (source domains). For example, when recognizing gene names in biomedical literature, we may want to recognize gene names related to a new organism (e.g., honey bee), but we only have labeled data for some well-studied old organisms (e.g., fly and yeast). The challenge is that the data from the source domains may be in a different feature space or follow a different data distribution compared with that in the target domain. To solve this problem, we proposed a locally weighted ensemble framework to adapt useful knowledge from multiple source domains to the target domain [KDD08]. We developed and analyzed a new paradigm to effectively transfer knowledge from multiple sources when facing the challenges of imbalanced distributions and discrepancies between source and target label distributions [SDM13]. Online transfer learning methods are proposed to transfer knowledge from multiple domains in an online manner when data continuously arrive [CIKM13]. We also proposed transfer learning methods for the task of denoising biological networks [BIBM12] and mobile device based arrhythmia detection [BCB13].

Outlier Detection in Networked Data
Representative work: [KDD10]

Networked data consists of node attribute values and relationships between nodes. For example, we can collect people's profiles and friends networks from online social networks. In networked data, closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in networked data can reveal important anomalous and interesting behavior that is not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. To automatically detect such outliers, we proposed probabilistic approaches [KDD10] to characterize the normal patterns deeply embedded in networked data and find out abnormal behavior. We also developed outlier detection techniques for heterogeneous information networks in which nodes or links are of different types. We further proposed to analyze the evolutionary behavior of temporal networks and take the time factor into consideration when identifying outliers [KDD12,ECML/PKDD12]. We later tackled the problem of outlier detection in heterogeneous networks in which nodes and links possese a variety of types [ASONAM13,ECML/PKDD13] The critical points, events or activities that are detected by the proposed approaches can greatly benefit the security and safety of the cyber and physical world. Some of the research results are integrated into our book and conference tutorials.

Information Network Analysis
Representative work: [ICDM09,ICDM13]

Information networks refer to networks that are formed by individual components and their interactions. Examples include communication and computer systems, the Internet, biological networks, transportation systems, epidemic networks, criminal rings, and hidden terrorist networks. Due to the prevalence and importance of these networks, information network analysis has received considerable attention recently. My research in this area focuses on the following three major analytical tasks: 1) Topic modeling: Identifying a set of popular topics discussed in information networks [ICDM09]; 2) Classification: Predicting the role of each node in an information network by learning from nodes with their roles labeled [ICDM13,ASONAM14]; 3) Evolution analysis: Detect and analyze the evolution in dynamic networks [BCB12,ICDM13]. These challenging research problems were tackled by conducting integrative analysis of the whole networks and analyzing both node and link behavior.

Discriminative Pattern Mining
Representative work: [KDD08]

Frequent patterns provide solutions to datasets that do not have well-structured feature vectors. Traditional frequent pattern mining is performed in two sequential steps: enumerating a set of frequent patterns, followed by feature selection. We proposed a novel one-step pattern mining approach that outputs highly compact and discriminative patterns [KDD08]. It builds a decision tree that partitions the data onto different nodes. Then at each node, it directly discovers a discriminative pattern to further divide its examples into purer subsets. Since the number of examples towards leaf level is relatively small, the new approach is able to examine patterns with extremely low global support that could not be enumerated on the whole dataset by the two-step method.