Jing Gao

Code and Data Sets

Truth Discovery on Heterogeneous Data

Relevant Paper: [SIGMOD14]

Conflict Resolution on Heterogeneous data (CRH) refers to the method that resolves conflicts among multiple sources of heterogeneous data types. In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori. Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. We derive a two-step iterative procedure including the computation of truths and source weights as a solution to the optimization problem. The advantage of this framework is its ability of taking various loss and regularization functions to characterize different data types and weight distributions effectively.

Weather forecast integration data set is a good test bed for the task of integrating multiple sources of heterogeneous data. We integrate weather forecasting data collected from three platforms: Wunderground, HAM weather, and World Weather Online. On each of them, we crawl the forecasts of three different days as three different sources, so altogether there are nine sources. For each source, we collected data of three properties: high temperature, low temperature and weather condition, among which the first two are continuous and the last is categorical. To get ground truths, we crawl the true weather information for each day. We collected the data for twenty US cities over a month.

Consensus Maximization among Multiple Supervised and Unsupervised Models

Relevant Papers: [NIPS09], [TKDE13], [KDD14]

[Code&Data in ZIP]

Bipartite Graph-based Consensus Maximization (BGCM) refers to the method that consolidates a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. Ensemble classifiers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph.

20 Newsgroup categorization refers to the task of classifying newsgroup messages according to topics. We construct six learning tasks, each of which involves four classes. We used the dataset where the newsgroup messages are sorted by date, and separated into training and test sets. The test sets are our target sets. We learn logistic regression and SVM models from the training sets, and apply these models, as well as K-means and min-cut clustering algorithms on the target sets.

Cora research paper classification refers to the task of classifying a set of research papers into their areas. From the original dataset, we extract four target sets, each of which includes papers from around four areas. The training sets contain research papers that are different from those in the target sets. Both training and target sets have two views, the paper abstracts, and the paper citations. We apply logistic regression classifiers and K-means clustering algorithms on the two views of the target sets.

DBLP author classification refers to the task of predicting authors' research areas. We retrieve around 4,000 authors from DBLP. The training sets are drawn from a different domain, i.e., the conferences in each research field. There are also two views for both training and target sets, the publication network, and the textual content of the publications. The amount of papers an author published in the conference can be regarded as link feature, whereas the pool of titles that an author published is the text feature. Logistic regression and K-means clustering algorithms are used to derive the predictions on the target set. We manually label the target set for evaluation.

Multiple Source Transfer Learning

Relevant Papers: [KDD08], [SDM13], [CIKM13]

[Code&Data in TAR.GZ]

Locally Weighted Ensemble (LWE) refers to the method that combines multiple models for transfer learning where the weights are dynamically assigned according to a model's predictive power on each test example. This is developed for transfer learning scenarios where we learn from several training domains and make predictions in a different but related test domain. LWE can integrate the advantages of various learning algorithms and the labeled information from multiple training domains into one unified classification model, which can then be applied on the test domain. Importantly, none of the base learning method is required to be specifically designed for transfer learning. We show the optimality of a locally weighted ensemble framework as a general approach to combine multiple models for domain transfer. We then propose an implementation of the local weight assignments by mapping the structures of a model onto the structures of the test domain, and then weighting each model locally according to its consistency with the neighborhood structure around the test example.

Synthetic dataset contains two training sets and the test set that are generated from several Gaussian distributions with the same variance. In each training set, there are 40 positive and 20 negative examples and in the test set, the number of positive and negative examples are 20 and 40 respectively.

Email spam filtering contains a training set of publicly available messages and three sets of email messages from individual users as test sets. The dataset was obtained from ECML/PKDD 2006 discovery challenge. The 4000 labeled examples in the training set and the 2500 test examples for each of the three different users differ in the word distribution. The aim is to design a server-based spam filter learned from public sources and transfer it to individual users.

Document classification contains two text classification collections that have two-level topic hierarchy. 20 newsgroup dataset contains approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups nearly evenly. The Reuters-21578 corpus contains Reuters news articles from 1987. From the two text collections, we generate nine cross-domain learning tasks each of which involves a top category classification problem but the training and test data are drawn from different sub categories. The strategy is to split the sub-categories among the training and the test sets so that the distributions of the two sets are similar but not exactly the same. The tasks are generated in the same way as in [W. Dai et al. 07].

Intrusion detection contains three datasets that are generated from KDD Cup'99 dataset. The original dataset consists of a series of TCP connection records for a local area network. Each example in the data set corresponds to a connection, which is labeled as either normal or an attack, with exactly one specific attack type. Some high level features are used to distinguish normal connections from attacks, including host, service and traffic features. Attacks fall into four main categories: DOS(denial-of-service), R2L(unauthorized access from a remote machine), U2R(unauthorized access to local superuser privileges), Probing(surveillance and other probing). We create three data sets, each contains a set of randomly selected normal examples and a set of attacks from one category. Then three cross-domain learning tasks are generated by training from two types of attacks to detect another type of attack.

Jing Gao

Home

Publications

Research

Teaching

Award

Group

Code&data

Talks

Professional Activities

CSE department

SUNY Buffalo