This website is based upon work supported by the National Science Foundation under Grant No. IIS-1319973, collaborative with NSF IIS-1320617. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Truth discovery is an emerging field in the data management and mining community. When conflicting information from multiple sources is collected, it is important to find reliable sources and identify the truth fact. The traditional conflict resolution approach that conducts majority voting usually fails as sources may have different reliability levels. Truth discovery can detect the truth even when truth is in the hands of the few if the few are reliable sources. This significantly improves the data aggregation performance by exploring the wisdom in the minority.
This project contributes to the development of this emerging field by developing truth discovery and information integration techniques that tackle unsolved challenges in this task. Specifically, we developed novel truth discovery methods for data of heterogeneous data types, data with long-tail distributions, streaming and time series data, distributed data and textual data. We also modeled correlations among sources and objects, derived fine-grained reliability degrees of sources and confidence degrees of the truths, considered the existence of true claims in the data set, and provided geometric interpretations of the truth discovery approach.
The effectiveness of the developed approaches was demonstrated on a variety of datasets drawn from multiple application scenarios, including crowdsourcing question answering, Internet information fusion, weather forecast integration, drug side-effect discovery, air quality monitoring, and indoor floorplan construction. These approaches can potentially benefit any other application in which decisions have to be made based on the reliable information extracted from diverse and heterogeneous sources.
Research results from this project were presented on the top conferences in the data science field, including KDD, VLDB, SIGMOD, SDM, ICDM, and CIKM. In this project, we conducted an extensive survey on truth discovery, which was published in SIGKDD Explorations, and we presented an overview of the truth discovery field in several tutorials on VLDB, KDD, SDM and CIKM conferences. The PI gave invited talks on workshops, in industrial labs and universities to present the research results of this project. The PI also discussed this research with high school students and undergraduate students at various outreach activities that promote "Women in STEM" at UB.
This project trained six PhD students, one master student and seven undergraduate students including three female students and one African American student. Their research skills have been greatly improved through this project, as demonstrated by their publications in top conferences and journals. In particular, two PhD students received "Best PhD dissertation award" in the Department of Computer Science and Engineering, University at Buffalo in 2017 and 2018 respectively. Majority of the research results in their dissertations were from this project. Research in this project has been integrated into the PI's data mining courses and seminars at UB via course projects and lectures.
KDD18 |
TruePIE: Discovering Reliable Patterns in Pattern-Based Information Extraction. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 1675-1684. Acceptance Rate: 107/983 = 10.89%. |
KDD18 |
TextTruth: An Unsupervised Approach to Discover Trustworthy Information from Multi-Sourced Text Data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 2018, 2729-2737. Acceptance Rate: 107/983 = 10.89%. |
SDM18 |
Online Truth Discovery on Time Series Data. SIAM International Conference on Data Mining, San Diego, CA, May 2018, 162-170. Acceptance Rate: 23.2%. |
ICDM17 |
Discovering Truths from Distributed Data. New Orleans, LA, November 2017, 505-514. Acceptance Rate: 72/778 = 9.25%. |
IJCAI17 |
A Correlated Topic Model Using Word Embeddings. International Joint Conference on Artificial Intelligence, Melbourne, Australia, August 2017, 4207-4213. Acceptance Rate: 660/2540 = 26%. |
KDD17 |
Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 1903-1911. Acceptance Rate: 85/396 = 21.5%. |
KDD17 |
Unsupervised Discovery of Drug Side-Effects From Heterogeneous Data Sources. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 967-976. Acceptance Rate: 131/784 = 16.7%. |
KDD17 |
Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, August 2017, 535-543. Acceptance Rate: 131/784 = 16.7%. |
SDM17 |
Detecting Malicious Behavior in Computer Networks via Cost-Sensitive and Connectivity Constrained Classification. SIAM International Conference on Data Mining, Houston, TX, April 2017, 117-125. Acceptance Rate: 26%. |
ICDM16 |
Topic Discovery for Short Texts Using Word Embeddings. IEEE International Conference on Data Mining, Barcelona, Spain, December 2016, 1299-1304. Acceptance Rate: 178/904 = 19.6%. |
CIKM16 |
Influence-Aware Truth Discovery. ACM International Conference on Information and Knowledge Management, Indianapolis, IN, October 2016, 851-860. Acceptance Rate: 165/935 = 17.6%. |
KDD16 |
A Truth Discovery Approach with Theoretical Guarantee. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1925-1934. Acceptance Rate: 142/784 = 18.1%. [Paper in PDF] |
KDD16 |
Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1935-1944. [Paper in PDF] |
KDD16 |
From Truth Discovery to Trustworthy Opinion Discovery: An Uncertainty-Aware Quantitative Modeling Approach. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, August 2016, 1885-1894. |
TKDE |
Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery. IEEE Transactions on Knowledge and Data Engineering, 28(8): 1986-1999, August 2016. |
SoCG16 |
Finding Global Optimum for Truth Discovery: Entropy Based Geometric Variance. International Symposium on Computational Geometry, Boston, MA, June 2016, 34:1-34:16. |
SenSys15 |
Truth Discovery on Crowd Sensing of Correlated Entities. ACM International Conference on Embedded Networked Sensor Systems, Seoul, South Korea, November 2015, 169-182. Acceptance Rate: 27/132 = 20.5%. |
SenSys15 |
Truth Discovery on Crowd Sensing of Correlated Entities. ACM International Conference on Embedded Networked Sensor Systems, Seoul, South Korea, November 2015, 183-196. Acceptance Rate: 27/132 = 20.5%. |
VLDB15 |
A Confidence-Aware Approach for Truth Discovery on Long-Tail Data. International Conference on Very Large Data Bases, Kohala Coast, HI, August 2015, 8(4): 425-436. |
KDD15 |
On the Discovery of Evolving Truth. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015, 675-684. |
KDD15 |
FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015, 745-754. |
KDD15 |
Modeling Truth Existence in Truth Discovery. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, August 2015, 1543-1552. |
SDM15 |
Believe It Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data. SIAM International Conference on Data Mining, Vancouver, Canada, April 2015, 397-405. |
SDM15 |
OnlineCM: Real-time Consensus Classification with Missing Values. SIAM International Conference on Data Mining, Vancouver, Canada, April 2015, 685-693. |
KDD14 |
Class-Distribution Regularized Consensus Maximization for Alleviating Overfitting in Model Combination. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, August 2014, 303-312. Acceptance Rate: 151/1036 = 14.6%. [Paper in PDF] [BIBTEX] |
IAAI14 |
Crowdsourcing for Multiple-Choice Question Answering. Annual Conference on Innovative Applications of Artificial Intelligence, Quebec City, Canada, July 2014, 2946-2953. [Paper in PDF] |
SIGMOD14 |
Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation. ACM SIGMOD International Conference on Management of Data, Snowbird, UT, June 2014, 1187-1198. [Paper in PDF] [Code&Data in ZIP] [More Informationn] [BIBTEX] |
Survey |
A Survey on Truth Discovery. SIGKDD Explorations, 17(12): 1-16, December 2015. [Paper] |
Tutorial |
Truth Discovery for Passive and Active Crowdsourcing. ACM International Conference on Information and Knowledge Management (CIKM'16), Indianapolis, IN, October 2016. [Slides] |
Tutorial |
Enabling the Discovery of Reliable Information from Passively and Actively Crowdsourced Data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), San Francisco, August 2016. |
Tutorial |
Towards Veracity Challenge in Big Data. SIAM International Conference on Data Mining (SDM'16), Miami, FL, May 2016. [Slides] |
Tutorial |
Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective. International Conference on Very Large Data Bases (VLDB'15), Kohala Coast, HI, August 2015. [Slides] |