CSE 601: Data Mining and Bioinformatics

Fall 2013

Basic Information
Overview

This course focuses on the fundamental techniques in data mining, including data warehousing, frequent pattern mining, clustering, classification, anomaly detection and feature selection methods. To demonstrate how data mining techniques are applied to various domains, we focus on the software systems design of bioinformatics, discussing the applications of data warehousing and data mining in biological and biomedical related fields. The class will discuss various software systems and provide insight that will help students gain a comprehensive understanding of the bioinformatics field. Projects will be designed based on these applications.

Textbooks
  • Data Mining: Concepts and Techniques, 3rd ed. Jiawei Han and Micheline Kamber, ISBN-13: 978-1-55860-901-3, Morgan Kaufmann Publishers.

  • Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison Wesley.

  • Data Warehousing. Paulraj Ponniah. John Wiley & Sons, Inc.
References
  • Bioinformatics: Managing Scientific Data. Zoe Lacroix and Terence Critchlow. 2003. Morgan Kaufmann Publishers.

  • Advanced Analysis of Gene Expression Microarray Data. Aidong Zhang. ISBN 981-256-645-7. World Scientific Publishing Co.
Grading Policy

Grades will be computed based on the following factors (subject to changes):
  • Class participation - 5%

  • Quizzes -- 25%

  • Projects (3) -- 45%

  • Homework (3-4) -- 25%
Course Topics and Schedule

The lecture slides were developed based on materials from several sources. Please see copyright notice.

Date Topic Assignment Readings
August 27 Introduction N/A N/A
August 29 Data Warehouse N/A Chapters 4&5 (Han), Ponniah
September 3 Data Warehouse N/A Chapters 4&5 (Han), Ponniah
September 10 Homework 1 Presentation Homework 1 due N/A
September 12 Data Warehouse N/A Chapters 4&5 (Han), Ponniah
September 17 Data Warehouse N/A Chapters 4&5 (Han), Ponniah
September 19 Association Rule N/A Chapters 6&7 (Han), Chapters 6&7 (Tan)
September 24 Clustering Basics N/A Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10]
September 26 Clustering Basics
Partitional Clustering
N/A Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10]
October 1 Partitional Clustering
Hierarchical Clustering
N/A Chapters 8&9 (Tan), Chapters 10&11 (Han)
October 3 Hierarchical Clustering
Density-based Clustering
Homework 2 Due Chapters 8&9 (Tan), Chapters 10&11 (Han)
October 8 Mixture Model
Spectral Clustering
Project 2 Out [DoBa08]
[Luxburg07], [ShMa00]
October 10 Project 1 Demo Project 1 due N/A
October 15 Spectral Clustering
MapReduce
N/A [Luxburg07], [ShMa00]
[Lin10], [SPY10]
October 17 MapReduce N/A [Lin10], [SPY10]
October 22 Principle Component Analysis N/A [Smith02]
October 24 Clustering: Other Topics N/A Chapters 10&11 (Han)
October 29 Project 2 Demo Project 2 Due N/A
October 31 Network Mining N/A
November 5 Network Mining Homework 3 out
November 7 Classification: Basics N/A Chapters 4&5 (Tan), Chapters 8&9 (Han)
November 12 Classification: Basics
Classification: Methods
N/A Chapters 4&5 (Tan), Chapters 8&9 (Han)
November 14 Classification: Basics
Classification: Methods
N/A Chapters 4&5 (Tan), Chapters 8&9 (Han)
[Mitchell05], [NgNotes]
November 19 Classification: Methods Project 3 Out Chapters 4&5 (Tan), Chapters 8&9 (Han)
[SeEl10]
November 21 Classification: Methods
Classification: Advanced Topics
N/A [SeEl10]
[GFJ10],[Zhu08], [PaYa10]
November 26 Classification: Advanced Topics N/A [GFJ10],[Zhu08], [PaYa10]
December 3 Anomaly Detection N/A Chapter 10 (Tan), Chapter 12 (Han)
[CBK09]
December 5 Anomaly Detection N/A [GGA+13]

Homeworks & Projects

Homework 1: Schema Design for Data Warehouse: Due September 10.
Project 1: Data Warehouse/OLAP System: Due October 10.
Homework 2: Mining Association Rules from Gene Expression Data: Due October 3.
Project 2: Clustering Algorithms: Code and report due October 28 and demo on October 29.
Homework 3: Clustering Analysis for Complex Networks: Due November 21 before class.
Project 3: Classification Algorithms: Code and report due December 11 and demo on December 12.

Supplementary Materials

[Jain10] Anil K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8): 651-666, 2010. [Paper]
[Luxburg07] Ulrike von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 2007. [Paper]
[ShMa00] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 2000. [Paper]
[DoBa08] Chuong B. Do and Serafim Batzoglou. What is the expectation maximization algorithm? Nature Biotechnology 26(8): 897-899, 2008. [Paper]
[Lin10] Jimmy Lin. Data-Intensive Information Processing Applications. University of Maryland, 2010. [Link]
[Mitchell05] Tom Mitchell. Machine Learning (sample chapter on Naive Bayes and Logistic Regression), 2005. [Chapter]
[NgNotes] Andrew Ng. CS 229 Lecture Notes on Support Vector Machines. [Notes]
[SeEl10] Giovanni Seni and John F. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, 2010.
[SPY10] Jimeng Sun, Spiros Papadimitriou and Rong Yan. Large-scale data mining: MapReduce and beyond. Tutorial on ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2010. [Video]
[Smith02] Lindsay I Smith. A Tutorial on Principle Component Analysis, 2002. [Paper]
[GFJ10] Jing Gao, Wei Fan, and Jiawei Han. On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled. Tutorial on SIAM Data Mining Conference (SDM), Columbus, OH, 2010. [Link]
[Zhu08] Xiaojin Zhu. Semi-supervised Learning Literature Survey. University of Wisconsin Madison, 2008. [Link]
[PaYa10] Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): 345-1359, 2010. [Link]
[CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection : A Survey. ACM Computing Surveys, Vol. 41(3), Article 15, 2009. [Link]
[GGA+13] Manish Gupta, Jing Gao, Charu Aggarwal, and Jiawei Han. Outlier Detection for Temporal Data. Tutorial on ACM International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, 2013. [Link]