Bioinformatics Data Set
Project 1
Schema Design for Biomedical Data Warehouse
Biomedical data are being generated in an explosive rate,
ranging from clinical test results to microarray gene expression profiles.
The scale and complexity of these data sets give rise to substantial challenges
in data management and analysis. Data warehouse and on-line analytical processing (OLAP)
technologies have been developed for business applications. It is highly desirable
that the these technologies can be
applied to biomedical data integration and mining. The major difficulty
lies in capturing and modeling diverse biological objects and their complex relationships.
There have been various logical data models proposed to specify
biomedical data in databases, including relational data models,
object-oriented data models, and multidimensional data models.
However, it is not clear yet which approach is the best for modeling
and analyzing data in biomedical data warehouses.
Please refer the hw1.pdf for details
Project 2
Mining Association Rules from Gene Expression Data
Problem
1. Implement the Apriori algorithm to find all frequent itemsets.
2. Generate association rules based on the templates you specify.
Please refer the hw2.pdf for details
- Please download the Homework2.rar which includes instruction and data set
- data.txt
Gene expression data ("up" regulated or "down" regulated) for 100 samples and 100 time points.
Each row represents each sample.
Each column (from 2nd column to 101st column) represents each time point.
The 102nd column shows a disease for each sample.
- sample_output.txt
Sample output file of frequent itemset detection with minimum support 40.
The 1st column shows the support for the frequent itemset, which is shown from the 2nd column.
The format of a frequent item is "sample id number" + "up (U) or down (D)".
Project 3
Biomedical Data Warehouse/OLAP System
In this project, you are asked to implement a clinical and genomic data warehouse based on your schema design using the Oracle system. A good data warehouse should satisfy the following requirements: 1) support regular and statistical OLAP operations; 2) be robust to potential changes in the future; and 3) support knowledge discovery.
Please refer the project1.pdf for details
- Please download the Project1.rar which includes instruction and data set
- All the data you need for Project 3 have been generated in Project1.rar. All the files are tab delimited. You may open them in Excel to have a better view.
Please note the file structures may be slightly different from what have been listed in the project handout.
- For some entities, we removed some attributes which are hard to understand and not essential to this project.
- For some entities, we missed some important attributes in the handout. Now we have added them in the files.
The first row of each file describes the file structure, please read it carefully.
- If a patient has multiple samples, use the average value of those samples as the patient value when you do the OLAP operations unless otherwise specified in the
project handout.
- If a sample was tested by multiple experiments, use the average value of those experiments as the sample value unless otherwise specified.
- If a gene corresponds to multiple microarray probes, use the average value of those probes as the gene value unless otherwise specfied.
- To save looking up the t-statistic table, we make the following assumptions:
- if the t-statistic value of a gene is greater than or equals to 1.0, this gene is regarded as "informative gene";
- if the t-statistic value on the correlations is greater than or equals to 5.0, the patient is classified as "ALL".
You are asked to classify new patients based on the informative genes you find. The microarray data for the new patients are recorded in the file
"test_samples.txt". The first row lists the names for the patients, while the first column lists the UIDs of the genes. Each of the other cells represents the expression
value of the corresponding gene in the corresponding patient. You do not need to populate this file into your data wharehouse. Moreover, when you classify the new
patients, you can read this "test_sample.tst" file directly. But for other data, you have to retrieve them from your data wharehouse.
Project 4
Microarray Data Analysis
In the past few years, microarray technology has become one of the foremost tools in biological research. The emergence of this technology has empowered researchers in functional genomics to monitor gene expression profiles of thousands of genes (perhaps even an entire genome) at a time. However, mining microarray data also presents great challenges to Bioinformatics research. This project will acquaint you with several basic approaches to analyzing microarray data from the beginning to end. You will apply the techniques introduced in class to real-world microarray data sets and learn how to discover useful knowledge from the data sets. This project will also help you understand the challenges in microarray data analysis and motivate you to develop novel approaches to addressing those challenges.
Please refer the project2.pdf for details
- Please download the Project2.rar which includes instruction and data set
- Clustering
For clustering, you can use 'cho.txt' and/or 'iyer.txt', which have expression values for each gene and each time-point.
The first row has the number of genes and the number of time-points.
Each row from the second represents each gene.
The first column has gene_id, and the second column has ground_truth of clusters. (You can compare it with your results. -1 means outliers.)
Each column from the third represents each time-point.
- Classification
For class prediction, you can use 'golub_*.txt', which have normalized expression values for each gene and each sample.
'golub_training.txt' is training data. Each row represents each gene, and each column represents each sample. (Training data has total 38 samples.)
'golub_test.txt' is test data. Each row represents each gene, and each column represents each sample. (Test data has total 34 samples.)
'golub_truth.txt' is the ground truth for total 72 samples (training 38 samples + test 34 samples).