Genome-wide integration on transcription factors, histone acetylation and gene expression reveals genes co-regulated by histone modification patterns

by Yayoi Natsume-Kitatani,Motoki Shiga and Hiroshi Mamitsuka

This support page includes the source code files of MATLAB and data resources, which are necessary to reproduce the results shown in the paper.

To reproduce the results, follow the instructions below.

1. Load the resources listed below on to MATLAB.

Source code

SelectCell.mat

normvector.mat

movmf_m.mat

Input datasets

genelist_GP.xls: genelist in GSE9217 (dataset GP)

genelist_ES.xls: genelist in GSE9840 (dataset ES)

genelist_TFHM.xls: genelist shared between two ChIP-chip datasets (dataset TR and AH+)

matrix_GP.txt: gene expression profile in GSE9217 (dataset GP)

matrix_ES.txt: gene expression profile in GSE9840 (dataset ES)

matrix_TR.txt: binding t-CDFs for 1756 genes in dataset TR

matrix_AHplus.txt: binding t-CDFs for 1756 genes in dataset AH+

2. Run clustering genes in ChIP-chip data.

% according to TF-binding

W_h=matrix_TR*matrix_TR';

[normvector_h]=normvector(W_h,clsn);

[bestclust_TR] = movmf_m(W_h,normvector_h,clsn,iter,kappa);

clsn: the number of clusters (eg: 10)

iter: the number of iterations (eg: 1000)

kappa: concentration parameter of vMF distribution (eg: 10)

% according to histone acetylation

W_k=matrix_AHplus*matrix_AHplus';

[normvector_k]=normvector(W_k,clsn);

[bestclust_AHplus] = movmf_m(W_k,normvector_k,clsn,iter,kappa);

For your reference, the results to be obtained are included in the following files.

("bestclust_TR.txt" and "bestclust_AHplus.txt")

3. Run clustering genes in microarray expression data (GSE9217: matrix_GP).

Run "SelectCell".

[Bestclust, GeneGroup, GeneGroupList]=SelectCell(matrix_GP, genelist_GP, clsn, iter, kappa, genelist_TFHM, bestclust_TR, bestclust_AHplus);

OUTPUT

Bestclust: cluster IDs of genes in microarray data

GeneGroup: cluster IDs of pattern-cells with t-values of more than 0.99

GeneGroupList: genelists of pattern-cells

Other outputs are 1) the number of genes in each cell of TF-HM, 2) t-CDFs of genes in each cell and 3) heatmaps of 1).

NOTE:

Cluster IDs are assigned randomly, which might make the resultant IDs different from those in the paper. If you run this software on your own gene expression data, the above parameters need to be replaced with in the followings:

matrix_GP -> gene expression profile of the microarray dataset

genelist_GP -> genelist of the dataset

The above procedure uses datasets previously reported in the following papers:

Harbison, C.T. et al. (2004) Nature, 431, 99-104.

Kurdistani, S.K. et al. (2004) Cell, 117, 721-733.

Bernstein, B.E. et al. (2004) Genome Biol, 5, R62.

Lee, Y.L. and Lee, C.K. (2008) Mol Cells, 26, 299-307.

GSE9217

GSE9840

Also the mixture model estimation part uses the source code, being accompanied with the following paper:

Banerjee, A. et al. (2005) J Mach Learn Res 6: 1-39.