خوشه‌بندی مدل-پایه‌ی داده‌های بیان ژن با استفاده از توزیع آمیخته‌ی چند متغیره‌ی t

STUDENT

DEGREE

YEAR

Nowadays, most of what we consider information is stored on computers; and the amount of data being collected is increasing. This is where the ability of humans to distinguish groups degrades. Therefore, we need some methods and techniques to summarize and extract these data. One of these techniques is Clustering. The term 'Clustering' refers to the grouping of data without any a priori knowledge of what groups are present in the data. Recently, longitudinal data which collected over period of time from specific units has been more attentive. In this thesis, longitudinal data are clustered using Gaussian and Non-Gaussian mixture distributions, with consideration of the appropriate covariance structure of these data. Families of mixture models are said to arise when the component parameters, usually the component covariance matrices, are decomposed and a number of constraints are imposed. Within the family setting, it is necessary to choose the member of the family and the number of mixture components. Clustering gene expression time course data is an important problem in bioinformatics since, similar behavior of genes can lead to get important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. Similarity in the behavior of gene expressions in one cluster could be a sign of similarity in their biological behavior, so the role of clustering methods in analyzing time course gene expression data is more important than other cases. In other words, if we find some genes with similar behavior and considered them as one groups, then we may find the role of unknown genes in biological performance of cells. In this thesis, model-based clustering of longitudina l gene expression data is considered to find the performance of unknown genes with mixture of multivariate t-distribution with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models. After estimating parameters of distribution via EM-algorithm and clustering data, The models are applied to simulated data to illustrate their efficacy. This model-based clustering approach is compared with another model-based clustering technique through three real gene expression time course data sets. The result of the experiments shows interesting applications and results.

امروزه جمع‌آوری اطلاعات از طریق کامپیوتر و اینترنت، باعث تولید اطلاعات زیادی شده ‌است. کسب دانش از مجموعه داده‌های بزرگ ممکن است پیچیده و در مواردی غیرممکن به ‌نظر آید، بنابراین نیاز به داشتن روش‌هایی برای استخراج اطلاعات از این نوع داده‌ها ضروری است. یکی از روش‌های مرسوم برای داده‌کاوی، خوشه‌بندی است. نقش روش‌های خوشه‌بندی به خصوص در تحلیل داده‌های طولی بیان ژن بسیار پر‌رنگتر از بقیه زمینه‌ها است، چرا که شباهت در رفتار بیان ژن‌ها در یک خوشه خاص می‌تواند نشانه‌ای از شباهت رفتار آن‌ها در یک رفتار زیستی نیز باشد. بنابراین اگر بتوان ژن‌هایی را که رفتار بیانی مشابه، به یکدیگر دارند در میان انبوهی از ژن‌های دیگر یافت و آن‌ها را در داخل گروه‌های متمایزی قرار داد، آنگاه می‌توان با بررسی بیشتر این گروه‌ها به رفتار زیستی آن‌ها در سلول پرداخت و نقش آن‌ها را در عملکرد زیستی سلول شناسایی نمود . در این پایان‌نامه خوشه‌بندی مدل-پایه‌‌ی داده‌های بیان ژن با هدف کشف عملکرد ساختار ژن‌های ناشناخته با استفاده از توزیع آمیخته‌ی چند متغیره‌ی t با ساختار کوواریانس تجزیه‌ی چالسکی اصلاح شده و یک مدل خطی برای میانگین انجام می‌شود. با اعمال محدودیت روی ساختار کوواریانس یک خانواده‌ی جدید از مدل‌ها ایجاد می‌شوند. پس از برآورد پارامتر‌های توزیع آمیخته به وسیله‌ی الگوریتم EM و خوشه‌بندی داده‌ها، خوشه‌بندی مدل-پایه‌ی حاصل به کمک مطالعات شبیه‌سازی و سه مجموعه داده با روش دیگر خوشه‌بندی مبتنی بر مدل مقایسه می‌شود. نتایج حاصل‌شده جالب و قابل توجه است و امکان استفاده از روش پیشنهادی برای خوشه‌بندی انواع داده‌ها وجود دارد.