خوشه‌بندی داده‌های طولی با استفاده از آمیخته‌ی فرآیند دیریکله

STUDENT

DEGREE

YEAR

Cluster analysis is the task of grouping a set of subjects in such a way that, according to some specific criteria, subjects in the same group, called a cluster, are more similar to each other than to those in other clusters. Clustering of longitudinal data, which is provided by repeatedly measuring subjects over time, is utilized to recognize data patterns. Frequently used similarity criteria are mostly based on distance measures and cannot easily be extended to cluster longitudinal data. In this thesis, we use a model-based clustering of longitudinal data based on mixture models whose similarity criterion is supported by having the same distribution. In this way, the most information of data is transmitted to the clustering. By considering this viewpoint, a wide range of applications is available in a lot of fields like medicine, public health, education, business, economics, psychology, biology, and more. In this thesis, to consider of a wide variety of correlation patterns in longitudinal data and also to utilize explanatory variables in clustering, we use dynamic mixed-effects models. These models can control the between-subjects variation and cover the serial correlation among observations, respectively, by entering random effects and lagged response variables to the model. Also, we handle the issue of the initial conditions that is appeared in fitting dynamic mixed-effects models. This is done by emphasizing on the joint modeling of start-up and subsequent responses. In this work, clustering of subjects instead of clustering of observations is regarded such that it is assumed that subjects with similar random effects are associated to common clusters. In this thesis, in a Bayesian approach, using the Dirichlet process as a prior for the random-effects distribution is proposed. Indeed, the discreteness nature of the Dirichlet process is utilized to cluster subjects. Inasmuch as the number of clusters doesn’t need to be entered to the model, this clustering technique is superior of other methodologies. Also, this thesis provides regression-based clustering by removing the effects of troublous factors in clustering. Moreover, by the use of Dirichlet processes, more accurate analysis as well as being able to estimate the number of clusters will be available. In longitudinal data studies, mostly characterizing time trends of measurements among different subjects is important. Therefore, in this thesis a new approach for clustering of longitudinal data is proposed such that subjects are clustered in accord with their distributional behavior over time. Generally, it is assumed that the distribution of responses is altered in random change points. These changes are related to the time of changing skewness of residual distributions or the time of modifying the regression coefficients in the corresponding model. For the aim of clustering, we address Dirichlet process as the prior of distribution of random change points in dynamic mixed-effects models. In these aforementioned approaches, the Gi sampling technique is adopted to approximate Bayes estimates of parameters. Also, the performance of the proposed models is illustrated by conducting some simulation studies. Moreover, the usefulness of proposed models is evaluated by applying them on real data sets.

خوشه‌بندی، یکی از مهم‌ترین روش‌های تحلیل داده محسوب می‌شود که امروزه در بسیاری از علوم همچون پزشکی، اقتصاد و علوم رفتاری کاربرد گسترده‌ای یافته است. در خوشه‌بندی، حجم وسیعی از داده‌ها بر پایه معیاری به‌عنوان معیار شباهت، در گروه‌های معناداری به نام خوشه با حداکثر میزان شباهت به یکدیگر، قرار داده می‌شوند. با پیچیده شدن ساختارهای وابستگی موجود در داده‌ها و متنوع شدن معیارهای شباهت، توجه به روش‌های خوشه‌بندی مدل-پایه که مبتنی بر آمیخته‌ای از توزیع‌ها هستند، رو به افزایش است .به‌وسیله این روش، بیش‌ترین اطلاعات درباره نحوه توزیع داده‌ها در خوشه‌بندی مورد استفاده قرار می‌گیرد. در این پایان‌نامه، به‌منظور درنظر گرفتن ساختارهای وابستگی موجود در داده‌های طولی که با اندازه‌گیری‌های مکرر بر روی واحدهای آزمایشی در طی زمان، حاصل می‌شوند و نیز بهره‌گیری از متغیر‌های توضیحی در راستای خوشه‌بندی بهتر مشاهدات، از مدل‌های رگرسیونی همچون مدل با اثرات آمیخته پویا استفاده خواهد شد. در این مدل، اثرات تصادفی و متغیر پاسخ تأخیری، جهت کنترل تغییرپذیری بین واحدهای آزمایشی و وابستگی پیاپی متغیرهای پاسخ به مدل وارد می‌شوند. به‌علاوه در این پایان‌نامه، رویکرد جدیدی نسبت به خوشه‌بندی داده‌های طولی مطرح می‌شود که در آن خوشه‌بندی بر اساس زمان تغییر رفتار توزیعی متغیر پاسخ صورت می‌پذیرد. در این حالت، با به‌کارگیری توزیع فرآیند دیریکله به‌عنوان توزیع پیشین برای توزیع زمان‌های تصادفی تغییر توزیع، خوشه‌بندی واحدهای آزمایشی انجام می‌گیرد.