ارائه ی روش های توزیع شده ی انتشار برچسب در شبکه های نامتجانس

STUDENT

DEGREE

YEAR

In today's world, we are faced with systems whose components and relationships between these components are of different types. These systems have recently modeled as heterogeneous networks. Heterogeneous networks are those consisting of various nodes and edges. These types of networks are a kind of complex networks and compared to homogeneous networks, they contain richer structural and semantic information. As a result, acquiring knowledge and exploring these types of networks requires special algorithms with different capabilities of the algorithms designed for the heterogeneous networks. On the other hand, heterogeneous networks are usually composed of many vertices and edges, and their rate of growth is much higher than homogeneous networks. Regarding the nature of these types of networks, knowledge extraction from this type of networks and relations discovery are so complicated. Thus, fast and accurate methods are required. Complex networks have many examples in the real world and are widely used today for modeling complicated processes. Biological networks are one kind of the complex networks. The purpose of this research is to provide fast and scalable methods for gaining knowledge from heterogeneous complex networks. Since in the heterogeneous networks, it is very important to consider the local and global features of the network together, we have chosen the label propagation algorithm which is a semi-supervised learning algorithm and in addition to introducing label propagation algorithms, we try to improve the speed and scalability of them in accordance with the needs of heterogeneous complex networks by providing a distributed platform for it, and finally we measure the accuracy of the proposed algorithms. In the current thesis, two distributed label propagation algorithms, namely DHLP-1 and DHLP-2, in the heterogeneous networks have been introduced. First, the heterogeneous network consisting of three concepts of drug, disease, and target has been formed and then, new drug-target, disease-target, and drug-disease associations have been predicted by label propagation. Vertex centric programming and Apache Giraph platform have been employed to make the introduced algorithms distributed. The experiments revealed that the runtime of the algorithms has decreased in the distributed version rather than non-distributed one. The effectiveness of our algorithm against other algorithms has been shown through 10-Fold Cross-Validation as well as other experiments. Keywords Vertex Centric, Label Propagation, Complex Networks, Heterogeneous Networks, Semi-Supervised Learning, Drug Repositioning

در دنیای کنونی معمولاً با سیستمهایی مواجه میباشیم که اجزای تشکیلدهندهی آنها و ارتباطات بین این اجزاء دارای انواع مختلف و متعددی میباشند. این نوع سیستمها به صورت شبکهی نامتجانس مدلسازی میشوند. شبکههای نامتجانس، شبکههایی هستند که از یالها و رأسهای با انواع مختلف تشکیل شدهاند. این نوع شبکهها گونهای از شبکههای پیچیده میباشند و در مقایسه با شبکه‌های متجانس حاوی اطلاعات ساختاری و معنایی غنی‌تری میباشند، در نتیجه کسب دانش و کاوش در این نوع شبکه‌ها، نیازمند الگوریتمهای خاص با قابلیتهایی متفاوت با الگوریتمهای مربوط به شبکههای متجانس میباشد. از سوی دیگر شبکههای نامتجانس معمولاً از رئوس و یالهای زیادی تشکیل شدهاند و سرعت رشد آنها در مقایسه با شبکههای متجانس بسیار زیاد است. با توجه به ماهیت این نوع شبکهها، استخراج دانش از این نوع شبکهها و کشف رابطهها بسیار پیچیده میباشد بنابراین نیاز به روشهای سریع و دقیق برای این منظور احساس میشود. شبکههای پیچیده نمونههای زیادی در دنیای واقعی دارند و امروزه به طور گسترده برای مدلسازی فرایندهای پیچیده استفاده میشوند. یکی از انواع شبکههای پیچیده، شبکههای زیستی میباشد. هدف از این پژوهش، ارائه‌ی روشهایی سریع و مقیاسپذیر برای کسب دانش از شبکه‌های پیچیده‌ی نامتجانس می‌باشد. از آنجا که در شبکههای نامتجانس لحاظ کردن ویژگیهای محلی و سراسری شبکه در کنار هم بسیار حائز اهمیت است، روش یادگیری نیمه نظارتی «انتشار برچسب» را انتخاب کرده و میکوشیم تا علاوه بر ارائهی یک روش انتشار برچسب، متناسب با نیازهای شبکههای پیچیدهی نامتجانس، با فراهم آوردن یک بستر توزیعشده برای آن، سرعت و مقیاسپذیری را در این الگوریتم ارتقاء بخشیم و میزان دقت را نیز مورد ارزیابی قرار دهیم. در این پایاننامه دو روش توزیعشدهی انتشار برچسب در شبکههای نامتجانس به نام های DHLP-1 و DHLP-2 معرفی شده است. ابتدا شبکهی نامتجانس متشکل از سه مفهوم دارو، بیماری و هدف تشکیل یافته است و سپس توسط انتشار برچسب روابط دارو-هدف، دارو-بیماری و بیماری-هدف جدید پیشبینی شدهاند. جهت توزیعشدگی روشهای معرفیشده، از برنامهنویسی رأسمحور و بستر Apache Giraph استفاده شده است. آزمایشهای انجام شده نشان می‌دهند که زمان اجرای روشها در حالت توزیعشده نسبت به غیر توزیعشده به شدت کاهش یافته است و همچنین توسط تحلیلهای آماری 10-Fold Cross Validation و آزمایشهای عملی دیگر، کارایی الگوریتم نسبت به روشهای مشابه نشان داده شده است. کلمات کلیدی: 1- برنامهنویسی رأسمحور 2- انتشار برچسب 3-شبکههای پیچیده 4- شبکههای نامتجانس 5- یادگیری نیمه نظارتی 6- جایگزینی دارو