داده نمایی با استفاده از الگوریتم ژنتیک مطالعه موردی: بار برق مصرفی اصفهان

STUDENT

DEGREE

YEAR

With the rapid growth of databases in many modern enterprises data mining has became an increasingly important approach for data analysis. Data mining activities include both direct and indirect approaches. Direct data mining focuses on one target variable, whereas the goal for indirect data mining is to understand the relationships amongst all of the variables. Data visualization is a key component of the directed data mining. Visualization of multi dimensional data is still a challenging task. The goal is not to display all multiple data dimensions, but to provide comprehension of multi-dimensional data for the user. Data visualization techniques have become important tools for analyzing large multi-dimensional data sets and providing insight with respect to scientific, economic and engineering applications. The most common methods allocate a representation for each data point in a lower-dimensional space and try to optimize these representations so that the distances between the projected points are kept proportional to the original distances of the corresponding data items. The methods differ in how the different distances are weighted and how the representations are optimized. Linear mapping, like principle component analysis, is effective but cannot truly reflect the data structure. Non-linear mapping, like Sammon mapping, Multi dimensional scaling (MDS) and Self Organization Map (SOM) requires more computations but are preferred for they preserve the data structure. We propose a discretization of the data visualization problem which allows us to formulate the problem as a quadratic assign problem (QAP). Since there exists no analytic solution for this problem, we investigate the use of Genetic Algorithms (GAs) for the data visualization problem. Genetic algorithms are efficient and robust searching and optimization methods that are used in data mining. Since the volume of data in data mining is large and the Genetic Algorithms search all the points, using GAs to solve this problem requires high computational capacity. Therefore, to make the search fast, a Self Adaptive Island Genetic Algorithm (SAIGA) is developed; in which the parameters of crossover rate, mutation rate, survival rate and migration rate of each population are adaptively fixed. The effects of communications topology between sub-populations are usually ignored in adaptive genetic algorithms. However, in the current paper, different communications topologies are considered. This algorithm is rather focused then on heuristically high yielding regions while simultaneously performing a highly explorative search on the other regions of the search. In other words, this algorithm improves the power of exploration and exploitation independently. In order to compare the proposed technique (QAP-SAIGA) and self organization maps (SOMs), we perform a case study.

با رشد سریع پایگاه‌داده‌ها در بسیاری از عرصه‌های دنیای امروز، داده‌کاوی رویکردی مهم برای تجزیه و تحلیل داده‌ها شده است و همواره بر اهمیت آن افزوده می‌شود. فعالیت‌های داده‌کاوی را می‌توان به دو دسته‌ی رویکرد مستقیم و رویکرد غیرمستقیم تقسیم کرد. در داده‌کاوی مستقیم هدف یک متغیر بوده درحالی‌که در داده‌کاوی غیرمستقیم، فهم روابط بین همه‌ی متغیرها مورد بررسی قرار می‌گیرد. یک جزء اساسی در داده‌کاوی غیرمستقیم، داده‌نمایی می‌باشد. داده‌نمایی داده‌های چندبعدی, مسأله‌ای چالش‌برانگیز است که هدف آن نمایش داده‌ها در ابعاد مختلف نیست، بلکه هدف آن, افزایش فهم کاربر نسبت به داده‌ها می‌باشد. امروزه داده‌نمایی به ابزاری پرکاربرد جهت آنالیز داده‌های چندبعدی و ایجاد درک داده‌ها در زمینه‌های مختلف علمی، اقتصادی، مهندسی و ... تبدیل شده است. روش‌های متداول که در داده‌نمایی مورد استفاده قرار می‌گیرند, نمایشی از هر داده در فضایی با بعد کم ایجاد کرده و این نمایش را به گونه‌ای بهینه می‌کنند که فواصل بین آنها با فواصل آیتم‌های مورد نظر در مجموعه‌ داده‌ اصلی مشابهت داشته باشد. روش‌های مورد استفاده در این زمینه از لحاظ نحوه‌ی وزن‌دهی فواصل و بهینه‌سازی نمایش, متفاوت می‌باشند. نگاشت‌های خطی مانند تجزیه و تحلیل عوامل اصلی، روش‌هایی مؤثر و کارا در این زمینه می‌باشند؛ اما ساختار داده‌ها را به درستی نشان نمی‌دهند. نگاشت‌های غیرخطی مانند نگاشت سامون، سنجش چندبعدی و نگاشت‌های خودسازمان‌یافته نیاز به محاسبات بالا دارند؛ اما ساختار داده‌ها را حفظ می‌کنند. در این پایان‌نامه با گسسته‌سازی مسأله‌ی داده‌نمایی, این مسأله به صورت یک مسأله‌ی تخصیص درجه دو فرموله می‌شود. با توجه به اینکه استفاده از روش‌های دقیق در حل این مسأله, از لحاظ محاسباتی مشکل می‌باشد، بکارگیری الگوریتم ژنتیک در حل این مسأله مورد بررسی قرار می‌گیرد. الگوریتم‌های ژنتیک روش‌های جستجو و بهینه‌سازی قوی و مؤثری هستند که در زمینه‌ی داده‌کاوی مورد استفاده قرار می‌گیرند. با توجه به بزرگی حجم داده‌ها در داده‌کاوی و انجام جستجوی الگوریتم ژنتیک بر روی تمام داده‌ها و محاسبات بالا، به منظور بهبود در کیفیت جواب‌های حاصل، نوعی الگوریتم ژنتیکی جزیره‌ای با پارامترهای پویا ارائه می‌شود که در آن پارامترهای نرخ برش، نرخ جهش، احتمال مهاجرت و عملگر انتخاب نخبه‌گرا برای هر زیرجمعیت به صورت پویا تعیین می‌شود. همچنین تأثیر توپولوژی ارتباطات بین زیرجمعیت‌ها که در اکثر الگوریتم‌های ژنتیک تطبیقی نادیده گرفته شده‌اند، در این الگوریتم مورد بررسی قرار می‌گیرد. این الگوریتم به گونه‌ای است که بیشتر به بهره‌برداری از نواحی امیدبخش برای بهبود در جواب‌ها پرداخته و به طور همزمان به جستجو در دیگر نواحی می‌پردازد. به عبارت دیگر، این الگوریتم قدرت اکتشاف و قدرت بهره‌برداری را مستقل ازهم و در نواحی مختلف افزایش می‌دهد. عملکرد مدل پیشنهادی در این پایان‌نامه با اجرای آن بر روی مجموعه ‌داده‌ی مربوط به مصرف برق اصفهان در ساعات مختلف روزهای سال 1387 نشان داده می‌شود. با مقایسه‌ی نتایج حاصل از الگوریتم ژنتیک جزیره‌ای با پارامترهای پویا با نتایج الگوریتم ژنتیک و نگاشت خودسازمان‌یافته، الگوریتم ژنتیک جزیره‌ای با پارامترهای پویا منجر به نتایج دقیق‌تر می‌شود.