توسعه مدل‌هاي دسته‌بندي و استخراج قواعد بر مبناي نوع داده

STUDENT

DEGREE

YEAR

Classification is one of the most common tasks of data mining and knowledge discovery which maps each item of the selected data onto one of a given set of classes. Classification has countless applications in many fields including financial, insurance, medical, social, biological sciences, etc. Improving performance and capabilities have always attracted attention in this field. Feature selection is a preprocessing procedure in pattern recognition and data mining. This thesis uses rough set theory as an eff ective feature selection method. A tree of the subsets of the original features set is developed and searched minimally to prune branches based on a monotonic property. Starting the search from a greedy solution yields an effective and exact feature selection algorithm in rough set for categorical datasets. The capability of the algorithm is compared with full search. Furthermore, its solution and computation time are compared with a meta-heuristic algorithm. The strengths and the weaknesses are described.The classification models developed in this thesis are able to treat different types of features, such as numerical, categorical and mixed features, differently without transforming them. In fact, the distance or similarity measures of case-based reasoning model are built. These measures consider the weight for each feature and handle categorical and numerical features differently. The proposed distance measures use the Euclidean distance for numerical features and co-occurrence of values for categorical features. The proportional distribution of different categorical values of features is computed only with respect to the values of class features at two states: without/with considering the class of the cases. The proposed case-based reasoning models are implemented on categorical and mixed datasets and their performance is evaluated in comparison withthe well-known tools of classification. The problem of sticker defect on cold rolling coils of Mobarakeh Steel Complex, as a classification problem, is investigated to fulfill the practical perspective of thesis. For this purpose, the features which were effective in producing defect are determined from research and expert viewpoints and the available data are collected. After refining the dataset and performing initial analysis, the performance of the proposed classifiers and some of the other well-known methods are used on datasets. Accordingly, the important features responsible for sticker defect are identified. Followed by the extraction of high-accuracy classification rules used for setting different process parameters so as to reduce, or possibly omit sticker defect.

چکيده دسته‌بندي يکي از اهداف مهم داده‌کاوي و بازيابي دانش بوده که به تخصيص يک نمونهبه دو يا چند دسته يا گروه از پيش‌تعيين شده گفته مي‌شود. دسته‌بندي در زمينه‌هاي مختلف مطالعاتي از جمله مباحث مالي، بيولوژي، پزشکي و غيره کاربرد دارد. افزايش عملکرد و قابليت مدل‌هاي دسته‌بندي هميشه مورد توجه بوده است. انتخاب مشخصه يک روال پيش‌پردازش در داده‌کاوي و شناخت الگو است. اين رساله با استفاده از نظريه مجموعه سخت و تابع درجه وابستگي،الگوريتمي کارا براي انتخاب مشخصه معرفي نموده وبا گسترش يک درخت از زيرمجموعه‌هاي مشخصه‌هاي اصلي و جستجوي حداقلي با هرس کردن برخي از شاخه‌ها براساس خاصيت يکنوايي و همچنين شروع جستجو از يک جواب حريصانه، الگوريتمي کارا و دقيق براي مجموعه داده‌هاي رسته‌اي ارائه مي‌نمايد. همچنين مدل‌هاي دسته‌بندي با عملکرد مناسب توسعه داده مي‌شود که توانايي مواجه با انواع مشخصه‌ها شامل عددي، رسته‌اي و مخلوط را داشته و بتواند بدون تغيير شکل آنها، رفتار متمايزي با هر نوع داده داشته باشد. در واقع، معيارهاي فاصله يا مشابهت مدل‌‌ استنتاج مبتني بر نمونه ساخته مي‌شود. اين معيارهاي فاصله، ضمن لحاظ کردن وزن هر مشخصه، از فاصله اقليدسي براي مشخصه‌هاي عددي و از وقوع همزمان مقادير مختلف براي مشخصه‌هاي رسته‌اي استفاده مي‌کند. محاسبه مقادير مختلف مشخصه‌هاي رسته‌اي در دو وضعيت با توجه و بدون توجه به دسته محاسبه مي‌شود. مسئله عيب چسبندگي بر روي کلاف‌هاي نورد سرد شرکت فولاد مبارکه به‌عنوان يک مسئله دسته‌بندي در نظر گرفته شد و پارامترهايي که در ايجاد اين عيب، مؤثر بودند شناسايي شده و بر اساس وجود يا عدم وجود اطلاعات انتخاب گرديدند. پس از پالايش مجموعه داده، عملکرد مدل‌هاي دسته‌بندي پيشنهادي رساله و برخي از ابزارهاي شناخته شده روي اين مجموعه داده، مورد آزمون قرار گرفته و مشخصه‌هاي با اهميت در دسته‌بندي ورق‌ها شناسايي و قواعد دسته‌بندي با بالاترين دقت جهت تنظيم پارامترهاي مختلف فرآيندي به‌منظور کاهش و حتي حذف عيب استخراج گرديده‌اند.