بهبود صحت دادهها با مدیریت تعارضات در ترکیب دادههای مکانی

STUDENT

DEGREE

YEAR

Data utilization plays a vital role in the development of human societies. Today, the data are stored in different and heterogeneous sources. Providing an integrated view by fusion of the data is essential to enhance data utilization. Data quality is a challenging issue in the data fusion. One of the main challenges that affects the quality is data inconsistency, which has been addressed in this study. Some inconsistencies occur within a dataset. Some other inconsistencies occur between different datasets. This type of inconsistency is more likely because each dataset may have been generated by different people and updated at different times. People with different purposes, motives, and background knowledge may describe concepts or entities in diverse resources and as a result, inconsistency arises. Real inconsistencies between two sources can be handled with common techniques, such as voting. In this research, we worked on apparent inconsistencies that are caused by the difference in data representation. The purpose of this study is to manage this type of inconsistency to improve the entity recognition and data fusion quality. Entity resolution is an essential task in data fusion. Intelligent entity recognition improves the quality of data fusion and data veracity. Different criteria are defined for different data types in entity resolution process. One of the weaknesses of the criteria is the lack of attention to the level of ion and granularity of the data in different sources. The weakness causes conflicts and inconsistencies that if not managed properly, the process of identifying the entities will be erroneous and as a result the data fusion will not be sufficiently accurate. In this research, first, a framework is introduced for managing inconsistencies and data fusion. In the framework, using data granulation and knowledge bases, increases data veracity that is one of the new features of big data. The proposed algorithms belong to the category of online algorithms that is able to provide results by observing new records. It does not require the evaluation of the entire dataset from the beginning and just considers the necessary comparisons. In this work, granule and data granulation are key elements. Identification of spatial entities, formation of spatial granules, identification of the relationship between granules and clustering of them help to manage differences in data granulation. Creating a new semantic structure for spatial concepts and introducing a new criteria for measuring the quality of data based on data granulation are other achievements of this research. Blocking techniques significantly reduce the number of comparisons because only block members are compared with each other accurately in entity matching process. For this purpose, a new blocking method based on geographical features is introduced. An indispensable type of data is spatial data, with diverse application. This paper focuses on spatial data granulation. The proposed algorithms are based on spatial relationships between entities and have been tested on real air accident datasets. We developed and used an aviation accident datasets, but the proposed approach is not limited to this field and can be applied to almost any type of data that have hierarchy of concepts and values. By applying data granulation approach on several methods in the field of spatial data, the quality of data fusion based on the F-Score index has been improved.

بهره برداری از حجم زیاد داده هایی که در منابع مختلف و ناهمگون ذخیره شده است، نقش موثری در توسعه جوامع امروزی دارد. یکی از اقدامات مهم در بهره برداری از داده‌ها ترکیب آن‌ها می‌باشد که یک دید یکپارچه از داده‌ها ارائه می‌کند. اگر داده‌های مورد استفاده از کیفیت مناسب برخوردار نباشند و یا ترکیب آن‌ها با کیفیت مناسبی صورت نپذیرد نتیجه مطلوب حاصل نمی‌شود. یکی از چالش‌های اساسی که کیفیت را تحت‌الشعاع قرار می‌دهد موضوع ناسازگاری [1] منابع داده‌ای با هم است که مورد توجه این تحقیق قرار گرفته است. برخی از ناسازگاری‌ها در داخل یک مجموعه داده و برخی دیگر بین مجموعه داده‌های مختلف رخ می‌دهند. احتمال این نوع ناسازگاری بیشتر است چرا که هر مجموعه داده ممکن است توسط افراد مختلف، تولید، در زمان‌های مختلف بروز و اهداف مختلفی را دنبال کرده باشد. برخی از ناسازگاری‌ها، واقعاً وجود دارند و باید به کمک روش‌هایی مانند رأی‌گیری و میانگین‌گیری در مورد آن‌ها قضاوت کرد. اما برخی دیگر، ظاهری هستند و به خاطر بازنمایی متفاوت داده‌ها، ناسازگار به نظر می‌رسند. هدف این تحقیق مدیریت این نوع از ناسازگاری‌هاست که با شناسایی و مدیریت آن‌ها می‌توان شناخت از موجودیت‌ها و تطابق و در نهایت ترکیب اطلاعات را بهبود بخشید. یکی از عملیات کلیدی در ترکیب داده‌ها موضوع شناسایی موجودیت‌ها و مفاهیم یکسان و مشابه هست. یکی از ضعف‌های روشهای موجود، عدم توجه به سطح تجرید و ریزدانگیِ [2] متفاوت داده‌هایی است که در منابع مختلف بیان‌شده‌اند. توجه نکردن به تفاوت ریزدانگی‌ها باعث پدید آمدن تعارضات و ناسازگاری‌هایی می‌شود که در صورت عدم مدیریت صحیح آن‌ها، فرآیند شناسایی موجودیت‌ها توأم با اشتباه خواهد بود و در نتیجه ترکیب حاصل از این داده‌ها از صحت کافی برخوردار نخواهد شد. برای رسیدن به این هدف ابتدا یک چارچوب کاری ارائه‌شده و سپس سعی می‌شود در قالب آن تعارضات، مدیریت شده و در نتیجه، شناسایی موجودیت‌ها و ترکیب داده‌ها بهبود یابد. الگوریتم های معرفی‌شده به‌گونه‌ای هستند که با ارائه داده‌های جدید به آن، نیازمند بررسیِ تمام موارد از ابتدا نیست و فقط مواردی که به خاطر داده‌های جدید ظهور کرده‌اند را بررسی می‌کنند. در این پژوهش مفهوم دانه [3] و ریزدانگی عنصر کلیدی است. شناسایی موجودیت‌های مکانی، تشکیل دانه‌های مکانی، شناسایی ارتباط بین دانه‌ها، خوشه‌بندی دانه‌های مکانی کمک می‌کند تا تفاوت ریزدانگی‌ها به‌نحو مطلوبی مدیریت شود. برای کاهش تعداد مقایسه‌ها که در عملِ تطابقِ موجودیت‌ها گلوگاه مهمی محسوب می‌شود یک روش بلاک‌بندی داده‌ها مبتنی بر خصوصیات جغرافیایی نیز در این کار معرفی شده است. با توجه به اهمیت و کاربردِ گسترده داده‌های مکانی، تمرکز این پژوهش روی این نوع داده‌ها است و ایده های مطرح شده روی مجموعه داده واقعی در مورد سوانح هوایی آزمایش شده‌اند. ساخت این مجموعه داده و تبدیل آن به شکلی که بتوان از آن به‌منظور ارزیابی نتایج استفاده کرد از دیگر نتایج کار است. این ایده‌ها علیرغم آنکه روی داده‌های مکانی آزمایش شده‌اند ولی روی هر نوع داده‌ای که بتوان سلسله‌مراتبی از مفاهیم و مقادیر آن را به‌دست آورد قابل اعمال هستند. با به‌کارگیری دیدگاه مبتنی بر ریزدانگی روی چندین روش مطرح در حوزه داده‌های مکانی، کیفیت ترکیب داده‌ها به استناد شاخص F-Score بهبود یافته است. [1] Inconsistency [2] Granularity [3] Granule