بهبود ناوبری ربات سیار در محیط‌ های پویا مبتنی بر یادگیری تقویتی

STUDENT

DEGREE

YEAR

— in these days, due to computational complexity and time consuming of classical methods, researches tend to using artificial intelligence approaches such as reinforcement learning method, which are the most convenient ways of dealing with sophisticated problems, specially knowing the fact that they do require no prior knowledge about the problem. In this research, a navigation algorithm based on reinforcement learning is proposed. In order to use reinforcement learning techniques, the main steps are to design the required states capable of representing the ongoing world realistically and a reward function based on which learning can happen. When the aim is to put the reinforcement learning techniques into work in dynamic environments, the designing of states and reward function should be done very delicately. Due to complexity of dynamic environments, finding a mapping function from environment status to states is a very challenging task. In addition, dynamics nature of the environment can also affect the reward signal as well and it will cause the variation in the amount of reward signal which a learning agent receives at each moment. As reinforcement learning methods are all based on reward functions, the quality of learning can be very dependent on the received reward signal during the time which the learning supposed to be happening. In such unsupervised learning method, the data is going to be provided by the environment, and not by a supervisor, and due to the randomness of such provided data, the learning agent will be faced with inappropriate convergence of algorithm, even if it was not diverged as there’s no guarantee for that to happen. In the current study, besides bringing the proof of proposed method convergence and describing and analyzing the manner in which an appropriate policy can be formed through the learning process, a heuristic function will be introduced to speed up the whole learning process. As is well-known the popularity of reinforcement learning methods is all because of their eye-catching performances, especially in unknown environments; so no prior knowledge is needed to be available on the agent’s status and its capability, and also the environment situation. In order to define a heuristic function which any prior information will be stay avoided, the definition of heuristic function and the guidance of agent in the environment is solely based on the acquired data from the environment. Such definition causes the heuristic function to be affected by the provided data. In every task, there are some certain conditions which must be held under any circumstances or otherwise it will be counted as system failure. The aim of adjusting the weights of a multi-objective reward function is exactly to control the agent’s behavior and furthermore, to prevent any system failure; so the consequences due to data dependency of huristic function in use can be tackled in the same way. At the end, the simulation results shows that the proposed approach has higher performance than any other previous researches were conducted in this literature, and more importantly, the performance of learning agent keeps improving through the learning process and there’s no casual reduction in performance, so the convergence of algorithm is guaranteed. As it was expected in analyzing the proposed method, an appropriate policy will be shaped in a very short time through learning process, no matter how the training data are being represented. Hence, the user can use the proposed approach free of any further concerns. Keywords: Navigation, dynamic environment, obstacle avoidance, reinforcement learning

در سال‌های اخیر بدلیل پیچیدگی محاسباتی و زمان‌بر بودن روش‌های کلاسیک، یک تمایل میان محققان استفاده از راهکارهای هوش مصنوعی است، که در این میان الگوریـتم‌های یادگیری تقویـتی به علت کارایی چشم‌گیر آن‌ها بخصوص در محیط‌های ناشناخته از محبوبیت خاصی برخوردار هستند. در این تحقیق رهیافتی جدید مبتنی بر یادگیری تقویتی جهت ناوبری یک عامل هوشمند در محیط‌های پویا پیشنهاد می‌شود. گام‌های اصلی در استفاده از الگوریتم یادگیری تقویتی، طراحی حالات و تابع پاداش مناسب است، که در محیط‌های پویا مبحث تعیین حالات و توابع پاداش از حساسیت خاصی برخوردار است. یک چالش مطرح در محیط‌های پویا، پیچیده و دشوار بودن بیان یک تابع نگاشت مناسب از وضعیت محیط به ویژگی‌هایی به عنوان معرف حالات مورد استفاده در الگوریتم یادگیری تقویتی است.چالش بزرگ دیگر در یادگیری در محیط پویا، نوسانات تابع پاداش دریافتی ناشی از پویایی محیط است. ازآنجایی‌که کیفیتنتایج در یادگیری تقویتی وابسته به تابع پاداش محیط است، در محیط‌های پویا تغییرات تصادفی در تابع پاداش موجب تغییر در نتایج می‌شود. در یادگیری تقویتی، داده‌های آموزشی توسط محیط برای عامل فراهم می‌گردند، و به دلیل ماهیت تصادفی داده‌ها ناشی از پویایی محیط، عامل یادگیرنده با مشکل عدم همگرایی و یا همگرایی نامناسب الگوریتم روبروست، و هیچ‌گونه تضمینی برای همگرایی مناسب الگوریتم مطرح نیست. در این تحقیق علاوه بر تحلیل همگرایی راهکار پیشنهادی و نحوه‌ی شکل‌گیری سیاستی مطلوب، یک تابع مکاشفه به‌منظور سرعت‌دهی الگوریتم یادگیری تقویتی و هدایت مناسب عامل ارائه می‌گردد. محبوبیت الگوریتم‌های یادگیری تقویتی، کارایی بالا و حیرت‌برانگیز آن‌ها در محیط‌های ناشناخته است. با توجه به عدم شناخت پیشین از نحوه‌ی عملکرد عامل و همچنین شرایط محیط، هدایت ‌عامل به‌طور غیرمستقیم است و تعریف تابع مکاشفه تنها بر اساس اطلاعات جمع‌آوری‌شده توسط عامل صورت می‌پذیرد. اما چنین تعریفی موجب می‌گردد تابع مکاشفه‌ی ارائه‌شده وابسته به داده‌های آموزشی گردد و درنتیجه تحت تأثیر نوع محیط و شرایط پویایی حاکم بر آن قرار گیرد. با توجه به اینکه تابع مکاشفه بر اساس اطلاعات محیطی از جمله پاداش‌های دریافتی است، به‌منظور فائق آمدن بر این مشکل و تحصیل یک هدایت مناسب در همه‌ی زمان‌ها، با تعیین چارچوبی مناسب در تخصیص میزان وزن های تابع پاداش چندهدفه به مقابله با تأثیر سوء ناشی از پویایی محیط پرداخته می‌شود و بدین سبب از هرگونه شکست عملیات ناوبری و برخورد با موانع اجتناب می‌گردد. همچنین شایان ذکر است که راهکار پیشنهادی به ازای دسته وسیعی از عامل‌ها با قابلیت مانور متفاوت قابل استفاده است. در پایان، نتایج شبیه‌سازی نشان می‌دهند که این پژوهش نسبت به کارهای پیشین، حتی به ازای محیط‌ها با چگالی موانع بالا، نرخ رسیدن به هدف بالاتری دارد و درنتیجه ازنظر کارایی نسبت به سایر الگوریتم‌های موجود برتری دارد. با توجه به نتایج شبیه‌سازی، روش پیشنهادی با آموزش بیشتر عامل نرخ رسیدن به هدف به‌طور پیوسته بالاتر می‌رود و یا ثابت می‌ماند، و با هیچ‌گونه کاهش ناگهانی کارایی مواجه نیست؛ فلذا همان‌گونه که در تشریح روش پیشنهادی پیش‌بینی می‌شود، پویایی محیط اثر سوئی در روند تحصیل سیاست در طول مدت زمان آموزش ندارد. الگوریـتم پیشنهادی صرفنظر از نوع و یا نحوه‌ی حضور داده‌های آموزشی، همواره به شکل‌گیری سیاستی مناسب در مدت زمانی معقول می‌انجامد. کلمات کلیدی: 1- ناوبری 2- اجتناب از مانع 3- یادگیری تقویتی 4- محیط پویا