یادگیری تقویتی با اعمال پارامتری برای حمله از میانه¬ی زمین در شبیه¬سازی سه بعدی ربات¬های فوتبالیست

STUDENT

DEGREE

YEAR

Problems in robotics domain are usually characterized by high-dimensional continuous state and action spaces. Partially observable and noisy states are accessible, rather than the actual state. Controlling autonomous robots in this domain is consequently challenging. The idea of interacting with the environment to autonomously find an optimal behavior is the essence of reinforcement learning. RoboCup Soccer is a commonly used testbed for reinforcement learning methods. Continuous multi-dimensional state space, noisy actions and perceptions and high uncertainty of this environment make reinforcement learning appropriate to use in this domain. Many tasks have so far been learned in this domain with these methods. Keepaway and Half Field Offense are two instances that incorporate suitable tasks for reinforcement learning. Such tasks have mostly been learned in 2D soccer. Because of an additional dimension and physical constraints, learning is much more difficult in 3D soccer. Applying 2D environment algorithms in 3D space faces new challenges. Extending Keepaway from 2D soccer to 3D soccer is an example of such efforts done so far. Reinforcement learning problems typically feature discrete or continuous action spaces. Parameterizing each discrete action with continuous parameters makes it possible to fine-tune actions in different situations. Learning in such parameterized action spaces is complicated by the necessity of dealing with continuous parameters. However, it provides the most fine-grained control over the agent’s behavior. One method of learning in this domain is to define separate policies for discrete actions and the continuous parameters of each action and then alternate learning these policies from interaction with the environment. In this study, a single-agent task of Half Field Offense is learned in a parameterized action space in the domain of 3D soccer simulation. The agent must learn to maintain the possession of the ball while it makes its way towards the goal and finally score on the goal at an appropriate time. The performance of the agent is evaluated by the number of goals scored. One of the reasons that makes this study important is that this task has never been implemented in the 3D environment before. Furthermore, making use of a parameterized action space and learning two separate policies for discrete actions and continuous parameters, entails using value-based methods along with policy search methods in an environment of such great complexity. Final results demonstrate that despite a large state space and noisy intricate actions, the agent succeeds in learning these two policies. The agent has been successful in maintaining an uptrend in the number of goals scored in different test scenarios. Keywords: Machine Learning, Reinforcement Learning, Parameterized Actions, Robotics, 3D Simulation, RoboCup, Half Field Offense

مسائل حوزه ی رباتیک، معمولاً دارای فضای حالت و عمل پیوسته با ابعاد بالا هستند. همچنین در این مسائل، حالت واقعی به صورت کاملاً مشاهده پذیر و بدون نویز در دسترس نیست. ویژگی هایی از این قبیل، کنترل ربات های خودمختار را به امری دشوار تبدیل می کند. یادگیری تقویتی، امکان کشف رفتار بهینه را به صورت خودمختار، از طریق تعامل با محیط، برای ربات ها فراهم می کند. یک محیط متداول برای آزمایش روش های یادگیری تقویتی، فوتبال ربات ها است. حالت پیوسته ی چند بعدی، ادراک و عمل های دارای نویز، پویایی و عدم قطعیت بالای این محیط، یادگیری تقویتی را به گزینه ای مناسب برای استفاده در آن، تبدیل می کند. تاکنون، وظایف زیادی در این محیط با استفاده از این روش ها یاد گرفته شده است. حفظ توپ و حمله از میانه ی زمین، دو مثال از این وظایف هستند که مسائل مناسبی برای یادگیری تقویتی تعریف می کنند. یادگیری این گونه وظایف، بیشتر در فوتبال دو بعدی انجام شده است. یادگیری در فوتبال سه بعدی، به دلیل داشتن یک بعد اضافه و محدودیت های فیزیکی مربوط به کنترل ربات های انسان نما، دشوارتر از محیط دو بعدی است. از این رو، به کار بردن الگوریتم های محیط دو بعدی در فضای سه بعدی، با چالش های جدیدی روبرو می شود. از جمله کارهایی که تاکنون در این خصوص انجام شده، می توان به انتقال مهارت حفظ توپ از فوتبال دو بعدی به فوتبال سه بعدی اشاره کرد. مسائل یادگیری تقویتی معمولاً دارای فضای عمل گسسته یا پیوسته هستند. می توان با در نظر گرفتن پارامترهایی پیوسته برای هر عمل گسسته، امکان تطبیق هر عمل با شرایط مختلف را فراهم کرد. یادگیری در فضای عمل پارامتری، به دلیل سر و کار داشتن با پارامترهای پیوسته، پیچیده است، اما باعث جزئی ترین کنترل روی رفتار عامل ها می شود. یک روش برای یادگیری در این فضا، تعریف سیاست های جداگانه برای عمل گسسته و پارامتر پیوسته برای هر عمل و یادگیری این سیاست ها به صورت یک درمیان، حین تعامل با محیط است. در این پژوهش، وظیفه ی تک عاملی حمله از میانه ی زمین در فضای عمل پارامتری، در شبیه سازی سه بعدی فوتبال ربات ها یاد گرفته می شود. عامل باید یاد بگیرد ضمن حفظ مالکیت توپ در مقابل یک مدافع، در زمین حریف به پیش برود و در زمان مناسب توپ را به سمت دروازه شوت بزند. معیار ارزیابی این عامل، تعداد گل های به ثمر رسیده است. یادگیری این وظیفه تاکنون در محیط سه بعدی انجام نشده است و این پژوهش از این لحاظ حائز اهمیت است. همچنین، استفاده از فضای عمل پارامتری و یادگیری دو سیاست جداگانه برای انتخاب عمل گسسته و پارامترهای پیوسته ی هر عمل، این پژوهش را دشوارتر می کند. نتایج حاصل، نشان دهنده ی این است که باوجود فضای حالت گسترده، نویز و پیچیدگی بالای اعمال در محیط سه بعدی، عامل موفق به یادگیری این سیاست ها می شود. در سناریوهای مختلف آزمایش شده، یک روند صعودی در تعداد گل های به ثمر رسیده توسط عامل، حفظ شده است. کلمات کلیدی: یادگیری ماشین، یادگیری تقویتی، اعمال پارامتری، رباتیک، شبیه سازی سه بعدی، ربوکاپ، حمله از میانه ی زمین.