‌گام برداشتن ربات انساننما با استفاده ازگرادیان سیاست دریادگیری‌تقویتی

STUDENT

DEGREE

YEAR

Implementation of a humanoid Robot which could behave like human beings is one of the main interests in robotics. Among all of mechanical behaviors of a man, the most noticeable one would be the walking. Humanoid walking is a domain of research which encompasses various areas of science such as Biology, Physiology, Mechanics, Control, Artificial Intelligence and Robotics. From the perspective of Artificial Intelligence, Reinforcement learning in continuous space is an appropriate control method for learning the behavioral patterns needed to be performed smooth and continuous. In recent years, this machine learning method has been considered widely within the Control and AI societies, and tested as a trial and error method, which makes the agent experiences through communicating with the environment. This learning algorithm has been successfully tested over various control tasks. The current research proposes a learning algorithm based on Reinforcement Learning named “Policy Gradients”, which is used to make the robot learn how to step as the basic action in walking process. This research focuses on motion planning based on dividing a single step into 2 different sub-behaviors. These sub-behaviors are the transition functions which are optimized among 3 semi-dynamic frames defined in advance. Besides, the learning algorithm is divided into 2 different learning processes, Learning the balance of the robot and minimizing the gyroscope error. Gyroscope reward function which is introduced in this research is a function of angular speeds in each dimension of motion space of the humanoid robot. The first learning process is based on the Policy Gradients learning algorithm and in the second learning process we have devised a hill climbing search algorithm beside the Policy Gradients to enhance the quality of learning. The policy which is used in the proposed method is stochastic policy which is used in Policy Gradient methods. The results show that, while learning the balance, the robot minimizes the Gyroscope Error as a fitness for tension minimization. The results prove that although the environment is stochastic, the final learned behavior is stable enough to surpass the noise of the system. The Humanoid could adjust its motor speed while learning the motion functions. Increasing the speed of motion meanwhile increasing the speed of convergence are the other consequences of this learning algorithm. Although there is no convergence proof for the Policy Gradient methods in reinforcement learning, it will be demonstrated that these methods beside frame- based motion planning and Gyroscope reward function in reinforcement learning, not only leads into stable walking manner based on the learned step behaviors, but also it converges toward a suitable local optimum in the policy space. Keywords Humanoid Robots, Reinforcement Learning, Policy Gradients, Gyroscope, Hill Climbing

یکی از علاقه مندی های مهم در علم رباتیک، پیاده سازی رباتی دوپا و شبیه به انسان است که قادر باشد همانند انسان رفتار نماید. از میان همه رفتارهای مکانیکی بدن انسان، شاخص ترین آنها راه رفتن است. راه رفتن یک ربات انسان نما، موضوعی است که دامنه های مختلفی از علوم پایه و مهندسی را در بر می گیرد که از آن جمله می توان به زیست شناسی، فیزیولوژی، مهندسی مکانیک، مهندسی کنترل ، هوش مصنوعی و رباتیک اشاره نمود. از منظر هوش مصنوعی، یادگیری تقویتی در محیط پیوسته یک روش کنترلی مناسب برای یادگیری رفتارهایی است که نیاز دارند تا به صورتی نرم، یکنواخت و پیوسته تحت کنترل قرار گیرند. این روش یادگیری در سالهای اخیر بسیار مورد توجه دانشمندان علم کنترل و هوش مصنوعی واقع شده است و در بسیاری از آزمون های کنترلی از جمله راه رفتن یک ربات دوپا ، به عنوان یک روش مبتنی برآزمون و خطا که تجربه کردن را از طریق تعامل و محاوره با محیط پیرامون می اندوزد، استفاده گردیده است. آنچه در این پژوهش به انجام رسیده است، ارائه روشی مبتنی بر یکی از روشهای یادگیری تقویتی تحت عنوان روش گرادیان سیاست است که برای یادگیری گام برداشتن به عنوان عمل پایه در فرآیند راه رفتن برای یک ربات انسان نما تعریف می شود. در روش ارائه شده، طرح ریزی حرکت با استفاده از تفکیک یک گام به دو زیر رفتار اعمال می شود. از طرفی فرآیند یادگیری به دو زیر فرآیند یادگیری تعادل ویادگیری کاهش خطای ژیرسکوپ، تفکیک می شود. فرآیند اول یادگیری با استفاده از روش سریع مبتنی بر گرادیان سیاست اعمال می گردد و در فرآیند دوم یادگیری، استفاده از یک جستجوی تپه نوردی ساده همراه با روش گرادیانی برای همگرایی به رفتاری پایدار جهت گام برداشتن، کارگشا خواهد بود. در نهایت، نتایج یادگیری حاکی از آن است که ربات در هر دو فاز حرکتی علاوه بر تعادل توانسته است، میزان "لق خوردن" که مبتنی بر خطای ژیرسکوپ تعریف می شود را کاهش دهد و سرعت حرکت موتورهایش را در کنار یادگیری توابع حرکتی تنظیم نماید تا در نهایت علاوه بر سرعت همگرایی، سرعت راه رفتن خود را نیز افزایش دهد. کلمات کلیدی: 1- ربات انسان نما 2-یادگیری تقویتی پیوسته 3-گرادیان سیاست 4-ژیرسکوپ 5- تپه نوردی