Implementation of a humanoid Robot which could behave like human beings is one of the main interests in robotics. Among all of mechanical behaviors of a man, the most noticeable one would be the walking. Humanoid walking is a domain of research which encompasses various areas of science such as Biology, Physiology, Mechanics, Control, Artificial Intelligence and Robotics. From the perspective of Artificial Intelligence, Reinforcement learning in continuous space is an appropriate control method for learning the behavioral patterns needed to be performed smooth and continuous. In recent years, this machine learning method has been considered widely within the Control and AI societies, and tested as a trial and error method, which makes the agent experiences through communicating with the environment. This learning algorithm has been successfully tested over various control tasks. The current research proposes a learning algorithm based on Reinforcement Learning named “Policy Gradients”, which is used to make the robot learn how to step as the basic action in walking process. This research focuses on motion planning based on dividing a single step into 2 different sub-behaviors. These sub-behaviors are the transition functions which are optimized among 3 semi-dynamic frames defined in advance. Besides, the learning algorithm is divided into 2 different learning processes, Learning the balance of the robot and minimizing the gyroscope error. Gyroscope reward function which is introduced in this research is a function of angular speeds in each dimension of motion space of the humanoid robot. The first learning process is based on the Policy Gradients learning algorithm and in the second learning process we have devised a hill climbing search algorithm beside the Policy Gradients to enhance the quality of learning. The policy which is used in the proposed method is stochastic policy which is used in Policy Gradient methods. The results show that, while learning the balance, the robot minimizes the Gyroscope Error as a fitness for tension minimization. The results prove that although the environment is stochastic, the final learned behavior is stable enough to surpass the noise of the system. The Humanoid could adjust its motor speed while learning the motion functions. Increasing the speed of motion meanwhile increasing the speed of convergence are the other consequences of this learning algorithm. Although there is no convergence proof for the Policy Gradient methods in reinforcement learning, it will be demonstrated that these methods beside frame- based motion planning and Gyroscope reward function in reinforcement learning, not only leads into stable walking manner based on the learned step behaviors, but also it converges toward a suitable local optimum in the policy space. Keywords Humanoid Robots, Reinforcement Learning, Policy Gradients, Gyroscope, Hill Climbing