高次元系のための確率的状態推定に基づいた強化学習
Reinforcement learning for high-dimensional systems based on stochastic state estimation
S3-4-1-1
線形可解マルコフ決定過程に基づいた新しい制御目的を達成するための学習済み制御器の合成法
Combining learned controllers to achieve new goals based on linearly solvable MDPs

○内部英治1, 金城健1,2, 銅谷賢治1,2
○Eiji Uchibe1, Ken Kinjo1,2, Kenji Doya1,2
沖縄先端科学技術大学院大学(OIST)1, 奈良先端科学技術大学院大学2
Okinawa Inst. of Science and Tech.1, Nara Institute of Science and Technology2

Learning complicated behaviors usually involves intensive manual tuning and expensive computational optimization because we have to solve a nonlinear Bellman equation. If we can create a new controller by combining learned controllers, learning time can be reduced drastically. However, simple linear combination of controllers is not appropriate and leads to undesirable behaviors. Recently, Todorov proposed a class of the so-called Linearly solvable Markov Decision Process (LMDP) which converts a nonlinear Bellman equation to a linear differential equation with respect to the desirability function, which is a nonlinear transformation of the value function. Linearity of the simplified Bellman equation allows us to apply superposition to derive a new composite controller from a set of learned component controllers. Todorov proposed a model based method for blending multiple controllers using the model-based LMDP framework. In this framework, mixing weights are naturally designed due to the property of linear differential equation. However, his method was a model-based approach and it was not evaluated in a real domain. This study proposes a model-free method which is similar to the Least Squares Temporal Difference (LSTD) learning. In this method, the exponentiated cost function can be regarded as the discount factor in LSTD, and a part of the parameters of the desirability function can be shared by component controllers. Our proposed method is applied to learning walking behaviors with the quadruped robot named Spring Dog to evaluate in real robot experiments. The goal of each component task is to go to the specific target position in the environment and that of the composite task is to approach arbitrary region represented by the components' target positions. Experimental results confirm that compositionality in LMDP is promising in the real robot environments.
S3-4-1-2
軌道モデルベース強化学習:ヒューマノイド運動制御への応用
Trajectory-model-based Reinforcement Learning Approach: Application to Humanoid Motor Control

○森本淳1
○Jun Morimoto1
ATR脳情報研究所ブレインロボットインターフェース研究室1
Dept of Brain Robot Interface, ATR Computational Neuroscience Labs1

We propose a reinforcement learning (RL) framework in which an approximated dynamics model of a humanoid robot is used. Although RL is a useful non-linear optimal control method, applying it to real robotic systems is usually difficult due to the large number of iterations required to acquire suitable policies. In this study, we approximate the dynamics using data from a real robot with sparse pseudo-input Gaussian processes (SPGPs). By using SPGPs, we estimated the probability distributions of output variables considering both the input densities and the observation noise. In real environments, since the observations from robot sensors can include large noise, SPGPs are able to suitably approximate the stochastic dynamics of a real humanoid robot. We use the approximated dynamics to improve the performance of a movement task in a path integral RL framework, which updates a policy from the sampled trajectories of the state and action vectors and the cost. We implemented our proposed method on a real humanoid robot and tested on a via-point reaching task. The robot achieved successful performance with fewer numbers of interactions with the real environment by using the proposed method than a conventional approach which does not use the simulated dynamics.
S3-4-1-3
Scaled free-energy based reinforcement learning for robust and efficient learning in high-dimensional state space
○内部英治1
○Eiji Uchibe1
沖縄先端科学術大学院大学(OIST)1
Okinawa Inst. of Science and Tech.1


S3-4-1-4
Exploiting previous experience to constrain reinforcement learning of robot control policies
○Ales Ude1
Jozef Stefan Institute1

In this talk I'm going to present a new methodology that combines ideas from statistical learning and reinforcement learning to efficiently compute new robot control policies. We start by collecting data from several robot executions of the desired task in different configurations of the external world. Statistical learning techniques are applied to compute a low-dimensional approximation for the optimal manifold of robot trajectories that solve the desired task in different configurations of the external world. The dimensionality of the approximating manifold is usually much lower than the dimensionality of the space of all robot movements. Next we refine the obtained policies by means of reinforcement learning on the approximating manifold, which results in a learning problem constrained to the low dimensional parameter space. We propose a reinforcement learning algorithm with an extended parameter set, which combines learning in constrained domain with learning in full space of parametric movement primitives. This way the robot can explore also actions outside of the initial approximating manifold. The proposed approach was tested for learning various tasks on different robots.
S3-4-1-5
反復型経路積分法に基づく非線形確率最適制御
Nonlinear stochastic optimal control based on iterative path integral method

○佐藤訓志1, , 佐伯正美1
○Satoshi Satoh1, Hilbert J. Kappen2, Masami Saeki1
広島大学・工・機械システム1
Faculty Eng, Hiroshima Univ, Hiroshima1, Dept Biophysics, Radboud Univ Nijmegen, the Netherlands2

The optimal feedback controller for a nonlinear stochastic optimal control problem is given by solving a stochastic Hamilton-Jacobi-Bellman (SHJB) equation. Since the SHJB equation is a nonlinear partial differential equation (PDE) of second order, it is very difficult to be solved.The aim of this talk is to provide an iterative solution method for SHJB equation based on probability theory and statistical physics. The proposed method is an extension of a path integral optimal control method originally proposed by the one of the authors.The main contribution of our method is to remove a special condition imposed in the original method, which restricts applicable class of the plant systems and cost functions. We introduce an iteration law such that if the solution for each iteration converges, the resultant one coincides with the solution to the SHJB equation to be solved. The iteration procedure is given by applying an exponential transformation to the original SHJB equation, so that each iteration forms a linear PDE of the Kolmogorov backward equation. The explicit solution to the PDE is given by the Feynman-Kac formula as a path integral form. Further, since the PDE always satisfies the special condition necessary for the conventional path integral method, the corresponding suboptimal controller in each iteration is easily obtained. The convergence property of the proposed method is investigated, and a convergence condition is provided. Consequently, the proposed method can iteratively solve the SHJB equation without the particular condition required so far, and it enables us to solve a wider class of nonlinear stochastic optimal control problems.
S3-4-1-6
非線形モデル予測制御:概念,アルゴリズムおよび応用
Nonlinear model predictive control: concepts, algorithms and applications

○大塚敏之1
○Toshiyuki Ohtsuka1
京都大学1
Kyoto University1

In this talk, key concepts, numerical algorithms, and applications of nonlinear model predictive control (NMPC) are introduced for non-specialists. NMPC is a general framework of feedback control for nonlinear dynamical systems. In NMPC, the control input to a system is determined at each time by real-time optimization of the system response over a finite future, which is computationally demanding. NMPC is applicable to a wide variety of control problems as long as the mathematical model of a system is available and real-time optimization is feasible. The development of efficient numerical algorithms for NMPC is an active area of research in control engineering in recent years. In particular, efficient numerical algorithms without any iterative searches have been developed by exploiting special structures in the optimization problem in NMPC. Application examples in this talk include mechatronic systems with sampling periods in the order of milliseconds.
上部に戻る 前に戻る