TOP一般口演
 
一般口演
報酬と罰の学習
Appetitive and Aversive Learning
座長:大村 優(北海道大学 医学研究院 生理系部門 薬理学分野)
2022年7月2日 9:00~9:15 沖縄コンベンションセンター 会議場B3・4 第6会場
3O06m1-01
内側前頭前野の神経細胞は他者の報酬予測誤差信号を処理しているか?
Do medial prefrontal neurons encode errors in the prediction of others’ reward?

*則武 厚(1,2)、磯田 昌岐(1,2)
1. 生理学研究所認知行動発達研究部門、2. 総合研究大学院大学生命科学研究科生理科学専攻
*Atsushi Noritake(1,2), Masaki Isoda(1,2)
1. Dept Syst Neurosci, Natl Inst Physiol Sci, Okazaki, Japan, 2. Dept Physiol Sci, Grad Univ Adv Stud (Sokendai), Hayama, Japan

Keyword: reward prediction error, others, medial prefrontal cortex, social

The current theoretical frameworks dealing with social interactions, such as observational learning and mentalizing, posit that the brain predicts rewards of other individuals so as to simulate their unobservable states of minds as well as their upcoming behavior. In these frameworks, errors between the predicted and received rewards of others (“others’ reward prediction errors”, oRPEs) are a critical piece of information for the observer to improve prediction accuracy. Neuroimaging and electrophysiological studies suggest that the medial prefrontal cortex (MPFC) generates signals about the prediction of others’ rewards. However, it remains unknown whether activity in the MPFC also signals oRPEs. To test this possibility, we recorded from single neurons (n = 319) in the MPFC of macaque monkeys during social Pavlovian conditioning. In this procedure, a pair of monkeys (a recorded monkey designated as “self” and a nonrecorded monkey designated as “partner”) were simultaneously conditioned with visual stimuli that were followed by rewards with different probabilities between the two monkeys. Reward outcomes were revealed first to the partner and 1 s later to the self. There was no trial in which the two monkeys were both rewarded; therefore, the final outcome in each trial was either partner-only-rewarded, self-only-rewarded, or neither-rewarded. This experimental procedure generated different conditional oRPEs depending on the partner’s reward outcome (i.e., unrewarded or rewarded), and allowed us to identify MPFC neurons encoding oRPEs rather than one’s own RPEs. We found that, when the partner was not rewarded, activities of 56 neurons were significantly correlated with oRPEs (positive correlation, n = 15; negative correlation, n = 41). When the partner was rewarded, activities of 39 neurons were significantly correlated with oRPEs (positive, n = 23; negative, n = 16). These two populations of neurons were largely non-overlapping. A further test revealed that the neuronal signature of oRPEs was attenuated in a nonsocial condition in which M2 was absent. These findings demonstrate that the MPFC contains dedicated neuronal populations for signaling negative oRPEs and positive oRPEs. We suggest that neurons in the MPFC can encode a fundamental, computational variable for grasping the current states of others’ minds and predicting their future behavior.
2022年7月2日 9:15~9:30 沖縄コンベンションセンター 会議場B3・4 第6会場
3O06m1-02
ドーパミンはKCNQチャネルのリン酸化を介して神経細胞の興奮性を高めることで報酬行動を促進させる
Dopamine drives neuronal excitability via KCNQ channel phosphorylation for reward behavior

*坪井 大輔(1)、大塚 岳(4)、下村 拓史(4)、Md.Faruk Omar(2)、山橋 幸恵(1)、天野 睦紀(2)、佐野 裕美(5)、永井 拓(7)、山田 清文(2)、Anastasios V. Tzingounis(6)、南部 篤(5)、久保 義弘(4)、川口 泰雄(3)、貝淵 弘三(1)
1. 藤田医科大学総合医科学研究所、2. 名古屋大学医学系研究科、3. 生理学研究所大脳回路論研究部門、4. 生理学研究所神経機能素子研究部門、5. 生理学研究所生体システム研究部門、6. コネティカット大学神経生理学部門、7. 藤田医科大学神経精神病態解明センター
*Daisuke Tsuboi(1), Takeshi Otsuka(4), Takushi Shimomura(4), Md.Faruk Omar(2), Yukie Yamahashi(1), Mutsuki Amano(2), Hiromi Sano(5), Taku Nagai(7), Kiyofumi Yamada(2), Tzingounis Anastasios V.(6), Atsushi Nambu(5), Yoshihiro Kubo(4), Yasuo Kawaguchi(3), Kozo Kaibuchi(1)
1. Institute for Comprehensive Medical Science, Fujita Health University, Toyoake, Japan, 2. Grad Sch Med, Univ of Nagoya, Tokyo, Japan, 3. Division of Cerebral Circuitry, National Institute for Physiological Sciences, Okazaki, Japan, 4. Division of Biophysics and Neurobiology, National Institute for Physiological Sciences, Okazaki, Japan, 5. Division of System Neurophysiology, National Institute for Physiological Sciences, Okazaki, Japan, 6. Department of Physiology and Neurobiology, University of Connecticut, CT, USA, 7. Neuropsychological Research Center, Fujita Health University,Toyoake, Japan

Keyword: KCNQ2, extracellular signal-regulated kinase/ERK, reward learning, Membrane excitation

The basal ganglia is innervated by midbrain dopamine neurons. As dopamine regulates memory, learning, and reward behavior, dysfunctional dopamine signaling has been implicated in various neuropsychological disorders, including Parkinson's disease, drug addiction, and schizophrenia. The striatum/nucleus accumbens (NAc) of the basal ganglia is mainly composed of medium spiny neurons (MSNs) that express dopamine receptors 1 and 2 (D1R and D2R). Dopamine increases the excitability of D1R-MSNs and leads to reward behavior, which involves a phosphorylation cascade including PKA. We have previously shown that dopamine increases D1R-MSN excitability and firing rates in NAc via the PKA/Rap1/Raf/ERK pathway to promote reward behavior. However, the mechanism by which ERK controls MSN excitability and reward behavior remains largely unknown. Here, we found that the D1R agonist, SKF81297, inhibited KCNQ-mediated currents and increased D1R-MSN firing rates in NAc slices of mouse, which were abolished by ERK inhibition. ERK phosphorylated KCNQ2 at Ser414 and Ser476 in vitro, while in vivo, KCNQ2 was phosphorylated downstream of dopamine signaling in NAc slices. ERK inhibited the KCNQ channel activity in a phosphorylation-dependent manner. Conditional deletion of Kcnq2 in D1R-MSNs diminished the inhibitory effect of SKF81297 on the KCNQ channel activity and enhanced neuronal excitability and cocaine-induced reward behavior. These effects were restored by wild type, but not by phospho-deficient KCNQ2. Together, these findings demonstrated that D1R-ERK signaling controls MSN excitability via KCNQ2 phosphorylation to regulate reward behavior. The post-modification of KCNQ2 is a potential therapeutical target for psychiatric diseases with dysfunction of the reward circuit.
2022年7月2日 9:30~9:45 沖縄コンベンションセンター 会議場B3・4 第6会場
3O06m1-03
二つの皮質基底核経路で異なる状態表現が使用される場合の理論的検討
Possible functional advantage of combining different state representations in the two cortico-basal ganglia pathways

*森田 賢治(1,2)、下村 寛治(1,3)、川口 泰雄(4,5)
1. 東京大学大学院教育学研究科、2. 東京大学ニューロインテリジェンス国際研究機構、3. 国立精神・神経医療研究センター精神保健研究所行動医学研究部、4. 玉川大学脳科学研究所、5. 生理学研究所
*Kenji Morita(1,2), Kanji Shimomura(1,3), Yasuo Kawaguchi(4,5)
1. Graduate School of Education, The University of Tokyo, Tokyo, Japan, 2. International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo, Japan, 3. Department of Behavioral Medicine, National Institute of Mental Health, National Center of Neurology and Psychiatry, Kodaira, Japan, 4. Brain Science Institute, Tamagawa University, Machida, Japan, 5. National Institute for Physiological Sciences (NIPS), Okazaki, Japan

Keyword: reinforcement learning, successor representation, reward prediction error, corticostriatal

The basal ganglia direct and indirect pathways have been suggested to be primarily involved in positive (appetitive) and negative (aversive) feedback-based learning via D1 and D2 dopamine receptors, respectively. Given that these pathways receive inputs unevenly from different neuron types and/or cortical areas, they might use different ways of state representations. Among various possible ways of state representations, recent work suggests that successor representation (SR) might underlie certain types of model-based/goal-directed behavior, while individual representation (IR) of states can be used for model-free/habitual behavior. We examined how different combinations of appetitive/aversive learning and SR/IR perform in dynamic reward environments. We simulated reward navigation tasks in a two-dimensional grid space, in which reward location changed along with time. We then examined the performance of the agent consisting of two systems, which used SR or IR, with the ratio of the learning rates for positive and negative reward prediction errors in each system systematically varied. As a result, we found that combination of SR-based appetitive learning system and IR-based aversive learning system achieves good performance. This is presumably because SR-based generalization of positive feedback is beneficial when learning newly rewarding sites whereas narrower negative feedback-based learning is sufficient, and generalization can be harmful, when policy has been sharpened after agent acquired multiple rewards at nearby sites. Implementation of such a combination in cortico-basal ganglia circuits is potentially in line with several previous findings about corticostriatal neurons and inputs: (1) suggested involvement in goal-directed behavior of, as well as less reciprocal connections among, intratelencephalic (IT) neurons, which were suggested (though controversial) to preferentially target the direct over the indirect pathways, and (2) suggested implementation of SR in limbic/sensory cortices, which preferentially project to the direct over the indirect pathways.
2022年7月2日 9:45~10:00 沖縄コンベンションセンター 会議場B3・4 第6会場
3O06m1-04
階層的な行動構築における算術学的な価値の表象
Arithmetic value representation for hierarchical behavior composition

*牧野 浩史(1)
1. 南洋理工大学
*Hiroshi Makino(1)
1. Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore

Keyword: Deep reinforcement learning, Artificial intelligence, Hierarchical learning, Representation learning

Humans and other animals can re-purpose their pre-acquired behavior skills to new, unseen tasks. Such an aptitude can grow their behavior repertoire through combinatorial expansion. In deep reinforcement learning (RL), artificial agents can extract re-usable skills from past experience and recombine them in a hierarchical manner. It remains largely unknown, however, whether the brain similarly composes a novel behavior. Here we trained deep RL agents with the soft actor-critic (SAC) algorithm and studied their representation of RL variables during hierarchical learning. The objective of SAC is to maximize future cumulative rewards and policy entropy, which confer artificial agents with flexibility and robustness to perturbation. We demonstrate that the agents learned to solve a novel composite task by additively combining representations of previously learned values of actions from constituent subtasks. Sample efficiency in the composite task was further augmented by the introduction of a stochastic policy in the subtask, which endowed the agents with a wide range of action representations. These theoretical predictions were empirically tested in mice trained in the same behavior paradigm, where mice with prior subtask training rapidly learned the composite task. Cortex-wide two-photon calcium imaging across the subtasks and composite task revealed neural representations of combined action values analogous to those observed in the deep RL agents. These mixed representations of subtask-related action values were not observed in the agents when a new value function was constructed by taking their maximum, highlighting the specificity of the additive operation. As in the case of the deep RL agents, learning efficiency in mice was further enhanced when the subtask policy was made more stochastic. Together, these results suggest that the brain composes a novel behavior with a simple arithmetic operation of pre-acquired action-value representations with a stochastic policy.