余凯：从深度学习到深度增强学习-互联网品牌营销|互联网品牌推广|品牌营销策划|网络营销|网络营销案例

知行合一：从深度学习到深度增强学习，创始人&CEO，地平线机器人技术

深度学习

2015年6月一次关于人工智能的网上公开讨论，“我认为深度学习是机器学习的唯一方向”

“ 深度学习还有一个激动人心的应用，就是learning to control, 我认为机器人的控制，会因为DNN reinforcement learning的方法而发生改变 ”

Deep Learning Since 2006

从神经元到深度学习

为什么深度学习受到重视？

• 模拟大脑的行为
• 特别适合大数据
• End-to-end学习
• 提供一套建模语言

深度学习与大数据

End-to-end Learning

• Most critical for accuracy
• Account for most of the computation for testing
• Most time-consuming in development cycle
• Often hand-craft in practice

Deep Learning

深度卷积神经网络

ImageNet image classification error rate

Speech recognition

End-to-end learning for speech recognition

过去10年，深度学习导致感知方面的巨大突破

增强学习：学习如何做决策

看作决策系统和环境的博弈，连续做决策来优化长期收益

增强学习(reinforcement Agent anleadrn inEg)- 如何行为决策

- At each step t, the agent
• receive state st
• receive scalar reward rt
• execute action at
- The environment:
• receive action at
• emit state st+1
• emit scalar reward rt+1

- Reinforcement Learning (RL) is a general-purpose framework for artificial intelligence
• RL given an agent the capacity to act
• each action influence the agent’s future state
• success is measured by a scalar reward signal
- RL in a nutshull
• select actions to maximize future reward
- We seek a single agent which can solve any human-level task
• The essence of an intelligence agent

增强学习的应用领域

- Control physical systems: walk, fly, drive, swim
- Interact with users: retain customers, personalize channel, optimize user experience
- Solve logical problems: scheduling, bandwidth allocation
- Play games: chess, checker, Go, Atari games
- Learn sequential algorithms: attention, memory, conditional computation, activations

增强学习的框架：策略函数和价值函数

知行合一：AlphaGo 深度增强学习

- Apply deep learning to RL?
- Use deep network to represent value function/policy/model
- Optimize value function/policy/model end-to-end
- Using stochastic gradient descent

Reinforce算法

- The simplest possible reinforcement learning algorithm to maximize total reward
- Use policy gradient in a whole episode to update parameters

深度神经网络增强学习App: Gliocogale tDioeenpM:i nGd’s oAlphaGo

- Training Steps
1. train a policy network using previous matches (SL network)
2. train a policy network by playing with itself (RL network)
3. train an value network using matches played using RL network.

Bellman equation

Deep Q-learning

这次围棋人机大赛意味着什么？

深度增强学习：让自动驾驶从感知到控制