Yoshua Bengio：Attention 让深度学习取得成功（英文版）-互联网品牌营销|互联网品牌推广|品牌营销策划|网络营销|网络营销案例

Yoshua Bengio，电脑科学家，毕业于麦吉尔大学，在MIT和AT&T贝尔实验室做过博士后研究员，自1993年之后就在蒙特利尔大学任教，与 Yann LeCun、 Geoffrey Hinton并称为“深度学习三巨头”，也是神经网络复兴的主要的三个发起人之一，在预训练问题、为自动编码器降噪等自动编码器的结构问题和生成式模型等等领域做出重大贡献。他早先的一篇关于语言概率模型的论文开创了神经网络做语言模型的先河，启发了一系列关于 NLP 的文章，进而在工业界产生重大影响。此外，他的小组开发了 Theano 平台。

Deep learning of seman/cs for natural language

Machine Learning, AI & No Free Lunch

Bypassing the curse of dimensionality

Progress in Deep Learning Theory

Exponential advantage of distributed representations

Exponential advantage of depth

A Myth is Being Debunked: Local Minima in Neural Nets

Saddle Points

Why N-grams have poor generalization

Neural Language Models: fighting one exponential by another one!

The Next Challenge: Rich Semantic Representations for Word Sequences

Attention Mechanism for Deep Learning

Applying an attention mechanism to

End-to-End Machine Translation

2014: The Year of Neural Machine Translation Breakthrough

Encoder-Decoder Framework

Bidirectional RNN for Input Side

Attention: Many Recent Papers

Soft-Attention vs Stochastic Hard-Attention

Attention-Based Neural Machine Translation

Predicted Alignments

En-Fr & En-De Alignments

Improvements over Pure AE Model

End-to-End Machine Translation with Recurrent Nets and Attention Mechanism

IWSLT 2015 – Luong & Manning (2015) TED talk MT, English-German

Image-to-Text: Caption Generation with Attention

Paying Attention to Selected Parts of the Image While Uttering Words

Speaking about what one sees

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

The Good

And the Bad

Interesting extensions

Multi-Lingual Neural MT with Shared Attention Mechanism

Character-Based Models

Experiments on Character-Based NMT

Attention Mechanisms for Memory Access

Large Memory Networks: Sparse Access Memory for Long-Term Dependencies

Delays & Hierarchies to Reach Farther

Ongoing Project: Knowledge Extraction

The Next Big Challenge: Unsupervised Learning

Conclusions

• Theory for deep learning has progressed substanFally on several fronts: why it generalizes beder, why local minima are not the issue people thought, and the probabilisFc interpretaFon of deep unsupervised learning.
• AdenFon mechanisms allow the learner to make a selecFon, sol or hard
• They have been extremely successful for machine translaFon and capFon generaFon
• They could be interesFng for speech recogniFon and video, especially if we used them to capture mulFple Fme scales
• They could be used to help deal with long-term dependencies, allowing some states to last for arbitrarily long