When programmable computers were first conceived, people wondered whether they might become intelligent, over a hundred years before one was built (Lovelace, 1842). Today, artificial intelligence (AI) is a thriving field with many practical applications and active research topics. We look to intelligent software to automate routine labor, understand speech or images, make diagnoses in medicine and support basic scientific research.
In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straightforward for computers—problems that can be described by a list of formal, mathematical rules. The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images.
This book is about a solution to these more intuitive problems. This solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to AI deep learning.
目录:
1 引言
1.1 谁应该读这本书?
1.2 历史趋势的深度学习
应用数学和机器学习基础知识
2 线性代数
2.1 标量,向量,矩阵和张量
2.2 乘法矩阵和向量
2.3 身份和逆矩阵
2.4 线性相关和Span
2.5 规范
2.6 向量和矩阵的特殊类型
2.7 特征分解
2.8 奇异值分解
2.9 摩尔-Penrose逆
2.10 跟踪操作员
2.11 行列式
2.12 例子:主成分分析
3 概率论与信息论
3.1 为什么概率有多大?
3.2 随机变量
3.3 概率分布
3.4 边缘概率
3.5 条件概率
3.6 条件概率的链式法则
3.7 独立和条件独立
3.8 期望,方差和协方差
3.9 常用概率分布
3.10 常用功能的有用的属性
3.11 贝叶斯法则
3.12 连续变量的技术细节
3.13 信息论
3.14 结构概率模型
4 数值计算
4.1 溢和下溢
4.2 可怜的调理
4.3 基于梯度优化
4.4 约束优化
4.5 举例:线性最小二乘
5 机器学习基础知识
5.1 学习算法
5.2 能力,过拟合和欠拟合
5.3 超参数和验证集
5.4 估计,偏差和方差
5.5 最大似然估计
5.6 贝叶斯统计
5.7 监督学习算法
5.8 无监督学习算法
5.9 随机梯度下降
5.10 构建一个机器学习算法
5.11 挑战激励深度学习
深II网络:现代实践
6 深层前馈网络
6.1 举例:学习XOR
6.2 基于梯度学习
6.3 隐单元
6.4 建筑设计
6.5 反向传播及其他差异算法
6.6 历史记载
7 正规化深学习
7.1 参数规范处罚
7.2 规范处罚约束优化
7.3 正规化和欠约束问题
7.4 数据集增强
7.5 噪声鲁棒
7.6 半监督学习。
7.7 多任务学习
7.8 提前终止
7.9 参数搭售和参数共享
7.10 稀疏交涉
7.11 套袋及其他集成方法
7.12 降
7.13 对抗性训练
7.14 切线距离,切线道具,和多方面的切线分类
8 优化培训模式深
8.1 如何学习不同于纯优化
8.2 挑战神经网络优化
8.3 基本算法
8.4 参数初始化策略
8.5 算法自适应学习速率
8.6 近似二阶方法
8.7 优化策略和元算法
9 卷积网络
9.1 卷积运算
9.2 动机
9.3 池
9.4 卷积和池作为一个无限此前强
9.5 基本卷积函数的变种
9.6 结构化输出
9.7 数据类型
9.8 高效卷积算法
9.9 随机或无监督功能
9.10 卷积网络的神经科学依据
9.11 卷积网络和深度学习史
10 序列建模:复发性和递归篮网
10.1 计算展开图
10.2 回归神经网络
10.3 双向RNNs
10.4 编码器,解码器顺序对顺序架构
10.5 复发性深网
10.6 递归神经网络
10.7 长期依赖的挑战
10.8 回声状态网络
10.9 漏单位等策略多时间尺度
10.10 长短期记忆等门控RNNs
10.11 优化长期依赖
10.12 外显记忆
11 实用方法
11.1绩效指标
11.2 默认的基准模型
11.3 确定是否要收集更多数据
11.4 选择超参数
11.5 调试策略
11.6 示例:多位数识别
12 应用
12.1 大规模深度学习
12.2 计算机视觉
12.3 语音识别
12.4 自然语言处理
12.5 其他应用程序
III深学习研究
13 线性因素模型
13.1 概率PCA和因子分析
13.2 独立成分分析(ICA)
13.3 慢特征分析
13.4 稀疏编码
13.5 PCA的集成块解读
14 Autoencoders
14.1 Undercomplete Autoencoders
14.2 正规化Autoencoders
14.3 表达能力,图层的大小和深度
14.4 随机编码器和解码器
14.5 去噪Autoencoders
14.6 与Autoencoders学习阀组
14.7 压缩Autoencoders
14.8 预测稀疏分解
14.9 Autoencoders应用
15 代表学习
15.1 贪婪逐层训练前无监督
15.2 迁移学习和适应的域
15.3 影响因素的半监督解开
15.4 分布式表示
15.5 从深度指数收益
15.6 提供了线索,发现根本原因
16 深学习结构化的概率模型
16.1 非结构化建模的挑战
16.2 使用图形来描述模型结构
16.3 从图形模型取样
16.4 结构化建模的优点
16.5 学习一下相关性
16.6 推理和近似推理
16.7 深入学习方法,以结构化概率模型
17 蒙特卡罗方法
17.1 采样和蒙特卡洛方法
17.2 重要性抽样
17.3 马尔可夫链蒙特卡罗方法
17.4 Gibbs抽样
17.5 分居模式之间的混合的挑战
18 面对分区函数
18.1 对数似然梯度
18.2 随机最大似然法和对比三岔口
18.3 Pseudolikelihood
18.4 得分匹配和比率匹配
18.5 去噪得分匹配
18.6 噪音对比估计
18.7 估计分区函数
19 近似推理
19.1 推理的优化
19.2 期望最大化
19.3 MAP推理和稀疏编码
19.4 变推理和学习
19.5 近似推理
20 深生成模型
20.1 玻尔兹曼机
20.2 受限玻尔兹曼机
20.3 深层信念网络
20.4 深层玻尔兹曼机
20.5 玻尔兹曼机实值数据
20.6 卷积玻尔兹曼机
20.7 玻尔兹曼机的结构或顺序输出
20.8 其他玻尔兹曼机
通过随机操作20.9反向传播
20.10 定向生成篮网
从Autoencoders 20.11图纸样品
20.12 生成随机网络
20.13 其他生成方案
20.14 评估生成模型
20.15 结论
参考书目
指数
Contents
Website vii
Acknowledgments viii
Notation xi
1 Introduction
1.1 Who Should Read This Book?
1.2 Historical Trends in Deep Learning
I Applied Math and Machine Learning Basics
2 Linear Algebra
2.1 Scalars, Vectors, Matrices and Tensors
2.2 Multiplying Matrices and Vectors
2.3 Identity and Inverse Matrices
2.4 Linear Dependence and Span
2.5 Norms
2.6 Special Kinds of Matrices and Vectors
2.7 Eigendecomposition
2.8 Singular Value Decomposition
2.9 The Moore-Penrose Pseudoinverse
2.10 The Trace Operator
2.11 The Determinant
2.12 Example: Principal Components Analysis
3 Probability and Information Theory
3.1 Why Probability?
3.2 Random Variables
3.3 Probability Distributions
3.4 Marginal Probability
3.5 Conditional Probability
3.6 The Chain Rule of Conditional Probabilities
3.7 Independence and Conditional Independence
3.8 Expectation, Variance and Covariance
3.9 Common Probability Distributions
3.10 Useful Properties of Common Functions
3.11 Bayes’ Rule
3.12 Technical Details of Continuous Variables
3.13 Information Theory
3.14 Structured Probabilistic Models
4 Numerical Computation
4.1 Overflow and Underflow
4.2 Poor Conditioning
4.3 Gradient-Based Optimization
4.4 Constrained Optimization
4.5 Example: Linear Least Squares
5 Machine Learning Basics
5.1 Learning Algorithms
5.2 Capacity, Overfitting and Underfitting
5.3 Hyperparameters and Validation Sets
5.4 Estimators, Bias and Variance
5.5 Maximum Likelihood Estimation
5.6 Bayesian Statistics
5.7 Supervised Learning Algorithms
5.8 Unsupervised Learning Algorithms
5.9 Stochastic Gradient Descent
5.10 Building a Machine Learning Algorithm
5.11 Challenges Motivating Deep Learning
II Deep Networks: Modern Practices
6 Deep Feedforward Networks
6.1 Example: Learning XOR
6.2 Gradient-Based Learning
6.3 Hidden Units
6.4 Architecture Design
6.5 Back-Propagation and Other Differentiation Algorithms
6.6 Historical Notes
7 Regularization for Deep Learning
7.1 Parameter Norm Penalties
7.2 Norm Penalties as Constrained Optimization
7.3 Regularization and Under-Constrained Problems
7.4 Dataset Augmentation
7.5 Noise Robustness
7.6 Semi-Supervised Learning .
7.7 Multi-Task Learning
7.8 Early Stopping
7.9 Parameter Tying and Parameter Sharing
7.10 Sparse Representations
7.11 Bagging and Other Ensemble Methods
7.12 Dropout
7.13 Adversarial Training
7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier
8 Optimization for Training Deep Models
8.1 How Learning Differs from Pure Optimization
8.2 Challenges in Neural Network Optimization
8.3 Basic Algorithms
8.4 Parameter Initialization Strategies
8.5 Algorithms with Adaptive Learning Rates
8.6 Approximate Second-Order Methods
8.7 Optimization Strategies and Meta-Algorithms
9 Convolutional Networks
9.1 The Convolution Operation
9.2 Motivation
9.3 Pooling
9.4 Convolution and Pooling as an Infinitely Strong Prior
9.5 Variants of the Basic Convolution Function
9.6 Structured Outputs
9.7 Data Types
9.8 Efficient Convolution Algorithms
9.9 Random or Unsupervised Features
9.10 The Neuroscientific Basis for Convolutional Networks
9.11 Convolutional Networks and the History of Deep Learning
10 Sequence Modeling: Recurrent and Recursive Nets
10.1 Unfolding Computational Graphs
10.2 Recurrent Neural Networks
10.3 Bidirectional RNNs
10.4 Encoder-Decoder Sequence-to-Sequence Architectures
10.5 Deep Recurrent Networks
10.6 Recursive Neural Networks
10.7 The Challenge of Long-Term Dependencies
10.8 Echo State Networks
10.9 Leaky Units and Other Strategies for Multiple Time Scales
10.10 The Long Short-Term Memory and Other Gated RNNs
10.11 Optimization for Long-Term Dependencies
10.12 Explicit Memory
11 Practical methodology
11.1 Performance Metrics
11.2 Default Baseline Models
11.3 Determining Whether to Gather More Data
11.4 Selecting Hyperparameters
11.5 Debugging Strategies
11.6 Example: Multi-Digit Number Recognition
12 Applications
12.1 Large Scale Deep Learning
12.2 Computer Vision
12.3 Speech Recognition
12.4 Natural Language Processing
12.5 Other Applications
III Deep Learning Research
13 Linear Factor Models
13.1 Probabilistic PCA and Factor Analysis
13.2 Independent Component Analysis (ICA)
13.3 Slow Feature Analysis
13.4 Sparse Coding
13.5 Manifold Interpretation of PCA
14 Autoencoders
14.1 Undercomplete Autoencoders
14.2 Regularized Autoencoders
14.3 Representational Power, Layer Size and Depth
14.4 Stochastic Encoders and Decoders
14.5 Denoising Autoencoders
14.6 Learning Manifolds with Autoencoders
14.7 Contractive Autoencoders
14.8 Predictive Sparse Decomposition
14.9 Applications of Autoencoders
15 Representation Learning
15.1 Greedy Layer-Wise Unsupervised Pretraining
15.2 Transfer Learning and Domain Adaptation
15.3 Semi-Supervised Disentangling of Causal Factors
15.4 Distributed Representation
15.5 Exponential Gains from Depth
15.6 Providing Clues to Discover Underlying Causes
16 Structured Probabilistic Models for Deep Learning
16.1 The Challenge of Unstructured Modeling
16.2 Using Graphs to Describe Model Structure
16.3 Sampling from Graphical Models
16.4 Advantages of Structured Modeling
16.5 Learning about Dependencies
16.6 Inference and Approximate Inference
16.7 The Deep Learning Approach to Structured Probabilistic Models
17 Monte Carlo Methods
17.1 Sampling and Monte Carlo Methods
17.2 Importance Sampling
17.3 Markov Chain Monte Carlo Methods
17.4 Gibbs Sampling
17.5 The Challenge of Mixing between Separated Modes
18 Confronting the Partition Function
18.1 The Log-Likelihood Gradient
18.2 Stochastic Maximum Likelihood and Contrastive Divergence
18.3 Pseudolikelihood
18.4 Score Matching and Ratio Matching
18.5 Denoising Score Matching
18.6 Noise-Contrastive Estimation
18.7 Estimating the Partition Function
19 Approximate inference
19.1 Inference as Optimization
19.2 Expectation Maximization
19.3 MAP Inference and Sparse Coding
19.4 Variational Inference and Learning
19.5 Learned Approximate Inference
20 Deep Generative Models
20.1 Boltzmann Machines
20.2 Restricted Boltzmann Machines
20.3 Deep Belief Networks
20.4 Deep Boltzmann Machines
20.5 Boltzmann Machines for Real-Valued Data
20.6 Convolutional Boltzmann Machines
20.7 Boltzmann Machines for Structured or Sequential Outputs
20.8 Other Boltzmann Machines
20.9 Back-Propagation through Random Operations
20.10 Directed Generative Nets
20.11 Drawing Samples from Autoencoders
20.12 Generative Stochastic Networks
20.13 Other Generation Schemes
20.14 Evaluating Generative Models
20.15 Conclusion
Bibliography
Index