[toc]

关于 About this series

本系列为阅读 Pattern Recognition And Machine Learning, 2006 的心得与笔记。

作者 Author

Pattern Recognition And Machine Learning 是 Chris Bishop 的著作，作者 Bishop 是微软公司杰出的科学家、微软剑桥研究院的实验室负责人。同时，Bishop 是剑桥大学达尔文学院院士，并任职爱丁堡大学计算机科学系教授。在 2004 年，他被选举为英国皇家工程院院士；在 2007 年，他被选举为爱丁堡皇家学会会员。

目录 Catalog

以下展示本书的目录，在未来也会更新为本阅读记录的目录。

1 Introduction

1.1 Example: Polynomial Curve Fitting

||==

1.2 Probability Theory

1.3 Model Selection

1.4 The Curse of Dimensionality

1.5 Decision Theory

1.6 Information Theory

2 Probability Distributions

2.1 Binary Variables

2.2 Multinomial Variables

2.3 The Gaussian Distribution

2.4 The Exponential Family

2.5 Nonparametric Methods

3 Linear Models for Regression

3.1 Linear Basis Function Models 138
3.2 The Bias-Variance Decomposition 147
3.3 Bayesian Linear Regression 152
3.4 Bayesian Model Comparison 161
3.5 The Evidence Approximation 165
3.6 Limitations of Fixed Basis Functions 172

4 Linear Models for Classification 179
4.1 Discriminant Functions 181
4.2 Probabilistic Generative Models 196
4.3 Probabilistic Discriminative Models 203
4.4 The Laplace Approximation 213
4.5 Bayesian Logistic Regression 217

5 Neural Networks 225
5.1 Feed-forward Network Functions 227
5.2 Network Training 232
5.3 Error Backpropagation 241
5.4 The Hessian Matrix 249
5.5 Regularization in Neural Networks 256
5.6 Mixture Density Networks 272
5.7 Bayesian Neural Networks 277

6 Kernel Methods 291
6.1 Dual Representations 293
6.2 Constructing Kernels 294
6.3 Radial Basis Function Networks 299
6.4 Gaussian Processes 303

7 Sparse Kernel Machines 325
7.1 Maximum Margin Classifiers 326
7.2 Relevance Vector Machines 345

8 Graphical Models 359
8.1 Bayesian Networks 360
8.2 Conditional Independence 372
8.3 Markov Random Fields 383
8.4 Inference in Graphical Models 393

9 Mixture Models and EM 423
9.1 K-means Clustering 424
9.1.1 Image segmentation and compression 428
9.2 Mixtures of Gaussians 430
9.2.1 Maximum likelihood 432
9.2.2 EM for Gaussian mixtures 435
9.3 An Alternative View of EM 439
9.3.1 Gaussian mixtures revisited 441
9.3.2 Relation to K-means 443
9.3.3 Mixtures of Bernoulli distributions 444
9.3.4 EM for Bayesian linear regression 448
9.4 The EM Algorithm in General 450
Exercises 455
10 Approximate Inference 461
10.1 Variational Inference 462
10.1.1 Factorized distributions 464
10.1.2 Properties of factorized approximations 466
10.1.3 Example: The univariate Gaussian 470
10.1.4 Model comparison 473
10.2 Illustration: Variational Mixture of Gaussians 474
xviii CONTENTS
10.2.1 Variational distribution 475
10.2.2 Variational lower bound 481
10.2.3 Predictive density 482
10.2.4 Determining the number of components 483
10.2.5 Induced factorizations 485
10.3 Variational Linear Regression 486
10.3.1 Variational distribution 486
10.3.2 Predictive distribution 488
10.3.3 Lower bound 489
10.4 Exponential Family Distributions 490
10.4.1 Variational message passing 491
10.5 Local Variational Methods 493
10.6 Variational Logistic Regression 498
10.6.1 Variational posterior distribution 498
10.6.2 Optimizing the variational parameters 500
10.6.3 Inference of hyperparameters 502
10.7 Expectation Propagation 505
10.7.1 Example: The clutter problem 511
10.7.2 Expectation propagation on graphs 513
Exercises 517
11 Sampling Methods 523
11.1 Basic Sampling Algorithms 526
11.1.1 Standard distributions 526
11.1.2 Rejection sampling 528
11.1.3 Adaptive rejection sampling 530
11.1.4 Importance sampling 532
11.1.5 Sampling-importance-resampling 534
11.1.6 Sampling and the EM algorithm 536
11.2 Markov Chain Monte Carlo 537
11.2.1 Markov chains 539
11.2.2 The Metropolis-Hastings algorithm 541
11.3 Gibbs Sampling 542
11.4 Slice Sampling 546
11.5 The Hybrid Monte Carlo Algorithm 548
11.5.1 Dynamical systems 548
11.5.2 Hybrid Monte Carlo 552
11.6 Estimating the Partition Function 554
Exercises 556
12 Continuous Latent Variables 559
12.1 Principal Component Analysis 561
12.1.1 Maximum variance formulation 561
12.1.2 Minimum-error formulation 563
12.1.3 Applications of PCA 565
12.1.4 PCA for high-dimensional data 569
CONTENTS xix
12.2 Probabilistic PCA 570
12.2.1 Maximum likelihood PCA 574
12.2.2 EM algorithm for PCA 577
12.2.3 Bayesian PCA 580
12.2.4 Factor analysis 583
12.3 Kernel PCA 586
12.4 Nonlinear Latent Variable Models 591
12.4.1 Independent component analysis 591
12.4.2 Autoassociative neural networks 592
12.4.3 Modelling nonlinear manifolds 595
Exercises 599
13 Sequential Data 605
13.1 Markov Models 607
13.2 Hidden Markov Models 610
13.2.1 Maximum likelihood for the HMM 615
13.2.2 The forward-backward algorithm 618
13.2.3 The sum-product algorithm for the HMM 625
13.2.4 Scaling factors 627
13.2.5 The Viterbi algorithm 629
13.2.6 Extensions of the hidden Markov model 631
13.3 Linear Dynamical Systems 635
13.3.1 Inference in LDS 638
13.3.2 Learning in LDS 642
13.3.3 Extensions of LDS 644
13.3.4 Particle filters 645
Exercises 646
14 Combining Models 653
14.1 Bayesian Model Averaging 654
14.2 Committees 655
14.3 Boosting 657
14.3.1 Minimizing exponential error 659
14.3.2 Error functions for boosting 661
14.4 Tree-based Models 663
14.5 Conditional Mixture Models 666
14.5.1 Mixtures of linear regression models 667
14.5.2 Mixtures of logistic models 670
14.5.3 Mixtures of experts 672
Exercises 674

符号约定 Notation

I have tried to keep the mathematical content of the book to the minimum neces?sary to achieve a proper understanding of the field. However, this minimum level is
nonzero, and it should be emphasized that a good grasp of calculus, linear algebra,
and probability theory is essential for a clear understanding of modern pattern recog?nition and machine learning techniques. Nevertheless, the emphasis in this book is
on conveying the underlying concepts rather than on mathematical rigour.
I have tried to use a consistent notation throughout the book, although at times
this means departing from some of the conventions used in the corresponding re?search literature. Vectors are denoted by lower case bold Roman letters such as
x, and all vectors are assumed to be column vectors. A superscript T denotes the
transpose of a matrix or vector, so that xT will be a row vector. Uppercase bold
roman letters, such as M, denote matrices. The notation (w1,…,wM) denotes a
row vector with M elements, while the corresponding column vector is written as
w = (w1,…,wM)T.
The notation [a, b] is used to denote the closed interval from a to b, that is the
interval including the values a and b themselves, while (a, b) denotes the correspond?ing open interval, that is the interval excluding a and b. Similarly, [a, b) denotes an
interval that includes a but excludes b. For the most part, however, there will be
little need to dwell on such refinements as whether the end points of an interval are
included or not.
The M × M identity matrix (also known as the unit matrix) is denoted IM,
which will be abbreviated to I where there is no ambiguity about it dimensionality.
It has elements Iij that equal 1 if i = j and 0 if i = j.
A functional is denoted f[y] where y(x) is some function. The concept of a
functional is discussed in Appendix D.
The notation g(x) = O(f(x)) denotes that |f(x)/g(x)| is bounded as x → ∞.
For instance if g(x)=3x2 + 2, then g(x) = O(x2).
The expectation of a function f(x, y) with respect to a random variable x is de?noted by Ex[f(x, y)]. In situations where there is no ambiguity as to which variable
is being averaged over, this will be simplified by omitting the suffix, for instance
xi
xii MATHEMATICAL NOTATION
E[x]. If the distribution of x is conditioned on another variable z, then the corresponding conditional expectation will be written Ex[f(x)|z]. Similarly, the variance
is denoted var[f(x)], and for vector variables the covariance is written cov[x, y]. We
shall also use cov[x] as a shorthand notation for cov[x, x]. The concepts of expectations and covariances are introduced in Section 1.2.2.
If we have N values x1,…, xN of a D-dimensional vector x = (x1,…,xD)T,
we can combine the observations into a data matrix X in which the nth row of X
corresponds to the row vector xT
n. Thus the n, i element of X corresponds to the
i
th element of the nth observation xn. For the case of one-dimensional variables we
shall denote such a matrix by x, which is a column vector whose nth element is xn.
Note that x (which has dimensionality N) uses a different typeface to distinguish it
from x (which has dimensionality D)

引用 Cite

@book{bishop2013pattern,
  title={Pattern Recognition and Machine Learning: All "just the Facts 101" Material},
  author={Bishop, C.M.},
  isbn={9788132209065},
  series={Information science and statistics},
  url={https://books.google.com.hk/books?id=HL4HrgEACAAJ},
  year={2013},
  publisher={Springer (India) Private Limited}
}

Karl's Nest

PRML-cookbook