This bibliography traces the origin of Bayesian methods for inference and learning in Bayes networks (BN). It contains references about the development and use of BNs as developed over the last 30—odd years in Artificial Intelligence. The rise of BNs in Machine Learning is relatively recent; in fact, only in the mid-90’s did chapters on the subject appear in Machine Learning texts. In contrast, the origin of Bayesian Statistics, attributed to the Reverend Bayes, in the 17th century, actually predates much of conventional statistics.  The growth of Bayesian methods in general and their relationship to the varied fields thus spawned, among which are Bayesian Statistics and Decision Theory, is fascinating, but too wide-ranging to cover here.[1]

Bayes networks can be seen as a unifying approach to reasoning with (i.e. a “logic” or “calculus” of) uncertainty that has been applied to a wide range of problems. Some sub-areas to which BNs have been applied include:

·         Dynamic models: Dynamic Bayes Nets (DBNs), Continuous Time DBNs (CTBNs), Hidden Markov Models (HMMs)

·         Automated, probabilistic planning methods, Markov Decision Processes (MDP), Partially Observable MDPs (POMDPs),

·         Activity inferencing methods,

·         Biological sequence analysis (Genomics, Protenomics),

·         Statistical language models, information retrieval, web search,

·         Approximate solution methods borrowing from Statistical Physics

·         Causal discovery,

·         Game theory.

The references included here do not cover these areas exhaustively, but should offer the background needed to understand them. Each of these areas merits a separate bibliography.



I. Standard Texts


A. Bayesian Networks and Decision Graphs is a comprehensive and up-to-date text, appropriate for advanced undergraduate and graduate courses. It covers propagation methods, inference, model building, and extensions applying utility theory to optimal decision making.  Jensen, from Aalborg University in Denmark had published some of the original work on BN inference using the “join-tree” algorithm.

Jensen, F. V., Bayesian Networks and Decision Graphs (Springer 2001).


B. Cowell’s text is similar to Jensen’s, but with a large emphasis on learning networks and their parameters, and with sections on continuous variables, structure learning, applications of Markov Chain Monte Carlo (MCMC)[2], undirected (Markov) models, etc. Lauritzen and Spiegelhalter (1988) published the first general purpose algorithm for solving Bayes Networks that was a watershed in the field.

Cowell., R. G., A. P. Dawid, S. L. Lauritzen, D. J. Spiegelhalter, Probabilistic Networks and Expert Systems’, (Information Science and Statistics, 2003) .


C. Pearl’s work is where it all started, as an alternative to the AI methods of the time.  It is dated as regards to computational methods, but perhaps the best read, and the best introduction to the philosophy that motivated the field.  This is the original "belief network" source, still the most articulate and fundamental, and an important contribution to the debate around reasoning with uncertainty that was raging at the time.  It predates the development of join-tree solution methods; instead it develops cut-set solution methods.

Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, (San Mateo, CA: Morgan Kaufmann, 1988).


D. These two volumes are compilations of articles, some republished from elsewhere. Jordan was a leader in introducing the field into Statistics and coined the phrase “Graphical Model.”

M. Jordan. T. Sejnowski ed., Graphical Models: Foundations in Neural Computation, (MIT 2001).

M. Jordan. ed., Learning in Graphical Models, (MIT 1998).



II. Two yettobepublished textbooks

Drafts of these texts were “lent” Intel when the researchers who wrote them were under grant.  Gary Bradski keeps copies. Ask him for distribution permission.

A. Written at an advanced undergraduate level, for a computational AI computer science curriculum

D. Koller, Bayes Networks and Beyond. Draft.


B. Written as a graduate level theoretical statistics introduction.

M. Jordan, C. Bishop, An Introduction to Graphical Models. Draft.



II. Introductory articles and tutorials


A. This issue of AI Magazine was dedicated to decision theory based methods. Several papers that were published in it are worth reading.

Charniak, E., “Bayes networks without tears,” AI Magazine, 12(4) (1991)


B. The entire chapter on Statistical Learning Methods, from this AI text is available online.  It has a distinctly Bayes networks flavor. The book has become a standard, and the other probability-related chapters are among the most transparent introductions to the field. 

Russell S, Norvig P., Artificial Intelligence: A Modern Approach 2nd Ed. (Englewood Cliffs, New Jersey: Prentice Hall Series in Artificial Intelligence. 2002).


C. Kevin Murphy “Bayes Net Toolbox” (BNT) is an open source Matlab library that comes with a good introduction to a variety of methods, including dynamic Bayes nets. (See the note below about PNL – Intel’s translation of BNT into C++.)


D. A tutorial in powerpoint, given at the UAI conference.

J. Breese and D. Koller, Bayesian Networks and Decision-Theoretic Reasoning for Artificial Intelligence, 1997.


E. Andrew Moore, while at CMU made his tutorial slides available.  See specifically the sets “Bayesian Networks”, “Probability for Data Miners”, however, they are all good.


Unfortunately, there doesn’t appear to be a review of advances in BN learning written since early 2000’s. I’m open to suggestions if some can suggest a recent comprehensive work. Perhaps everyone who would have authored one has gone to Google J?


Some early influential work, for a historical perspective: 


F. Dave Heckerman’s Ph.D. thesis is still a good read.  He was a leader in applying BNs to diagnostic reasoning.

Heckerman, D., Probability Similarity Networks (MIT, ACM Award Thesis 1990). 


G. A 3 page retrospective note on one of the early papers, on “Influence Diagrams,” a variation of Bayes nets that include decision and value variables.  He explains how influence diagrams have had a pervasive effect on related fields, especially probabilistic AI.

Boutilier, C., “The Influence of Influence Diagrams on Artificial Intelligence,” Decision Analysis, Vol.2, No. 4, (December, 2005), pp. 229-231.


H. Two articles reprinted in

Shafer, G. and J. Pearl Readings in Uncertain Reasoning, (San Mateo, CA: Morgan Kaufmann, 1990).

The dated-ness of most of the other articles in this compilation of papers is evidence to how quickly the field has advanced in the last decade and a half.

An alternative solution method to the formal join-tree methods, based on equivalence transforms of networks (“arc reversals”).

Shachter, R., D., “Evaluating Influence Diagrams,” Operations Research, vol 33, No. 6, (1986).

Also this seminal paper led the way for BNs to replace rule-based systems such as MYCIN.

 Heckerman, D. “Probabilistic Interpretation for MYCIN’s Certainty Factors” UAI01 (Proceedings of the 1st conf on Uncertainty in AI) Los Angeles, CA,1985.


I. The large amount of work published on diagnosis has seen few recent additions.  This article reviews the field, and gives a detailed example.

Agosta, J.M., T. Gardos, “Bayes Network `Smart Diagnostics'” Intel Technology Journal, Vol. 8 No. 4 (November 2004).


J. “Value of Information” (VOI) is a notion of sensitivity to probabilities that is borrowed from Decision Theory and has found use in BN applications. The original concept comes from Decision Analysis.[3]  There are no recent reviews I know of, however the textbooks mentioned all address it. 

VOI measures are somewhat analogous to feature selection measures used in Machine Learning. When VOI is applied with no specific value function, (e.g. in a BN) by using an entropy-based measure as an approximation, its properly referred to as “Information Gain”; see A. Moore’s slides for an introduction:



III. Machine Learning with Bayes Nets

A. The original tutorial on Bayesian methods for learning BNs:

Heckerman “A Tutorial on Learning Bayesian Networks” republished in
M. Jordan ed., Learning in Graphical Models, pp. 301-354.

It can also be found at Heckerman’s website:


B. Naïve Bayes

Networks constrained to simple structures often work better for classification than general Bayesian networks. Most widely known are linear classifiers known as naïve Bayes.

Moore’s slides cover the topic well.

For a textbook reference see the draft of Tom Mitchell’s new edition of his book Machine Learning:

Much of the theory of “linear discriminants” was described in the original version of  Duda and Hart’s 1973 text, although common use of the term “naïve Bayes” appears to postdate the book.[4] The book has recently revised and brought up to date:

Duda, R.O. P. E. Hart and D. G. Stork, Pattern Classification 2nd Edition (Wiley Interscience, 2000).



C. Tree-Augmented naïve Bayes (TAN)

By relaxing slightly the BN’s structural constraints, it was possible to improve naïve Bayes to make it competitive with the best machine learning classifiers of the time:

Friedman, N., D Geiger, M Goldszmidt “Bayesian Networks Classifiers” Machine Learning, 29, 131–163 (1997).


D. The original Conditional Random Field (CRF) paper initiated a spate of papers comparing generative and discriminative criteria for optimizing classifiers. Generative models fit distributions to the entire set of model variables, compared to discriminative models that optimize the fit for the variables to be predicted. Bayes nets are the archetypical generative approach, but have recently also been adapted to discriminative methods, as developed in this paper.

John Lafferty and Andrew McCallum and Fernando Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” Proceedings of the International Conference on Machine Learning (ICML-2001),2001


Here’s another that is widely cited:

Ng, A. and M. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes,” in Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press, 2002.


E. Support Vector Machines (SVMs – “machines” don’t mean “hardware” in this context), and their generalization as kernel machines have a well deserved reputation.  Although not Bayesian in origin, finding a Bayesian interpretation of SVM classifiers is the subject of some recent work:

 Zhang, Z. and Michael I. Jordan Bayesian Multi-category Support Vector Machines (UAI22, 2006).


IV. Conferences


Started in 1985, as a workshop in response to the exclusion of quantitative methods from the general AI conference, it has become the premier Bayesian AI venue.  An index of all the papers can be found at the conference website: Unfortunately the online full text versions do not appear to be available any more.



What began as a Neural Nets / Neuroscience conference has grown into a multi-discplinary Machine Learning conference with a Bayesian flavor.  The website contains a link to the contents of the entire series of proceedings, online.


These machine learning conferences also have a large mix of Bayesian approaches:



V. Free Software

A. BNT / PNL. The Intel Bayes net/Machine Learning project ported large parts of Kevin Murphy’s previously referenced BNT library to C++.  Sadly there hasn’t been continuing support for PNL:


B. Genie and SMILE from U. Pitt’s Decision System Lab.  SMILE is a mature library for inference, with recent additions for learning. Genie is a full featured graphical interface built on SMILE.


C. SAMIAM, from UCLA Automated Reasoning Group.  SAMIAM (“Sam I am”) is a graphical tool that can be used to explore a variety of inference and sensitivity algorithms.


D. ELVIRA, a JAVA implementation from Universidad de Granada. (The first page is in Spanish).

A current discussion on open-source tools, and the introduction of a new tool, CARMEN, also from Spain, can be found in the paper:


Many other sources exist, but these mentioned tend to be the most stable and up to date.


VI. Commercial Software

There are several commercially available BN products, most notably

Hugin  ( and

Netica (, whose web pages contain tutorial information and cases studies.


Consulting firms that also provide BN products:

C. AgenaRisk

D. Charles River Analytics




VII. Research Websites


A. The “DAG” group under D. Koller at Stanford:


B. Nir Friedman is a frequent collaborator with D. Koller


C. Michael Jordan, joint Computer Science—Statistics group at UC Berkeley


D. The Research Unit of Decision Support Systems at Aalborg University, Denmark


E. Tommi Jaakkola’s Machine Learning course at MIT:


F. Pedro Domingos, U. Washington


G. Promedas: large scale medical diagnosis:,


H. Adnan Darwiche, creator of SAMIAM tool


I. David Poole, UBC


J. L. van de Gaag, Ultrecht University


K. Monash University in Australia


L. Microsoft research group is among the research leaders in this area.  See the Microsoft Research, Decision Theory and Adaptive Systems group


M. A list of success stories, on the “Uncertainty in AI” website:


A somewhat out of date list can be found in the Google directory



[1] When I can’t resist, I’ve put references to the broader field in footnotes.  If you really want to know the larger picture, here is a recent bible on Bayesian reasoning, with a maximum entropy flavor. Best taken in small doses.

Jaynes, E.T. and G. L. Bretthorst, Probability Theory: The Logic of Science, (Cambridge University Press, 2003).

Many controversies have arisen about the use of Bayesian statistics. See, for example the discussion on global climate change by Stephen Schneider.


Similarly there is an ongoing debate about the use of Bayesian statistics in law. See 

Jeff Strnad “Should Legal Empiricists Go Bayesian?”





[2] Bayesian Data Analysis is a primer on Markov Chain Monte Carlo. (MCMC). MCMC has recently gained widespread popularity in Statistics and in BN research for solving problems that are intractable by conventional methods. It is not specifically about Bayesian networks, (it never uses the term).

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian Data Analysis (Chapman & Hall 1995).


[3] Several papers on the topic have appeared recently. See

The original reference on VOI is

R. Howard, “Information Value Theory,” IEEE Trans. Syst. Sci. Cybernetics, Vol.2, No. 1 pp22-26, (1966).

[4] The origin of the term is unclear; apparently popularized in Friedman’s articles about TAN.  The method itself was well known, and can be found in an early Bayesian text:

I. J. Good. The Estimation of Probabilities: An essay on modern Bayesian methods. (MIT Press, 1965).