**Sources:**

- Christopher Bishop's book Pattern recognition and machine learning
- Philipp Hennig's course on Probabilistic ML
- The course on probabilistic graphical models by Stefano Ermon's group
- The python library
`pgmpy`

- Construct a joint distribution $p(X,Y)$ over the data $(X,Y)$.
- Condition on the observed variable $X$ to compute the posterior distribution

- Query the posterior distribution $p(Y|x)$ to predict the distribution of values $Y$ for a new data point $x$.

**Problem:** Naive Bayesian inference is often intractable if the range of $(X,Y)$ is "too large".

$\Rightarrow$ *(Probabilistic) graphical models (PGMs)* leverage conditional independencies to compactly represent joint distributions and efficiently perform inference.

We will look at the following models:

*Bayes(ian)/belief models/networks*(directed graphs)*Markov models/networks / Markov random fields*(undirected graphs)

- Use a directed acyclic graph (DAG) to represent a set of random variables and their conditional dependencies.
- A node $N$ in the graph represents the conditional distribution $p(N | \mathrm{Pa}(N))$, where $\mathrm{Pa}(N)$ is the set of parents of $N$.
- Joint distribution is given by:

$\Rightarrow$ Assuming each variable $N_i$ has $n$ outcomes and $k \ll K$ parents, this reduces the number of values from $n^K$ to $K n^{k+1}$.

Describes the intelligence of a student (average/high), his score in the SAT exam (bad/good), the grade he gets in the class (A,B,C), the difficulty of the class (easy/hard), and the quality of the recommendation letter he gets from the professor after completing the course (bad/good).

Notation:

- Gray nodes are called
*observed variables*, which we expect to obtain for new data points. - White nodes are
*hidden variables / latent variables*, some of which we are interested in estimating.

In [1]:

```
import utils
model = utils.student_model()
utils.render(model)
```

Let us look at the conditional distributions.

In [2]:

```
node = "Grade" # try "Intel"
utils.print_cpd(model.get_cpds(node))
```

The structure leads to a reduced representation of the joint distribution (see chain rule of probability

$$ P(L,G,D,I,S) = \underbrace{P(L|G,D,I,S)}_{P(L|G)} \ \underbrace{P(G|D,I,S)}_{P(G|D,I)} \ \underbrace{P(S|D,I)}_{P(S|I)} \ \underbrace{P(D|I)\ }_{P(D)} P(I), $$which reduces the number of values from $48$ to $26$.

In [3]:

```
from itertools import product
utils.print_cpds(model)
print("---")
print(f"Joint: {len(list(product(*model.states.values())))} values")
print(f"PGM: {sum(cpd.values.size for cpd in model.cpds)} values")
```

The structure helps in marginalization, as we can break the computation into smaller parts (*variable eliminiation*). For instance:

In [4]:

```
from pgmpy.inference import VariableElimination
node = "Grade"
infer = VariableElimination(model)
print(infer.query([node]))
```

This also allows us to compute conditional distributions and *maximum a posteriori* (MAP) estimates.

In [5]:

```
node = "Grade"
evidence = {"SAT": "Good", "Intel": "High"}
map_est = infer.map_query([node], evidence=evidence, show_progress=False)
print(infer.query([node], evidence=evidence))
print(f"MAP estimate: {map_est}")
```

Let us look more closely at the independencies given by our model.

First, we see that

$$N \perp \mathrm{NonDe}(N) | \operatorname{Pa}(N),$$where $\mathrm{NonDe}(N)$ is the set of nodes which are not descendents of the node $N$.

In [6]:

```
nodes = ["Grade"]
utils.render(model)
model.local_independencies(nodes)
```

Out[6]:

(Grade ⟂ SAT | Intel, Difficulty)

**Causal / evidential / cascade / head-to-tail:** $ D \rightarrow G \rightarrow L$ implies $D \perp L | G$, but, in general, $ D \not\perp L$.

In [7]:

```
cascade = model.copy()
cascade.remove_nodes_from(["Intel", "SAT"])
cascade.latents = {"Difficulty", "Letter"}
utils.render(cascade)
cascade.get_independencies(include_latents=True)
```

Out[7]:

(Letter ⟂ Difficulty | Grade) (Difficulty ⟂ Letter | Grade)

**Common cause / fork / fan-out / tail-to-tail:** $ S \leftarrow I \rightarrow G$ implies $S \perp G | I$, but, in general, $ S \not\perp G$.

In [8]:

```
fork = model.copy()
fork.remove_nodes_from(["Letter", "Difficulty"])
fork.latents = {"Grade", "SAT"}
utils.render(fork)
fork.get_independencies(include_latents=True)
```

Out[8]:

(SAT ⟂ Grade | Intel) (Grade ⟂ SAT | Intel)

**Common evidence / collider / V-structure / fan-in / head-to-head:** $ I \rightarrow G \leftarrow D$ implies $I \perp D$ but, in general, $ I \not\perp D | G$.

In [9]:

```
collider = model.copy()
collider.remove_nodes_from(["Letter", "SAT"])
collider.latents = {"Intel", "Difficulty"}
utils.render(collider)
collider.get_independencies(include_latents=True)
```

Out[9]:

(Intel ⟂ Difficulty) (Difficulty ⟂ Intel)

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 0, and the host, who knows what's behind the doors, opens another door, say No. 2, which has a goat. He then says to you, "Do you want to pick door No. 1?" Is it to your advantage to switch your choice?

**Assumption:** The host must always open a door which has not been picked by the contestant and which reveals a goat.

In [10]:

```
node = "Host" # try "Prize"
mh_model = utils.monty_hall_model()
utils.print_cpd(mh_model.get_cpds(node))
```

In [11]:

```
mh_infer = VariableElimination(mh_model)
print(mh_infer.query(["Prize"], evidence={"Choice": 0, "Host": 2}))
utils.render(mh_model)
```

The local independencies motivate the following general theorem on conditional independencies in Bayesian models.

Let A, B, C be disjoint subsets of nodes of a DAG and consider all paths from any node in A to any node in B (treating edges as undirected). We call a path *blocked* if it includes a node $N$ such that the nodes at $N$ meet

- head-to-tail or tail-to-tail, and $N\in C$, or
- head-to-head, and $(\{N\} \cup \mathrm{De}(N)) \cap C = \emptyset $ (neither $N$ nor its descendants are in $C$).

If all paths are blocked, then A is said to be *d-separated* from B by C, and

A path which is not blocked is also called *active*.

In [12]:

```
A = {"Difficulty"}
B = {"SAT"}
C = {"Intel"} # try {"Intel", "Letter"}, {"Letter"}, and set()
model.latents = model.nodes - C
active_trail_nodes = model.active_trail_nodes(list(A), observed=C, include_latents=True)
active_trail_nodes = set.union(*active_trail_nodes.values())
dseparated = not bool(active_trail_nodes.intersection(B))
print(f"d-separated: {dseparated}")
if dseparated:
print(f"{A} ⟂ {B} | {C}")
utils.render(model)
```

d-separated: True {'Difficulty'} ⟂ {'SAT'} | {'Intel'}

We call a subset of nodes $S$ a *Markov blanket* of node $N$, if $N$ is independent of the rest of the graph conditioned on $S$.

In a general Bayesian model, the minimal Markov Blanket (*Markov boundary*) contains all parents $\mathrm{Pa}(N)$, children $\mathrm{Ch}(N)$, and co-parents $\mathrm{CoPa}(N)$ (parents of children other than $N$), i.e.,

In [13]:

```
node = "Intel"
ext_model = model.copy()
for tail, head in zip(["Edu", "Country", "Subject"], ["Intel", "Edu", "Difficulty"]):
ext_model.add_edge(tail, head)
blanket = set(ext_model.get_markov_blanket(node))
print(f"Markov blanket of '{node}': {blanket}")
ext_model.latents = ext_model.nodes - blanket
utils.render(ext_model, grid_unit=3.5)
```

Markov blanket of 'Intel': {'SAT', 'Edu', 'Difficulty', 'Grade'}

Let $C_1,C_2$ describe two independent fair coin tosses and let $B$ describe a bell which is activated if the coins show the same side. We observe that $C_1 \perp C_2$, $C_1 \perp B$, and $C_2 \perp B$. This gives rise to three factorizations of the joint (and thus three Bayesian models), however none represents every (conditional) independence property of the distribution.

In [14]:

```
from pgmpy.models import BayesianNetwork
nodes = {"B", "C1", "C2"}
for head in nodes:
utils.render(BayesianNetwork([(n, head) for n in nodes - {head}], latents=nodes))
```

Markov random fields are graphical models based on undirected graphs. They represent conditional independence more conveniently by graph separation (*strong Markov property*):

For any disjoint subsets $X,Y,S$ of nodes, where $S$ is a *separating set* (every path from $X$ to $Y$ passes through $S$), it holds that $X \perp Y |S$.

$\Rightarrow$ The Markov boundary of a node $N$ is just given by the set of neighbors of $N$.

In [15]:

```
node = "A"
markov_model = utils.voting_model()
blanket = set(markov_model.markov_blanket(node))
print(f"Markov blanket of '{node}': {blanket}")
markov_model.latents = markov_model.nodes - blanket
utils.render(markov_model)
```

Markov blanket of 'A': {'C', 'D'}

Note that no directed graph over four variables implies the set of conditional independence properties given by this undirected graph, i.e.,

$$A\not\perp B, \quad C \perp D | \{A , B\}, \quad A \perp B | \{C,D\}.$$In [16]:

```
markov_model.latents = markov_model.nodes
utils.render(markov_model)
markov_model.latents = {"C", "D"}
utils.render(markov_model)
markov_model.latents = {"A", "B"}
utils.render(markov_model)
```

We cannot specify the conditional distributions as in Bayesian models. Instead we identify dependent variables and define the strength of their interactions.

We can represent such interactions by *potential functions / factors* $\phi_c$ defined on *cliques* $(N_i)_{i\in c}$ (connected subgraphs). The joint distribution is then given by

where the *partition function* (normalization constant) is given by

We use cliques, as our model imposes that nodes not sharing an edge are independent conditioned on the rest of the graph, i.e., they should not appear in the same factor.

In [17]:

```
clique_index = 0
print(markov_model.factors[clique_index])
print(markov_model.get_partition_function())
```

*Belief propagation / sum–product message passing*.

In [18]:

```
from pgmpy.inference import BeliefPropagation
evidence = {"B": 1} # try {"D": 1} and {"D": 1, "C": 0}
markov_model.latents = markov_model.nodes - set(evidence.keys())
voting_infer = BeliefPropagation(markov_model)
print(voting_infer.query(["A"], evidence=evidence))
utils.render(markov_model)
```

Finally, note that also Markov random fields do not always offer perfect maps, where every conditional independence property of a distribution is reflected in the graph. For instance, the distribution given by a collider, $A\perp B$ and $A\not\perp B |C$, has no corresponding undirected graph (over the same variables) that is a perfect map.

In [19]:

```
utils.render(BayesianNetwork([("A", "C"), ("B", "C")], latents={"A", "B"}))
```

**Undirected Graphical Models / Markov Random Fields:**

$+$ can be applied to problems without directionality of the associated variable dependencies.

$+$ directly encode conditional independence structure.

$-$ expressing the joint requires to find cliques and compute the normalization constant $Z$ (often requires approximations).

$-$ interpreting undirected models and generating data is typically harder than for Bayesian models.

**Directed Graphical Models / Bayesian Networks:**

$+$ directly encode a facorization of the joint.

$-$ analyzing conditional independence requires to consider d-separation.