Sources:
pgmpy
Problem: Naive Bayesian inference is often intractable if the range of $(X,Y)$ is "too large".
$\Rightarrow$ (Probabilistic) graphical models (PGMs) leverage conditional independencies to compactly represent joint distributions and efficiently perform inference.
We will look at the following models:
$\Rightarrow$ Assuming each variable $N_i$ has $n$ outcomes and $k \ll K$ parents, this reduces the number of values from $n^K$ to $K n^{k+1}$.
Describes the intelligence of a student (average/high), his score in the SAT exam (bad/good), the grade he gets in the class (A,B,C), the difficulty of the class (easy/hard), and the quality of the recommendation letter he gets from the professor after completing the course (bad/good).
Notation:
import utils
model = utils.student_model()
utils.render(model)
Let us look at the conditional distributions.
node = "Grade" # try "Intel"
utils.print_cpd(model.get_cpds(node))
+------------+------------------+------------------+------------------+------------------+ | Intel | Intel(Average) | Intel(Average) | Intel(High) | Intel(High) | +------------+------------------+------------------+------------------+------------------+ | Difficulty | Difficulty(Easy) | Difficulty(Hard) | Difficulty(Easy) | Difficulty(Hard) | +------------+------------------+------------------+------------------+------------------+ | Grade(A) | 0.3 | 0.05 | 0.9 | 0.5 | +------------+------------------+------------------+------------------+------------------+ | Grade(B) | 0.4 | 0.25 | 0.08 | 0.3 | +------------+------------------+------------------+------------------+------------------+ | Grade(C) | 0.3 | 0.7 | 0.02 | 0.2 | +------------+------------------+------------------+------------------+------------------+
The structure leads to a reduced representation of the joint distribution (see chain rule of probability
$$ P(L,G,D,I,S) = \underbrace{P(L|G,D,I,S)}_{P(L|G)} \ \underbrace{P(G|D,I,S)}_{P(G|D,I)} \ \underbrace{P(S|D,I)}_{P(S|I)} \ \underbrace{P(D|I)\ }_{P(D)} P(I), $$which reduces the number of values from $48$ to $26$.
from itertools import product
utils.print_cpds(model)
print("---")
print(f"Joint: {len(list(product(*model.states.values())))} values")
print(f"PGM: {sum(cpd.values.size for cpd in model.cpds)} values")
P(Difficulty), 2 values P(Intel), 2 values P(Grade|Intel, Difficulty), 12 values P(Letter|Grade), 6 values P(SAT|Intel), 4 values --- Joint: 48 values PGM: 26 values
The structure helps in marginalization, as we can break the computation into smaller parts (variable eliminiation). For instance:
\begin{align*} P(G) = \sum_{L,D,I,S} P(L,G,D,I,S) &= \sum_{L,D,I,S} P(L|G) P(G|D,I) P(S|I) P(D) P(I) \\&= \sum_{D} P(D) \sum_{I} P(G|D,I)P(I) \sum_{S} P(S|I) \sum_{L} P(L|G). \end{align*}from pgmpy.inference import VariableElimination
node = "Grade"
infer = VariableElimination(model)
print(infer.query([node]))
+----------+--------------+ | Grade | phi(Grade) | +==========+==============+ | Grade(A) | 0.3620 | +----------+--------------+ | Grade(B) | 0.2884 | +----------+--------------+ | Grade(C) | 0.3496 | +----------+--------------+
This also allows us to compute conditional distributions and maximum a posteriori (MAP) estimates.
node = "Grade"
evidence = {"SAT": "Good", "Intel": "High"}
map_est = infer.map_query([node], evidence=evidence, show_progress=False)
print(infer.query([node], evidence=evidence))
print(f"MAP estimate: {map_est}")
+----------+--------------+ | Grade | phi(Grade) | +==========+==============+ | Grade(A) | 0.7400 | +----------+--------------+ | Grade(B) | 0.1680 | +----------+--------------+ | Grade(C) | 0.0920 | +----------+--------------+ MAP estimate: {'Grade': 'A'}
Let us look more closely at the independencies given by our model.
First, we see that
$$N \perp \mathrm{NonDe}(N) | \operatorname{Pa}(N),$$where $\mathrm{NonDe}(N)$ is the set of nodes which are not descendents of the node $N$.
nodes = ["Grade"]
utils.render(model)
model.local_independencies(nodes)
(Grade ⟂ SAT | Intel, Difficulty)
Causal / evidential / cascade / head-to-tail: $ D \rightarrow G \rightarrow L$ implies $D \perp L | G$, but, in general, $ D \not\perp L$.
cascade = model.copy()
cascade.remove_nodes_from(["Intel", "SAT"])
cascade.latents = {"Difficulty", "Letter"}
utils.render(cascade)
cascade.get_independencies(include_latents=True)
(Letter ⟂ Difficulty | Grade) (Difficulty ⟂ Letter | Grade)
Common cause / fork / fan-out / tail-to-tail: $ S \leftarrow I \rightarrow G$ implies $S \perp G | I$, but, in general, $ S \not\perp G$.
fork = model.copy()
fork.remove_nodes_from(["Letter", "Difficulty"])
fork.latents = {"Grade", "SAT"}
utils.render(fork)
fork.get_independencies(include_latents=True)
(SAT ⟂ Grade | Intel) (Grade ⟂ SAT | Intel)
Common evidence / collider / V-structure / fan-in / head-to-head: $ I \rightarrow G \leftarrow D$ implies $I \perp D$ but, in general, $ I \not\perp D | G$.
collider = model.copy()
collider.remove_nodes_from(["Letter", "SAT"])
collider.latents = {"Intel", "Difficulty"}
utils.render(collider)
collider.get_independencies(include_latents=True)
(Intel ⟂ Difficulty) (Difficulty ⟂ Intel)
Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 0, and the host, who knows what's behind the doors, opens another door, say No. 2, which has a goat. He then says to you, "Do you want to pick door No. 1?" Is it to your advantage to switch your choice?
Assumption: The host must always open a door which has not been picked by the contestant and which reveals a goat.
node = "Host" # try "Prize"
mh_model = utils.monty_hall_model()
utils.print_cpd(mh_model.get_cpds(node))
+---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | Choice | Choice(0) | Choice(0) | Choice(0) | Choice(1) | Choice(1) | Choice(1) | Choice(2) | Choice(2) | Choice(2) | +---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | Prize | Prize(0) | Prize(1) | Prize(2) | Prize(0) | Prize(1) | Prize(2) | Prize(0) | Prize(1) | Prize(2) | +---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | Host(0) | 0.0 | 0.0 | 0.0 | 0.0 | 0.5 | 1.0 | 0.0 | 1.0 | 0.5 | +---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | Host(1) | 0.5 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.5 | +---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | Host(2) | 0.5 | 1.0 | 0.0 | 1.0 | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | +---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
The Bayesian model has a collider structure and, as excepted, it holds that $C \perp P$. However, in general, the door opened by the host can provide information on the location of the car, i.e., $$C \not\perp P | H.$$ We can infer that the posterior probability of the car being behind door 1 is the largest and thus the contestant should switch the door.
mh_infer = VariableElimination(mh_model)
print(mh_infer.query(["Prize"], evidence={"Choice": 0, "Host": 2}))
utils.render(mh_model)
+----------+--------------+ | Prize | phi(Prize) | +==========+==============+ | Prize(0) | 0.3333 | +----------+--------------+ | Prize(1) | 0.6667 | +----------+--------------+ | Prize(2) | 0.0000 | +----------+--------------+
The local independencies motivate the following general theorem on conditional independencies in Bayesian models.
Let A, B, C be disjoint subsets of nodes of a DAG and consider all paths from any node in A to any node in B (treating edges as undirected). We call a path blocked if it includes a node $N$ such that the nodes at $N$ meet
If all paths are blocked, then A is said to be d-separated from B by C, and
$$A \perp B | C.$$A path which is not blocked is also called active.
A = {"Difficulty"}
B = {"SAT"}
C = {"Intel"} # try {"Intel", "Letter"}, {"Letter"}, and set()
model.latents = model.nodes - C
active_trail_nodes = model.active_trail_nodes(list(A), observed=C, include_latents=True)
active_trail_nodes = set.union(*active_trail_nodes.values())
dseparated = not bool(active_trail_nodes.intersection(B))
print(f"d-separated: {dseparated}")
if dseparated:
print(f"{A} ⟂ {B} | {C}")
utils.render(model)
d-separated: True {'Difficulty'} ⟂ {'SAT'} | {'Intel'}
We call a subset of nodes $S$ a Markov blanket of node $N$, if $N$ is independent of the rest of the graph conditioned on $S$.
In a general Bayesian model, the minimal Markov Blanket (Markov boundary) contains all parents $\mathrm{Pa}(N)$, children $\mathrm{Ch}(N)$, and co-parents $\mathrm{CoPa}(N)$ (parents of children other than $N$), i.e.,
$$ \mathrm{Pa}(N) \cup \mathrm{Ch}(N) \cup \mathrm{CoPa}(N).$$node = "Intel"
ext_model = model.copy()
for tail, head in zip(["Edu", "Country", "Subject"], ["Intel", "Edu", "Difficulty"]):
ext_model.add_edge(tail, head)
blanket = set(ext_model.get_markov_blanket(node))
print(f"Markov blanket of '{node}': {blanket}")
ext_model.latents = ext_model.nodes - blanket
utils.render(ext_model, grid_unit=3.5)
Markov blanket of 'Intel': {'SAT', 'Edu', 'Difficulty', 'Grade'}
Let $C_1,C_2$ describe two independent fair coin tosses and let $B$ describe a bell which is activated if the coins show the same side. We observe that $C_1 \perp C_2$, $C_1 \perp B$, and $C_2 \perp B$. This gives rise to three factorizations of the joint (and thus three Bayesian models), however none represents every (conditional) independence property of the distribution.
from pgmpy.models import BayesianNetwork
nodes = {"B", "C1", "C2"}
for head in nodes:
utils.render(BayesianNetwork([(n, head) for n in nodes - {head}], latents=nodes))
Markov random fields are graphical models based on undirected graphs. They represent conditional independence more conveniently by graph separation (strong Markov property):
For any disjoint subsets $X,Y,S$ of nodes, where $S$ is a separating set (every path from $X$ to $Y$ passes through $S$), it holds that $X \perp Y |S$.
$\Rightarrow$ The Markov boundary of a node $N$ is just given by the set of neighbors of $N$.
node = "A"
markov_model = utils.voting_model()
blanket = set(markov_model.markov_blanket(node))
print(f"Markov blanket of '{node}': {blanket}")
markov_model.latents = markov_model.nodes - blanket
utils.render(markov_model)
Markov blanket of 'A': {'C', 'D'}
Note that no directed graph over four variables implies the set of conditional independence properties given by this undirected graph, i.e.,
$$A\not\perp B, \quad C \perp D | \{A , B\}, \quad A \perp B | \{C,D\}.$$markov_model.latents = markov_model.nodes
utils.render(markov_model)
markov_model.latents = {"C", "D"}
utils.render(markov_model)
markov_model.latents = {"A", "B"}
utils.render(markov_model)
We cannot specify the conditional distributions as in Bayesian models. Instead we identify dependent variables and define the strength of their interactions.
We can represent such interactions by potential functions / factors $\phi_c$ defined on cliques $(N_i)_{i\in c}$ (connected subgraphs). The joint distribution is then given by
$$ p(N_1,\dots,N_K) = \frac{1}{Z} \tilde{p}(N_1,\dots,N_K) = \frac{1}{Z}\prod_{c \in \mathrm{cliques}} \phi_c\big((N_i)_{i\in c}\big), $$where the partition function (normalization constant) is given by
$$Z = \sum_{N_1,\dots,N_K} \tilde{p}(N_1,\dots,N_K).$$We use cliques, as our model imposes that nodes not sharing an edge are independent conditioned on the rest of the graph, i.e., they should not appear in the same factor.
For the sake of illustration, let us interpret $A,B,C,D$ in our previous example as persons. Let's assume that $\{A,C\}$, $\{C,B\}$, $\{B,D\}$, and $\{D,A\}$ are pairs of friends, which tend to have similar voting preferences. For simplicity, we assume that the preferences are the same across all pairs and that $\phi_c =1$ for all other cliques.
clique_index = 0
print(markov_model.factors[clique_index])
print(markov_model.get_partition_function())
+------+------+------------+ | A | C | phi(A,C) | +======+======+============+ | A(0) | C(0) | 5.0000 | +------+------+------------+ | A(0) | C(1) | 1.0000 | +------+------+------------+ | A(1) | C(0) | 1.0000 | +------+------+------------+ | A(1) | C(1) | 10.0000 | +------+------+------------+ 11327.0
For illustration, let us perform (approximate) inference on our Markov random field using Belief propagation / sum–product message passing.
from pgmpy.inference import BeliefPropagation
evidence = {"B": 1} # try {"D": 1} and {"D": 1, "C": 0}
markov_model.latents = markov_model.nodes - set(evidence.keys())
voting_infer = BeliefPropagation(markov_model)
print(voting_infer.query(["A"], evidence=evidence))
utils.render(markov_model)
+------+----------+ | A | phi(A) | +======+==========+ | A(0) | 0.0216 | +------+----------+ | A(1) | 0.9784 | +------+----------+
Finally, note that also Markov random fields do not always offer perfect maps, where every conditional independence property of a distribution is reflected in the graph. For instance, the distribution given by a collider, $A\perp B$ and $A\not\perp B |C$, has no corresponding undirected graph (over the same variables) that is a perfect map.
utils.render(BayesianNetwork([("A", "C"), ("B", "C")], latents={"A", "B"}))
Undirected Graphical Models / Markov Random Fields:
$+$ can be applied to problems without directionality of the associated variable dependencies.
$+$ directly encode conditional independence structure.
$-$ expressing the joint requires to find cliques and compute the normalization constant $Z$ (often requires approximations).
$-$ interpreting undirected models and generating data is typically harder than for Bayesian models.
Directed Graphical Models / Bayesian Networks:
$+$ directly encode a facorization of the joint.
$-$ analyzing conditional independence requires to consider d-separation.