Algebraic & Geometric Methods in Statistics

Outline and some illustrative examples in nonlinear statistics

Goals

After the course, you can:

  • list topics in algebraic statistics
  • recognize problems in statistics that are answerable by algebraic methods
  • assess which algebraic methods are suitable for solving a problem
  • apply basic algebraic tools to solve a problem

Tentative course outline:

  1. What is algebraic statistics? An invitation / introduction / overview

  2. Exponential families 1.1. Statistical foundations 1.2. Underlying algebra

  3. Conditional independence and graphical models 2.1. Statistical foundations 2.2. Underlying algebra

  4. Goodness-of-fit testing of models for discrete data 3.1. Overview 3.2. Chromosome clusters in cancer cells 3.3. Network data 3.4. Challenges of large, sparse data sets

  5. Parameter identifiability 4.1. Overview 4.2. Graphical models 4.3. Phylogenetics and evolutionary biology 4.4. Model selection: learning a causal graph

  6. Maximum likelihood estimation 5.1. Introduction 5.2. Deciding existence of ML estimators 5.3. Algorithms for MLE: convex and non-convex optimization

Materials

Books and resources

Main textbook: Seth Sullivant “Algebraic Statistics”. It is avaiable in the bookstore. (I will check with the library for an e-copy.)

General course syllabus is here

Homework and grade

Approximately 6-7 assignments, expect a usual weekly workload.

Project

Option A Designing a minisymposium that has a connection to the topics in this course (other related topics require instructor approval).

Option B Reading a paper, working on a small research project, or applying algebraic methods on a data set, and writing a report on it. Timeline will be determined soon; project will take place during second half of semester. Groups up to 2 students.


Communication

Feel free to discuss things on Canvas discussion boards!!

Saving this information:

I am going to link to this on Canvas!!

Motivating example 1: Discrete Markov chain

Section 1.1. of the textbook. Lecture on board.

Action item: derive two polynomial equations for Markov chain model.

Let \(X_1, X_2, X_3\) be a sequence of random variables taking values in \(\Sigma = \{0,1\}\).

  • There are eight joint probabilities \(p_{ijk} = P(X_1 = i,\, X_2 = j,\, X_3 = k)\) for \(i,j,k \in \{0,1\}\).

  • A probability distribution associated to \((X_1, X_2, X_3)\) corresponds to a point in \(\mathbb{R}^8\).

  • The sequence is a Markov chain if \(P(X_3 = x_3 \mid X_1 = x_1, X_2 = x_2) = P(X_3 = x_3 \mid X_2 = x_2)\).

Question: When is a point in \(\mathbb{R}^8\) the probability distribution associated to a Markov chain?


Conditional Probabilities in Terms of Joint Probabilities

For \(i,j,k \in \{0,1\}\), \(P(X_3 = k \mid X_1 = i, X_2 = j) = \dfrac{p_{ijk}}{p_{ij+}}\) where \(p_{ij+} = \sum_{k \in \{0,1\}} p_{ijk}\).

Markov chain condition.

Using the above and comparing the conditional distributions for different values of \(X_1\), we have \(\dfrac{p_{ijk}}{p_{ij+}} = \dfrac{p_{i'jk}}{p_{i'j+}}\) for \(i \neq i'\), which implies \(p_{ijk}\, p_{i'j+} = p_{i'jk}\, p_{ij+}\).

Example

In the binary case (\(\Sigma = \{0,1\}\)), simplifying yields the two polynomial equations: \(p_{000}\, p_{101} - p_{001}\, p_{100} = 0\) and \(p_{010}\, p_{111} - p_{011}\, p_{110} = 0\).


Characterization of the Model (Semialgebraic Set)

A point \(p \in \mathbb{R}^8\) is the probability distribution associated to a (binary, 3-step) Markov chain iff:

  • \(p_{ijk} \ge 0\) for all \(i,j,k \in \{0,1\}\),
  • \(\sum_{i,j,k \in \{0,1\}} p_{ijk} = 1\),
  • \(p_{000}\, p_{101} - p_{001}\, p_{100} = 0\),
  • \(p_{010}\, p_{111} - p_{011}\, p_{110} = 0\).

Therefore, the Markov chain model forms a semialgebraic set—the solution set of a system of polynomial equations and inequalities.

Notes and Next Steps:

  • This is an example of a conditional independence model (future lecture)
  • Fitting the model to data: Assume there is a true, unknown distribution \(p\) in the model generating the data. What is \(p\)? (likelihood inference; future lecture)
  • Model assessment: How well does the model fit the data? (Fisher’s exact test in future lecture).

Motivating example 2: Graphical models

What is algebraic statistics?


Probability / statistics

  • Probability distribution
  • Statistical model
  • (Discrete) exponential family
  • Conditional inference
  • Maximum likelihood estimation
  • model selection
  • Multivariate Gaussian model
  • Phylogenetic model
  • MAP estimates

Algebra/geometry

  • Point
  • (Semi)algebraic set
  • Toric variety / ideal
  • Lattice points in polytopes
  • Polynomial optimization
  • Geometry of singularities
  • Spectrahedral geometry
  • Tensor networks
  • Tropical geometry

Lecture plan

We will continue now with the following topics:

  • Probability Primer (Chapter 2) and
  • Conditional Independence (Chapter 4)

Appendix

Following is a 3-slide “intro” to algebraic geometry; these were slides by S. Sullivant given at a colloquium a long long time ago. They are meant to just give you a glimpse into the vocabulary… not to digest this immediately.

Introduction to algebraic geometry



Example: Hardy-Weinberg Equilibrium

License

This document is created for Math/Stat 561, Spring 2023, at Illinois Tech.

While the course materials are generally not to be distributed outside the course without permission of the instructor, all materials posted on this page are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.