PROJECTS for 561

Algebraic & Geometric Methods in Statistics

Relevant information

Before starting project work, please familiarize yourself with the general project information page on this website, because it:

describes roughly what I look for in course projects,
gives an idea of the type of work expected,
outlines how I typically grade projects.

Please note that is a general informational page for my courses, and your particular project may not undergo the type of rigorous peer review described therein.

Logistics

Disclaimer: This document, while as self-contained as possible, is not 100% of the information you need to learn about the projects. I will present the topics briefly in class. Students who miss that lecture will need to catch up with their peers.

Some of these are reading and discovery projects, some are hands-on explorations, and some are data analyses. Your choice, as per your requests via the feedback form sent out in Week 5 (AhaSlides).

Keep in mind the course objectives; the projects are designed to help you achieve those, particularly the last two.

By design, this assignment is very open-ended. This is a graduate elective, after all. Students are strongly encouraged to compute many examples. Students are also encouraged to formulate, test, and prove their own conjectures.

Teaming

Projects are to be done in groups of 2 to 5 students. The project topic choices will be on a first-come first-served basis; each student will need to claim his/her project topic and team by making a post on Campuswire to claim. In case of conflict, or solo or insufficient teams, I will resolve issues with the relevent teams/groups/students.

Project task

Learn a new theorem, or a model, or showcase a tool we learned in class on a new data set, and communicate it in both written and oral form.

Minimum requirements

Your project may start you on a research career, or it may simply introduce you to a new interesting topic. The goal is to learn something about algebraic statistics that you have to do beyond your homework exercises. In addition to a deeper understanding that the team will acquire, every project needs to answer key high-level questions:

What is the main idea?
When is it useful, and when not?
How does this fit with topics we learned in class?
If concepts are the same or similar but notation is different, make sure to translate.

Learning

Each student is expected to learn a theorem/algorithm/application/usecase beyond the usual lecture material. Expected level of ‘learning’: be able to state the result/method and work through an example.

Writing

Demonstrate what you have learned. The written document must introduce and correctly state the result being studied. If there is an algorithm in place of a theorem, it should be explained properly. If you are studying a new method or model, you should also include at least one interesting example illustrating the result. The article should be as self-contained as possible and understandable to other students in the course. The new document must be typed, be {} pages in length (with one inch margins and a either a 11pt or 12pt font), and be available in PDF format. Templates for both LaTeX and Markdown are provided to you; and I won’t specify things like text having to be single-spaced (obviously?!), but you can infer this from the templates!

Presenting

Every team is expected to give a short presentation on their project before the end of the semester. You are encouraged to bring printed copies of the draft of your report on the day of the presentation, so that everyone can take a look and give you feedback. After the presentation, you will get feedback from both your peers and from the instructor, and then you will be expected to use that feedback to improve your draft before you submit the final written document.

Projects list

Models of association of 3-way tables

Project focus: study what models, other than the complete independence and a few CI models, can be written for 3-dimensional tables. These include both marginal and conditional independence models.

The reading material is lesson 5 here and the main task is to read it, understand it, and run some examples. There are data sets already provided in the lesson module in the link; but none of them use Markov bases. For any example that you go over, you will need to propose what might be a Markov basis. Some of these problems will be known, and some solvable; the minimum you can do is create the matrix \(A\) and describe what it looks like for this model in general, not only for one particular dimension/data set. The point of the last task is to allow students (present or future) to solve a possibly open problem in algebraic statistics!

Graphical models and CI statements

Project focus: Section 13.1 of the book.

To summarize, I quote the book from p.289: “Characterizations of precisely which graphs satisfy the properties that the pairwise Markov property equals the local Markov property and the local Markov property equals the global Markov property can be found in [Lauritzen96]. Remarkably, the failure of the intersection axiom is the only obstruction to the probabilistic equality of the three Markov properties.” In the class, we should already cover Theorem 13.1.4. from the book. So the project will go deeper over these topics and present them to class.

Special note: since this project covers essentially one section of the book that we were already going to cover in class, there is less emphasis on data analysis and more on presenting involved in that the presentation should be like a lecture supplement.

If, on the other hand, the team prefers to instead take the project to the applied side and actually do some data analysis, great!, please just let the instructor know. You may use section 3 of this paper as a guideline to find a sub-topic of interest & learn more.

Markov bases for uncovering proximity of chromosome territories

Project focus: application of Markov bases and challenges that come in practice.

Main reference is this 2014 paper by Serkan Hoten and collaborators. The project task is to read this paper, present what the data looks like, explain how the models can be viewed as log-linear models, and report results, including, but not limited to:

what is the main method used?
why was the method necessary?
why was it effective or not?
how did the authors figure out what is a Markov basis for the model?

Data analysis and trying out a few models

Project focus: get a huge/new contingency table and apply the GoF test.

So many students want to “use alg stat on real data”; for data ideas, see links for data sets in the next section, for example. If you can figure out how to generate a Markov basis, you can input the moves into the GoF test function in this R package: https://github.com/dkahle/algstat. This works fine as long as model is log-linear!

How good are algebraic Markov chains?

Project focus: simulation and study of Markov chains.

The question from this project comes from Luis Garcia’s tutorial on the AlgStat.R package, which can be found here. The author suggests, on the last slide, “Project 3: Analyze the mixing times for the MCMC of your favorite dataset and log-linear model.”

In this project, you will:

Choose a model for a table (2 or 3 or 4 dimensions)
Derive the design matrix \(A\) for the log-linear model
Compute a Markov basis (either by a computer or by looking it up in a published paper!) for that model
Start with either a real data set (cross-classified units categories according to 2 or 3 or 4 criteria) or a simulated one (i.e., you make it up!)
Implement or use already implemented versions of Metropolis-Hastings algorithm to do a random walk on the fiber
Study the mixing times of this chain.

Understanding the intuitive definition of mixing time, and how to track it (for example, the rate at which the algorithm is discovering the fiber), are important!

You are free to locate a discrete data set, which you can covert into a contingency table, on your own but please get approval and/or help from the instructor as necessary.

Phylogenetics, DNA data, and polynomials for tree reconstructions

Project focus: learn how polynomials defining the model are used to analyze data and fit a model in computational biology.

Main reference: J. Fernández-Sánchez, M. Casanellas, Invariant versus classical approach when evolution is heterogeneous across sites and lineages, Systematic Biology(open in new window) (2016), 65 (2), 280-291, doi 10.1093/sysbio/syv086.

This paper Toric ideals of phylogenetic invariants includes a full picture but you would only be focused on understanding Theorem 2 (a), including a worked-out example.

There is also a great book resource: “Algebraic statistics for computational biology” ASCB, where all of the background material can be found. This paper can serve as a general background reference, in particular its first two sections give an excellent overview.

Use model ideals and/or Eucledian distance to test model fit

Project focus: Figure out how to use ideal generators and/or euclidean distance to test if a data set comes from a particular

The minimum deliverable will be a detailed formulation of the nearest-point problem, and a proposal on how to use Euclidean distance and how might one use polynomials vanishing on the model to try to solve it. A worked out example would be ideal. (If the team does more, and solves a problem for a simple model, this may actually become a research paper.)

This is an open problem that would already be very interesting for at least simple contingency table models, small model of independence, or small CI model. Using simulated data is OK for this project, but if you have ideas on real data sets, please go ahead.

Some related work:

An example paper does this for biological models, where they essentially contrast complex geometry vs. real life. Main reference: M. Casanellas, J. Fernández-Sánchez, M. Garrote-López, Distance to the stochastic part of phylogenetic varieties, Journal of Symbolic Computation (2021), vol. 104, 563-682, https://arxiv.org/abs/1912.02138. Reading this paper, one should understand how “distance” is computed and what “stochastic part” means.
Euclidean distance degree defines the notion of Euclidean distance; this paper is an application which computes ED using julia.
Bernd Sturmfels’ ICM22 talk slides 6-7-8 summarize ED degree.

Learning a causal DAG

I do not have a full-fledged description of this project, but if it is a hot topic, we can revisit.

This problem is a bit more open-ended. The goal is to find a way to use code for learning a causal DAG github link I found this sentence “the CausalDAG Python package available at https://github.com/uhlerlab/causaldag, which provides an efficient implementation of Algorithm 4.” in this paper.

Additional resources

papers

In case you are wondering why we ever even need to sample from the fiber, consider this graph-theory asymptotic result on the sizes of fibers of just 2-way tables.
Check this out if you are interested in blending machine learning methods with algebraic: cached sufficient statistics.
Talk on using interventional data for causal discovery. The first half of the talk walks through graphical models, causal discovery, and how machine learning can interact with the solution of the problem effectively.

data

It is best to get your own!

If you are looking for network data to use, you can collect social media data. For graduate students, R packages like twitteR and Rfacebook are a good option.

https://data.cityofchicago.org/ City of Chicago data portal has lots of data you can browse and download.
if you are interested in already visualized data, you can take a look at chichives.com.

Useful sites for network data, which you may reference to see what things look like:

Regarding hwX:

Read section 3 of this paper which is on Algebraic Statistics. The author, Aida Maraj, a postdoc at the University of Michigan, will come and give a talk on March 3rd.
About Markov bases: Felix Almendra Hernandez, a PhD student in UC Davis, will come and give a lecture on his work and qualifying exam in this research area on April 7th.