Abstract

The purpose of this paper is to delve into the Chicago public high school system to extract information on demographics and academics. The methods used are common in machine learning and discrete mathematics and seek to bring helpful visualizations and interpretations of education disparity. Not only is the goal to analyze the current state but also how these demographic and academic factors have changed over time for certain school groups. This can accomplish this using graphical models and create visualizations to show key factors and how the change over time. In addition, one can see how the COVID-19 pandemic has affected schools in this system. Lastly, the correlations between these factors will be presented along with discussion.

Introduction

Chicago has a history of being one of the most segregated cities in the U.S. which trickles down on the schooling system. This can be a problem when trying to provide education to all who need it. It also can help easily identify when one racial/ethnic group is being undervalued. The Chicago Public School system has a lot on their plate when trying to provide quality education to all walks of life. This begs the question, how well is this goal being achieved? This brings us to the goal of the paper. How can one use methods in statistics to visualize and test for the measure of educational disparity? The main focus will be to provide the methods to measure this disparity. In addition, we will visualize this disparity and provide meaningful interpretations of the statistical methods used.

In order to begin our descent into the Chicago public high school system, we must gather data. The data source used for this paper is from the Illinois State Board of Education [1]. Their website contains a report card library dating back to 1996. Since the goal of this paper is to create analysis of Chicago public high schools in the present day, we will only go as far as 2015 for our data. These report cards contain data on every registered public school in Illinois. The pieces of data we are interested in are demographic data and academic data. We will then use these for our analysis going forward.

Main Idea/Feature Selection

To start, we need to understand what data we are going to use. We will begin by selecting these factors from the report card data:

  1. School Name
  2. % Student Enrollmen - White
  3. % Student Enrollment − Black or African American
  4. % Student Enrollment − Hispanic or Latino
  5. % Student Enrollment − Asian
  6. % Student Enrollment − Low Income
  7. Student Attendance Rate
  8. High School Dropout Rate − Total
  9. High School 4−Year Graduation Rate − Total
  10. % Graduates enrolled in a Post secondary Institution within 12 months
  11. # Student Enrollment

Then we will reduce our set of schools to high schools in Chicago. Our goal is to separate schools into bins with similar demographics and then use simple statistics to describe the different bins. We can achieve this by using what is known as K-means Clustering ([2], p. 460). Using this method, we can partition the schools into separate bins of similar demographics. In particular, we will be clustering schools based on factors 1) through 6). This means that each school in our clustering algorithm can be represented as a point in \(\mathbb{R}^6\). Later we will see how these clusters have an affect on academic “success”. In the context of this paper, we define academic success to be the factors 7) through 10).

K-Means Clustering

Mathematical Setup for K-means Clustering

In the preliminary step of the K-means algorithm we randomly select K points \[\{p_1^{(0)}, p_2^{(0)},\dots, p_K^{(0)}\}\] to be the starting means or centroids. Let \(S \subset \mathbb{R}^6\) be the set of schools. With these random starting points, we will be able to construct a partition given by \[ \{ S_1^{(0)}, S_2^{(0)}, \dots, S_K^{(0)} \} \] where \[\bigcup^K_{i = 1} S_i^{(0)} = S\] and \[S_i^{(0)} = \{s\in S : d(s,p_i^{(0)}) < d(s,p_j^{(0)}) \; \forall \; j\neq i\}\] where the metric \(d:\mathbb{R}^6\times\mathbb{R}^6 \to \mathbb{R}\) is given by euclidean distance. The intuition behind the math here is that we drop \(K\) points in space that each have 6 randomly generated proportions for our school demographics. Then we partition schools into \(K\) subsets where the \(i^{th}\) subset, \(S_i^{(0)}\), is the collection of schools that was ``closest" to the \(i^{th}\) point, \(p_i^{(0)}\). The superscript \(^{(0)}\) denotes that this is the preliminary step. Moving forward, we recalculate our \(K\) points, \[\{p_1^{(1)}, p_2^{(1)},\dots,p_K^{(1)}\}\] where \[p_i^{(1)} = \frac{1}{|S_i^{(0)}|}\sum_{s\in S_i^{(0)}}s\] Which means that \(i^{th}\) point, \(p_i^{(1)}\), is now the mean demographics of the schools from \(i^{th}\) subset, \(S_i^{(0)}\), from the previous step. Then we create another partition given by \[\{S_1^{(1)}, S_2^{(1)},\dots,S_K^{(1)}\}\] using the same strategy as in the preliminary step. This algorithm will repeat in this fashion until the means converge to a point where the clusters don’t change on the next iteration. Thus we will have our final clusters given by \[\{S_1, S_2,...,S_K\}\] Once we have our clusters, we can now label each of the schools with the cluster it belongs in. For each cluster, \(S_i\), we can give summary statistics. We present the implementation with \(K=3\).

Results from K-means Clustering: 2019-2020

From figures 1, 2, and 3 that cluster 0 was predominately black or African American, cluster 1 was predominately Hispanic, and cluster 2 was a mixed demographic cluster with a high amount of non-low-income students. These results, intuitively, give rise to some powerful concerns. The important aspect of the construction of these clusters is that the algorithm only used demographic factors to partition the schools yet was able to spot a group of schools with significantly higher academic ``success" in cluster 2. Now it must be understood that how we are defining academic success is restricted to only 4 pieces of data. In reality, basing an individual student’s academic success on these 4 factors would be preposterous. However, making a judgment about an entire school on these factors makes more sense.

Figure 1. Summary Stats of Cluster 0

Figure 1. Summary Stats of Cluster 0

Figure 2. Summary Stats of Cluster 1

Figure 2. Summary Stats of Cluster 1

Figure 3. Summary Stats of Cluster 2

Figure 3. Summary Stats of Cluster 2

These clusters allow us to analyze even more about the school system in Chicago. They allow us to see the level of segregation in Chicago and how that is woven into the school system. By plotting these clusters, we can create a visual of this.

Figure 4. Mapping clusters over Chicago neighborhoods

Figure 4. Mapping clusters over Chicago neighborhoods

It is clear to see that not only do these clusters separate schools efficiently on the basis of academic success (without even trying to) they also seem to be an efficient tool in splitting up the areas of Chicago.

Traditional Approach

In this section we will use the 2019-2020 school year data to take a look at these factors from a more straightforward standpoint. What can we see about the correlations between all these factors? Informally speaking, a correlation close to 1 means that the two factors are strongly positively related, a correlation close to -1 means that the two factors are strongly negatively related, and a correlation of 0 means that there is no observable relationship between the factors [4].

Figure 7. Proportion of students attending college over time

Figure 7. Proportion of students attending college over time