Abstract

The purpose of this paper is to delve into the Chicago public high school system to extract information on demographics and academics. The methods used are common in machine learning and discrete mathematics and seek to bring helpful visualizations and interpretations of education disparity. Not only is the goal to analyze the current state but also how these demographic and academic factors have changed over time for certain school groups. This can accomplish this using graphical models and create visualizations to show key factors and how the change over time. In addition, one can see how the COVID-19 pandemic has affected schools in this system. Lastly, the correlations between these factors will be presented along with discussion.

Introduction

Chicago has a history of being one of the most segregated cities in the U.S. which trickles down on the schooling system. This can be a problem when trying to provide education to all who need it. It also can help easily identify when one racial/ethnic group is being undervalued. The Chicago Public School system has a lot on their plate when trying to provide quality education to all walks of life. This begs the question, how well is this goal being achieved? This brings us to the goal of the paper. How can one use methods in statistics to visualize and test for the measure of educational disparity? The main focus will be to provide the methods to measure this disparity. In addition, we will visualize this disparity and provide meaningful interpretations of the statistical methods used.

In order to begin our descent into the Chicago public high school system, we must gather data. The data source used for this paper is from the Illinois State Board of Education [1]. Their website contains a report card library dating back to 1996. Since the goal of this paper is to create analysis of Chicago public high schools in the present day, we will only go as far as 2015 for our data. These report cards contain data on every registered public school in Illinois. The pieces of data we are interested in are demographic data and academic data. We will then use these for our analysis going forward.

Main Idea/Feature Selection

To start, we need to understand what data we are going to use. We will begin by selecting these factors from the report card data:

  1. School Name
  2. % Student Enrollmen - White
  3. % Student Enrollment − Black or African American
  4. % Student Enrollment − Hispanic or Latino
  5. % Student Enrollment − Asian
  6. % Student Enrollment − Low Income
  7. Student Attendance Rate
  8. High School Dropout Rate − Total
  9. High School 4−Year Graduation Rate − Total
  10. % Graduates enrolled in a Post secondary Institution within 12 months
  11. # Student Enrollment

Then we will reduce our set of schools to high schools in Chicago. Our goal is to separate schools into bins with similar demographics and then use simple statistics to describe the different bins. We can achieve this by using what is known as K-means Clustering ([2], p. 460). Using this method, we can partition the schools into separate bins of similar demographics. In particular, we will be clustering schools based on factors 1) through 6). This means that each school in our clustering algorithm can be represented as a point in \(\mathbb{R}^6\). Later we will see how these clusters have an affect on academic “success”. In the context of this paper, we define academic success to be the factors 7) through 10).

K-Means Clustering

Mathematical Setup for K-means Clustering

In the preliminary step of the K-means algorithm we randomly select K points \[\{p_1^{(0)}, p_2^{(0)},\dots, p_K^{(0)}\}\] to be the starting means or centroids. Let \(S \subset \mathbb{R}^6\) be the set of schools. With these random starting points, we will be able to construct a partition given by \[ \{ S_1^{(0)}, S_2^{(0)}, \dots, S_K^{(0)} \} \] where \[\bigcup^K_{i = 1} S_i^{(0)} = S\] and \[S_i^{(0)} = \{s\in S : d(s,p_i^{(0)}) < d(s,p_j^{(0)}) \; \forall \; j\neq i\}\] where the metric \(d:\mathbb{R}^6\times\mathbb{R}^6 \to \mathbb{R}\) is given by euclidean distance. The intuition behind the math here is that we drop \(K\) points in space that each have 6 randomly generated proportions for our school demographics. Then we partition schools into \(K\) subsets where the \(i^{th}\) subset, \(S_i^{(0)}\), is the collection of schools that was ``closest" to the \(i^{th}\) point, \(p_i^{(0)}\). The superscript \(^{(0)}\) denotes that this is the preliminary step. Moving forward, we recalculate our \(K\) points, \[\{p_1^{(1)}, p_2^{(1)},\dots,p_K^{(1)}\}\] where \[p_i^{(1)} = \frac{1}{|S_i^{(0)}|}\sum_{s\in S_i^{(0)}}s\] Which means that \(i^{th}\) point, \(p_i^{(1)}\), is now the mean demographics of the schools from \(i^{th}\) subset, \(S_i^{(0)}\), from the previous step. Then we create another partition given by \[\{S_1^{(1)}, S_2^{(1)},\dots,S_K^{(1)}\}\] using the same strategy as in the preliminary step. This algorithm will repeat in this fashion until the means converge to a point where the clusters don’t change on the next iteration. Thus we will have our final clusters given by \[\{S_1, S_2,...,S_K\}\] Once we have our clusters, we can now label each of the schools with the cluster it belongs in. For each cluster, \(S_i\), we can give summary statistics. We present the implementation with \(K=3\).

Results from K-means Clustering: 2019-2020

From figures 1, 2, and 3 that cluster 0 was predominately black or African American, cluster 1 was predominately Hispanic, and cluster 2 was a mixed demographic cluster with a high amount of non-low-income students. These results, intuitively, give rise to some powerful concerns. The important aspect of the construction of these clusters is that the algorithm only used demographic factors to partition the schools yet was able to spot a group of schools with significantly higher academic ``success" in cluster 2. Now it must be understood that how we are defining academic success is restricted to only 4 pieces of data. In reality, basing an individual student’s academic success on these 4 factors would be preposterous. However, making a judgment about an entire school on these factors makes more sense.

Figure 1. Summary Stats of Cluster 0

Figure 1. Summary Stats of Cluster 0

Figure 2. Summary Stats of Cluster 1

Figure 2. Summary Stats of Cluster 1

Figure 3. Summary Stats of Cluster 2

Figure 3. Summary Stats of Cluster 2

These clusters allow us to analyze even more about the school system in Chicago. They allow us to see the level of segregation in Chicago and how that is woven into the school system. By plotting these clusters, we can create a visual of this.

Figure 4. Mapping clusters over Chicago neighborhoods

Figure 4. Mapping clusters over Chicago neighborhoods

It is clear to see that not only do these clusters separate schools efficiently on the basis of academic success (without even trying to) they also seem to be an efficient tool in splitting up the areas of Chicago.

Traditional Approach

In this section we will use the 2019-2020 school year data to take a look at these factors from a more straightforward standpoint. What can we see about the correlations between all these factors? Informally speaking, a correlation close to 1 means that the two factors are strongly positively related, a correlation close to -1 means that the two factors are strongly negatively related, and a correlation of 0 means that there is no observable relationship between the factors [4].

Figure 7. Proportion of students attending college over time

Figure 7. Proportion of students attending college over time

Figure 8. Correlation Matrix of Factors

Figure 8. Correlation Matrix of Factors

The important part of this matrix to focus on are the correlations between are demographic factors and our academic factors. This data is presented below in figure 9. The first thing to notice is the disadvantage that schools with a large amount of low income students have when it comes to academic factors. Across the board, low income students is the most strongly correlated factor with respect to each academic factors. We can also see the disadvantage to schools that are predominantly black or African American.

Figure 9. Correlations of academic factors against demographic factors

Figure 9. Correlations of academic factors against demographic factors

Referring back to figure 9 we notice a shocking discovery. The correlation between the proportion of low income students of a school and the proportion of graduates that enroll in postsecondary institution is about -0.644 while the correlation between the high school graduation rate and proportion of graduates that enroll in postsecondary institution is about 0.771. As an example as to why this is so problematic, let us consider someone trying to guess the academic profile of a school. Perhaps this person was limited to asking questions only about demographics. What these numbers tell us is that asking for the proportion of low income students is almost as good as asking the high school graduation rate when trying to predict the postsecondary institution enrollment! In short this person would most likely say, “tell us how wealthy the student body is and we will give a good ballpark estimate for its academics factors”.

Further Analysis and Conclusions

The hidden goal of this project is to recreate new research projects for others in this scope. The proposed choice of parameters is not the only choice. The github repository where the analysis was done contains scripts to recreate results [5]. For example, what if the number of clusters chosen was 4 instead of 3? What if we chose to look at attendance rate over time instead? All potential avenues for further analysis can be started in that code.

To summarize our journey in uncovering the disparities present in the Chicago public high school system we will start from the beginning. We started this to analyze the disparity for a single school year, 2019-2020. We were able to immediately see differences in academic factors even though the clustering algorithm knew nothing about them. After plotting these clusters over the city of Chicago, we were able to understand the connection between the segregation of Chicago in the school system. Next we decided to observe how these clusters changed in time by making use of a graphical model and matching these clusters over time. Again, we were not only able to see clear differences year to year but we were able to see that these disparities have grown over time. Lastly, we observed the correlation matrix between all these factors and identified key values to illustrate this disparity.

In conclusion, there is much work to be done when it comes to Chicago public high schools. It is not sure whether or not the school system itself can make up for these inequalities but proof of that was not the goal of this paper. What we have shown is that it exists and it is strong. Low income students and students of color are at disadvantages when it comes to the academic factors selected.

License

The author of this technical report, which was written as a deliverable for a SoReMo project, retains the copyright of the written material herein upon publication of this document in SoReMo Reports.

References

[1] Illinois State Board of Education. Report card library.

[2] Hastie, T., Tibshirani, R. and Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.

[3] Diestel, R. (2017). Graph theory. Springer Publishing Company, Incorporated.

[4] Casella, G. and Berger, R. L. (2002). Statistical inference. Cengage Learning.

[5] Kralis, M. (2021). Supplementary material and code for SoReMo report. GitHub repository https://github.com/mkralis123/SoReMo.