Musings on Data Science in High School

Feb 27, 2021 11 min read STEM Education, R Coding in School

Data Science Education in High School - Part I

The Past

My journey as an educator of probability and statistics is not linear. After studying physics at Macalester College and then matriculating to the Ontario Institute for Studies in Education at the University of Toronto for my B.Ed., I was poised to lead students down a calculus-based, physics-laden experience in the math classroom. I was exited to teach students the beauty behind mechanics, quantum theory, and classical physics and all the ways they can apply these skills to their future studies and investigations. But one rarely chooses what they teach in their first job, and I was designated to teach ninth grade algebra, and basic level grade 12 math. In my first year or two, I was learning how to teach students, and more importantly how to become their mentor and their support person in the classroom, but I was also concerned that it would be years before I was given the opportunity to teach older students all my cool physics and math skills. Fortunately (or unfortunately), a colleague teaching the physics classes became ill and I needed to swoop in to take over some physics classes. Here was my chance to have an impact on the young engineers and doctors of the future. My experience teaching physics was fascinating, and I loved it – and I still miss it – but what also came with the 2nd semester was a twelfth-grade math class called Mathematics of Data Management.

Although I encountered stats in my physics degree, it was generally in the context of scientific measurements and uncertainty. It wasn’t until my final year in the spring term when I took an Introduction to Statistical Modelling class from Daniel Kaplan, one of my favourite professors (there were several favourites) at Macalester. To create models and understand the way data can be analyzed, modelled and applied, we used a program called R for statistical computing. Back in 2007, the interface reminded me of an MS-DOS interface, and it looked overwhelming. I didn’t love working with it, but I did some really neat statistical modelling with some easy keystrokes. If nothing else, that course was my first time really working with different types of data that weren’t physics measurements, they were data about people, or about things which may be interesting to make predictions about. I suppose I enjoyed the course, but I never thought much about it until I had to teach the Mathematics of Data Management class.

Once I started teaching Data Management, with several references from my teaching colleagues, I began to understand how far behind students would be once they tried to perform statistics with a computer. We were still exploring our statistical measures by hand. We had TI-84 calculators and only for the student “project” did we get to the computer lab to analyze data in Excel or Fathom. I was disheartened, but also realized that access to software and computing was limited, so I didn’t push the envelope.

The Present

Fast forward several years. I’m at a new school, St. Andrew’s College, and every student has a laptop. In the last nine years at SAC, the Mathematics of Data Management and AP Statistics courses have been well-subscribed and we’re now using software like Excel to store larger datasets for investigations and do some basic wrangling and tidying to explore our data. We’ve come a long way, but we need to move beyond the typical curriculum expectations and ensure that students are equipped to use computing to solve problems and ask questions. The world of data science is at our fingertips, and we as teachers in that community, I believe, are called to give our students an opportunity to explore data like the way industries are have begun to. We need our students to be familiar with different coding languages and develop a workflow that makes exploring data accessible and interesting to them. It’s our duty.

Now, the reason for this is not grounded in the trite saying, “you’re going to need to know how to code or use software if you want to advance” , or even worse, “to get a get a good job” sort of way. It’s grounded in the belief that being able to investigate, explore, discover our data-rich world is essential to our learning. Instead of being motivated by building capital, realize that asking and answering questions of the world around us is innately human. Society has reached a point where so much of our past, and our present, is stored as data. That we can discover inequities in cities, explore racial bias, observe the trajectory of a world-wide infection, understand the strategy of sports teams even, by writing a few lines of code is incredible. So often, I hear colleagues discuss the ways in which students become too focused on a single field of study when they choose courses linked to STEM field, but I firmly believe that my journey in the STEM subjects, and more recently, my journey learning more about #rstats and the tidyverse set of packages in R has enriched my perspective about the amazing work the data science community is doing to uncover social-scientific findings. If anything, working with data, and seeing your thoughts and analysis come alive strengthens your arguments and analysis. Too often, an appeal to emotion is unsubstantiated and is not borne out in the data – this opens the door to criticism and disagreement rather than support. Shouldn’t all our students, in any discipline, be trained to critically analyze and argument they encounter, and should they be able to support their own with evidence. This should be an essential aspect of a student’s formative training.

The manifesto that Nate Silver wrote once his website fivethirtyeight.com became a more standalone news organization should offer some clarity here. He writes, “to be clear, our approach at FiveThirtyEight will be quantitative — there will be plenty of numbers at this site. But using numbers is neither necessary nor sufficient to produce good works of journalism”. The notion of enhancing ones understanding of the world is foundational in the way schools should work.

So, for our students, and for the world, we should be willing to encourage them to use the tidyverse packages in R to perform the basic calculations and data exploration that we’ve done by hand, or by calculator, in the last number of years. Even in 2007, Ontario’s Ministry of Education had its finger of the pulse of this important movement. Embrace the technology. The authors write, “Operation that were an essential part of a procedure-focused curriculum for decades can be accomplished quickly and effectively using technology, so that students can now solve problems that were previously too time-consuming to attempt, and can focus on underlying concepts.” MOE - pg. 4. At that time, that technology may have been a TI-84 and a dataset was something of size, n = 15, but now we’re at a point where our students can ask the same questions about datasets with thousands, or more, observations.

Of course there will be objections, and students and teachers may have concerns about the implementation of these learning objectives. There will be growing pains. I’ll share with you some of the feedback I had when I tried to incorporate a data science in R unit in my class. Directly from student feedback forms, I chose the following objections that capture the more prominent arguments against using a software package like RStudio in this course.

“I find it interesting but I am quite confused on the reasons why we are learning and how it will benefit me in the future. I personally feel like I won’t be using something like this down the road because I am not interested in tech.”
“I would like to get better at using it but at the moment its difficult for me to understand. It is very interesting and will help me in the future”
“I don’t mind using R but it feels very unnecessary to put into our data management course as its not as if we could use R in the exam.”

There are many different responses to these concerns from students. Hopefully, in the above paragraphs, I have addressed the first two comment. They are opposite sides of the same coin, but the second bit of feedback is an interesting idea, and one which I’ve explored. I think it’s more a matter of the demonstration of learning on the final exam does not need to focus on the computing skills of a student, but, I am confident that with more experience exploring data or running simulations to examine probabilities using software, a student will have learned a year’s worth of concept with greater stickiness than they would have with limited time to ask their own questions and apply the skills they’ve learned to try and answer those questions.

Fortunately for me, several other amazing thinkers have already considered these problems and posed thoughtful and lucid counters to them. One of the more inspiring educators of data science and statistics is Mine Cetinkaya-Rundel, who argues on one of her websites

“One solution for these concerns is to avoid hands-on data analysis completely. If we do not ask our students to start with raw data and instead always provide them with small, tidy rectangles of data then there is never really a need for statistical software beyond spreadsheet or graphing calculator. This is not what we want in a modern statistics course and is a disservice to students.” Data Science in a Box

As for teachers, there are several objections, but mostly, they can be categorized into two buckets. Firstly, are those who may say “ this is hard to learn and I don’t want to learn it because it’s unnecessary to teach the course”, and others may say “Excel or a TI Calculator is easier for students to learn so that we spend more time on the concepts”. I’ll tackle the first point myself. This is not a shaming situation, but I believe that if we are truly willing to help students get the most out of our classes, then doing some type of data science and investigative work with large data is essential for students. Whether one agrees with that or not, I think, will be the basis for those who disagree with that point. Yes, it’s a lot of work to learn how to use a new coding language, but it’s rewarding and fun and will help you develop a more nuanced understanding. We are, of course, life-long learners – and, within the R community, there are several resources and professionals to turn to.

As for the second point, once again, I’ll turn to Dr. Mine Cetinkaya-Rundel who addressed this point with the following suggestion.

“this ignores the fact that these software tools also have nontrivial learning curves. In fact, teaching specific data analysis tasks using such software often requires lengthy step-by-step instructions, with annotated screenshots, for navigating menus and other interface elements. Also, it is not uncommon that instructions for one task do not easily extend to another. Replacing such instructions with just a few lines of R code actually makes the instructional materials more concise and less intimidating.” Data Science in a Box

In an Excel plot of a hisotgram - you better remember the steps and settings every time and know which column to select.

Although there is an intimidation associated a coding language and an IDE, once a student understands how much they can accomplish in such a space, they are much more inclined to find confidence. Let’s also understand that I am essentially holding every student’s hand as they try to create a histogram or boxplot in Excel or Fathom. I’d be happy to do the same thing with R, and instead of sitting with each of them one on one to show a them the point and click methodology to create, I can send them lines of code, with file names for them to replicate and then make their own. Ask yourself, which one seems more intuitive, and frankly, more versatile and transferable when thinking about a different dataset. In R, one can simply change the object name and accomplish the same analysis, in Excel or Fathom, once must repeat the entire “point and click” procedure over and over again. See for instance, this plot with comments shows the intuituve way the tidyverse uses commands.

mpg %>% ## This is the data we're using
  ggplot(aes(cty))+ ## We want to look at the city mpg of cars
  geom_histogram(aes(fill = class), bins = 15)+ ## We want to use a hisotgram
  labs(x = "City (mpg)", ## label various aspects of the plot
       y = "Count",
       fill = "Class of Auto",
       title = "Fuel Efficiency looks related to its class",
       subtitle = "Subcompact cars lead the way")

This plot was generated with the code above

The Future

So, where do we go from here? I’m not sure, but I think this is the right path. The path that offers our students an opportunity to analyze meaningful data and create plots can communicate their thoughts. The path that allows students to support their own arguments with actual evidence, and the path that allows students to consider arguments they may read or hear in the media, or from others, with a critical lens. By building a toolkit to explore data and statistics, students will develop the tools to investigate and answer their own questions and have the know-how to explore and understand how others think about similar problems. In the meantime, while we develop ways of teaching statistics and Data Management courses in high school, governments should be heeding the advice of the mathematicians who are calling for changes in our curricular objectives. Indeed, “the demands for statistical literacy have never been greater”, and let’s use the tools at our disposal to equip our students with this power.

STEM Education R Coding in School

Chris Papalia

Head of Memorial House, Teacher, University Counsellor

Head of House and Teacher of Mathematics and Science at St. Andrew’s College. College Counsellor and Consultant.