Topic: Data Abstractions with the CORGIS Visualizer
Author: Austin Cory Bart firstname.lastname@example.org
Duration: 40-50 minutes
In this lesson, students interact with an online platform for creating visualizations of a wide variety of datasets (including health, energy, books, law, and more). In doing so, they must confront the limitations of the abstraction inherent in the models presented.
- Students will select a dataset from the CORGIS Visualizer, suggest possible concrete abstractions, and identify the kinds of questions that could be answered with those abstractions.
- Students will analyze the actual data abstraction that they choose, develop a question related to the dataset, and then create a visualization that answers the question.
- Students will identify limitations of the abstraction with regards to the questions it can answer.
Visualizations are a powerful way to summarize data into a medium accessible by a wider audience. Computational tools like Excel or Python’s MatPlotLib can be used to quickly produce these visualizations. The result can then be analyzed and interpreted to draw conclusions. Increasingly, data is collected and used to answer important real-world questions by generating these visualizations. However, the collection of data is a design process with trade-offs and decisions. While a computational tool can make it easy to interact with a rich data abstraction, there are still limitations inherent by the process of formalizing an abstraction. In particular, the details that are removed by the process make it impossible to answer other kinds of questions.
By the end of this lesson, learners will be able to…
- Given a real world entity or idea, identify possible data abstractions and the questions answerable with that abstraction.
- Given a tool that provides access to a large dataset, describe the abstraction represented by the data.
- Given a tool that provides access to a large dataset, explain the kinds of questions such a tool could and could not answer.
You may also cover:
- Given a graph, students will evaluate the graph for its accuracy, validity, and reasonableness.
Relevant Computer Science Principles:
2.2 - Multiple levels of abstraction are used to write programs or create other computational artifacts.
2.3 - Models and simulations use abstraction to generate new understanding and knowledge.
3.1 - People use computer programs to process information to gain insight and knowledge.
3.2 - Computing facilitates exploration and the discovery of connections in information.
7.2 - Computing enables innovation in nearly every field.
7.5 - An investigative process is aided by effective organization and selection of resources. Appropriate technologies and tools facilitate the accessing of information and enable the ability to evaluate the credibility of sources.
To assess/verify the learning objectives:
- Learners will show the instructors their completed graphs.
- Learners will answer instructor questions in the Post-discussion.
Prior to this lesson, learners will already be able to:
- Interpret histograms, line plots, scatter plots, and bar charts
- Define the concept of a data abstraction (e.g., a dataset)
Learners and Contexts
This lesson was designed for:
Age: Grades 9-12
Size: Any reasonably-sized class should work, probably in 10-50 range works best.
Instructors: No additional instructors are necessary. More instructors can help guide discussion.
Formality: Suitable for formal and informal settings.
- This activity is plugged-in.
- The website works on both computers and smart phones/tablets.
- A keyboard is not required.
- A working internet connection is required to load the pages.
External Tool: https://think.cs.vt.edu/corgis/visualizer/
Most of this lesson will be:
- Learners listening to me present
- Learners discussing in small groups
- Learners using the CORGIS Visualizer in small groups
Strategy: Present Content
Duration: 5 minutes
- Hook: First slide is a Corgi surfing on a “tidal wave of data”. Here’s a chance to connect real world uses of data - either locally or in the world at large.
- Self-driving cars use data from their environment to figure out what to do next
- Facebook and other social media use user data to target ads
- Scientists collect weather data to make forecasts
- Teachers collect grades to figure out weaknesses in their teaching
- Hook: Show a slide with all the CORGIS datasets’ icons. Then, tell students that we’re going to be looking at a big collection of data today, and ask them to start thinking about what they seem personally interested in. Ideally, students should be attracted to the colorful pictures involving cute CORGIS and find things that they think sounds interesting.
- Inform students that we will be analyzing data abstractions and making visualizations to answer real questions.
- Remind students about the definition of a data abstraction: a representation of reality with details removed.
- Interactive: Ask if anyone has any questions about how the concept of a data abstraction leads to datasets.
- Describe the growing importance of data, and how there are many kinds of data that have been collected out there.
- Highlight the connection between the act of gathering data and the limitations that imposes on the data itself.
- Describe the CORGIS collection, and give examples of some of its datasets.
- Publicly available Google Slides, presented from the front of the room.
2. Show CORGIS Datasets
Strategy: Provide Guidance
Duration: 2 minutes
- Organize students into groups of 2-3. Consider having only one laptop per group.
- Show the URL for the CORGIS website (think.cs.vt.edu/corgis/visualizer)
- Tell students to bring up the webpage, but to NOT open any links on the page yet.
- Tell students to look at the variety of datasets.
- Tell students to agree on a dataset that interests the group.
- Students should NOT advance into any of the datasets’ pages yet, they should just be browsing the collection.
- Students must agree on a single dataset
3. Dataset Discussion
Strategy: Support Practice
Duration: 10 minutes
- Tell students to discuss their chosen dataset as a group.
- Tell them to list specific properties and data they expect to find in the dataset.
- Tell them to describe how they expect the dataset to be structured and accessed.
- Walk around and encourage discussion about the datasets.
- If students are struggling to think of properties/data:
- Ask them to describe the thing being abstracted.
- Ask them what characteristics are relevant to answering questions related to the thing.
- Ask them, what would they want to know about the thing?
- For example, for the Cars dataset, they might list things like “make”, “model”, “year created” of a car. Alternatively, they might try to articulate that the dataset is about “individual people’s cars” or “general makes of cars” or “the history of one person’s car”. You can encourage exotic thinking like, “the last year that the car had its windows replaced”.
- If students have no guesses about how the data might be structured:
- Ask them how they usually see data in a computer
- Potential answers might be: spreadsheets, tables, graphs, paragraphs of explanation
- Goal is to have them
- Remind students not to access their dataset
- Have students develop expectations of what a dataset should look like, to eventually contrast with reality.
3.5. Visualization Misconceptions (Potentially)
Strategy: Present Content
Duration: 10 minutes
- You can review the slides that cover common misconceptions about visualizations.
- Interaction: You can ask students to point out issues with the graphs or ask questions here quite a bit.
- Pirates vs. Global Warming: The goal of this graph is to highlight that it is perfectly possible to make nonsense graphs that look fine. You should be able to see a number of issues when you study the graph.
- Correlation vs. Causation
- X-Axis makes no sense
- How would this data have been collected?
- Why are the points annotated with years?
- Histogram vs. Line Plots: Students are drawn to line plots, and shy away from histograms. But histograms are the bread and butter of data visualization - they are suitable for any numeric dataset and are a great starting point for understanding numbers. Line plots are only suitable for TRENDS - data ordered by time, by geographic position, or some other continuous type.
- Histograms vs. Bar Charts: Students confuse these two ideas, but they are distinct. Bar charts are values separated by categories. Technically, histograms are a type of bar chart where the categories are adjacent number ranges. The fact that they are adjacent is why there is no gap between the bars. When you make a bar chart, you MUST have a second column of data (a category) to pair up with the numeric values. For histograms, no second column is needed because the values themselves lead to the categories (by “binning” the numbers into equally sized groups).
4. Demo Visualizer
Strategy: Provide Guidance
Duration: 5 minutes
- Have students click the link for their dataset to open the Visualizer page.
- On your computer, click the link for the Weather dataset.
- Give a brief summary of the
Overview section of the page. Mention that this dataset is from the National Weather Service, and summarizes weekly weather reports specifically from 2016 for the United States.
- Give an interactive tutorial on how to use the Visualizer tool’s interface
- Show the graph produced at the bottom
- Demonstrate how the graph type can be changed with the
Chart Type dropdown. Switch to a Histogram.
- Demonstrate how the data being plotted can be changed with the
Values dropdown. Switch to “Temperature, Average”.
- Demonstrate how the data can be filtered with the
By dropdowns. Set to “Station, Location” and “Wilmington DE” (or other appropriate location).
- Interpret the produced graph. In this case, we are seeing the distribution of weekly average temperatures for Wilmington Delaware for the year 2016.
- Suggest a question that this dataset can answer. For example, someone interested in moving to Wilmington might wonder what the weather is like. Alternatively, city planners trying to weather proof the city might need to know the range of temperatures that can occur.
- Suggest a question that this dataset cannot answer because of its abstraction’s limitations. For example, we don’t know what the daily or hourly temperature looked like, just the weekly average. Further, we cannot answer anything involving humidity because that data is not available. You can particularly note that the although there is no data for other years or for other locations, the abstraction itself does not limit this - a nuance for how abstractions limit data collection vs. other kinds of limitations.
- Some datasets are large, and may take a minute to load on slower connections. Test the internet connection to make sure it is stable. Try to avoid students all loading the datasets at the exact same time.
5. Dataset Exploration
Strategy: Support Practice
Duration: 10 minutes
- Tell each group to develop at least one question that this dataset can answer.
- Tell each group to develop a visualization that answers their question.
- Tell each group to develop a question that this dataset cannot answer because of its abstraction’s limitations.
- Tell each group to discuss the difference between the abstraction they expected and the actual properties of the abstraction.
- Walk around and support students using the Visualizer.
- Common questions that might occur:
- Most datasets have extra information for each field under the
Values dropdown, in the
Overview section, or in the link provided. If they cannot find the information even after all that, then you can tell them to use their best judgment.
- If a graph is blank, then most likely they are using a filter that does not have any interesting data available. Ask them to remove the filter.
- If students are unsure what a feature does, have them experiment with the feature and see what happens. If they are still unsure, you might refer to the following:
- Histograms have a
Bins input box that adjusts how many boxes the histogram will have. Increasing the number can make a finer-grained graph, but less binning occurs so the relationships can be harder to see.
- Bar Charts have a
Group By dropdown that is distinct from the
Filter. You can click on values in the left column to have them appear on the graph (and in the column on the right). You can remove values from the graph by clicking on the entries in the column on the right. The
Search boxes allow you to find specific values, but remember to clear the search box after using it.
- Students should produce a visualization that they can correctly interpret.
- Students should be able to articulate the limitations of an abstraction for developing a question.
- You should see students interacting with the Visualizer and refining their graphs.
- Students can ask each other if they agree with their answer.
- Students can ask instructor(s) if they agree with their answer.
- Students should be discussing their question and not working individually.
Duration: 8 minutes
- Announce to everyone to stop working on their computers and to focus on you.
- If there is time available, ask some groups to share the question they could ask and a question they couldn’t ask.
- Make sure that their negative question is limited by the abstraction, not just the data collected.
- Then, give a brief description of the role of abstraction in data collection. Highlight the value of rich abstractions and existing data, but also the limitations.
- Abstraction is a critical part of using computation to answer questions. Once we have a data abstraction, we can use tools like this Visualizer to answer a range of questions.
- However, abstractions inherently do not capture all possible information - that’s the point of an abstraction, to remove unnecessary details. Therefore, the tools are always limited by what they can do with the abstractions available.
- Soon you will be looking closely at the assumptions involved in dataset collection and analysis. Keep in mind the limitations imposed by the data abstraction itself.
- Long term, you may need to develop data abstractions to answer questions. Consider very carefully the kinds of data you will need to capture in order to support the kind of computation you want to do.
Lack of Devices
- If no learners have devices, then we can collaborate as a class on a single dataset that we choose as a group.
- If only some learners do not have devices, then we can have a one device per group policy. We could also expand the group sizes as needed.
- If the internet goes down, then I can load the datasets from my local copy.
- If the internet is slow, then we will have larger groups and only have students access smaller datasets (TODO: list some).
- If I don’t have a device to project my presentation, then I’ll just read my slides out loud. I can write the URL on a whiteboard, chalkboard, or a piece of paper that I hand around the room.
- If I can’t use my own device to project my presentation but have access to someone else’s, then there will be no issues since the presentation is through a public Google Slides.
- If I have too few instructors for the number of learners, then I will rely more on the students to discuss and share ideas.
- If I have too many instructors for the number of learners, then I can have the other instructors engage more deeply with each group.
- If a learner is not interested in their group’s dataset, I will try to find ways to connect the dataset to their lives, or encourage their group to explain the dataset’s importance to themselves. Otherwise, I will have the uninterested learner move to a different group.
- If a learner is struggling with the Visualizer, I will ask them to watch their peer.
- If many learners are struggling with the Visualizer, I will group them together and walk them through another example (e.g., Earthquakes).
- If a learner is finished with making their two questions, I will have them make more questions.
- If they make at least 3 of each kind of questions, I can have them look at another dataset.