Week 6
Cleaning and Data
Analysis in
Soci—269
Make sure you’re … well, here on this slide.
In 1-2 weeks, we’ll all be able to produce plots like this.
But let’s take things one step at a time.
You can use your console as a calculator.
Feel free to play around with the numbers above.
What’s 18% of 905? Use the console on this slide to arrive at an answer.
Note: Keep clicking or the space bar on your to advance through the slide deck.
Use the combine function—c()—to stitch together multiple elements of the same type, yielding an atomic vector.
Here’s a character vector.
And here’s a numeric vector.
Note: Keep clicking or the space bar on your to advance through the slide deck.
We can create new objects with the assignment operator <-.
Here’s how we can store the vector c(1, 2, 3) as an object named x.
Can you store the character vector embedded below as a new object named five_colleges?
Note: Keep clicking or the space bar on your to advance through the slide deck.
Try to run the line below.
As Wickham, Çetinkaya-Rundel and Grolemund (2023) note, “object names must start with a letter and can only contain letters, numbers, _, and ..”
Note: Keep clicking or the space bar on your to advance through the slide deck.
is particularly useful for working with data frames.
Change the name of this data frame to IRIS.
We can use the $ operator to isolate elements within list or data.frame objects. This comes in handy when we use basic functions in .
Now, calculate the mean of the Petal.Length variable in iris.
Note: Keep clicking or the space bar on your to advance through the slide deck.
We can annotate our code using the # sign.
Now, find the median value of iris$Petal.Width while weaving in some crude annotations.
Note: Keep clicking or the space bar on your to advance through the slide deck.
Store penguins as a new object named penguins_week6.
Calculate the mean of bill_length_mm in the dataset.
How many male and female penguins are in penguins_week6? Use the table function to find out—and add some annotations, too.
dplyr—The pipe operator (|> in base ) will help us chain together a series of dplyr verbs as we manipulate our data frames.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::filter()The filter() function allows us to zoom-in on our observations of interest based on column values.
Logical vectors are often used to streamline the “filtration” process.
We can use string patterns in character vectors to make life easier, too.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::arrange()The arrange() function allows us to modify—or have fine control over—the order of our observations based on column values.
We can arrange our rows using multiple variables.
We can organize rows in descending order, too.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::distinct()Calling the distinct() function returns all the unique rows or values in a dataset vis-à-vis specific variables or their combinations.
Here are the distinct values of the year variable in palmerpenguins::penguins:
We can isolate unique species and island pairs, too.
distinct() is helpful for removing “redundant” information or extraneous rows.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr—dplyr::mutate()The mutate() function lets us create new variables or modify existing columns. Generally, this is achieved by manipulating the variables in our data.
Create a new variable for population in 1000s.
Logical vectors are useful for generating new variables.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::select()The select() function lets us home-in on our variables of interest.
We can provide (and modify) column names or positions.
We can use helper functions when variables are legion.
How can we remove the first three columns?
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::rename()The rename() function allows us to rename variables without adjusting the dimensions of our data frame.
We can rename columns by specifying variable names as follows: new = old.
We can also use column positions to point to the variables we’d like to rename:
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::relocate()The relocate() function allows us to move our columns around.
Here’s how we can move a few columns to the “front.”
We can specify where we want columns relocated.
Helper functions can, well, be quite helpful.
Note: Keep clicking or the space bar on your to advance through the slide deck.
For the rest of today’s session, we’ll work with data from the Varieties of Democracy Project (Coppedge et al. 2025). You can access the codebook for V-Dem 15 below:
dplyr verbs we discussed today to heavily modify the dataset—say, by isolating specific columns, years, or countries of interest before generating new variables.Our Syllabus Will Change
I’ll be updating the syllabus over the mid-semester break. Specifically, I will—in all likelihood—extend the deadline for your first coding assignment.
More information about the first coding assignment
will be provided soon.
Mid-Semester Break
The mid-semester break is upon us. Ergo, we do not have class on Monday.
Please, do not show up on Monday.
learnrWith the arrival of ChatGPT (and its analogues),
why learn how to code?
dplyr—dplyr::group_by()The group_by() function allows us to group our data before performing additional operations or transformations.
What’s weird about this code?
Crucially, we can apply group_by() to more than one column.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::summarise()We can call the summarize() function—or summarise()—to easily compute summary statistics at the group level.
What do we need to change before the code can run smoothly?
We can generate summary statistics by group combinations (or group x group interactions).
As of dplyr 1.10, we can use the .by argument for “per-operation” grouping.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::slice()The slice() function allows us to extract observations of interest.
Here, we isolate the first (or # last) two rows for each continent in gapminder.
slice_min() and slice_max() allow us to isolate the minimum or maximum values of column x.
The slice_sample() function can be used to randomly draw from our (grouped or ungrouped) data.
Note: Keep clicking or the space bar on your to advance through the slide deck.
Note
The last two verbs—i.e., summarize() and slice()—can be applied to ungrouped data as well. What’s more, the other verbs we encountered on Monday—e.g., filter() or mutate()—can also be used to manipulate grouped data.
Manipulate your V-Dem data frame so that it looks like this:
# A tibble: 75 × 8
country year electoral electoral_global_avg liberal participatory
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Canada 2000 0.843 0.491 0.766 0.594
2 Canada 2001 0.838 0.494 0.76 0.587
3 Canada 2002 0.831 0.505 0.753 0.578
4 Canada 2003 0.831 0.513 0.753 0.574
5 Canada 2004 0.83 0.513 0.758 0.571
6 Canada 2005 0.829 0.518 0.756 0.571
7 Canada 2006 0.834 0.521 0.765 0.571
8 Canada 2007 0.834 0.517 0.766 0.572
9 Canada 2008 0.832 0.521 0.762 0.57
10 Canada 2009 0.833 0.523 0.764 0.57
# ℹ 65 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>
You can access the can_mex_usa data frame via GitHub or by copying—and executing—the following line:
Note: Need a hint? Skip to the next slide.
Are you done? Congrats! Here’s some more data to explore (cf. Healy 2023)—
Explore the variables here:
