Part 1: Introducing dplyr and tidyr
Overview
Teaching: 40 min
Exercises: 25 minQuestions
How can I select specific rows and/or columns from a data frame?
How can I combine multiple commands into a single command?
How can create new columns or remove existing columns from a data frame?
How can I reformat a dataframe to meet my needs?
Objectives
Describe the purpose of the
dplyr
andtidyr
packages.Select certain columns in a data frame with the
dplyr
functionselect
.Select certain rows in a data frame according to filtering conditions with the
dplyr
functionfilter
.Link the output of one
dplyr
function to the input of another function with the ‘pipe’ operator%>%
.Add new columns to a data frame that are functions of existing columns with
mutate
.Use the split-apply-combine concept for data analysis.
Use
summarize
,group_by
, andcount
to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.Describe the concept of a wide and a long table format and for which purpose those formats are useful.
Describe what key-value pairs are.
Reshape a data frame from long to wide format and back with the
spread
andgather
commands from thetidyr
package.Export a data frame to a csv file.
Data Manipulation using dplyr
and tidyr
dplyr
is a package for making tabular data manipulation easier by using a
limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr
which enables you to
swiftly convert between different data formats (long vs. wide) for plotting and analysis.
Similarly to readr
, dplyr
and tidyr
are also part of the tidyverse. These packages were loaded in R’s memory when we called library(tidyverse)
earlier.
What are dplyr
and tidyr
?
The package dplyr
provides easy tools for the most common data
manipulation tasks. It is built to work directly with data frames, with many
common tasks optimized by being written in a compiled language (C++). An
additional feature is the ability to work directly with data stored in an
external database. The benefits of doing this are that the data can be managed
natively in a relational database, queries can be conducted on that database,
and only the results of the query are returned.
This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.
The package tidyr
addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups. Moving back and forth between these formats is nontrivial, and tidyr
gives you tools for this and more sophisticated data manipulation.
To learn more about dplyr
and tidyr
after the workshop, you may want to check out this
handy data transformation with dplyr
cheatsheet and this one about tidyr
.
To make sure, everyone will use the same dataset for this lesson, we’ll read in one of the datasets that we downloaded earlier.
## load the tidyverse
library(tidyverse)
test_results_1 <- read_csv("data/placement_1.csv") %>%
mutate(country = as.factor(country)) # changes the country variable to a factor
## inspect the data
test_results_1
## preview the data
# View(interviews)
## summarize the data
# summary(test_results_1)
# skimr::skim(test_results_1)
We’re going to learn some of the most common dplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columnsgroup_by()
andsummarize()
: create summary statistics on grouped dataarrange()
: sort resultscount()
: count discrete values
Creating new variables
Mutate
Frequently you’ll want to create new columns based on the values in existing
columns, for example to scale test scores or to create total scores.
For this we’ll use mutate()
.
We currently have a dataframe with information for each test taker about country
of origin and how they responded (correct or incorrect) to each of the items on
the test. We are missing total score columns. We know from inspecting our dataframe
that our items are in columns 5:74
. The .
in front of the column indices is a
placeholder for the dataframe test_results_1
. It is similar to the command
test_results[,5:74]
.
test_results_1 <- test_results_1 %>%
mutate(raw_total = rowSums(.[5:74], na.rm = TRUE))
# print the vector
test_results_1$raw_total
[1] 38 32 32 49 38 36 37 39 37 34 40 26 34 32 38 61 56 51 45 48 54 27 39
[24] 49 44 47 52 32 51 31 42 60 33 40 57 48 38 25 45 47 43 53 33 29 47 48
[47] 54 63 57 48 36 50 46 54 38 44 43 42 20 20 39 51 44 43 49 21 48 51 31
[70] 42 21 55 57 65 60 34 47 52 29 26 38 47 51 58 27 24 38 14
Often, we want to report scores that have been scaled in some way, such as percents.
Now that we have a raw_total
column, we can create a column called percent_total
.
test_results_1 <- test_results_1 %>%
mutate(percent_total = (raw_total / 70) * 100)
# print the vector
test_results_1$percent_total
[1] 54.28571 45.71429 45.71429 70.00000 54.28571 51.42857 52.85714
[8] 55.71429 52.85714 48.57143 57.14286 37.14286 48.57143 45.71429
[15] 54.28571 87.14286 80.00000 72.85714 64.28571 68.57143 77.14286
[22] 38.57143 55.71429 70.00000 62.85714 67.14286 74.28571 45.71429
[29] 72.85714 44.28571 60.00000 85.71429 47.14286 57.14286 81.42857
[36] 68.57143 54.28571 35.71429 64.28571 67.14286 61.42857 75.71429
[43] 47.14286 41.42857 67.14286 68.57143 77.14286 90.00000 81.42857
[50] 68.57143 51.42857 71.42857 65.71429 77.14286 54.28571 62.85714
[57] 61.42857 60.00000 28.57143 28.57143 55.71429 72.85714 62.85714
[64] 61.42857 70.00000 30.00000 68.57143 72.85714 44.28571 60.00000
[71] 30.00000 78.57143 81.42857 92.85714 85.71429 48.57143 67.14286
[78] 74.28571 41.42857 37.14286 54.28571 67.14286 72.85714 82.85714
[85] 38.57143 34.28571 54.28571 20.00000
Selecting columns and filtering rows
To select columns of a
data frame, use select()
. The first argument to this function is the data
frame (test_results_1
), and the subsequent arguments are the columns to keep.
select(test_results_1, country, raw_total)
To choose rows based on a specific criteria, use filter()
:
filter(test_results_1, country == "china")
# A tibble: 36 x 76
ID names country admin_date q1_list_mi q2_list_det
<dbl> <chr> <fct> <dttm> <dbl> <dbl>
1 635 JUDr… china 1960-12-25 22:02:34 1 1
2 636 pan … china 1985-10-07 21:18:00 0 1
3 637 Nadě… china 1987-12-22 17:04:22 1 0
4 638 Mári… china 1983-10-31 10:23:45 0 1
5 639 MUDr… china 1930-12-28 11:00:55 1 1
6 640 Anež… china 1978-08-16 01:04:50 1 0
7 641 Magd… china 2011-01-12 04:11:12 0 1
8 642 Alžb… china 1977-11-19 01:12:51 0 0
9 643 Lubo… china 2012-07-30 10:15:54 0 0
10 644 Sára… china 1963-05-19 22:58:51 0 1
# … with 26 more rows, and 70 more variables: q3_list_det <dbl>,
# q4_list_det <dbl>, q5_list_det <dbl>, q6_list_det <dbl>,
# q7_list_det <dbl>, q8_list_det <dbl>, q9_list_det <dbl>,
# q10_list_mi <dbl>, q11_list_det <dbl>, q12_list_mi <dbl>,
# q13_list_prag <dbl>, q14_list_det <dbl>, q15_list_det <dbl>,
# q16_list_det <dbl>, q17_list_inf <dbl>, q18_list_inf <dbl>,
# q19_list_det <dbl>, q20_list_mi <dbl>, q21_list_det <dbl>,
# q22_list_prag <dbl>, q23_list_mi <dbl>, q24_list_det <dbl>,
# q25_list_prag <dbl>, q26_list_det_an <dbl>, q27_list_det_an <dbl>,
# q28_list_det_an <dbl>, q29_list_prag_an <dbl>, q30_list_mi_an <dbl>,
# q31_list_mi_an <dbl>, q32_list_det_an <dbl>, q33_list_det <dbl>,
# q34_list_prag_an <dbl>, q35_list_inf_an <dbl>, q36_read_mi <dbl>,
# q37_read_det <dbl>, q38_read_det <dbl>, q39_read_mi <dbl>,
# q40_read_voc <dbl>, q41_read_voc <dbl>, q42_read_mi <dbl>,
# q43_read_det <dbl>, q44_read_mi <dbl>, q45_read_det <dbl>,
# q46_read_mi <dbl>, q47_read_torg <dbl>, q48_read_det <dbl>,
# q49_read_det <dbl>, q50_read_voc <dbl>, q51_read_torg <dbl>,
# q52_read_inf <dbl>, q53_read_torg <dbl>, q54_read_det <dbl>,
# q55_read_torg <dbl>, q56_read_purp <dbl>, q57_read_purp <dbl>,
# q58_read_mi <dbl>, q59_read_inf <dbl>, q60_read_det_an <dbl>,
# q61_read_purp_an <dbl>, q62_read_voc_an <dbl>, q63_read_inf_an <dbl>,
# q64_read_mi_an <dbl>, q65_read_torg_an <dbl>, q66_read_torg_an <dbl>,
# q67_read_purp_an <dbl>, q68_read_mi_an <dbl>, q69_read_det_an <dbl>,
# q70_read_det_an <dbl>, raw_total <dbl>, percent_total <dbl>
The !
symbol negates the result; we can use that to filter out all
test takers from China.
filter(test_results_1, country != 'china')
# A tibble: 52 x 76
ID names country admin_date q1_list_mi q2_list_det
<dbl> <chr> <fct> <dttm> <dbl> <dbl>
1 671 Deni… morocco 1950-02-04 19:06:32 1 1
2 672 Bože… morocco 1998-08-20 22:56:09 0 1
3 673 Pavl… morocco 2018-07-15 11:01:39 1 1
4 674 Patr… morocco 1963-05-14 02:37:10 1 1
5 675 Maty… morocco 2006-09-19 11:46:42 1 0
6 676 Vlad… morocco 1929-12-15 00:30:38 1 1
7 677 Milu… morocco 1961-02-22 22:13:36 0 0
8 678 Niko… morocco 1986-01-08 07:12:44 0 1
9 679 Ing.… morocco 2005-04-08 10:43:11 1 0
10 680 Davi… morocco 2002-03-14 20:02:29 1 1
# … with 42 more rows, and 70 more variables: q3_list_det <dbl>,
# q4_list_det <dbl>, q5_list_det <dbl>, q6_list_det <dbl>,
# q7_list_det <dbl>, q8_list_det <dbl>, q9_list_det <dbl>,
# q10_list_mi <dbl>, q11_list_det <dbl>, q12_list_mi <dbl>,
# q13_list_prag <dbl>, q14_list_det <dbl>, q15_list_det <dbl>,
# q16_list_det <dbl>, q17_list_inf <dbl>, q18_list_inf <dbl>,
# q19_list_det <dbl>, q20_list_mi <dbl>, q21_list_det <dbl>,
# q22_list_prag <dbl>, q23_list_mi <dbl>, q24_list_det <dbl>,
# q25_list_prag <dbl>, q26_list_det_an <dbl>, q27_list_det_an <dbl>,
# q28_list_det_an <dbl>, q29_list_prag_an <dbl>, q30_list_mi_an <dbl>,
# q31_list_mi_an <dbl>, q32_list_det_an <dbl>, q33_list_det <dbl>,
# q34_list_prag_an <dbl>, q35_list_inf_an <dbl>, q36_read_mi <dbl>,
# q37_read_det <dbl>, q38_read_det <dbl>, q39_read_mi <dbl>,
# q40_read_voc <dbl>, q41_read_voc <dbl>, q42_read_mi <dbl>,
# q43_read_det <dbl>, q44_read_mi <dbl>, q45_read_det <dbl>,
# q46_read_mi <dbl>, q47_read_torg <dbl>, q48_read_det <dbl>,
# q49_read_det <dbl>, q50_read_voc <dbl>, q51_read_torg <dbl>,
# q52_read_inf <dbl>, q53_read_torg <dbl>, q54_read_det <dbl>,
# q55_read_torg <dbl>, q56_read_purp <dbl>, q57_read_purp <dbl>,
# q58_read_mi <dbl>, q59_read_inf <dbl>, q60_read_det_an <dbl>,
# q61_read_purp_an <dbl>, q62_read_voc_an <dbl>, q63_read_inf_an <dbl>,
# q64_read_mi_an <dbl>, q65_read_torg_an <dbl>, q66_read_torg_an <dbl>,
# q67_read_purp_an <dbl>, q68_read_mi_an <dbl>, q69_read_det_an <dbl>,
# q70_read_det_an <dbl>, raw_total <dbl>, percent_total <dbl>
We can also select columns based on their names or on partial matches to their names. If we wanted only the listening test items:
select(test_results_1, contains("_list_"))
# A tibble: 88 x 35
q1_list_mi q2_list_det q3_list_det q4_list_det q5_list_det q6_list_det
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 0 1
2 0 1 1 0 0 1
3 1 0 1 0 1 1
4 0 1 1 1 1 1
5 1 1 0 1 0 1
6 1 0 1 1 0 1
7 0 1 0 0 0 1
8 0 0 0 1 0 1
9 0 0 0 1 0 1
10 0 1 0 0 0 1
# … with 78 more rows, and 29 more variables: q7_list_det <dbl>,
# q8_list_det <dbl>, q9_list_det <dbl>, q10_list_mi <dbl>,
# q11_list_det <dbl>, q12_list_mi <dbl>, q13_list_prag <dbl>,
# q14_list_det <dbl>, q15_list_det <dbl>, q16_list_det <dbl>,
# q17_list_inf <dbl>, q18_list_inf <dbl>, q19_list_det <dbl>,
# q20_list_mi <dbl>, q21_list_det <dbl>, q22_list_prag <dbl>,
# q23_list_mi <dbl>, q24_list_det <dbl>, q25_list_prag <dbl>,
# q26_list_det_an <dbl>, q27_list_det_an <dbl>, q28_list_det_an <dbl>,
# q29_list_prag_an <dbl>, q30_list_mi_an <dbl>, q31_list_mi_an <dbl>,
# q32_list_det_an <dbl>, q33_list_det <dbl>, q34_list_prag_an <dbl>,
# q35_list_inf_an <dbl>
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
test_results_china <- filter(test_results_1, country != 'china')
totals <- select(test_results_china, raw_total)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
Pipes are a recent addition to R. Pipes let you take the
output of one function and send it directly to the next, which is useful when
you need to do many things to the same dataset. Pipes in R look like %>%
and
are made available via the magrittr
package, installed automatically with
dplyr
. If you use RStudio, you can type the pipe with Ctrl
- Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
test_results_1 %>%
filter(country == "china") %>%
select(country, raw_total)
# A tibble: 36 x 2
country raw_total
<fct> <dbl>
1 china 38
2 china 32
3 china 32
4 china 49
5 china 38
6 china 36
7 china 37
8 china 39
9 china 37
10 china 34
# … with 26 more rows
In the above code, we use the pipe to send the test_results_1
dataset first
through filter()
to keep rows where country
is “china”, then through
select()
to keep only the country
and raw_total
columns. Since %>%
takes the object on its left and passes it as the first argument to the function
on its right, we don’t need to explicitly include the data frame as an argument
to the filter()
and select()
functions any more.
Some may find it helpful to read the pipe like the word “then”. For instance,
in the above example, we take the data frame test_results_1
, then we filter
for rows with country == "china"
, then we select
columns country
and raw_total
.
The dplyr
functions by themselves are somewhat simple,
but by combining them into linear workflows with the pipe, we can accomplish
more complex manipulations of data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
test_results_china <- test_results_1 %>%
filter(country == "china") %>%
select(country, raw_total)
test_results_china
# A tibble: 36 x 2
country raw_total
<fct> <dbl>
1 china 38
2 china 32
3 china 32
4 china 49
5 china 38
6 china 36
7 china 37
8 china 39
9 china 37
10 china 34
# … with 26 more rows
Note that the final data frame (test_results_china
) is the leftmost part of this expression.
Exercise
Using pipes, subset the
test_results_1
data to include raw_totals for test takers from morocco. Then select thecountry
andpercent_total
columns.Solution
test_results_morocco <- test_results_1 %>% filter(country == "morocco") %>% select(country, percent_total) test_results_morocco
# A tibble: 20 x 2 country percent_total <fct> <dbl> 1 morocco 54.3 2 morocco 35.7 3 morocco 64.3 4 morocco 67.1 5 morocco 61.4 6 morocco 75.7 7 morocco 47.1 8 morocco 41.4 9 morocco 67.1 10 morocco 68.6 11 morocco 77.1 12 morocco 90 13 morocco 81.4 14 morocco 68.6 15 morocco 51.4 16 morocco 71.4 17 morocco 65.7 18 morocco 77.1 19 morocco 54.3 20 morocco 62.9
You can also filter numerical data using other logical operators (such as >
, <=
).
For example, if you wanted data from test takers who scored above a certain score.
test_results_70 <- test_results_1 %>%
filter(percent_total >= 70) %>%
select(country, percent_total)
test_results_70
# A tibble: 26 x 2
country percent_total
<fct> <dbl>
1 china 70
2 china 87.1
3 china 80
4 china 72.9
5 china 77.1
6 china 70
7 china 74.3
8 china 72.9
9 china 85.7
10 china 81.4
# … with 16 more rows
We can use mutate
to create total scores for the listening and reading tests.
test_results_1 <- test_results_1 %>%
mutate(list_raw_total = rowSums(select(., contains("_list_")), na.rm = TRUE))
test_results_1$list_raw_total
[1] 20 13 20 28 20 19 16 20 20 20 19 9 17 16 21 31 28 28 24 27 26 11 24
[24] 28 15 25 26 15 26 20 23 29 16 20 29 25 16 13 25 22 19 28 15 12 22 21
[47] 25 32 25 24 18 27 22 29 13 26 16 21 10 9 20 22 22 20 21 12 24 28 15
[70] 23 13 24 27 32 29 18 23 29 14 12 17 24 28 31 12 6 18 6
Exercise
Using pipes, create a
read_raw_total
column in our dataframe. Thenselect
all of the columns that contain raw scores.Solution
test_results_1 <- test_results_1 %>% mutate(read_raw_total = rowSums(select(., contains("_read_")), na.rm = TRUE)) test_results_1 %>% select(., contains("raw"))
# A tibble: 88 x 3 raw_total list_raw_total read_raw_total <dbl> <dbl> <dbl> 1 38 20 18 2 32 13 19 3 32 20 12 4 49 28 21 5 38 20 18 6 36 19 17 7 37 16 21 8 39 20 19 9 37 20 17 10 34 20 14 # … with 78 more rows
Split-apply-combine data analysis and the summarize() function
Many data analysis tasks can be approached using the split-apply-combine
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results. dplyr
makes this very easy through the use of
the group_by()
function.
The summarize()
function
group_by()
is often used together with summarize()
, which collapses each
group into a single-row summary of that group. group_by()
takes as arguments
the column names that contain the categorical variables for which you want
to calculate the summary statistics. So to compute the average raw test score by country
of origin:
test_results_1 %>%
group_by(country) %>%
summarize(mean = mean(raw_total))
# A tibble: 3 x 2
country mean
<fct> <dbl>
1 china 41.9
2 morocco 44.9
3 russia 40.2
You may also have noticed that the output from these calls doesn’t run off the
screen anymore. It’s one of the advantages of tbl_df
over data frame.
Once the data are grouped, you can create multiple summary statistics at a time. For example, we may want an indicator of dispersion in addition to central tendancy:
test_results_1 %>%
group_by(country) %>%
summarize(mean = mean(raw_total),
sd = sd(raw_total),
min = min(raw_total),
max = max(raw_total))
# A tibble: 3 x 5
country mean sd min max
<fct> <dbl> <dbl> <dbl> <dbl>
1 china 41.9 9.40 26 61
2 morocco 44.9 9.54 25 63
3 russia 40.2 13.6 14 65
It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min
to put the group with the lowest minimum score first.
test_results_1 %>%
group_by(country) %>%
summarize(mean = mean(raw_total),
sd = sd(raw_total),
min = min(raw_total),
max = max(raw_total)) %>%
arrange(min)
# A tibble: 3 x 5
country mean sd min max
<fct> <dbl> <dbl> <dbl> <dbl>
1 russia 40.2 13.6 14 65
2 morocco 44.9 9.54 25 63
3 china 41.9 9.40 26 61
To sort in descending order, we need to add the desc()
function. If we want to sort the results in decreasing order by raw_total
:
test_results_1 %>%
group_by(country) %>%
summarize(mean = mean(raw_total),
sd = sd(raw_total),
min = min(raw_total),
max = max(raw_total)) %>%
arrange(desc(min))
# A tibble: 3 x 5
country mean sd min max
<fct> <dbl> <dbl> <dbl> <dbl>
1 china 41.9 9.40 26 61
2 morocco 44.9 9.54 25 63
3 russia 40.2 13.6 14 65
Counting
When working with data, we often want to know the number of observations found
for each factor or combination of factors. For this task, dplyr
provides
count()
. For example, if we wanted to count the number of rows of data for
each country of origin, we would do:
test_results_1 %>%
count(country)
# A tibble: 3 x 2
country n
<fct> <int>
1 china 36
2 morocco 20
3 russia 32
For convenience, count()
provides the sort
argument to get results from most to least:
test_results_1 %>%
count(country, sort = TRUE)
# A tibble: 3 x 2
country n
<fct> <int>
1 china 36
2 russia 32
3 morocco 20
Within the summarise
function we can use group_by
and n()
:
test_results_1 %>%
group_by(country) %>%
summarise(n = n())
# A tibble: 3 x 2
country n
<fct> <int>
1 china 36
2 morocco 20
3 russia 32
Exercise
- What are the
n
,mean
,median
,sd
, and range (hint: range = (max
-min
) + 1) of thepercent_total
scores?Solution
test_results_1 %>% summarise(n = n(), mean = mean(percent_total), median = median(percent_total), sd = sd(percent_total), range = (max(percent_total) - min(percent_total)) + 1)
# A tibble: 1 x 5 n mean median sd range <int> <dbl> <dbl> <dbl> <dbl> 1 88 60.0 61.4 15.9 73.9
- Use
group_by()
to calculate the same summary statistics for test takers by country of origin.Solution
test_results_1 %>% group_by(country) %>% summarise(n = n(), mean = mean(percent_total), median = median(percent_total), sd = sd(percent_total), range = (max(percent_total) - min(percent_total)) + 1)
# A tibble: 3 x 6 country n mean median sd range <fct> <int> <dbl> <dbl> <dbl> <dbl> 1 china 36 59.9 56.4 13.4 51.0 2 morocco 20 64.1 66.4 13.6 55.3 3 russia 32 57.5 60.7 19.4 73.9
Reshaping with gather and spread
So far, we have been working with “tidy” data (see Hadley Wickhams Tidy Data manuscript for more information https://vita.had.co.nz/papers/tidy-data.pdf). Four characteristics of a tidy dataset include the following:
- Each variable has its own column
- Each observation has its own row
- Each value must have its own cell
- Each type of observational unit forms a table
Here we examine the fourth rule: Each type of observational unit forms a table.
In test_results_1
, each row contains the values of variables associated with each
record (the unit), values such as correct or incorrect responses and total and part scores.
What if instead of comparing records, we wanted to look at differences in test takers grouped by
different countries of origin or across different subtests (i.e., listening and reading)?
We’d need to create a new table where each row (the unit) is comprised
of values of variables associated with each country of origin (e.g. for
country
). In practical terms this means the values of the countries of
origin in country
would become the names of column variables and the cells
would contain TRUE
or FALSE
.
Having created a new table, we can now explore the relationship within and between countries of origin. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest.
The opposite transformation would be to transform column names into values of a variable.
We can do both these of transformations with two tidyr
functions, spread()
and gather()
.
Gathering
In this situation we are gathering the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names. We will do this in two steps to make this process a bit clearer.
gather()
takes four principal arguments:
- the data
- the key column variable we wish to create from column names.
- the value column variable we wish to create and fill with values associated with the key.
- the names of the columns we use to fill the key variable (or to drop).
test_results_1_long <- test_results_1 %>%
select(ID, country, raw_total, percent_total, list_raw_total, read_raw_total) %>%
gather(key, value, -ID, -country)
test_results_1_long
# A tibble: 352 x 4
ID country key value
<dbl> <fct> <chr> <dbl>
1 635 china raw_total 38
2 636 china raw_total 32
3 637 china raw_total 32
4 638 china raw_total 49
5 639 china raw_total 38
6 640 china raw_total 36
7 641 china raw_total 37
8 642 china raw_total 39
9 643 china raw_total 37
10 644 china raw_total 34
# … with 342 more rows
Spreading
spread()
takes three principal arguments:
- the data
- the key column variable whose values will become new column names.
- the value column variable whose values will fill the new column variables.
Further arguments include fill
which, if set, fills in missing values with
the value provided.
We can return our long dataframe back into a wide dataframe with spread
.
results_wide <- test_results_1_long %>%
spread(key, value)
Joining data
Sometimes it is necessary to separate pieces of our data, perform operations on them
and then join them back together with the original dataframe. For today, we will work
with full_join
from dplyr
. Our goal is to calculate the raw total scores for the
reading and listening anchor sets in our data. We will leverage the select
and
rowSums
commands again:
list_1_an <- test_results_1 %>%
select(., ID, contains("_list_"))%>%
select(., ID, contains("_an")) %>%
mutate(., list_an_raw = rowSums(.[2:10], na.rm = TRUE)) %>%
select(ID, list_an_raw)
We can use full_join
to add list_an_raw
back to the original dataframe. This command
takes three arguments. x
and y
are the tables to join. by
is character vector:
test_results_1 <- full_join(x = test_results_1, y = list_1_an, by = 'ID')
Exercise
Do the same operations as above to create a
read_an_raw
variable. Add it to the original dataframe. Then create an new dataframe calledtest_1_raw
that contains theID
andcountry
variables and all columns with raw totals. (hint: there are a different number of items in the reading anchor test than in the listening anchor test.)Solution
read_1_an <- test_results_1 %>% select(., ID, contains("_read_"))%>% select(., ID, contains("_an")) %>% mutate(., read_an_raw = rowSums(.[2:12], na.rm = TRUE)) %>% select(ID, read_an_raw) test_results_1 <- full_join(test_results_1, read_1_an, by = 'ID') test_1_raw <- select(test_results_1, ID, country, contains("raw")) test_1_raw
# A tibble: 88 x 7 ID country raw_total list_raw_total read_raw_total list_an_raw <dbl> <fct> <dbl> <dbl> <dbl> <dbl> 1 635 china 38 20 18 2 2 636 china 32 13 19 1 3 637 china 32 20 12 2 4 638 china 49 28 21 8 5 639 china 38 20 18 6 6 640 china 36 19 17 4 7 641 china 37 16 21 5 8 642 china 39 20 19 7 9 643 china 37 20 17 4 10 644 china 34 20 14 5 # … with 78 more rows, and 1 more variable: read_an_raw <dbl>
Now we can save this data frame to our data_output
directory.
write_csv(test_1_raw, path = "data_output/test_1_raw_totals.csv")
Key Points
Use the
dplyr
package to manipulate dataframes.Use
select()
to choose variables from a dataframe.Use
filter()
to choose data based on values.Use
group_by()
andsummarize()
to work with subsets of data.Use
mutate()
to create new variables.Use the
tidyr
package to change the layout of dataframes.Use
gather()
to go from wide to long format.Use
spread()
to go from long to wide format.