Part 1: Introducing dplyr and tidyr

Overview

Teaching: 40 min
Exercises: 25 min
Questions
  • How can I select specific rows and/or columns from a data frame?

  • How can I combine multiple commands into a single command?

  • How can create new columns or remove existing columns from a data frame?

  • How can I reformat a dataframe to meet my needs?

Objectives
  • Describe the purpose of the dplyr and tidyr packages.

  • Select certain columns in a data frame with the dplyr function select.

  • Select certain rows in a data frame according to filtering conditions with the dplyr function filter.

  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.

  • Add new columns to a data frame that are functions of existing columns with mutate.

  • Use the split-apply-combine concept for data analysis.

  • Use summarize, group_by, and count to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.

  • Describe the concept of a wide and a long table format and for which purpose those formats are useful.

  • Describe what key-value pairs are.

  • Reshape a data frame from long to wide format and back with the spread and gather commands from the tidyr package.

  • Export a data frame to a csv file.

Data Manipulation using dplyr and tidyr

dplyr is a package for making tabular data manipulation easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis.

Similarly to readr, dplyr and tidyr are also part of the tidyverse. These packages were loaded in R’s memory when we called library(tidyverse) earlier.

What are dplyr and tidyr?

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.

This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups. Moving back and forth between these formats is nontrivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about dplyr and tidyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet and this one about tidyr.

To make sure, everyone will use the same dataset for this lesson, we’ll read in one of the datasets that we downloaded earlier.

## load the tidyverse
library(tidyverse)

test_results_1 <- read_csv("data/placement_1.csv") %>%
  mutate(country = as.factor(country)) # changes the country variable to a factor

## inspect the data
test_results_1

## preview the data
# View(interviews)

## summarize the data
# summary(test_results_1)
# skimr::skim(test_results_1)

We’re going to learn some of the most common dplyr functions:

Creating new variables

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to scale test scores or to create total scores. For this we’ll use mutate().

We currently have a dataframe with information for each test taker about country of origin and how they responded (correct or incorrect) to each of the items on the test. We are missing total score columns. We know from inspecting our dataframe that our items are in columns 5:74. The . in front of the column indices is a placeholder for the dataframe test_results_1. It is similar to the command test_results[,5:74].

test_results_1 <- test_results_1 %>%
  mutate(raw_total = rowSums(.[5:74], na.rm = TRUE))

# print the vector
test_results_1$raw_total
 [1] 38 32 32 49 38 36 37 39 37 34 40 26 34 32 38 61 56 51 45 48 54 27 39
[24] 49 44 47 52 32 51 31 42 60 33 40 57 48 38 25 45 47 43 53 33 29 47 48
[47] 54 63 57 48 36 50 46 54 38 44 43 42 20 20 39 51 44 43 49 21 48 51 31
[70] 42 21 55 57 65 60 34 47 52 29 26 38 47 51 58 27 24 38 14

Often, we want to report scores that have been scaled in some way, such as percents. Now that we have a raw_total column, we can create a column called percent_total.

test_results_1 <- test_results_1 %>%
  mutate(percent_total = (raw_total / 70) * 100)

# print the vector
test_results_1$percent_total
 [1] 54.28571 45.71429 45.71429 70.00000 54.28571 51.42857 52.85714
 [8] 55.71429 52.85714 48.57143 57.14286 37.14286 48.57143 45.71429
[15] 54.28571 87.14286 80.00000 72.85714 64.28571 68.57143 77.14286
[22] 38.57143 55.71429 70.00000 62.85714 67.14286 74.28571 45.71429
[29] 72.85714 44.28571 60.00000 85.71429 47.14286 57.14286 81.42857
[36] 68.57143 54.28571 35.71429 64.28571 67.14286 61.42857 75.71429
[43] 47.14286 41.42857 67.14286 68.57143 77.14286 90.00000 81.42857
[50] 68.57143 51.42857 71.42857 65.71429 77.14286 54.28571 62.85714
[57] 61.42857 60.00000 28.57143 28.57143 55.71429 72.85714 62.85714
[64] 61.42857 70.00000 30.00000 68.57143 72.85714 44.28571 60.00000
[71] 30.00000 78.57143 81.42857 92.85714 85.71429 48.57143 67.14286
[78] 74.28571 41.42857 37.14286 54.28571 67.14286 72.85714 82.85714
[85] 38.57143 34.28571 54.28571 20.00000

Selecting columns and filtering rows

To select columns of a data frame, use select(). The first argument to this function is the data frame (test_results_1), and the subsequent arguments are the columns to keep.

select(test_results_1, country, raw_total)

To choose rows based on a specific criteria, use filter():

filter(test_results_1, country == "china")
# A tibble: 36 x 76
      ID names country admin_date          q1_list_mi q2_list_det
   <dbl> <chr> <fct>   <dttm>                   <dbl>       <dbl>
 1   635 JUDr… china   1960-12-25 22:02:34          1           1
 2   636 pan … china   1985-10-07 21:18:00          0           1
 3   637 Nadě… china   1987-12-22 17:04:22          1           0
 4   638 Mári… china   1983-10-31 10:23:45          0           1
 5   639 MUDr… china   1930-12-28 11:00:55          1           1
 6   640 Anež… china   1978-08-16 01:04:50          1           0
 7   641 Magd… china   2011-01-12 04:11:12          0           1
 8   642 Alžb… china   1977-11-19 01:12:51          0           0
 9   643 Lubo… china   2012-07-30 10:15:54          0           0
10   644 Sára… china   1963-05-19 22:58:51          0           1
# … with 26 more rows, and 70 more variables: q3_list_det <dbl>,
#   q4_list_det <dbl>, q5_list_det <dbl>, q6_list_det <dbl>,
#   q7_list_det <dbl>, q8_list_det <dbl>, q9_list_det <dbl>,
#   q10_list_mi <dbl>, q11_list_det <dbl>, q12_list_mi <dbl>,
#   q13_list_prag <dbl>, q14_list_det <dbl>, q15_list_det <dbl>,
#   q16_list_det <dbl>, q17_list_inf <dbl>, q18_list_inf <dbl>,
#   q19_list_det <dbl>, q20_list_mi <dbl>, q21_list_det <dbl>,
#   q22_list_prag <dbl>, q23_list_mi <dbl>, q24_list_det <dbl>,
#   q25_list_prag <dbl>, q26_list_det_an <dbl>, q27_list_det_an <dbl>,
#   q28_list_det_an <dbl>, q29_list_prag_an <dbl>, q30_list_mi_an <dbl>,
#   q31_list_mi_an <dbl>, q32_list_det_an <dbl>, q33_list_det <dbl>,
#   q34_list_prag_an <dbl>, q35_list_inf_an <dbl>, q36_read_mi <dbl>,
#   q37_read_det <dbl>, q38_read_det <dbl>, q39_read_mi <dbl>,
#   q40_read_voc <dbl>, q41_read_voc <dbl>, q42_read_mi <dbl>,
#   q43_read_det <dbl>, q44_read_mi <dbl>, q45_read_det <dbl>,
#   q46_read_mi <dbl>, q47_read_torg <dbl>, q48_read_det <dbl>,
#   q49_read_det <dbl>, q50_read_voc <dbl>, q51_read_torg <dbl>,
#   q52_read_inf <dbl>, q53_read_torg <dbl>, q54_read_det <dbl>,
#   q55_read_torg <dbl>, q56_read_purp <dbl>, q57_read_purp <dbl>,
#   q58_read_mi <dbl>, q59_read_inf <dbl>, q60_read_det_an <dbl>,
#   q61_read_purp_an <dbl>, q62_read_voc_an <dbl>, q63_read_inf_an <dbl>,
#   q64_read_mi_an <dbl>, q65_read_torg_an <dbl>, q66_read_torg_an <dbl>,
#   q67_read_purp_an <dbl>, q68_read_mi_an <dbl>, q69_read_det_an <dbl>,
#   q70_read_det_an <dbl>, raw_total <dbl>, percent_total <dbl>

The ! symbol negates the result; we can use that to filter out all test takers from China.

filter(test_results_1, country != 'china')
# A tibble: 52 x 76
      ID names country admin_date          q1_list_mi q2_list_det
   <dbl> <chr> <fct>   <dttm>                   <dbl>       <dbl>
 1   671 Deni… morocco 1950-02-04 19:06:32          1           1
 2   672 Bože… morocco 1998-08-20 22:56:09          0           1
 3   673 Pavl… morocco 2018-07-15 11:01:39          1           1
 4   674 Patr… morocco 1963-05-14 02:37:10          1           1
 5   675 Maty… morocco 2006-09-19 11:46:42          1           0
 6   676 Vlad… morocco 1929-12-15 00:30:38          1           1
 7   677 Milu… morocco 1961-02-22 22:13:36          0           0
 8   678 Niko… morocco 1986-01-08 07:12:44          0           1
 9   679 Ing.… morocco 2005-04-08 10:43:11          1           0
10   680 Davi… morocco 2002-03-14 20:02:29          1           1
# … with 42 more rows, and 70 more variables: q3_list_det <dbl>,
#   q4_list_det <dbl>, q5_list_det <dbl>, q6_list_det <dbl>,
#   q7_list_det <dbl>, q8_list_det <dbl>, q9_list_det <dbl>,
#   q10_list_mi <dbl>, q11_list_det <dbl>, q12_list_mi <dbl>,
#   q13_list_prag <dbl>, q14_list_det <dbl>, q15_list_det <dbl>,
#   q16_list_det <dbl>, q17_list_inf <dbl>, q18_list_inf <dbl>,
#   q19_list_det <dbl>, q20_list_mi <dbl>, q21_list_det <dbl>,
#   q22_list_prag <dbl>, q23_list_mi <dbl>, q24_list_det <dbl>,
#   q25_list_prag <dbl>, q26_list_det_an <dbl>, q27_list_det_an <dbl>,
#   q28_list_det_an <dbl>, q29_list_prag_an <dbl>, q30_list_mi_an <dbl>,
#   q31_list_mi_an <dbl>, q32_list_det_an <dbl>, q33_list_det <dbl>,
#   q34_list_prag_an <dbl>, q35_list_inf_an <dbl>, q36_read_mi <dbl>,
#   q37_read_det <dbl>, q38_read_det <dbl>, q39_read_mi <dbl>,
#   q40_read_voc <dbl>, q41_read_voc <dbl>, q42_read_mi <dbl>,
#   q43_read_det <dbl>, q44_read_mi <dbl>, q45_read_det <dbl>,
#   q46_read_mi <dbl>, q47_read_torg <dbl>, q48_read_det <dbl>,
#   q49_read_det <dbl>, q50_read_voc <dbl>, q51_read_torg <dbl>,
#   q52_read_inf <dbl>, q53_read_torg <dbl>, q54_read_det <dbl>,
#   q55_read_torg <dbl>, q56_read_purp <dbl>, q57_read_purp <dbl>,
#   q58_read_mi <dbl>, q59_read_inf <dbl>, q60_read_det_an <dbl>,
#   q61_read_purp_an <dbl>, q62_read_voc_an <dbl>, q63_read_inf_an <dbl>,
#   q64_read_mi_an <dbl>, q65_read_torg_an <dbl>, q66_read_torg_an <dbl>,
#   q67_read_purp_an <dbl>, q68_read_mi_an <dbl>, q69_read_det_an <dbl>,
#   q70_read_det_an <dbl>, raw_total <dbl>, percent_total <dbl>

We can also select columns based on their names or on partial matches to their names. If we wanted only the listening test items:

select(test_results_1, contains("_list_"))
# A tibble: 88 x 35
   q1_list_mi q2_list_det q3_list_det q4_list_det q5_list_det q6_list_det
        <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1          1           1           1           0           0           1
 2          0           1           1           0           0           1
 3          1           0           1           0           1           1
 4          0           1           1           1           1           1
 5          1           1           0           1           0           1
 6          1           0           1           1           0           1
 7          0           1           0           0           0           1
 8          0           0           0           1           0           1
 9          0           0           0           1           0           1
10          0           1           0           0           0           1
# … with 78 more rows, and 29 more variables: q7_list_det <dbl>,
#   q8_list_det <dbl>, q9_list_det <dbl>, q10_list_mi <dbl>,
#   q11_list_det <dbl>, q12_list_mi <dbl>, q13_list_prag <dbl>,
#   q14_list_det <dbl>, q15_list_det <dbl>, q16_list_det <dbl>,
#   q17_list_inf <dbl>, q18_list_inf <dbl>, q19_list_det <dbl>,
#   q20_list_mi <dbl>, q21_list_det <dbl>, q22_list_prag <dbl>,
#   q23_list_mi <dbl>, q24_list_det <dbl>, q25_list_prag <dbl>,
#   q26_list_det_an <dbl>, q27_list_det_an <dbl>, q28_list_det_an <dbl>,
#   q29_list_prag_an <dbl>, q30_list_mi_an <dbl>, q31_list_mi_an <dbl>,
#   q32_list_det_an <dbl>, q33_list_det <dbl>, q34_list_prag_an <dbl>,
#   q35_list_inf_an <dbl>

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:

test_results_china <- filter(test_results_1, country != 'china')
totals <- select(test_results_china, raw_total)

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

Pipes are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl

test_results_1 %>%
    filter(country == "china") %>%
    select(country, raw_total)
# A tibble: 36 x 2
   country raw_total
   <fct>       <dbl>
 1 china          38
 2 china          32
 3 china          32
 4 china          49
 5 china          38
 6 china          36
 7 china          37
 8 china          39
 9 china          37
10 china          34
# … with 26 more rows

In the above code, we use the pipe to send the test_results_1 dataset first through filter() to keep rows where country is “china”, then through select() to keep only the country and raw_total columns. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter() and select() functions any more.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we take the data frame test_results_1, then we filter for rows with country == "china", then we select columns country and raw_total. The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

test_results_china <- test_results_1 %>%
    filter(country == "china") %>%
    select(country, raw_total)

test_results_china
# A tibble: 36 x 2
   country raw_total
   <fct>       <dbl>
 1 china          38
 2 china          32
 3 china          32
 4 china          49
 5 china          38
 6 china          36
 7 china          37
 8 china          39
 9 china          37
10 china          34
# … with 26 more rows

Note that the final data frame (test_results_china) is the leftmost part of this expression.

Exercise

Using pipes, subset the test_results_1 data to include raw_totals for test takers from morocco. Then select the country and percent_total columns.

Solution

test_results_morocco <- test_results_1 %>%
  filter(country == "morocco") %>%
  select(country, percent_total)

test_results_morocco
# A tibble: 20 x 2
   country percent_total
   <fct>           <dbl>
 1 morocco          54.3
 2 morocco          35.7
 3 morocco          64.3
 4 morocco          67.1
 5 morocco          61.4
 6 morocco          75.7
 7 morocco          47.1
 8 morocco          41.4
 9 morocco          67.1
10 morocco          68.6
11 morocco          77.1
12 morocco          90  
13 morocco          81.4
14 morocco          68.6
15 morocco          51.4
16 morocco          71.4
17 morocco          65.7
18 morocco          77.1
19 morocco          54.3
20 morocco          62.9

You can also filter numerical data using other logical operators (such as >, <=). For example, if you wanted data from test takers who scored above a certain score.

test_results_70 <- test_results_1 %>%
    filter(percent_total >= 70) %>%
    select(country, percent_total)

test_results_70
# A tibble: 26 x 2
   country percent_total
   <fct>           <dbl>
 1 china            70  
 2 china            87.1
 3 china            80  
 4 china            72.9
 5 china            77.1
 6 china            70  
 7 china            74.3
 8 china            72.9
 9 china            85.7
10 china            81.4
# … with 16 more rows

We can use mutate to create total scores for the listening and reading tests.

test_results_1 <- test_results_1 %>%
  mutate(list_raw_total = rowSums(select(., contains("_list_")), na.rm = TRUE))

test_results_1$list_raw_total
 [1] 20 13 20 28 20 19 16 20 20 20 19  9 17 16 21 31 28 28 24 27 26 11 24
[24] 28 15 25 26 15 26 20 23 29 16 20 29 25 16 13 25 22 19 28 15 12 22 21
[47] 25 32 25 24 18 27 22 29 13 26 16 21 10  9 20 22 22 20 21 12 24 28 15
[70] 23 13 24 27 32 29 18 23 29 14 12 17 24 28 31 12  6 18  6

Exercise

Using pipes, create a read_raw_total column in our dataframe. Then select all of the columns that contain raw scores.

Solution

test_results_1 <- test_results_1 %>%
 mutate(read_raw_total = rowSums(select(., contains("_read_")), na.rm = TRUE))

test_results_1 %>%
 select(., contains("raw"))
# A tibble: 88 x 3
   raw_total list_raw_total read_raw_total
       <dbl>          <dbl>          <dbl>
 1        38             20             18
 2        32             13             19
 3        32             20             12
 4        49             28             21
 5        38             20             18
 6        36             19             17
 7        37             16             21
 8        39             20             19
 9        37             20             17
10        34             20             14
# … with 78 more rows

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the average raw test score by country of origin:

test_results_1 %>%
    group_by(country) %>%
    summarize(mean = mean(raw_total))
# A tibble: 3 x 2
  country  mean
  <fct>   <dbl>
1 china    41.9
2 morocco  44.9
3 russia   40.2

You may also have noticed that the output from these calls doesn’t run off the screen anymore. It’s one of the advantages of tbl_df over data frame.

Once the data are grouped, you can create multiple summary statistics at a time. For example, we may want an indicator of dispersion in addition to central tendancy:

test_results_1 %>%
    group_by(country) %>%
    summarize(mean = mean(raw_total),
              sd = sd(raw_total),
              min = min(raw_total),
              max = max(raw_total))
# A tibble: 3 x 5
  country  mean    sd   min   max
  <fct>   <dbl> <dbl> <dbl> <dbl>
1 china    41.9  9.40    26    61
2 morocco  44.9  9.54    25    63
3 russia   40.2 13.6     14    65

It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min to put the group with the lowest minimum score first.

test_results_1 %>%
    group_by(country) %>%
    summarize(mean = mean(raw_total),
              sd = sd(raw_total),
              min = min(raw_total),
              max = max(raw_total)) %>%
  arrange(min)
# A tibble: 3 x 5
  country  mean    sd   min   max
  <fct>   <dbl> <dbl> <dbl> <dbl>
1 russia   40.2 13.6     14    65
2 morocco  44.9  9.54    25    63
3 china    41.9  9.40    26    61

To sort in descending order, we need to add the desc() function. If we want to sort the results in decreasing order by raw_total:

test_results_1 %>%
    group_by(country) %>%
    summarize(mean = mean(raw_total),
              sd = sd(raw_total),
              min = min(raw_total),
              max = max(raw_total)) %>%
  arrange(desc(min))
# A tibble: 3 x 5
  country  mean    sd   min   max
  <fct>   <dbl> <dbl> <dbl> <dbl>
1 china    41.9  9.40    26    61
2 morocco  44.9  9.54    25    63
3 russia   40.2 13.6     14    65

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each country of origin, we would do:

test_results_1 %>%
    count(country)
# A tibble: 3 x 2
  country     n
  <fct>   <int>
1 china      36
2 morocco    20
3 russia     32

For convenience, count() provides the sort argument to get results from most to least:

test_results_1 %>%
    count(country, sort = TRUE)
# A tibble: 3 x 2
  country     n
  <fct>   <int>
1 china      36
2 russia     32
3 morocco    20

Within the summarise function we can use group_by and n():

test_results_1 %>%
  group_by(country) %>%
  summarise(n = n())
# A tibble: 3 x 2
  country     n
  <fct>   <int>
1 china      36
2 morocco    20
3 russia     32

Exercise

  1. What are the n, mean, median, sd, and range (hint: range = (max - min) + 1) of the percent_total scores?

Solution

test_results_1 %>%
   summarise(n = n(),
             mean = mean(percent_total),
             median = median(percent_total),
             sd = sd(percent_total),
             range = (max(percent_total) - min(percent_total)) + 1)
# A tibble: 1 x 5
      n  mean median    sd range
  <int> <dbl>  <dbl> <dbl> <dbl>
1    88  60.0   61.4  15.9  73.9
  1. Use group_by() to calculate the same summary statistics for test takers by country of origin.

Solution

test_results_1 %>%
   group_by(country) %>%
   summarise(n = n(),
             mean = mean(percent_total),
             median = median(percent_total),
             sd = sd(percent_total),
             range = (max(percent_total) - min(percent_total)) + 1)
# A tibble: 3 x 6
  country     n  mean median    sd range
  <fct>   <int> <dbl>  <dbl> <dbl> <dbl>
1 china      36  59.9   56.4  13.4  51.0
2 morocco    20  64.1   66.4  13.6  55.3
3 russia     32  57.5   60.7  19.4  73.9

Reshaping with gather and spread

So far, we have been working with “tidy” data (see Hadley Wickhams Tidy Data manuscript for more information https://vita.had.co.nz/papers/tidy-data.pdf). Four characteristics of a tidy dataset include the following:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value must have its own cell
  4. Each type of observational unit forms a table

Here we examine the fourth rule: Each type of observational unit forms a table.

In test_results_1, each row contains the values of variables associated with each record (the unit), values such as correct or incorrect responses and total and part scores. What if instead of comparing records, we wanted to look at differences in test takers grouped by different countries of origin or across different subtests (i.e., listening and reading)?

We’d need to create a new table where each row (the unit) is comprised of values of variables associated with each country of origin (e.g. for country). In practical terms this means the values of the countries of origin in country would become the names of column variables and the cells would contain TRUE or FALSE.

Having created a new table, we can now explore the relationship within and between countries of origin. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest.

The opposite transformation would be to transform column names into values of a variable.

We can do both these of transformations with two tidyr functions, spread() and gather().

Gathering

In this situation we are gathering the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names. We will do this in two steps to make this process a bit clearer.

gather() takes four principal arguments:

  1. the data
  2. the key column variable we wish to create from column names.
  3. the value column variable we wish to create and fill with values associated with the key.
  4. the names of the columns we use to fill the key variable (or to drop).
test_results_1_long <- test_results_1 %>%
  select(ID, country, raw_total, percent_total, list_raw_total, read_raw_total) %>%
  gather(key, value, -ID, -country)

test_results_1_long
# A tibble: 352 x 4
      ID country key       value
   <dbl> <fct>   <chr>     <dbl>
 1   635 china   raw_total    38
 2   636 china   raw_total    32
 3   637 china   raw_total    32
 4   638 china   raw_total    49
 5   639 china   raw_total    38
 6   640 china   raw_total    36
 7   641 china   raw_total    37
 8   642 china   raw_total    39
 9   643 china   raw_total    37
10   644 china   raw_total    34
# … with 342 more rows

Spreading

spread() takes three principal arguments:

  1. the data
  2. the key column variable whose values will become new column names.
  3. the value column variable whose values will fill the new column variables.

Further arguments include fill which, if set, fills in missing values with the value provided.

We can return our long dataframe back into a wide dataframe with spread.

results_wide <- test_results_1_long %>%
  spread(key, value)

Joining data

Sometimes it is necessary to separate pieces of our data, perform operations on them and then join them back together with the original dataframe. For today, we will work with full_join from dplyr. Our goal is to calculate the raw total scores for the reading and listening anchor sets in our data. We will leverage the select and rowSums commands again:

list_1_an <- test_results_1 %>%
  select(., ID, contains("_list_"))%>%
  select(., ID, contains("_an")) %>%
  mutate(., list_an_raw = rowSums(.[2:10], na.rm = TRUE)) %>%
  select(ID, list_an_raw)

We can use full_join to add list_an_raw back to the original dataframe. This command takes three arguments. x and y are the tables to join. by is character vector:

test_results_1 <- full_join(x = test_results_1, y = list_1_an, by = 'ID')

Exercise

Do the same operations as above to create a read_an_raw variable. Add it to the original dataframe. Then create an new dataframe called test_1_raw that contains the ID and country variables and all columns with raw totals. (hint: there are a different number of items in the reading anchor test than in the listening anchor test.)

Solution

read_1_an <- test_results_1 %>%
  select(., ID, contains("_read_"))%>%
  select(., ID, contains("_an")) %>%
  mutate(., read_an_raw = rowSums(.[2:12], na.rm = TRUE)) %>%
  select(ID, read_an_raw)

test_results_1 <- full_join(test_results_1, read_1_an, by = 'ID')

test_1_raw <- select(test_results_1, ID, country, contains("raw"))

test_1_raw
# A tibble: 88 x 7
      ID country raw_total list_raw_total read_raw_total list_an_raw
   <dbl> <fct>       <dbl>          <dbl>          <dbl>       <dbl>
 1   635 china          38             20             18           2
 2   636 china          32             13             19           1
 3   637 china          32             20             12           2
 4   638 china          49             28             21           8
 5   639 china          38             20             18           6
 6   640 china          36             19             17           4
 7   641 china          37             16             21           5
 8   642 china          39             20             19           7
 9   643 china          37             20             17           4
10   644 china          34             20             14           5
# … with 78 more rows, and 1 more variable: read_an_raw <dbl>

Now we can save this data frame to our data_output directory.

write_csv(test_1_raw, path = "data_output/test_1_raw_totals.csv")

Key Points

  • Use the dplyr package to manipulate dataframes.

  • Use select() to choose variables from a dataframe.

  • Use filter() to choose data based on values.

  • Use group_by() and summarize() to work with subsets of data.

  • Use mutate() to create new variables.

  • Use the tidyr package to change the layout of dataframes.

  • Use gather() to go from wide to long format.

  • Use spread() to go from long to wide format.