Data manipulation in R

Antoine Soetewey 2019-12-24 18 minute read

Introduction

Not all datasets are as clean and tidy as you would expect. Therefore, after importing your dataset into RStudio, most of the time you will need to prepare it before performing any statistical analyses. Data manipulation can even sometimes take longer than the actual analyses when the quality of the data is poor. Data manipulation include a broad range of tools and techniques.

We present here in details the manipulations that you will most likely need for your projects. Do not hesitate to let me know (as a comment at the end of this article for example) if you find other data manipulations essential so that I can add them.

Dataset

In this article, we use the dataset cars to illustrate the different data manipulation techniques. Note that the dataset is installed by default in RStudio (so you do not need to import it) and I use the generic name dat as the name of the dataset throughout the article (see here why I always use a generic name instead of more specific names). Here is a table of the whole dataset:

dat <- cars # rename the cars dataset with a generic name
dat # display the entire dataset
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

This dataset has 50 observations with 2 variables (speed and distance). You can check the number of observations and variables with nrow(dat) and ncol(dat), or dim(dat):

nrow(dat) # number of rows/observations
## [1] 50
ncol(dat) # number of columns/variables
## [1] 2
dim(dat) # dimension: number of rows/observations and columns/variables
## [1] 50  2

Subset a dataset

First or last observations

  • To keep only the first 10 observations:
head(dat, n = 10)
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
  • To keep only the last 5 observations:
tail(dat, n = 5)
##    speed dist
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

Random sample of observations

  • To draw a sample of 4 observations without replacement:
library(dplyr)
sample_n(dat, 4, replace = FALSE)
##   speed dist
## 1    24   70
## 2    16   32
## 3    24  120
## 4    15   20

Based on row or column numbers

If you know what observation(s) or column(s) you want to keep, you can use the row or column number(s) to subset your dataset. We illustrate this with several examples:

  • keep all the variables for the \(3^{rd}\) observation:
dat[3, ]
  • keep the \(2^{nd}\) variable for all observations:
dat[, 2]
  • You can mix the two above methods to keep only the \(2^{nd}\) variable of the \(3^{rd}\) observation:
dat[3, 2]
## [1] 4
  • keep several observations; for example observations \(1\) to \(5\), the \(10^{th}\) and the \(15^{th}\) observation for all variables:
dat[c(1:5, 10, 15), ] # do not forget c()
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 10    11   17
## 15    12   28
  • remove observations 5 to 45:
dat[-c(5:45), ]
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85
  • tip: to keep only the last observation, use nrow() instead of the row number:
dat[nrow(dat), ] # nrow() gives the number of rows
##    speed dist
## 50    25   85

This way, no matter the number of observations, you will always select the last one. This technique of using a piece of code instead of a specific value is to avoid “hard coding”. Hard coding is generally not recommended (unless you want to specify a parameter that you are sure will never change) because if your dataset changes, you will need to manually edit your code.

As you probably figured out by now, you can select observations and/or variables of a dataset by running dataset_name[row_number, column_number]. When the row or column number is left empty, the entire row/column is selected. Note that all examples presented above also works for matrices:

mat <- matrix(c(-1, 2, 0, 3), ncol = 2, nrow = 2)
mat
##      [,1] [,2]
## [1,]   -1    0
## [2,]    2    3
mat[1, 2]
## [1] 0

Based on variable names

To select one variable of the dataset based on its name rather than on its column number, use dataset_name$variable_name:

dat$speed
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

Accessing variables inside a dataset with this second method is strongly recommended compared to the first if you intend to modify the structure of your database. Indeed, if a column is added or removed in the dataset, the numbering will change. Therefore, variables are generally referred to by its name rather than by its position (column number). In addition, it is easier to understand and interpret code with the name of the variable written (another reason to call variables with a concise but clear name). There is only one reason why I would still use the column number; if the variables names are expected to change while the structure of the dataset do not change.

To select variables, it is also possible to use the select() command from the powerful dplyr package (for compactness only the first 6 observations are displayed thanks to the head() command):

head(select(dat, speed))
##   speed
## 1     4
## 2     4
## 3     7
## 4     7
## 5     8
## 6     9

This is equivalent than removing the distance variable:

head(select(dat, -dist))
##   speed
## 1     4
## 2     4
## 3     7
## 4     7
## 5     8
## 6     9

Based on one or multiple criterion

Instead of subsetting a dataset based on row/column numbers or variable names, you can also subset it based on one or multiple criterion:

  • keep only observations with speed larger than 20. The first argument refers to the name of the dataset, while the second argument refers to the subset criteria:
subset(dat, dat$speed > 20)
##    speed dist
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85
  • keep only observations with distance smaller than or equal to 50 and speed equal to 10. Note the == (and not =) for the equal criteria:
subset(dat, dat$dist <= 50 & dat$speed == 10)
##   speed dist
## 7    10   18
## 8    10   26
## 9    10   34
  • use | to keep only observations with distance smaller than 20 or speed equal to 10:
subset(dat, dat$dist < 20 | dat$speed == 10)
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 12    12   14
  • to filter out some observations, use !=. For instance, to keep observations with speed not equal to 24 and distance not equal to 120 (for compactness only the last 6 observations are displayed thanks to the tail() command):
tail(subset(dat, dat$speed != 24 & dat$dist != 120))
##    speed dist
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 50    25   85

Create a new variable

Often a dataset can be enhanced by creating new variables based on other variables from the initial dataset. In this example, we create two new variables; one being the speed times the distance (which we call speed_dist) and the other being a categorization of the speed (which we call speed_cat). We then display the first 6 observations of this new dataset with the 4 variables:

# create new variable speed_dist
dat$speed_dist <- dat$speed * dat$dist

# create new variable speed_cat
# the function ifelse() below means that if dat$speed > 7, then speed_cat is "high speed", otherwise it is "low_speed"
dat$speed_cat <- ifelse(dat$speed > 7,
  "high speed", "low speed"
)

# display first 6 observations
head(dat) # 6 is the default in head()
##   speed dist speed_dist  speed_cat
## 1     4    2          8  low speed
## 2     4   10         40  low speed
## 3     7    4         28  low speed
## 4     7   22        154  low speed
## 5     8   16        128 high speed
## 6     9   10         90 high speed

Note than in programming, a character string is generally surrounded by quotes ("character string").

Transform a continuous variable into a categorical variable

To transform a continuous variable into a categorical variable (also known as qualitative variable):

dat$speed_quali <- cut(dat$speed,
  breaks = c(0, 12, 15, 19, 26), # cut points
  right = FALSE # closed on the left, open on the right
)
dat[c(1:2, 23:24, 49:50), ] # display some observations
##    speed dist speed_dist  speed_cat speed_quali
## 1      4    2          8  low speed      [0,12)
## 2      4   10         40  low speed      [0,12)
## 23    14   80       1120 high speed     [12,15)
## 24    15   20        300 high speed     [15,19)
## 49    24  120       2880 high speed     [19,26)
## 50    25   85       2125 high speed     [19,26)

This transformation is often done on age, when the age (a continuous variable) is transformed into a qualitative variable representing different age groups.

Sum and mean in rows

In survey with Likert scale (used in psychology, among others), it is often the case that we need to compute a score for each respondents based on multiple questions. The score is usually the mean or the sum of all the questions of interest. This can be done with rowMeans() and rowSums(). For instance, let’s compute the mean and the sum of the variables speed, dist and speed_dist (variables must be numeric of course as sum and mean cannot be computed on qualitative variables!) for each row and store them under the variables mean_score and total_score:

dat$mean_score <- rowMeans(dat[, 1:3]) # variables speed, dist and speed_dist correspond to variables 1 to 3
dat$total_score <- rowSums(dat[, 1:3])
head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109

Sum and mean in column

It is also possible to compute the mean and sum by column with colMeans() and colSums():

colMeans(dat[, 1:3])
##      speed       dist speed_dist 
##      15.40      42.98     769.64
colSums(dat[, 1:3])
##      speed       dist speed_dist 
##        770       2149      38482

Categorical variables and labels management

For categorical variables, it is a good practice to use the factor format and to name the different levels of the variables.

  • for this example, let’s create another new variable called dist_cat based on the distance and then change its format from numeric to factor (while also specifying the labels of the levels):
# create new variable dist_cat
dat$dist_cat <- ifelse(dat$dist < 15,
  1, 2
)

# change from numeric to factor and specify the labels
dat$dist_cat <- factor(dat$dist_cat,
  levels = c(1, 2),
  labels = c("small distance", "big distance") # follow the order of the levels
)

head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109
##         dist_cat
## 1 small distance
## 2 small distance
## 3 small distance
## 4   big distance
## 5   big distance
## 6 small distance
  • to check the format of a variable:
class(dat$dist_cat)
## [1] "factor"
# or
str(dat$dist_cat)
##  Factor w/ 2 levels "small distance",..: 1 1 1 2 2 1 2 2 2 2 ...

This will be sufficient if you need to format only a limited number of variables. However, if you need to do it for a large amount of categorical variables, it quickly becomes time consuming to write the same code many times. As you can imagine, it possible to format many variables without having to write the entire code for each variable one by one by using the within() command:

dat <- within(dat, {
  speed_cat <- factor(speed_cat, labels = c(
    "high speed",
    "low speed"
  ))
  dist_cat <- factor(dist_cat, labels = c(
    "small distance",
    "big distance"
  ))
})
head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109
##         dist_cat
## 1 small distance
## 2 small distance
## 3 small distance
## 4   big distance
## 5   big distance
## 6 small distance
str(dat)
## 'data.frame':    50 obs. of  8 variables:
##  $ speed      : num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist       : num  2 10 4 22 16 10 18 26 34 17 ...
##  $ speed_dist : num  8 40 28 154 128 90 180 260 340 187 ...
##  $ speed_cat  : Factor w/ 2 levels "high speed","low speed": 2 2 2 2 1 1 1 1 1 1 ...
##  $ speed_quali: Factor w/ 4 levels "[0,12)","[12,15)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ mean_score : num  4.67 18 13 61 50.67 ...
##  $ total_score: num  14 54 39 183 152 109 208 296 384 215 ...
##  $ dist_cat   : Factor w/ 2 levels "small distance",..: 1 1 1 2 2 1 2 2 2 2 ...

Recode categorical variables

It is possible to recode labels of a categorical variable if you are not satisfied with the current labels. In this example, we change the labels as follows:

  • “small distance” \(\rightarrow\) “short distance”
  • “big distance” \(\rightarrow\) “large distance”
dat$dist_cat <- recode(dat$dist_cat,
  "small distance" = "short distance",
  "big distance" = "large distance"
)
head(dat)
##   speed dist speed_dist  speed_cat speed_quali mean_score total_score
## 1     4    2          8  low speed      [0,12)   4.666667          14
## 2     4   10         40  low speed      [0,12)  18.000000          54
## 3     7    4         28  low speed      [0,12)  13.000000          39
## 4     7   22        154  low speed      [0,12)  61.000000         183
## 5     8   16        128 high speed      [0,12)  50.666667         152
## 6     9   10         90 high speed      [0,12)  36.333333         109
##         dist_cat
## 1 short distance
## 2 short distance
## 3 short distance
## 4 large distance
## 5 large distance
## 6 short distance

Change reference level

For some analyses, you might want to change the order of the levels. For example, if you are analyzing data about a control group and a treatment group, you may want to set the control group as the reference group. By default, levels are ordered by alphabetical order or by its numeric value if it was change from numeric to factor.

  • to check the current order of the levels (the first level being the reference):
levels(dat$dist_cat)
## [1] "short distance" "large distance"

In this case, “short distance” being the first level it is the reference level. It is the first level because it was initially set with a value equal to 1 when creating the variable.

  • to change the reference level:
dat$dist_cat <- relevel(dat$dist_cat, ref = "large distance")
levels(dat$dist_cat)
## [1] "large distance" "short distance"

Large distance is now the first and thus the reference level.

Rename variable names

To rename variable names, use the rename() command from the dplyr package as follows:

  • dist \(\rightarrow\) distance
  • speed_dist \(\rightarrow\) speed_distance
  • dist_cat \(\rightarrow\) distance_cat
dat <- rename(dat,
  distance = dist,
  speed_distance = speed_dist,
  distance_cat = dist_cat
)
names(dat) # display variable names
## [1] "speed"          "distance"       "speed_distance" "speed_cat"     
## [5] "speed_quali"    "mean_score"     "total_score"    "distance_cat"

Create a dataframe manually

Although most analyses are performed on an imported dataset, it is also possible to create a dataframe directly in R:

# Create the data frame named dat
dat <- data.frame(
  "variable1" = c(6, 12, NA, 3), # presence of 1 missing value
  "variable2" = c(3, 7, 9, 1),
  stringsAsFactors = FALSE
)
# Print the data frame
dat
##   variable1 variable2
## 1         6         3
## 2        12         7
## 3        NA         9
## 4         3         1

Missing values

Missing values (represented by NA in RStudio, for “Not Applicable”) are often problematic for many analyses. For instance, the mean of a series or variable with at least one NA will give a NA (the dataframe created in the previous section is used for this example):

mean(dat$variable1)
## [1] NA

It is however possible to compute most measures for variables including at least one NA thanks to the argument na.rm = TRUE:

mean(dat$variable1, na.rm = TRUE)
## [1] 7

Nonetheless, datasets with NAs are still problematic for some types of analysis. Several alternatives exist to remove or impute missing values.

Remove NAs

A simple solution is to remove all observations (i.e., rows) containing at least one missing value. This is done by keeping observations with complete cases:

dat_complete <- dat[complete.cases(dat), ]
dat_complete
##   variable1 variable2
## 1         6         3
## 2        12         7
## 4         3         1

Be careful before removing observations with missing values, especially if missing values are not “missing at random”. This is, however, beyond the scope of the present article.

Impute NAs

Instead of removing observations with at least one NA, it is possible to impute them, that is, replace them by some values such as the median or the mode of the variable. This can be done easily with the command impute() from the package imputeMissings:

library(imputeMissings)
dat_imputed <- impute(dat) # default method is median/mode
dat_imputed
##   variable1 variable2
## 1         6         3
## 2        12         7
## 3         6         9
## 4         3         1

When the median/mode method is used (the default), character vectors and factors are imputed with the mode. Numeric and integer vectors are imputed with the median. Again, use imputations carefully. Other packages offer more advanced imputation techniques. However, we keep it simple and straightforward for this article as advanced imputations is beyond the scope of introductory data manipulations in R.

Scale

Scaling (i.e., standardizing) a variable is often used before a Principal Component Analysis (PCA)1 when variables of a dataset have different units. Remember that scaling a variable means that it will compute the mean and the standard deviation of that variable. Then each value (so each row) of that variable is “scaled” by subtracting the mean and dividing by the standard deviation of that variable. Formally:

\[z = \frac{x - \bar{x}}{s}\]

where \(\bar{x}\) and \(s\) are the mean and the standard deviation of the variable, respectively.

To scale one or more variables in R use scale():

dat_scaled <- scale(dat_imputed)
head(dat_scaled)
##       variable1  variable2
## [1,] -0.1986799 -0.5477226
## [2,]  1.3907590  0.5477226
## [3,] -0.1986799  1.0954451
## [4,] -0.9933993 -1.0954451

Thanks for reading. I hope this article helped you to manipulate your data in RStudio.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.


  1. Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing a better visualization of the variation present in a dataset with a large number of variables. When there are many variables, the data cannot easily be illustrated in their raw format. To counter this, the PCA takes a dataset with many variables and simplifies it by transforming the original variables into a smaller number of “principal components”. The first dimension contains the most variance in the dataset and so on, and the dimensions are uncorrelated. Note that PCA is done on quantitative variables.↩︎



Liked this post?

Get updates every time a new article is published.
No spam and unsubscribe anytime.


↑ Back to top