Introduction to labelled

The purpose of the labelled package is to provide functions to manipulate metadata as variable labels, value labels and defined missing values using the labelled class and the label attribute introduced in haven package.

Variable labels

A variable label could be specified for any vector using var_label.

library(labelled)

var_label(iris$Sepal.Length) <- "Length of sepal"

It’s possible to add a variable label to several columns of a data frame using a named list.

var_label(iris) <- list(Petal.Length = "Length of petal", Petal.Width = "Width of Petal")

To get the variable label, simply call var_label.

var_label(iris$Petal.Width)
## [1] "Width of Petal"
var_label(iris)
## $Sepal.Length
## [1] "Length of sepal"
## 
## $Sepal.Width
## NULL
## 
## $Petal.Length
## [1] "Length of petal"
## 
## $Petal.Width
## [1] "Width of Petal"
## 
## $Species
## NULL

To remove a variable label, use NULL.

var_label(iris$Sepal.Length) <- NULL

In RStudio, variable labels will be displayed in data viewer.

View(iris)

Value labels

The first way to create a labelled vector is to use the labelled function. It’s not mandatory to provide a label for each value observed in your vector. You can also provide a label for values not observed.

v <- labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, no = 3, "don't know" = 8, refused = 9))
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know FALSE
##      9    refused FALSE

Use val_labels to get all value labels and val_label to get the value label associated with a specific value.

val_labels(v)
##        yes         no don't know    refused 
##          1          3          8          9
val_label(v, 8)
## [1] "don't know"

val_labels could also be used to modify all the value labels attached to a vector, while val_label will update only one specific value label.

val_labels(v) <- c(yes = 1, nno = 3, bug = 5)
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value label is_na
##      1   yes FALSE
##      3   nno FALSE
##      5   bug FALSE
val_label(v, 3) <- "no"
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value label is_na
##      1   yes FALSE
##      3    no FALSE
##      5   bug FALSE

With val_label, you can also add or remove specific value labels.

val_label(v, 2) <- "maybe"
val_label(v, 5) <- NULL
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value label is_na
##      1   yes FALSE
##      3    no FALSE
##      2 maybe FALSE

To remove all value labels, use val_labels and NULL. The labelled class will also be removed.

val_labels(v) <- NULL
v
##  [1]  1  2  2  2  3  9  1  3  2 NA

Adding a value label to a non labelled vector will apply labelled class to it.

val_label(v, 1) <- "yes"
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value label is_na
##      1   yes FALSE

Note that applying val_labels to a factor will have no effect!

f <- factor(1:3)
f
## [1] 1 2 3
## Levels: 1 2 3
val_labels(f) <- c(yes = 1, no = 3)
f
## [1] 1 2 3
## Levels: 1 2 3

You could also apply value labels to several columns of a data frame.

df <- data.frame(v1 = 1:3, v2 = c(2, 3, 1), v3 = 3:1)

val_label(df, 1) <- "yes"
val_label(df[, c("v1", "v3")], 2) <- "maybe"
val_label(df[, c("v2", "v3")], 3) <- "no"
val_labels(df)
## $v1
##   yes maybe 
##     1     2 
## 
## $v2
## yes  no 
##   1   3 
## 
## $v3
##   yes maybe    no 
##     1     2     3
val_labels(df[, c("v1", "v3")]) <- c(YES = 1, MAYBE = 2, NO = 3)
val_labels(df)
## $v1
##   YES MAYBE    NO 
##     1     2     3 
## 
## $v2
## yes  no 
##   1   3 
## 
## $v3
##   YES MAYBE    NO 
##     1     2     3
val_labels(df) <- NULL
val_labels(df)
## $v1
## NULL
## 
## $v2
## NULL
## 
## $v3
## NULL
val_labels(df) <- list(v1 = c(yes = 1, no = 3), v2 = c(a = 1, b = 2, c = 3))
val_labels(df)
## $v1
## yes  no 
##   1   3 
## 
## $v2
## a b c 
## 1 2 3 
## 
## $v3
## NULL

Missing values

It is possible to define some values that should be considered as missing (and would probably be later converted into NA).

The way missing values are stored by the labelled class requires that each missing value needs to have an associated value label.

With labelled function, you can specify which value labels should be considered as missing values.

v <- labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, no = 3, "don't know" = 8, refused = 9), c(FALSE, FALSE, TRUE, TRUE))
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE

You can get and modify the list of missing values with missing_val.

missing_val(v)
## don't know    refused 
##          8          9
missing_val(v) <- 9
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know FALSE
##      9    refused  TRUE
missing_val(v) <- NULL
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know FALSE
##      9    refused FALSE
missing_val(v) <- c(8, 9)
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE

If you try to set a missing value to a value who don’t have an attached value label, you’ll get an error.

missing_val(v) <- c(7, 8, 9)
Error: no value label found for 7, please specify `force`

With the force argument, you can specify what should be done. If force = FALSE, only values having already a value label will be considered as missing. If force = TRUE, an automatic value label will be created.

missing_val(v, force = FALSE) <- c(7, 8, 9)
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE
missing_val(v, force = TRUE) <- c(7, 8, 9)
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE
##      7          7  TRUE

You need to be aware that if you remove a value label considered as missing, the attached value will not be considered as missing anymore.

missing_val(v)
## don't know    refused          7 
##          8          9          7
val_label(v, 7) <- NULL
missing_val(v)
## don't know    refused 
##          8          9

Sorting value labels

Value labels are sorted by default in the order they have been created.

v <- c(1,2,2,2,3,9,1,3,2,NA)
val_label(v, 1) <- "yes"
val_label(v, 3) <- "no"
val_label(v, 9) <- "refused"
val_label(v, 2) <- "maybe"
val_label(v, 8) <- "don't know"
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      9    refused FALSE
##      2      maybe FALSE
##      8 don't know FALSE

It could be useful to reorder the value labels according to their attached values.

sort_val_labels(v)
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      2      maybe FALSE
##      3         no FALSE
##      8 don't know FALSE
##      9    refused FALSE
sort_val_labels(v, decreasing = TRUE)
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      9    refused FALSE
##      8 don't know FALSE
##      3         no FALSE
##      2      maybe FALSE
##      1        yes FALSE

If you prefer, you can also sort them according to the labels.

sort_val_labels(v, according_to = "l")
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      8 don't know FALSE
##      2      maybe FALSE
##      3         no FALSE
##      9    refused FALSE
##      1        yes FALSE

Converting to NA

The internal way to deal with missing values in R is to set them equal to NA. missing_to_na will convert all values defined as missing into NA.

v <- labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, no = 3, "don't know" = 8, refused = 9), c(FALSE, FALSE, TRUE, TRUE))
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE
missing_to_na(v)
## <Labelled double> 
##  [1]  1  2  2  2  3 NA  1  3  2 NA
## 
## Labels:
##  value label is_na
##      1   yes FALSE
##      3    no FALSE

In some cases, values who don’t have an attached value label could also be considered as missing. nolabel_to_na will convert them to NA.

nolabel_to_na(v)
## <Labelled double> 
##  [1]  1 NA NA NA  3  9  1  3 NA NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE

Finally, in some cases, a value label is attached only to specific values that corresponds to a missing value. For example:

size <- labelled(c(1.88, 1.62, 1.78, 99, 1.91), c("not measured" = 99))
size
## <Labelled double> 
## [1]  1.88  1.62  1.78 99.00  1.91
## 
## Labels:
##  value        label is_na
##     99 not measured FALSE

In such cases, val_labels_to_na could be appropriate.

val_labels_to_na(size)
## [1] 1.88 1.62 1.78   NA 1.91

These 3 functions could also be applied to an overall data frame. Only labelled vectors will be impacted.

Converting to factor

A labelled vector could easily be converted to a factor with as_factor.

v <- labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, no = 3, "don't know" = 8, refused = 9), c(FALSE, FALSE, TRUE, TRUE))
v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE
as_factor(v)
##  [1] yes     2       2       2       no      refused yes     no     
##  [9] 2       <NA>   
## Levels: yes 2 no don't know refused

The levels argument allows to specify what should be used as the factor levels, i.e. the labels (default), the values or the labels prefixed with values.

as_factor(v, levels = "v")
##  [1] 1    2    2    2    3    9    1    3    2    <NA>
## Levels: 1 2 3 8 9
as_factor(v, levels = "p")
##  [1] [1] yes     [2] 2       [2] 2       [2] 2       [3] no     
##  [6] [9] refused [1] yes     [3] no      [2] 2       <NA>       
## Levels: [1] yes [2] 2 [3] no [8] don't know [9] refused

The ordered argument will create an ordinal factor.

as_factor(v, ordered = TRUE)
##  [1] yes     2       2       2       no      refused yes     no     
##  [9] 2       <NA>   
## Levels: yes < 2 < no < don't know < refused

The arguments missing_to_na and nolabel_to_na specify if the corresponding functions should be applied before converting to a factor. Therefore, the two following commands are equivalent.

as_factor(v, missing_to_na = TRUE)
##  [1] yes  2    2    2    no   <NA> yes  no   2    <NA>
## Levels: yes 2 no
as_factor(missing_to_na(v))
##  [1] yes  2    2    2    no   <NA> yes  no   2    <NA>
## Levels: yes 2 no

sort_levels specifies how the levels should be sorted: "none" to keep the order in which value labels have been defined, "values" to order the levels according to the values and "labels" according to the labels. "auto" (default) will be equivalent to "none" except if some values with no attached labels are found and are not dropped. In that case, "values" will be used.

as_factor(v, sort_levels = "n")
##  [1] yes     2       2       2       no      refused yes     no     
##  [9] 2       <NA>   
## Levels: yes no don't know refused 2
as_factor(v, sort_levels = "v")
##  [1] yes     2       2       2       no      refused yes     no     
##  [9] 2       <NA>   
## Levels: yes 2 no don't know refused
as_factor(v, sort_levels = "l")
##  [1] yes     2       2       2       no      refused yes     no     
##  [9] 2       <NA>   
## Levels: 2 don't know no refused yes

The function as_labelled could be used to turn a factor into a labelled numeric vector.

f <- factor(1:3, labels = c("a", "b", "c"))
as_labelled(f)
## <Labelled double> 
## [1] 1 2 3
## 
## Labels:
##  value label is_na
##      1     a FALSE
##      2     b FALSE
##      3     c FALSE

Note that as_labelled(as_factor(v)) will not be equal to v due to the way factors are stored internally by R.

v
## <Labelled double> 
##  [1]  1  2  2  2  3  9  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      3         no FALSE
##      8 don't know  TRUE
##      9    refused  TRUE
as_labelled(as_factor(v))
## <Labelled double> 
##  [1]  1  2  2  2  3  5  1  3  2 NA
## 
## Labels:
##  value      label is_na
##      1        yes FALSE
##      2          2 FALSE
##      3         no FALSE
##      4 don't know FALSE
##      5    refused FALSE

Importing labelled data

In haven package, read_spss, read_stata and read_sas are natively importing data using the labelled class and the label attribute for variable labels.

Functions from foreign package could also import some metadata from SPSS and Stata files. to_labelled can convert data imported with foreign into a labelled data frame. However, there are some limitations compared to using haven:

  • For SPSS files, it will be better to set use.value.labels = FALSE, to.data.frame = FALSE and use.missings = FALSE when calling read.spss. If use.value.labels = TRUE, variable with value labels will be converted into factors by read.spss (and kept as factors by foreign_to_label). If to.data.frame = TRUE, meta data describing the missing values will not be imported. If use.missings = TRUE, missing values would have been converted to NA by read.spss.
  • For Stata files, set convert.factors = FALSE when calling read.dta to avoid conversion of variables with value labels into factors. So far, missing values defined in Stata are always imported as NA by read.dta and could not be retrieved by foreign_to_labelled.

The memisc package provide functions to import variable metadata and store them in specific object of class data.set. The to_labelled method can convert a data.set into a labelled data frame.

  # from foreign
  library(foreign)
  df <- to_labelled(read.spss(
    "file.sav",
    to.data.frame = FALSE,
    use.value.labels = FALSE,
    use.missings = FALSE
 ))
 df <- to_labelled(read.dta(
   "file.dta",
   convert.factors = FALSE
 ))

 # from memisc
 library(memisc)
 nes1948.por <- UnZip("anes/NES1948.ZIP", "NES1948.POR", package="memisc")
 nes1948 <- spss.portable.file(nes1948.por)
 df <- to_labelled(nes1948)
 ds <- as.data.set(nes19480)
 df <- to_labelled(ds)