Get Data Matchready

Want to link up your data from different sources? Awesome! Just a heads-up, you’ll probably need to do some cleaning first. Let’s dive in and see how our package makes getting your SGIC data ready super easy.

We`ll start by loading trustmebro:

library(trustmebro)

Data

Our key data set trustmebro::sailor_keys is a longitudinal data set in long format. It is a tibble with 20 rows and 12 columns.

This data should be linked with our survey data trustmebro::sailor_students, a tibble with 12 rows and 6 columns.

Let us take a quick look at the survey data:

print(trustmebro::sailor_students)
#> # A tibble: 12 × 6
#>    sgic             school class   gender  testscore_langauge testscore_calculus
#>    <chr>            <chr>  <chr>   <chr>                <dbl>              <dbl>
#>  1 "MUC__0308"      54321  "3-B "  "Male"                 425                394
#>  2 "HÄT 2701"       22345  "2-A"   "???"                 4596                123
#>  3 "MUK3801"        22345  "  2-B" "Femal…               2456               9485
#>  4 "SAM10"          22345  "3-B"   "Femal…               2345                  3
#>  5 "T0601"          65432  "1-C"   "Femal…               1234                 NA
#>  6 "      UIT3006 " 12345  "3-3"    <NA>                  123                394
#>  7 "@@@@@@"         <NA>   "3_2  " "Femal…                 56               2938
#>  8  <NA>            12345  "3@41"  "   Fe…                986               3948
#>  9 " "              unkown  <NA>   "Femal…                284                205
#> 10 "MOA2210"        12345  " "     "Femal…                105                 21
#> 11 "MUK3801"        22345  "2-B"   "Femal…               9586                934
#> 12 "T0601"          65432  "1-C"   "Femal…                 NA                764

Replace non-alphanumeric characters you don’t want to deal with

Yep, this data needs cleaning. There’s a lot of unnecessary stuff, like whitespace. You see this all the time with survey data strings. We can replace all non-alphanumeric characters in string-variables of our data set trustmebro::sailor_students using trustmebro::purge_string:

purge_string(sailor_students, replacement = "#")
#> # A tibble: 12 × 6
#>    sgic      school class gender testscore_langauge testscore_calculus
#>    <chr>     <chr>  <chr> <chr>               <dbl>              <dbl>
#>  1 MUC##0308 54321  3#B   MALE                  425                394
#>  2 H#T2701   22345  2#A   ###                  4596                123
#>  3 MUK3801   22345  2#B   FEMALE               2456               9485
#>  4 SAM10     22345  3#B   FEMALE               2345                  3
#>  5 T0601     65432  1#C   FEMALE               1234                 NA
#>  6 UIT3006   12345  3#3   #                     123                394
#>  7 ######    #      3#2   FEMALE                 56               2938
#>  8 #         12345  3#41  FEMALE                986               3948
#>  9 #         UNKOWN #     FEMALE                284                205
#> 10 MOA2210   12345  #     FEMALE                105                 21
#> 11 MUK3801   22345  2#B   FEMALE               9586                934
#> 12 T0601     65432  1#C   FEMALE                 NA                764

Please note that since we deal with data collected in Germany, umlauts remain unchanged from this.

Recode variables

A few variables need recoding for further analysis. For that, we can provide a recode map:

recode_map <- c(MALE = "M", FEMALE = "F")

The recode_map is a named vector where the names represent categories (in this case, “Male” and “Female”), and the values (“M” and “F”) are the corresponding codes used for those categories. It is used to map full category labels to shorter, standardized values. We can pass it to trustmebro::recode_valinvec, to recode the values accordingly. A new variable will be added that contains the recoded values

recode_valinvec(purge_string(sailor_students, replacement = "#"), gender, recode_map, gender_recode)
#> # A tibble: 12 × 7
#>    sgic  school class gender testscore_langauge testscore_calculus gender_recode
#>    <chr> <chr>  <chr> <chr>               <dbl>              <dbl> <chr>        
#>  1 MUC#… 54321  3#B   MALE                  425                394 M            
#>  2 H#T2… 22345  2#A   ###                  4596                123 ###          
#>  3 MUK3… 22345  2#B   FEMALE               2456               9485 F            
#>  4 SAM10 22345  3#B   FEMALE               2345                  3 F            
#>  5 T0601 65432  1#C   FEMALE               1234                 NA F            
#>  6 UIT3… 12345  3#3   #                     123                394 #            
#>  7 ####… #      3#2   FEMALE                 56               2938 F            
#>  8 #     12345  3#41  FEMALE                986               3948 F            
#>  9 #     UNKOWN #     FEMALE                284                205 F            
#> 10 MOA2… 12345  #     FEMALE                105                 21 F            
#> 11 MUK3… 22345  2#B   FEMALE               9586                934 F            
#> 12 T0601 65432  1#C   FEMALE                 NA                764 F