Uvod


The best thing to do with missing data is to not have any

—Gertrude Mary Cox


Bavićemo se sledećim:

Podaci koji nedostaju su vrednosti koje su trebale biti zabeležene, ali nisu.

NA = Not Available.

Kako proveriti da li imamo nedostajuće podatke?

x <- c(1, NA, 3, NA, NA, 5)

library(naniar)
any_na(x)
[1] TRUE
are_na(x)
[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE
n_miss(x)
[1] 3
prop_miss(x)
[1] 0.5

Podaci koji nedostaju u kalkulacijama?

Generalno pravilo: NA + [anything] = NA

visineDF <- data.frame(Srdjan = c(182), Zoran = c(188), Milan = c(NA))

sum(visineDF)
[1] NA

Neke začkoljice u radu sa podacima koji nedostaju

NaN: Not a Number — R tumači NaN kao podatak koji nedostaje.
NULL — je prazna vrednost, i R je ne tretira kao nedostajuću.
Inf — je beskonačna vrednost, i R je ne tretira kao nedostajuću.

any_na(NaN)
[1] TRUE
any_na(NULL)
[1] FALSE
any_na(Inf)
[1] FALSE

Podaci koji nedostaju u uslovnim izjavama

NA | TRUE
[1] TRUE
NA | FALSE
[1] NA
NA + NaN
[1] NA
NaN + NA
[1] NaN

Korišćenje i pronalaženje nedostajućih podataka

# Create x, a vector, with values NA, NaN, Inf, ".", and "missing"
x <- c(NA, NaN, Inf, ".", "missing")

# Use any_na() and are_na() on to explore the missings
any_na(x)
[1] TRUE
are_na(x)
[1]  TRUE FALSE FALSE FALSE FALSE

Jedna od prvih stvari koje ćete želeti da uradite sa novim datasetom jeste provera da li, i koliko nedostajućih vrednosti poostoji.

library(readr)

dat_hw <- read_csv(here::here("data-raw","data_hw.csv"), 
                    col_types = cols(height = col_double(), weight = col_double()))


# Use n_miss() to count the total number of missing values in dat_hw


dat_hw
# A tibble: 100 x 2
   weight height
    <dbl>  <dbl>
 1  104.  -1.15 
 2   77.1  2.79 
 3  108.   1.81 
 4   85.0  1.71 
 5   87.4  0.918
 6   87.4  0.249
 7   87.2  1.94 
 8  124.   2.24 
 9   82.8  2.58 
10   NA   NA    
# ... with 90 more rows
glimpse(dat_hw)
Observations: 100
Variables: 2
$ weight <dbl> 104.16064, 77.05248, 107.69692, 85.03788, 87.44273, 87.38751...
$ height <dbl> -1.1468301, 2.7899022, 1.8085469, 1.7118845, 0.9179269, 0.24...
n_miss(dat_hw)
[1] 30
# Use n_miss() on dat_hw$weight to count the total number of missing values
n_miss(dat_hw$weight)
[1] 15
# Use n_complete() on dat_hw to count the total number of complete values
n_complete(dat_hw)
[1] 170
# Use n_complete() on dat_hw$weight to count the total number of complete values
n_complete(dat_hw$weight)
[1] 85
# Use prop_miss() and prop_complete on dat_hw to count the total number of missing values in each of the variables
prop_miss(dat_hw)
[1] 0.15
prop_complete(dat_hw)
[1] 0.85
pct_complete(dat_hw)
[1] 85
pct_miss(dat_hw)
[1] 15

Vežba — koji će biti rezultat sledećih izjava?

1 + NA
NA + NA
NA | TRUE
NA | FALSE

Sumarni izveštaji za nedostajuće podatke

Potrebni su nam sumarni izveštaji/pregledi/rezimei za:

pošto sve gore navedeno može uticati nepovoljno na dalju analizu podataka.

Prema strukturi rezimea postoje:

Paket naniar poseduje familiju funkcija čija imena počinju sa miss_. Svaka od ovih funkcija daje različite preglede nedostajućih podataka, i kao izlaz daje dataframe. Ove funkcije takođe rade sa dplyr funkcijom group_by, tako da možete istražiti obrasce nedostajanja po grupama.

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
glimpse(airquality)
Observations: 153
Variables: 6
$ Ozone   <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 1...
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290,...
$ Wind    <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9...
$ Temp    <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58,...
$ Month   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
$ Day     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
miss_var_summary(airquality)
# A tibble: 6 x 3
  variable n_miss pct_miss
  <chr>     <int>    <dbl>
1 Ozone        37    24.2 
2 Solar.R       7     4.58
3 Wind          0     0   
4 Temp          0     0   
5 Month         0     0   
6 Day           0     0   
miss_case_summary(airquality)
# A tibble: 153 x 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     5      2     33.3
 2    27      2     33.3
 3     6      1     16.7
 4    10      1     16.7
 5    11      1     16.7
 6    25      1     16.7
 7    26      1     16.7
 8    32      1     16.7
 9    33      1     16.7
10    34      1     16.7
# ... with 143 more rows

Tabulacija nedostajućih podataka prikazuje broj promenljivih/zapisa sa 0, 1, 2, ..., n nedostajućih vrednosti i udeo u ukupnom broju promenljivih/zapisa koje ove promenljive/zapisi čine.

miss_var_table(airquality)
# A tibble: 3 x 3
  n_miss_in_var n_vars pct_vars
          <int>  <int>    <dbl>
1             0      4     66.7
2             7      1     16.7
3            37      1     16.7
miss_case_table(airquality)
# A tibble: 3 x 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0     111     72.5 
2              1      40     26.1 
3              2       2      1.31

Takođe možemo istražiti nedostajanje po rasponima podataka. Ovo je veoma korisno za vremenske serije. miss_var_span daje izveštaj o nedostajućim podacima u obliku dataframe-a za svaki span, dok miss_var_run vraća dataframe sa dve nove kolone run_length i is_na koje opisuju complete/missing serije nedostajanja podataka. Ovo je korisno za istraživanje ponavljanja u obrascu nedostajanja.

head(pedestrian)
# A tibble: 6 x 9
  hourly_counts date_time            year month month_day week_day  hour
          <int> <dttm>              <int> <ord>     <int> <ord>    <int>
1           883 2016-01-01 00:00:00  2016 Janu~         1 Friday       0
2           597 2016-01-01 01:00:00  2016 Janu~         1 Friday       1
3           294 2016-01-01 02:00:00  2016 Janu~         1 Friday       2
4           183 2016-01-01 03:00:00  2016 Janu~         1 Friday       3
5           118 2016-01-01 04:00:00  2016 Janu~         1 Friday       4
6            68 2016-01-01 05:00:00  2016 Janu~         1 Friday       5
# ... with 2 more variables: sensor_id <int>, sensor_name <chr>
glimpse(pedestrian)
Observations: 37,700
Variables: 9
$ hourly_counts <int> 883, 597, 294, 183, 118, 68, 47, 52, 120, 333, 761, 1...
$ date_time     <dttm> 2016-01-01 00:00:00, 2016-01-01 01:00:00, 2016-01-01...
$ year          <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,...
$ month         <ord> January, January, January, January, January, January,...
$ month_day     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ week_day      <ord> Friday, Friday, Friday, Friday, Friday, Friday, Frida...
$ hour          <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ sensor_id     <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
$ sensor_name   <chr> "Bourke Street Mall (South)", "Bourke Street Mall (So...
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
# A tibble: 10 x 5
   span_counter n_miss n_complete prop_miss prop_complete
          <int>  <int>      <dbl>     <dbl>         <dbl>
 1            1      0       4000   0               1    
 2            2      1       3999   0.00025         1.00 
 3            3    121       3879   0.0302          0.970
 4            4    503       3497   0.126           0.874
 5            5    745       3255   0.186           0.814
 6            6      0       4000   0               1    
 7            7      1       3999   0.00025         1.00 
 8            8      0       4000   0               1    
 9            9    745       3255   0.186           0.814
10           10    432       3568   0.108           0.892
miss_var_run(pedestrian, hourly_counts)
# A tibble: 35 x 2
   run_length is_na   
        <int> <chr>   
 1       6628 complete
 2          1 missing 
 3       5250 complete
 4        624 missing 
 5       3652 complete
 6          1 missing 
 7       1290 complete
 8        744 missing 
 9       7420 complete
10          1 missing 
# ... with 25 more rows

Nekada nas interesuju nedostajanja u grupama podataka. Svaka od gornjih funkcija može raditi sa group_by funkcijom.

airquality %>%
  group_by(Month) %>%
  miss_var_summary()
# A tibble: 25 x 4
# Groups:   Month [5]
   Month variable n_miss pct_miss
   <int> <chr>     <int>    <dbl>
 1     5 Ozone         5     16.1
 2     5 Solar.R       4     12.9
 3     5 Wind          0      0  
 4     5 Temp          0      0  
 5     5 Day           0      0  
 6     6 Ozone        21     70  
 7     6 Solar.R       0      0  
 8     6 Wind          0      0  
 9     6 Temp          0      0  
10     6 Day           0      0  
# ... with 15 more rows

Vežbanje rezimiranja nedostajućih podataka

# Summarise missingness in each variable of the `airquality` dataset
miss_var_summary(airquality)
# A tibble: 6 x 3
  variable n_miss pct_miss
  <chr>     <int>    <dbl>
1 Ozone        37    24.2 
2 Solar.R       7     4.58
3 Wind          0     0   
4 Temp          0     0   
5 Month         0     0   
6 Day           0     0   
# Summarise missingness in each case of the `airquality` dataset
miss_case_summary(airquality)
# A tibble: 153 x 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     5      2     33.3
 2    27      2     33.3
 3     6      1     16.7
 4    10      1     16.7
 5    11      1     16.7
 6    25      1     16.7
 7    26      1     16.7
 8    32      1     16.7
 9    33      1     16.7
10    34      1     16.7
# ... with 143 more rows
# Return the summary of missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_summary()
# A tibble: 25 x 4
# Groups:   Month [5]
   Month variable n_miss pct_miss
   <int> <chr>     <int>    <dbl>
 1     5 Ozone         5     16.1
 2     5 Solar.R       4     12.9
 3     5 Wind          0      0  
 4     5 Temp          0      0  
 5     5 Day           0      0  
 6     6 Ozone        21     70  
 7     6 Solar.R       0      0  
 8     6 Wind          0      0  
 9     6 Temp          0      0  
10     6 Day           0      0  
# ... with 15 more rows
# Return the summary of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_summary()
# A tibble: 153 x 4
# Groups:   Month [5]
   Month  case n_miss pct_miss
   <int> <int>  <int>    <dbl>
 1     5     5      2       40
 2     5    27      2       40
 3     5     6      1       20
 4     5    10      1       20
 5     5    11      1       20
 6     5    25      1       20
 7     5    26      1       20
 8     5     1      0        0
 9     5     2      0        0
10     5     3      0        0
# ... with 143 more rows

Vežbanje tabulacije nedostajanja

# Tabulate missingness in each variable and case of the `airquality` dataset
miss_var_table(airquality)
# A tibble: 3 x 3
  n_miss_in_var n_vars pct_vars
          <int>  <int>    <dbl>
1             0      4     66.7
2             7      1     16.7
3            37      1     16.7
miss_case_table(airquality)
# A tibble: 3 x 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0     111     72.5 
2              1      40     26.1 
3              2       2      1.31
# Tabulate the missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_table()
# A tibble: 12 x 4
# Groups:   Month [5]
   Month n_miss_in_var n_vars pct_vars
   <int>         <int>  <int>    <dbl>
 1     5             0      3       60
 2     5             4      1       20
 3     5             5      1       20
 4     6             0      4       80
 5     6            21      1       20
 6     7             0      4       80
 7     7             5      1       20
 8     8             0      3       60
 9     8             3      1       20
10     8             5      1       20
11     9             0      4       80
12     9             1      1       20
# Tabulate of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_table()
# A tibble: 11 x 4
# Groups:   Month [5]
   Month n_miss_in_case n_cases pct_cases
   <int>          <int>   <int>     <dbl>
 1     5              0      24     77.4 
 2     5              1       5     16.1 
 3     5              2       2      6.45
 4     6              0       9     30   
 5     6              1      21     70   
 6     7              0      26     83.9 
 7     7              1       5     16.1 
 8     8              0      23     74.2 
 9     8              1       8     25.8 
10     9              0      29     96.7 
11     9              1       1      3.33

Vežbanje ostalih pregleda

# Calculate the summaries for each run of missingness for the variable, hourly_counts
miss_var_run(pedestrian, var = hourly_counts)
# A tibble: 35 x 2
   run_length is_na   
        <int> <chr>   
 1       6628 complete
 2          1 missing 
 3       5250 complete
 4        624 missing 
 5       3652 complete
 6          1 missing 
 7       1290 complete
 8        744 missing 
 9       7420 complete
10          1 missing 
# ... with 25 more rows
# Calculate the summaries for each span of missingness, for a span of 4000, for the variable hourly_counts
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
# A tibble: 10 x 5
   span_counter n_miss n_complete prop_miss prop_complete
          <int>  <int>      <dbl>     <dbl>         <dbl>
 1            1      0       4000   0               1    
 2            2      1       3999   0.00025         1.00 
 3            3    121       3879   0.0302          0.970
 4            4    503       3497   0.126           0.874
 5            5    745       3255   0.186           0.814
 6            6      0       4000   0               1    
 7            7      1       3999   0.00025         1.00 
 8            8      0       4000   0               1    
 9            9    745       3255   0.186           0.814
10           10    432       3568   0.108           0.892
# For each `month` variable, calculate the run of missingness for hourly_counts
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)
# A tibble: 51 x 3
# Groups:   month [12]
   month    run_length is_na   
   <ord>         <int> <chr>   
 1 January        2976 complete
 2 February       2784 complete
 3 March          2976 complete
 4 April           888 complete
 5 April           552 missing 
 6 April          1440 complete
 7 May             744 complete
 8 May              72 missing 
 9 May            2160 complete
10 June           2880 complete
# ... with 41 more rows
# For each `month` variable, calculate the span of missingness of a span of 2000, for the variable hourly_counts
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)
# A tibble: 25 x 6
# Groups:   month [12]
   month    span_counter n_miss n_complete prop_miss prop_complete
   <ord>           <int>  <int>      <dbl>     <dbl>         <dbl>
 1 January             1      0       2000     0             1    
 2 January             2      0       2000     0             1    
 3 February            1      0       2000     0             1    
 4 February            2      0       2000     0             1    
 5 March               1      0       2000     0             1    
 6 March               2      0       2000     0             1    
 7 April               1    552       1448     0.276         0.724
 8 April               2      0       2000     0             1    
 9 May                 1     72       1928     0.036         0.964
10 May                 2      0       2000     0             1    
# ... with 15 more rows

Vizualizacija nedostajućih podataka

vis_miss daje Heatmap nedostajućih podataka. Argument cluster = TRUE specifikuje da želimo pregled na osnovu hijerarhijskog klasterovanja (Mcquitty metoda) da bi sortirali redove prema nedostajanjima.

library(visdat)
vis_miss(airquality)

vis_miss(airquality, cluster = TRUE)

Vizualizacija po promenljivim i zapisima. Kod gg_miss_case svaka linija predstavlja količinu nedostajanja u tom zapisu. Ove vizualizacije su sortirane tako da je najveći broj nedostajanja na vrhu. Takođe je dozvoljen faceting na osnovu nivoa zadate promenljive.

gg_miss_var(airquality)

gg_miss_case(airquality)

gg_miss_case(airquality, order_cases = FALSE)

gg_miss_var(airquality, facet = Month)

Vizualizacija obrazaca nedostajanja. Da bi istražili česte obrasce nedostajanja (koje promenljive i zapisi istovremeno nestaju) koristićemo gg_miss_upset. Ova funkcija pokazuje broj kombinacija nedostajućih vrednosti koje se javljaju zajedno ili istovremeno. Na donjem grafiku možemo videti da nedostaju vrednosti samo za Solar.R i Ozone. 35 vrednosti nedostaje samo za Ozone, 5 vrednosti nedostaje samo za Solar.R, i 2 vrednosti nedostaju istovremeno za Solar.R i Ozone.

gg_miss_upset(airquality)

Da bi istražili kako se nedostajanje podataka za promenljive menja u zavisnosti od vrednosti zadate kategorijske promenljive koristimo gg_miss_fct. Ova funkcija ne podržava faceting.

gg_miss_fct(x = airquality, fct = Month)

gg_miss_span je vizualni analog funkcije miss_var_span. Recimo broj nedostajanja na svakih 3000 redova. Ova funkcija podržava faceting.

gg_miss_span(pedestrian, hourly_counts, span_every = 3000)

Vežba — vizualizacija nedostajanja.

glimpse(riskfactors)
Observations: 245
Variables: 34
$ state            <fct> 26, 40, 72, 42, 32, 19, 45, 56, 18, 8, 55, 48, 26,...
$ sex              <fct> Female, Female, Female, Male, Female, Male, Male, ...
$ age              <int> 49, 48, 55, 42, 66, 66, 37, 62, 38, 42, 41, 64, 58...
$ weight_lbs       <int> 190, 170, 163, 230, 135, 165, 150, 170, 146, 260, ...
$ height_inch      <int> 64, 68, 64, 74, 62, 70, 68, 70, 70, 73, 72, 66, 62...
$ bmi              <dbl> 32.68, 25.90, 28.04, 29.59, 24.74, 23.72, 22.86, 2...
$ marital          <fct> Married, Divorced, Married, Married, Widowed, Marr...
$ pregnant         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, No, NA, NA, NA...
$ children         <int> 0, 0, 0, 1, 0, 0, 3, 0, 2, 3, 2, 0, 0, 0, 0, 0, 0,...
$ education        <fct> 6, 5, 4, 6, 5, 5, 6, 6, 4, 5, 6, 4, 4, 3, 4, 4, 4,...
$ employment       <fct> 2, 1, 5, 1, 1, 7, 2, 7, 1, 3, 1, 1, 1, 5, 1, 1, 2,...
$ income           <fct> 35-50k, 35-50k, <10k, >75k, 50-75k, 35-50k, >75k, ...
$ veteran          <fct> 5, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5,...
$ hispanic         <fct> No, No, Yes, No, No, No, No, No, No, No, No, No, N...
$ health_general   <fct> Good, Fair, Fair, Excellent, Excellent, Excellent,...
$ health_physical  <int> 3, 4, 0, 0, 0, 0, 0, 30, 30, 0, 0, 0, 5, 0, 0, 3, ...
$ health_mental    <int> 15, 30, 0, 0, 0, 0, 0, 30, 30, 20, 0, 2, 0, 0, 30,...
$ health_poor      <int> 2, 3, NA, NA, NA, NA, NA, 30, 14, 4, NA, 0, 3, NA,...
$ health_cover     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Y...
$ provide_care     <fct> No, No, No, No, No, No, Yes, No, No, No, No, No, N...
$ activity_limited <fct> Yes, No, No, No, No, No, No, No, Yes, NA, No, No, ...
$ drink_any        <fct> No, No, No, Yes, No, No, Yes, No, No, NA, Yes, Yes...
$ drink_days       <int> NA, NA, NA, 15, NA, NA, 2, NA, NA, NA, 4, 3, 1, NA...
$ drink_average    <int> NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, 2, 5, 1, NA...
$ smoke_100        <fct> No, No, No, No, Yes, No, No, Yes, Yes, Yes, No, Ye...
$ smoke_days       <fct> NA, NA, NA, NA, Everyday, NA, NA, Not@All, Everyda...
$ smoke_stop       <fct> NA, NA, NA, NA, Yes, NA, NA, NA, Yes, NA, NA, NA, ...
$ smoke_last       <fct> NA, NA, NA, NA, NA, NA, NA, 7, NA, 5, NA, 7, 7, NA...
$ diet_fruit       <int> 1095, 52, 36, NA, -7, 24, 52, 156, 24, NA, 365, 10...
$ diet_salad       <int> 261, 209, 156, NA, 261, 52, 156, 24, 84, NA, 156, ...
$ diet_potato      <int> 104, 52, 52, NA, 209, 104, 24, 52, 144, NA, 52, 10...
$ diet_carrot      <int> 156, 0, 24, NA, 261, 52, 24, 104, 24, NA, 156, 0, ...
$ diet_vegetable   <int> 521, 52, 24, NA, 365, 365, 730, 365, 0, NA, 1095, ...
$ diet_juice       <int> 12, 0, 24, NA, 104, 365, 104, 0, 0, NA, 261, 12, 1...
skimr::skim(riskfactors)
Data summary
Name riskfactors
Number of rows 245
Number of columns 34
_______________________
Column type frequency:
factor 18
numeric 16
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1.00 FALSE 52 6: 12, 34: 12, 25: 11, 12: 10
sex 0 1.00 FALSE 2 Fem: 153, Mal: 92
marital 1 1.00 FALSE 6 Mar: 131, Wid: 41, Div: 39, Nev: 24
pregnant 215 0.12 FALSE 2 No: 28, Yes: 2
education 1 1.00 FALSE 6 4: 85, 6: 75, 5: 59, 3: 14
employment 0 1.00 FALSE 7 1: 92, 7: 80, 2: 21, 8: 18
income 0 1.00 FALSE 10 >75: 46, 35-: 38, 50-: 32, 25-: 28
veteran 3 0.99 FALSE 5 5: 198, 3: 35, 4: 6, 2: 2
hispanic 2 0.99 FALSE 2 No: 221, Yes: 22
health_general 0 1.00 FALSE 6 Goo: 82, Ver: 56, Exc: 45, Fai: 40
health_cover 0 1.00 FALSE 2 Yes: 217, No: 28
provide_care 3 0.99 FALSE 2 No: 181, Yes: 61
activity_limited 3 0.99 FALSE 2 No: 170, Yes: 72
drink_any 2 0.99 FALSE 2 No: 132, Yes: 111
smoke_100 2 0.99 FALSE 2 No: 126, Yes: 117
smoke_days 128 0.48 FALSE 3 Not: 84, Eve: 26, Som: 7
smoke_stop 212 0.13 FALSE 2 Yes: 19, No: 14
smoke_last 161 0.34 FALSE 6 7: 59, 5: 9, 6: 6, 8: 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1.00 58.11 17.50 7.00 48.00 59.00 70.00 97.00 ▁▃▇▇▃
weight_lbs 10 0.96 174.27 44.70 96.00 144.00 170.00 195.00 410.00 ▆▇▂▁▁
height_inch 2 0.99 66.35 3.80 57.00 64.00 66.00 69.00 75.00 ▂▇▇▇▃
bmi 11 0.96 27.78 6.57 14.21 23.35 26.66 31.32 55.72 ▃▇▃▁▁
children 0 1.00 0.42 0.88 0.00 0.00 0.00 0.00 4.00 ▇▁▁▁▁
health_physical 0 1.00 4.12 9.11 -9.00 0.00 0.00 3.00 30.00 ▁▇▁▁▁
health_mental 0 1.00 3.14 8.19 -9.00 0.00 0.00 1.00 30.00 ▁▇▁▁▁
health_poor 113 0.54 5.46 10.76 -9.00 0.00 0.00 5.00 30.00 ▁▇▁▁▂
drink_days 134 0.45 9.29 10.20 -9.00 2.00 4.00 15.00 30.00 ▁▇▃▃▂
drink_average 135 0.45 1.41 2.57 -9.00 1.00 1.00 2.00 9.00 ▁▁▇▆▁
diet_fruit 8 0.97 324.71 338.11 -9.00 60.00 209.00 365.00 2555.00 ▇▁▁▁▁
diet_salad 8 0.97 175.48 140.00 -9.00 52.00 156.00 261.00 730.00 ▇▇▃▁▁
diet_potato 8 0.97 111.68 132.42 -9.00 24.00 104.00 156.00 1095.00 ▇▁▁▁▁
diet_carrot 8 0.97 91.52 101.32 -9.00 24.00 52.00 156.00 365.00 ▇▂▂▁▁
diet_vegetable 8 0.97 407.15 316.25 -9.00 156.00 365.00 626.00 1825.00 ▇▇▃▁▁
diet_juice 8 0.97 227.85 307.28 -9.00 24.00 104.00 365.00 2555.00 ▇▁▁▁▁
# Visualize all of the missingness in the `riskfactors`  dataset
vis_miss(riskfactors)

# Visualize and cluster all of the missingness in the `riskfactors` dataset
vis_miss(riskfactors, cluster = TRUE)

# Visualize and sort the columns by missingness in the `riskfactors` dataset
vis_miss(riskfactors, sort_miss = TRUE)

Vežba — vizualizacija nedostajanja po zapisima i promenljivima.

# Visualize the number of missings in cases using `gg_miss_case()`
gg_miss_case(riskfactors)

# Explore the number of missings in cases using `gg_miss_case()` and facet by the variable `education`
gg_miss_case(riskfactors, facet = education)

# Visualize the number of missings in variables using `gg_miss_var()`
gg_miss_var(riskfactors)

# Explore the number of missings in variables using `gg_miss_var()` and facet by the variable `education`
gg_miss_var(riskfactors, facet = education)

# Explore the number of missings in variables using `gg_miss_var()` and facet by the variable `education`
gg_miss_var(riskfactors, facet = education)

Vežba — vizualizacija obrazaca nedostajanja.

# Using the airquality dataset, explore the missingness pattern using gg_miss_upset()
gg_miss_upset(airquality)

# With the riskfactors dataset, explore how the missingness changes across the marital variable using gg_miss_fct()
gg_miss_fct(x = riskfactors, fct = marital)

# Using the pedestrian dataset, explore how the missingness of hourly_counts changes over a span of 3000 
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)

# Using the pedestrian dataset, explore the impact of month by facetting by month
# and explore how missingness changes for a span of 1000
gg_miss_span(pedestrian, var = hourly_counts, span_every = 1000, facet = month)

Pretraga i zamena nedostajućih vrednosti

Do sada smo pretpostavljati da su nedostajuće vrednosti kodirane kao NA. To često nije slučaj, već one mogu biti kodirane na razne načine, ili uopšte nekodirane. Ovde ćemo se baviti sa sledećim:

Funkcija miss_scan_count pretražuje dataframe prema definisanoj listi vrednosti koje predstavljaju nedostajuće vrednosti.

chaos <- tibble::tribble(
           ~score,    ~grade,    ~place,
                3,     "N/A",     "-99",
              -99,       "E",      "97",
                4, "missing",      "95",
              -99,      "na",      "92",
                7,     "n/a",     "-98",
               10,        NA, "missing",
               12,       ".",      "88",
               16,        NA,       ".",
                9,     "N/a",      "86"
           )

  
chaos %>%
miss_scan_count(search = list("N/A"))
# A tibble: 3 x 2
  Variable     n
  <chr>    <int>
1 score        0
2 grade        1
3 place        0
chaos %>%
miss_scan_count(search = list("N/A",
"N/a"))
# A tibble: 3 x 2
  Variable     n
  <chr>    <int>
1 score        0
2 grade        2
3 place        0

Pomoću funkcije replace_with_na možemo zameniti nedostajuće vrednosti za odgovajuću promenljivu i imenovanu listu vrednosti koje želimo da zamenimo sa NA.

chaos %>%
  replace_with_na(replace = list(grade = c("N/A", "N/a")))
# A tibble: 9 x 3
  score grade   place  
  <dbl> <chr>   <chr>  
1     3 <NA>    -99    
2   -99 E       97     
3     4 missing 95     
4   -99 na      92     
5     7 n/a     -98    
6    10 <NA>    missing
7    12 .       88     
8    16 <NA>    .      
9     9 <NA>    86     

Da bi se izbeglo ponavljanje, ili postavili dodatni uslovi, možemo koristiti druge funkcije slične funkciji replace_with_na:

Ove funkcije koriste specifičnu sintaksu. Argumentu condition koje koriste ove funkcije moramo proslediti specijalnu funkciju koja počinje sa tildom ~ dok se promenljivoj obraćamo sa .x.

chaos %>%
replace_with_na_all(condition = ~.x == -99)
# A tibble: 9 x 3
  score grade   place  
  <dbl> <chr>   <chr>  
1     3 N/A     <NA>   
2    NA E       97     
3     4 missing 95     
4    NA na      92     
5     7 n/a     -98    
6    10 <NA>    missing
7    12 .       88     
8    16 <NA>    .      
9     9 N/a     86     
chaos %>% 
replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))
# A tibble: 9 x 3
  score grade place
  <dbl> <chr> <chr>
1     3 <NA>  -99  
2   -99 E     97   
3     4 <NA>  95   
4   -99 <NA>  92   
5     7 n/a   -98  
6    10 <NA>  <NA> 
7    12 .     88   
8    16 <NA>  .    
9     9 N/a   86   

Vežba — korišćenje miss_scan_count

pacman <- readRDS(here::here("data-raw","pacman_proba.rds"))


glimpse(pacman)
Observations: 2,000
Variables: 6
$ year    <chr> "2007", "1995", "1980", "1982", "na", "2013", "2003", "2016...
$ month   <chr> "10", "8", "2", "5", "na", "11", "12", "9", "3", "5", "9", ...
$ day     <chr> "27", "23", "8", "9", "na", "15", "4", "9", "20", "19", "27...
$ initial <chr> "LEX", "PNY", "MBJ", "QRC", "YPZ", "RVJ", "VKD", "ZIS", "IY...
$ score   <chr> "2065812", "1163465", "175380", "2025632", "925357", "31973...
$ country <chr> "CA", "JP", " ", "ES", "NZ", "AU", "US", "CN", "CN", "AU", ...
pacman
# A tibble: 2,000 x 6
   year  month day   initial score   country
   <chr> <chr> <chr> <chr>   <chr>   <chr>  
 1 2007  10    27    LEX     2065812 "CA"   
 2 1995  8     23    PNY     1163465 "JP"   
 3 1980  2     8     MBJ     175380  " "    
 4 1982  5     9     QRC     2025632 "ES"   
 5 na    na    na    YPZ     925357  "NZ"   
 6 2013  11    15    RVJ     319733  "AU"   
 7 2003  12    4     VKD     3322668 "US"   
 8 2016  9     9     ZIS     2137806 "CN"   
 9 2013  3     20    IYD     3059716 "CN"   
10 1993  5     19    CHQ     231892  "AU"   
# ... with 1,990 more rows
# Explore the strange missing values "N/A"
miss_scan_count(data = pacman, search = list("N/A"))
# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year         0
2 month        0
3 day          0
4 initial      0
5 score      100
6 country      0
# Explore the strange missing values "missing"
miss_scan_count(data = pacman, search = list("missing"))
# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year        93
2 month       93
3 day         93
4 initial      0
5 score        0
6 country      0
# Explore the strange missing values "na"
miss_scan_count(data = pacman, search = list("na"))
# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year       100
2 month      100
3 day        100
4 initial      0
5 score        0
6 country      0
# Explore the strange missing values " " (a single space)
miss_scan_count(data = pacman, search = list(" "))
# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year         0
2 month        0
3 day          0
4 initial      0
5 score        0
6 country    100
# Explore all of the strange missing values, "N/A", "missing", "na", " "
miss_scan_count(data = pacman, search = list("N/A", "missing","na", " "))
# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year       193
2 month      193
3 day        193
4 initial      0
5 score      100
6 country    100

Vežba — korišćenje replace_with_na

# Print the top of the pacman data using `head()`
head(pacman)
# A tibble: 6 x 6
  year  month day   initial score   country
  <chr> <chr> <chr> <chr>   <chr>   <chr>  
1 2007  10    27    LEX     2065812 "CA"   
2 1995  8     23    PNY     1163465 "JP"   
3 1980  2     8     MBJ     175380  " "    
4 1982  5     9     QRC     2025632 "ES"   
5 na    na    na    YPZ     925357  "NZ"   
6 2013  11    15    RVJ     319733  "AU"   
# Replace the strange missing values "N/A", "na", and "missing" with `NA` for the variables, year, and score
pacman_clean <- replace_with_na(pacman, replace = list(year = c("N/A", "na", "missing"),
                                score = c("N/A", "na", "missing")))
                                        
# Test if `pacman_clean` still has these values in it?
miss_scan_count(pacman_clean, search = list("N/A", "na", "missing"))
# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year         0
2 month      193
3 day        193
4 initial      0
5 score        0
6 country      0

Vežba — korišćenje replace_with_na varijanti.

# Use `replace_with_na_at()` to replace with NA
replace_with_na_at(pacman,
                   .vars = c("year", "month", "day"), 
                   ~.x %in% c("N/A", "missing", "na", " "))
# A tibble: 2,000 x 6
   year  month day   initial score   country
   <chr> <chr> <chr> <chr>   <chr>   <chr>  
 1 2007  10    27    LEX     2065812 "CA"   
 2 1995  8     23    PNY     1163465 "JP"   
 3 1980  2     8     MBJ     175380  " "    
 4 1982  5     9     QRC     2025632 "ES"   
 5 <NA>  <NA>  <NA>  YPZ     925357  "NZ"   
 6 2013  11    15    RVJ     319733  "AU"   
 7 2003  12    4     VKD     3322668 "US"   
 8 2016  9     9     ZIS     2137806 "CN"   
 9 2013  3     20    IYD     3059716 "CN"   
10 1993  5     19    CHQ     231892  "AU"   
# ... with 1,990 more rows
# Use `replace_with_na_if()` to replace with NA the character values using `is.character`
replace_with_na_if(pacman,
                   .predicate = is.character, 
                   ~.x %in% c("N/A", "missing", "na", " "))
# A tibble: 2,000 x 6
   year  month day   initial score   country
   <chr> <chr> <chr> <chr>   <chr>   <chr>  
 1 2007  10    27    LEX     2065812 CA     
 2 1995  8     23    PNY     1163465 JP     
 3 1980  2     8     MBJ     175380  <NA>   
 4 1982  5     9     QRC     2025632 ES     
 5 <NA>  <NA>  <NA>  YPZ     925357  NZ     
 6 2013  11    15    RVJ     319733  AU     
 7 2003  12    4     VKD     3322668 US     
 8 2016  9     9     ZIS     2137806 CN     
 9 2013  3     20    IYD     3059716 CN     
10 1993  5     19    CHQ     231892  AU     
# ... with 1,990 more rows
# Use `replace_with_na_all()` to replace with NA
replace_with_na_all(pacman, ~.x %in% c("N/A", "missing", "na", " "))
# A tibble: 2,000 x 6
   year  month day   initial score   country
   <chr> <chr> <chr> <chr>   <chr>   <chr>  
 1 2007  10    27    LEX     2065812 CA     
 2 1995  8     23    PNY     1163465 JP     
 3 1980  2     8     MBJ     175380  <NA>   
 4 1982  5     9     QRC     2025632 ES     
 5 <NA>  <NA>  <NA>  YPZ     925357  NZ     
 6 2013  11    15    RVJ     319733  AU     
 7 2003  12    4     VKD     3322668 US     
 8 2016  9     9     ZIS     2137806 CN     
 9 2013  3     20    IYD     3059716 CN     
10 1993  5     19    CHQ     231892  AU     
# ... with 1,990 more rows

Nedostajući podaci koji nedostaju

Do sada smo radili sa eksplicitno nedostajućim vrednostima. Implicitno nedostajuće vrednosti su vrednosti koje bi trebale da budu u skupu podataka ali nisu.

# Za Sema nedostaje evening score

tetris_skorovi <- tibble::tribble(
    ~name,       ~time, ~score,
  "robin",   "morning",    358,
  "robin", "afternoon",    534,
  "robin",   "evening",    100,
    "sam",   "morning",    139,
    "sam", "afternoon",    177,
  "blair",   "morning",    963,
  "blair", "afternoon",    962,
  "blair",   "evening",    929
  )


tetris_skorovi %>% spread(key=time,value = score)
# A tibble: 3 x 4
  name  afternoon evening morning
  <chr>     <dbl>   <dbl>   <dbl>
1 blair       962     929     963
2 robin       534     100     358
3 sam         177      NA     139

Da bismo implicitne nedostajuće vrednosti pretvorili u eksplicitne koristimo funkciju tidyr::complete. Ona traži unikatne kombinacije za željene promenljive.

tetris_skorovi %>%
  tidyr::complete(name, time)
# A tibble: 9 x 3
  name  time      score
  <chr> <chr>     <dbl>
1 blair afternoon   962
2 blair evening     929
3 blair morning     963
4 robin afternoon   534
5 robin evening     100
6 robin morning     358
7 sam   afternoon   177
8 sam   evening      NA
9 sam   morning     139

U sledećem slučaju bismo želeli da popunimo NA vrednosti sa očekivanim vrednostima metodom popunjavanja nedostajućih vrednosti sa prethodnom ili sledećom prisutnom vrednošću. Ova metoda se još naziva i LOCF (Last observation carried forward). Ova metoda je korisna ukoliko znamo da postoji neka struktura u podacima na osnovu koje možemo lako da znamo koje vrednosti nedostaju. Za ovu svrhu koristimo funkciju tidyr::fill.
fill(data, ..., .direction = c("down", "up", "downup", "updown")).

tetris_skorovi2 <- tibble::tribble(
                       ~name,       ~time, ~value,
                     "robin",   "morning",    936,
                          NA, "afternoon",    635,
                          NA,   "evening",    438,
                       "sam",   "morning",    208,
                          NA, "afternoon",     92,
                          NA,   "evening",     79,
                     "blair",   "morning",    969,
                          NA, "afternoon",    918,
                          NA,   "evening",    954
                     )


tetris_skorovi2 %>%
  tidyr::fill(name)
# A tibble: 9 x 3
  name  time      value
  <chr> <chr>     <dbl>
1 robin morning     936
2 robin afternoon   635
3 robin evening     438
4 sam   morning     208
5 sam   afternoon    92
6 sam   evening      79
7 blair morning     969
8 blair afternoon   918
9 blair evening     954

Vežba — ispravljanje implicitno nedostajućih vrednosti funkcijom complete()

frogger <- readRDS(here::here("data-raw","frogger.rds"))

frogger
    name       time  value
1  jesse    morning   6678
2  jesse  afternoon 800060
3  jesse    evening 475528
4  jesse late_night 143533
5   andy    morning 425115
6   andy  afternoon 587468
7   andy late_night 111000
8    nic  afternoon 588532
9    nic late_night 915533
10   dan    morning 388148
11   dan    evening 180912
12  alex    morning 552670
13  alex  afternoon  98355
14  alex    evening 266055
15  alex late_night 121056
# Use `complete()` on the `time` and `name` variables to make implicit missing values explicit
frogger_tidy <- frogger %>% complete(time, name)

Vežba — ispravljanje eksplicitno nedostajućih vrednosti funkcijom fill().

frogger2 <- readRDS(here::here("data-raw","frogger2.rds"))

frogger2
# A tibble: 15 x 4
   name  time        value  year
   <chr> <chr>       <int> <dbl>
 1 jesse morning      6678  1981
 2 <NA>  afternoon  800060    NA
 3 <NA>  evening    475528    NA
 4 <NA>  late_night 143533    NA
 5 andy  morning    425115    NA
 6 <NA>  afternoon  587468    NA
 7 <NA>  late_night 111000    NA
 8 nic   afternoon  588532    NA
 9 <NA>  late_night 915533    NA
10 dan   morning    388148  1988
11 <NA>  evening    180912    NA
12 alex  morning    552670    NA
13 <NA>  afternoon   98355    NA
14 <NA>  evening    266055    NA
15 <NA>  late_night 121056    NA
# Use `fill()` to fill down the name variable in the frogger dataset
frogger %>% fill(name)
    name       time  value
1  jesse    morning   6678
2  jesse  afternoon 800060
3  jesse    evening 475528
4  jesse late_night 143533
5   andy    morning 425115
6   andy  afternoon 587468
7   andy late_night 111000
8    nic  afternoon 588532
9    nic late_night 915533
10   dan    morning 388148
11   dan    evening 180912
12  alex    morning 552670
13  alex  afternoon  98355
14  alex    evening 266055
15  alex late_night 121056

Vežba — korišćenje funkcija complete() i fill() zajedno.

# Correctly fill() and complete() missing values so that our dataset becomes sensible
frogger2 %>%
  fill(name) %>%
  complete(name, time)
# A tibble: 20 x 4
   name  time        value  year
   <chr> <chr>       <int> <dbl>
 1 alex  afternoon   98355    NA
 2 alex  evening    266055    NA
 3 alex  late_night 121056    NA
 4 alex  morning    552670    NA
 5 andy  afternoon  587468    NA
 6 andy  evening        NA    NA
 7 andy  late_night 111000    NA
 8 andy  morning    425115    NA
 9 dan   afternoon      NA    NA
10 dan   evening    180912    NA
11 dan   late_night     NA    NA
12 dan   morning    388148  1988
13 jesse afternoon  800060    NA
14 jesse evening    475528    NA
15 jesse late_night 143533    NA
16 jesse morning      6678  1981
17 nic   afternoon  588532    NA
18 nic   evening        NA    NA
19 nic   late_night 915533    NA
20 nic   morning        NA    NA

Mehanizmi nedostajanja podataka

Tip mehanizma nedostajanja podataka usmerava traganje za razlozima nedostajanja ukazivanjem na postojanje povezanosti merenih promenljivih i verovatnoće nedostajanja podataka.
Rubin (1976) je definisao tri osnovna mehanizma nedostajanja podataka:

Razlika između MCAR i MAR je u tome što izostajanje podataka u okviru MCAR nije pod uticajem nijedne varijable u datom setu. S druge strane, izostajanje podataka po MAR mehanizmu nije pod uticajem varijable u kojoj ima nedostajućih podataka, ali jeste pod uticajem neke druge varijable u setu podataka.
Ukoliko, pak, distribucija nedostajućih podataka zavisi od samih nedostajućih podatka, kažemo da podaci ne nedostaju po slučajnom rasporedu, odnosno da se distribuiraju po MNAR obrascu (Schafer & Graham, 2002). U ovom slučaju, nedostajući podatak je povezan s razlogom zašto nedostaje.

MCAR:

# Primer MCAR
test_vacation <- tibble::tribble(
      ~test, ~vacation,
         NA,      TRUE,
   11.53334,     FALSE,
  10.126115,      TRUE,
         NA,     FALSE,
         NA,      TRUE,
   8.551881,     FALSE,
         NA,     FALSE,
         NA,      TRUE,
  10.608264,      TRUE,
   8.611877,      TRUE
  )

test_vacation
# A tibble: 10 x 2
    test vacation
   <dbl> <lgl>   
 1 NA    TRUE    
 2 11.5  FALSE   
 3 10.1  TRUE    
 4 NA    FALSE   
 5 NA    TRUE    
 6  8.55 FALSE   
 7 NA    FALSE   
 8 NA    TRUE    
 9 10.6  TRUE    
10  8.61 TRUE    

MAR:

# Primer MAR - veća verovatnoća nedostajanja test skorova za radnike sa visokim nivoom depresije
test_vacation2 <- tibble::tribble(
                        ~test, ~vacation, ~depression,
                           NA,      TRUE,    87.93109,
                     11.53334,     FALSE,    40.02708,
                    10.126115,      TRUE,    48.62883,
                           NA,     FALSE,    88.21743,
                           NA,      TRUE,    90.29282,
                     8.551881,     FALSE,    44.77343,
                           NA,     FALSE,    89.48865,
                           NA,      TRUE,    89.99209,
                    10.608264,      TRUE,    45.56832,
                     8.611877,      TRUE,    42.41686
                    )

test_vacation2
# A tibble: 10 x 3
    test vacation depression
   <dbl> <lgl>         <dbl>
 1 NA    TRUE           87.9
 2 11.5  FALSE          40.0
 3 10.1  TRUE           48.6
 4 NA    FALSE          88.2
 5 NA    TRUE           90.3
 6  8.55 FALSE          44.8
 7 NA    FALSE          89.5
 8 NA    TRUE           90.0
 9 10.6  TRUE           45.6
10  8.61 TRUE           42.4

MNAR:

# Ukoliko bi bilo poznato da postoji veza između test skorova i depresije, a nedostaju i test skorovi i skorovi depresivnosti
test_vacation3 <- tibble::tribble(
                        ~test, ~vacation, ~depression,
                           NA,      TRUE,          NA,
                     11.53334,     FALSE,    11.53334,
                    10.126115,      TRUE,   10.126115,
                           NA,     FALSE,          NA,
                           NA,      TRUE,          NA,
                     8.551881,     FALSE,    8.551881,
                           NA,     FALSE,          NA,
                           NA,      TRUE,          NA,
                    10.608264,      TRUE,   10.608264,
                     8.611877,      TRUE,    8.611877
                    )

test_vacation3
# A tibble: 10 x 3
    test vacation depression
   <dbl> <lgl>         <dbl>
 1 NA    TRUE          NA   
 2 11.5  FALSE         11.5 
 3 10.1  TRUE          10.1 
 4 NA    FALSE         NA   
 5 NA    TRUE          NA   
 6  8.55 FALSE          8.55
 7 NA    FALSE         NA   
 8 NA    TRUE          NA   
 9 10.6  TRUE          10.6 
10  8.61 TRUE           8.61

U sledećem primeru vidimo da postoji dosta šuma u nedostajanju. Nasumični obrasci nedostajanja, ili obrasci nedostajanja za koje nam se čini da imaju punu šuma, sugerišu da nema punu varijanse u našem obrascu nedostajanja. Na osnovu toga možemo reći da se radi o MCAR mehanizmu nedostajanja podataka.

vis_miss(mt_cars, cluster = TRUE)

U sledećem primeru vidimo da definitivno postoji grupisanje nedostajanja. Ovo je često slučaj kod MAR mehanizma nedostajanja podataka.

oceanbuoys
# A tibble: 736 x 8
    year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
   <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
 1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
 2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
 3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
 4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
 5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
 6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
 7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
 8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
 9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
# ... with 726 more rows
oceanbuoys %>% arrange(year) %>% vis_miss()

U sledećem primeru demonstrira se MNAR mehanizam nedostajanja. Ocean podaci su izmenjeni tako da su podaci koji nedostaju generisani na osnovu veze sa vrednostima promenljive koja je uklonjena iz dataseta, dakle na osnovu nečega što sada nije posmatrano. U ovom slučaju jasno se vidi struktura grupisanja, međutim ovo često ne mora biti slučaj. Važno je zapamtiti da je veoma teško pouzdano ustanoviti da li je mehanizam nedostajanja MCAR, MAR, ili MNAR. Vizualizacije ne daju definitivan odgovor na to pitanje.

vis_miss(ocean, cluster = TRUE)

Vežba — istraživanje mehanizama nedostajanja podataka. Najpre ćemo pogledati kako sortiranje po različitim promenljivim utiče na pregled nedostajućih podataka.

# Arrange by year
oceanbuoys %>% arrange(year) %>% vis_miss()

# Arrange by latitude
oceanbuoys %>% arrange(latitude) %>% vis_miss()

# Arrange by wind_ew (wind east west)
oceanbuoys %>% arrange(wind_ew) %>% vis_miss()

Na osnovu svega navedenog može se zaključiti da je u pitanju MAR mehanizam. Promenljive year i location su obe važne za objašnjenje obrasca nedostajanja podataka, jer pokazuju grupisanje nedostajanja podataka. Kod MAR-a visoke ili niske vrednosti nekih promenljivih imaju uticaj na nedostajanje podataka.

gg_miss_var(oceanbuoys, facet = year)

gg_miss_case(oceanbuoys, facet = year)

Alati za istraživanje mehanizama nedostajanja podataka. Shadow matrice. Nabular podaci.

Recimo da imamo sledeće podatke.

cenzus1 <- tibble::tribble(
               ~income,    ~education,
              48.69087,            NA,
              40.93218,            NA,
              52.69245, "high_school",
              31.33808,            NA,
              89.35671,  "university",
             103.87278,  "university"
             )

cenzus1
# A tibble: 6 x 2
  income education  
   <dbl> <chr>      
1   48.7 <NA>       
2   40.9 <NA>       
3   52.7 high_school
4   31.3 <NA>       
5   89.4 university 
6  104.  university 

Ukoliko pogledamo raspodelu za income vidimo da je većina vrednosti između 70k-80k. Ali, ako income razložimo prema nedostajanju vrednosti promenljive education možemo videti da postoji gap između te dve raspodele.

cenzus1 %>%
  bind_shadow() %>%
  ggplot(aes(x = income, fill = education_NA)) + 
  geom_density(alpha = 0.5)

Shadow matrica je jasnija reprezentacija nedostajanja podataka. Dva glavna svojstva shadow matrice su:

Kada se ovakvi podaci columnwise dodaju na originalne podatke dobijamo takozvane nabular podatke (NA i Tabular). Ovo radimo pomoću funkcije bind_shadow().

cenzus1 %>%
  bind_shadow()
# A tibble: 6 x 4
  income education   income_NA education_NA
   <dbl> <chr>       <fct>     <fct>       
1   48.7 <NA>        !NA       NA          
2   40.9 <NA>        !NA       NA          
3   52.7 high_school !NA       !NA         
4   31.3 <NA>        !NA       NA          
5   89.4 university  !NA       !NA         
6  104.  university  !NA       !NA         

Primena nabular na airquality dataset.

bind_shadow(airquality)
# A tibble: 153 x 12
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
   <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
 2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
 3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
 4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
 5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
 6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
 7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
 8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
 9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
# ... with 143 more rows, and 2 more variables: Month_NA <fct>, Day_NA <fct>
airquality %>%
  bind_shadow() %>%
  group_by(Ozone_NA) %>%
  summarise(mean = mean(Wind))
# A tibble: 2 x 2
  Ozone_NA  mean
  <fct>    <dbl>
1 !NA       9.86
2 NA       10.3 

Vežba — shadow matrix.

# Create shadow matrix data with `as_shadow()`
as_shadow(oceanbuoys)
# A tibble: 736 x 8
   year_NA latitude_NA longitude_NA sea_temp_c_NA air_temp_c_NA humidity_NA
   <fct>   <fct>       <fct>        <fct>         <fct>         <fct>      
 1 !NA     !NA         !NA          !NA           !NA           !NA        
 2 !NA     !NA         !NA          !NA           !NA           !NA        
 3 !NA     !NA         !NA          !NA           !NA           !NA        
 4 !NA     !NA         !NA          !NA           !NA           !NA        
 5 !NA     !NA         !NA          !NA           !NA           !NA        
 6 !NA     !NA         !NA          !NA           !NA           !NA        
 7 !NA     !NA         !NA          !NA           !NA           !NA        
 8 !NA     !NA         !NA          !NA           !NA           !NA        
 9 !NA     !NA         !NA          !NA           !NA           !NA        
10 !NA     !NA         !NA          !NA           !NA           !NA        
# ... with 726 more rows, and 2 more variables: wind_ew_NA <fct>,
#   wind_ns_NA <fct>
# Create nabular data by binding the shadow to the data with `bind_shadow()`
bind_shadow(oceanbuoys)
# A tibble: 736 x 16
    year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
   <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
 1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
 2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
 3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
 4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
 5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
 6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
 7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
 8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
 9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
# ... with 726 more rows, and 8 more variables: year_NA <fct>,
#   latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
#   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>
# Bind only the variables with missing values by using bind_shadow(only_miss = TRUE)
bind_shadow(oceanbuoys, only_miss = TRUE)
# A tibble: 736 x 11
    year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
   <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
 1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
 2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
 3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
 4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
 5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
 6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
 7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
 8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
 9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
# ... with 726 more rows, and 3 more variables: sea_temp_c_NA <fct>,
#   air_temp_c_NA <fct>, humidity_NA <fct>

Vežba — grupni pregled nedostajanja pomoću nabularnih podataka.

# `bind_shadow()` and `group_by()` humidity missingness (`humidity_NA`)
oceanbuoys %>%
  bind_shadow() %>%
  group_by(humidity_NA) %>%
  summarise(wind_ew_mean = mean(wind_ew), # calculate mean of wind_ew
            wind_ew_sd = sd(wind_ew)) # calculate standard deviation of wind_ew
# A tibble: 2 x 3
  humidity_NA wind_ew_mean wind_ew_sd
  <fct>              <dbl>      <dbl>
1 !NA                -3.78       1.90
2 NA                 -3.30       2.31
# Repeat this, but calculating summaries for wind north south (`wind_ns`)
oceanbuoys %>%
  bind_shadow() %>%
  group_by(humidity_NA) %>%
  summarise(wind_ns_mean = mean(wind_ns),
            wind_ns_sd = sd(wind_ns))
# A tibble: 2 x 3
  humidity_NA wind_ns_mean wind_ns_sd
  <fct>              <dbl>      <dbl>
1 !NA                 2.78       2.06
2 NA                  1.66       2.23

Vežba — dalje istraživanje kombinacija nedostajanja. Dodaćemo broj posmatranih slučajeva, i dodatni nivo grupisanja za kombinacije nedostajanja vlažnosti i temperature vazduha. Brojeći broj opservacija u svakoj grupi dobijamo koristan kontekst za sumarni pregled.

# Summarise wind_ew by the missingness of `air_temp_c_NA`
oceanbuoys %>% 
  bind_shadow() %>%
  group_by(air_temp_c_NA) %>%
  summarise(wind_ew_mean = mean(wind_ew),
            wind_ew_sd = sd(wind_ew),
            n_obs = n())
# A tibble: 2 x 4
  air_temp_c_NA wind_ew_mean wind_ew_sd n_obs
  <fct>                <dbl>      <dbl> <int>
1 !NA                  -3.91       1.85   655
2 NA                   -2.17       2.14    81
# Summarise wind_ew by missingness of `air_temp_c_NA` and `humidity_NA`
oceanbuoys %>% 
  bind_shadow() %>%
  group_by(air_temp_c_NA, humidity_NA) %>%
  summarise(wind_ew_mean = mean(wind_ew),
            wind_ew_sd = sd(wind_ew),
            n_obs = n())
# A tibble: 4 x 5
# Groups:   air_temp_c_NA [2]
  air_temp_c_NA humidity_NA wind_ew_mean wind_ew_sd n_obs
  <fct>         <fct>              <dbl>      <dbl> <int>
1 !NA           !NA                -4.01       1.74   565
2 !NA           NA                 -3.24       2.31    90
3 NA            !NA                -2.06       2.08    78
4 NA            NA                 -4.97       1.74     3

Istraživanje uslovnog nedostajanja podataka za jednu promenljivu

Koristićemo nabularne podatke i sledeće vizualizacije:

Najpre density.

ggplot(airquality,
       aes(x = Temp)) + 
  geom_density()

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             color = Ozone_NA)) + 
  geom_density()

Takođe možemo koristiti boxplotove. Vidimo da su srednje vrednosti temperatura slične i kada nedostaje, i kada ne nedostaju ozon. Međutim varijansa temperature je manja kada nedostaju vrednosti ozona, takođe primetni su outlieri u temperaturi.

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Ozone_NA,
             y = Temp)) + 
  geom_boxplot()

Možemo koristiti i faceting za density.

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp)) + 
  geom_density() + 
  facet_wrap(~Ozone_NA)

Možemo koristiti i faceting za scatter plot. Ovde se vidi da imamo manji broj vrednosti za Wind i Temp kada nedostaju podaci za Ozone.

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             y = Wind)) + 
  geom_point() +
  facet_wrap(~Ozone_NA)

Tačke se mogu kodirati bojom prema nedostajanju. Možemo takođe dodati sloj za nedostajanje, prema nedostajanju Solar.R.

airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp,
y = Wind,
color = Ozone_NA)) +
geom_point()

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             color = Ozone_NA)) + 
  geom_density()  +
  facet_wrap(~Solar.R_NA)

# First explore the missingness structure of `oceanbuoys` using `vis_miss()`
vis_miss(oceanbuoys)

# Explore the distribution of `wind_ew` for the missingness of `air_temp_c_NA` using  `geom_density()`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew, 
             color = air_temp_c_NA)) + 
  geom_density()

# Explore the distribution of sea temperature for the missingness of humidity (humidity_NA) using  `geom_density()`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = sea_temp_c,
             color = humidity_NA)) + 
  geom_density()

Vežba — nabularni podaci i vizualizacije. Istražićemo varijacije u jednoj promenljivoj u zavisnosti od nedostajanja drugih.

# First explore the missingness structure of `oceanbuoys` using `vis_miss()`
vis_miss(oceanbuoys)

# Explore the distribution of `wind_ew` for the missingness of `air_temp_c_NA` using  `geom_density()`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew, 
             color = air_temp_c_NA)) + 
  geom_density()

# Explore the distribution of sea temperature for the missingness of humidity (humidity_NA) using  `geom_density()`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = sea_temp_c,
             color = humidity_NA)) + 
  geom_density()

Vežba — Istražićemo varijacije u jednoj promenljivoj u zavisnosti od nedostajanja drugih pomoću faceting-a.

# Explore the distribution of wind east west (`wind_ew`) for the missingness of air temperature using  `geom_density()` and facetting by the missingness of air temperature (`air_temp_c_NA`).
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = wind_ew)) + 
  geom_density() + 
  facet_wrap(~air_temp_c_NA)

# Build upon this visualisation by filling by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = wind_ew,
             color = humidity_NA)) + 
  geom_density() + 
  facet_wrap(~air_temp_c_NA)

Vežba — Istražićemo varijacije u jednoj promenljivoj u zavisnosti od nedostajanja drugih pomoću box plotova.

# Explore the distribution of wind east west (`wind_ew`) for the missingness of air temperature using  `geom_boxplot()`
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = air_temp_c_NA,
             y = wind_ew)) + 
  geom_boxplot()

# Build upon this visualisation by facetting by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = air_temp_c_NA,
             y = wind_ew)) + 
  geom_boxplot() + 
  facet_wrap(~humidity_NA)

Istraživanje uslovnog nedostajanja podataka za dve promenljive

Kod scatterplotova imamo problem da se uklonjaju celi redovi za vrednosti koje nedostaju. ggplot2 daje warning u tom slučaju. Koristićemo geom geom_miss_point. Na donjem primeru dobijamo informaciju da se vrednosti koje nedostaju javljaju za manje vrednosti ozona.

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_point()

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point()

geom_miss_point u pozadini imputira nedostajuće vrednosti. Podaci se imputiraju za 10% ispod minimalne vrednosti.

primerGMP <- tibble::tribble(
               ~Ozone, ~Ozone_shift, ~Ozone_NA,
                   41,           41,     "!NA",
                   36,           36,     "!NA",
                   12,           12,     "!NA",
                   18,           18,     "!NA",
                   NA,    -19.72321,        NA,
                   28,           28,     "!NA"
               )

primerGMP
# A tibble: 6 x 3
  Ozone Ozone_shift Ozone_NA
  <dbl>       <dbl> <chr>   
1    41        41   !NA     
2    36        36   !NA     
3    12        12   !NA     
4    18        18   !NA     
5    NA       -19.7 <NA>    
6    28        28   !NA     

Sa geom_miss_point takođe možemo koristiti faceting i nabularne podatke.

ggplot(airquality,
       aes(x = Wind,
           y = Ozone)) +
  geom_miss_point() +
  facet_wrap( ~ Month)

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Wind,
             y = Ozone)) +
  geom_miss_point() +
  facet_wrap( ~ Solar.R_NA)

Vežba — Istraživanje nedostajućih podataka pomoću scatterplotova.

# Explore the missingness in wind and air temperature, and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
       aes(x = wind_ew,
           y = air_temp_c)) + 
  geom_miss_point()

# Explore the missingness in humidity and air temperature, and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
       aes(x = humidity,
           y = air_temp_c)) + 
  geom_miss_point()

Vežba — Istraživanje nedostajućih podataka pomoću scatterplotova i facetinga.

# Explore the missingness in wind and air temperature, and display the missingness using `geom_miss_point()`. Facet by year to explore this further.
ggplot(oceanbuoys,
       aes(x = wind_ew,
           y = air_temp_c)) + 
  geom_miss_point() + 
  facet_wrap(~year)

# Explore the missingness in humidity and air temperature, and display the missingness using `geom_miss_point()` Facet by year to explore this further.
ggplot(oceanbuoys,
       aes(x = humidity,
           y = air_temp_c)) + 
  geom_miss_point() + 
  facet_wrap(~year)

Vežba — Istraživanje nedostajućih podataka pomoću scatterplotova, facetinga i nabularnih podataka.

# Use geom_miss_point() and facet_wrap to explore how the missingness in wind_ew and air_temp_c is different for missingness of humidity
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew,
           y = air_temp_c)) + 
  geom_miss_point() + 
  facet_wrap(~humidity_NA)

# Use geom_miss_point() and facet_grid to explore how the missingness in wind_ew and air_temp_c is different for missingness of humidity AND by year - by using `facet_grid(humidity_NA ~ year)`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew,
             y = air_temp_c)) + 
  geom_miss_point() + 
  facet_grid(humidity_NA~year)

Imputacija

Funkcija impute_below() imputira vrednosti ispod minimuma iz postojećih podataka.

Moramo znati koje su vrednosti imputirane.

impute_below(c(5,6,7,NA,9,10))
[1]  5.00000  6.00000  7.00000  4.40271  9.00000 10.00000
df11 <- data.frame(var1 = c(1, 5, 2, 6, 3, 7, 4, NA, 5, 9, 6, 10))
impute_below_all(df11)
         var1
1   1.0000000
2   5.0000000
3   2.0000000
4   6.0000000
5   3.0000000
6   7.0000000
7   4.0000000
8   0.2524351
9   5.0000000
10  9.0000000
11  6.0000000
12 10.0000000
bind_shadow(df11)
# A tibble: 12 x 2
    var1 var1_NA
   <dbl> <fct>  
 1     1 !NA    
 2     5 !NA    
 3     2 !NA    
 4     6 !NA    
 5     3 !NA    
 6     7 !NA    
 7     4 !NA    
 8    NA NA     
 9     5 !NA    
10     9 !NA    
11     6 !NA    
12    10 !NA    
bind_shadow(df11) %>%
impute_below_all()
# A tibble: 12 x 2
     var1 var1_NA
    <dbl> <fct>  
 1  1     !NA    
 2  5     !NA    
 3  2     !NA    
 4  6     !NA    
 5  3     !NA    
 6  7     !NA    
 7  4     !NA    
 8  0.252 NA     
 9  5     !NA    
10  9     !NA    
11  6     !NA    
12 10     !NA    

Vizualizacija imputiranih podataka putem histograma i facetinga.

aq_imp <- airquality %>%
bind_shadow() %>%
impute_below_all()

ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram()

ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram() +
facet_wrap(~Month)

Plot možemo podeliti na osnovu nedostajanja podataka u Solar.R.

ggplot(aq_imp,
       aes(x = Ozone,
           fill = Ozone_NA)) + 
  geom_histogram() + 
  facet_wrap(~Solar.R_NA)

add_label_missings dodaje kolonu koja oposuje da li nedostaju podaci za bilo koju promenljivu iz zapisa. add_label_shadow proverava da li su vrednosti imputirane kada je primenjene bind_shadow ili add_shadow.

aq_imp <- airquality %>%
bind_shadow() %>%
add_label_missings() %>%
impute_below_all()

aq_imp
# A tibble: 153 x 13
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
   <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1  41     190     7.4    67     5     1 !NA      !NA        !NA     !NA    
 2  36     118     8      72     5     2 !NA      !NA        !NA     !NA    
 3  12     149    12.6    74     5     3 !NA      !NA        !NA     !NA    
 4  18     313    11.5    62     5     4 !NA      !NA        !NA     !NA    
 5 -19.7   -33.6  14.3    56     5     5 NA       NA         !NA     !NA    
 6  28     -33.1  14.9    66     5     6 !NA      NA         !NA     !NA    
 7  23     299     8.6    65     5     7 !NA      !NA        !NA     !NA    
 8  19      99    13.8    59     5     8 !NA      !NA        !NA     !NA    
 9   8      19    20.1    61     5     9 !NA      !NA        !NA     !NA    
10 -18.5   194     8.6    69     5    10 NA       !NA        !NA     !NA    
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
#   any_missing <chr>
ggplot(aq_imp,
aes(x = Ozone,
y = Solar.R,
colour = any_missing)) +
geom_point()

airquality %>%
  add_shadow(Ozone, Solar.R) %>%
  add_label_shadow()
# A tibble: 153 x 9
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA any_missing
   <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <chr>      
 1    41     190   7.4    67     5     1 !NA      !NA        Not Missing
 2    36     118   8      72     5     2 !NA      !NA        Not Missing
 3    12     149  12.6    74     5     3 !NA      !NA        Not Missing
 4    18     313  11.5    62     5     4 !NA      !NA        Not Missing
 5    NA      NA  14.3    56     5     5 NA       NA         Missing    
 6    28      NA  14.9    66     5     6 !NA      NA         Missing    
 7    23     299   8.6    65     5     7 !NA      !NA        Not Missing
 8    19      99  13.8    59     5     8 !NA      !NA        Not Missing
 9     8      19  20.1    61     5     9 !NA      !NA        Not Missing
10    NA     194   8.6    69     5    10 NA       !NA        Missing    
# ... with 143 more rows

Vežba — imputiranje ispod opsega.

# Impute the data below the range using `impute_below`.
ocean_imp <- impute_below_all(oceanbuoys)

# Visualise the new missing values
ggplot(ocean_imp,
       aes(x = wind_ew, y = air_temp_c)) + 
  geom_point()

# Impute and track data with `bind_shadow`, `impute_below`, and `add_label_shadow`
ocean_imp_track <- bind_shadow(oceanbuoys) %>% 
  impute_below_all() %>%
  add_label_shadow()

# Look at the imputed values
ocean_imp_track
# A tibble: 736 x 17
    year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
   <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
 1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
 2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
 3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
 4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
 5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
 6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
 7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
 8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
 9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
# ... with 726 more rows, and 9 more variables: year_NA <fct>,
#   latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
#   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>,
#   any_missing <chr>

Vežba — vizualizacija imputiranih podataka u scatterplotu.

# Impute and track the missing values
ocean_imp_track <- bind_shadow(oceanbuoys) %>% 
  impute_below_all() %>%
  add_label_shadow()

# Visualise the missingness in wind and air temperature, coloring missing air temp values with air_temp_c_NA
ggplot(ocean_imp_track,
       aes(x = wind_ew, y = air_temp_c, color = air_temp_c_NA)) + 
  geom_point()

# Visualise humidity and air temp, coloring any missing cases using the variable any_missing
ggplot(ocean_imp_track,
       aes(x = humidity, y = air_temp_c, color = any_missing)) + 
  geom_point()

Vežba — kreiranje histograma imputiranih podataka.

# Explore the values of air_temp_c, visualising the amount of missings with `air_temp_c_NA`.
p <- ggplot(ocean_imp_track, aes(x = air_temp_c, fill = air_temp_c_NA)) + geom_histogram()

# Expore the missings in humidity using humidity_NA
p2 <- ggplot(ocean_imp_track, aes(x = humidity, fill = humidity_NA)) + geom_histogram()

# Explore the missings in air_temp_c according to year, using `facet_wrap(~year)`.
p + facet_wrap(~year)

# Explore the missings in humidity according to year, using `facet_wrap(~year)`.
p2 + facet_wrap(~year)

Svojstva dobrih i loših imputacija

Jedna veoma loša imputacija je mean imputacija.

df12 <- data.frame(x = c(1, 4, 9, 16, NA, 36))
mean(df12$x, na.rm = TRUE)
[1] 13.2

Za istraživanje ovakve imputacije koristićemo sledeće funkcije:

  • impute_mean(data$variable).
  • impute_mean_if(data, is.numeric).
  • impute_mean_at(data, vars(variable1, variable2)).
  • impute_mean_all(data).
aq_impute_mean <- airquality %>%
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all() %>%
  add_label_shadow()

aq_impute_mean
# A tibble: 153 x 9
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA any_missing
   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <fct>    <fct>      <chr>      
 1  41      190    7.4    67     5     1 !NA      !NA        Not Missing
 2  36      118    8      72     5     2 !NA      !NA        Not Missing
 3  12      149   12.6    74     5     3 !NA      !NA        Not Missing
 4  18      313   11.5    62     5     4 !NA      !NA        Not Missing
 5  42.1    186.  14.3    56     5     5 NA       NA         Missing    
 6  28      186.  14.9    66     5     6 !NA      NA         Missing    
 7  23      299    8.6    65     5     7 !NA      !NA        Not Missing
 8  19       99   13.8    59     5     8 !NA      !NA        Not Missing
 9   8       19   20.1    61     5     9 !NA      !NA        Not Missing
10  42.1    194    8.6    69     5    10 NA       !NA        Missing    
# ... with 143 more rows

Kada evaluirate imputacije, istražite promene i sličnosti u:

  • mean/median pomoću boxplota.
  • Spread (scale).
  • location.

Donji plot pokazuje da je median niži za !NA grupu.

ggplot(aq_impute_mean,
       aes(x = Ozone_NA,
           y = Ozone)) +
  geom_boxplot()

Evalucija - spread imputacija pomoću scatterplota. Ova vizualizacija pokazuje da nema varijabilnosti u raširenosti imputiranih tačaka.

ggplot(aq_impute_mean,
       aes(x = Ozone,
           y = Solar.R,
           colour = any_missing)) +
  geom_point()

Evaluacija imputacija za više promenljivih.

aq_imp <- airquality %>%
  bind_shadow() %>%
  impute_mean_all()

aq_imp_long <- shadow_long(aq_imp, 
                           Ozone, 
                           Solar.R)

aq_imp_long
# A tibble: 306 x 4
   variable value variable_NA value_NA
   <chr>    <dbl> <chr>       <chr>   
 1 Ozone     41   Ozone_NA    !NA     
 2 Ozone     36   Ozone_NA    !NA     
 3 Ozone     12   Ozone_NA    !NA     
 4 Ozone     18   Ozone_NA    !NA     
 5 Ozone     42.1 Ozone_NA    NA      
 6 Ozone     28   Ozone_NA    !NA     
 7 Ozone     23   Ozone_NA    !NA     
 8 Ozone     19   Ozone_NA    !NA     
 9 Ozone      8   Ozone_NA    !NA     
10 Ozone     42.1 Ozone_NA    NA      
# ... with 296 more rows
ggplot(aq_imp_long,
       aes(x = value,
           fill = value_NA)) + 
  geom_histogram() + 
  facet_wrap(~variable)

Vežba - evaluacija loših imputacija.

# Impute the mean value and track the imputations 
ocean_imp_mean <- bind_shadow(oceanbuoys) %>% 
  impute_mean_all() %>% 
  add_label_shadow()

# Explore the mean values in humidity in the imputed dataset
ggplot(ocean_imp_mean, 
       aes(x = humidity_NA, y = humidity)) + 
  geom_boxplot()

# Explore the values in air temperature in the imputed dataset
ggplot(ocean_imp_mean, 
       aes(x = air_temp_c_NA, y = air_temp_c)) + 
  geom_boxplot()

Vežba - evaluacija loših imputacija - scale.

# Explore imputations in air temperature and humidity, coloring by the variable, any_missing
ggplot(ocean_imp_mean, 
       aes(x = air_temp_c, y = humidity, color = any_missing)) + 
  geom_point()

# Explore imputations in air temperature and humidity, coloring by the variable, any_missing, and faceting by year
ggplot(ocean_imp_mean, 
       aes(x = air_temp_c, y = humidity, color = any_missing)) +
  geom_point() + 
  facet_wrap(~year)

Vežba - evaluacija loših imputacija - za više promenljivih.

# Gather the imputed data 
ocean_imp_mean_gather <- shadow_long(ocean_imp_mean,
                                     humidity,
                                     air_temp_c)
# Inspect the data
ocean_imp_mean_gather
# A tibble: 1,472 x 4
   variable   value       variable_NA   value_NA
   <chr>      <chr>       <chr>         <chr>   
 1 air_temp_c 27.14999962 air_temp_c_NA !NA     
 2 air_temp_c 27.02000046 air_temp_c_NA !NA     
 3 air_temp_c 27          air_temp_c_NA !NA     
 4 air_temp_c 26.93000031 air_temp_c_NA !NA     
 5 air_temp_c 26.84000015 air_temp_c_NA !NA     
 6 air_temp_c 26.94000053 air_temp_c_NA !NA     
 7 air_temp_c 27.04000092 air_temp_c_NA !NA     
 8 air_temp_c 27.11000061 air_temp_c_NA !NA     
 9 air_temp_c 27.20999908 air_temp_c_NA !NA     
10 air_temp_c 27.25       air_temp_c_NA !NA     
# ... with 1,462 more rows
ocean_imp_mean_gather$value <- as.numeric(ocean_imp_mean_gather$value)


ocean_imp_mean_gather
# A tibble: 1,472 x 4
   variable   value variable_NA   value_NA
   <chr>      <dbl> <chr>         <chr>   
 1 air_temp_c  27.1 air_temp_c_NA !NA     
 2 air_temp_c  27.0 air_temp_c_NA !NA     
 3 air_temp_c  27   air_temp_c_NA !NA     
 4 air_temp_c  26.9 air_temp_c_NA !NA     
 5 air_temp_c  26.8 air_temp_c_NA !NA     
 6 air_temp_c  26.9 air_temp_c_NA !NA     
 7 air_temp_c  27.0 air_temp_c_NA !NA     
 8 air_temp_c  27.1 air_temp_c_NA !NA     
 9 air_temp_c  27.2 air_temp_c_NA !NA     
10 air_temp_c  27.2 air_temp_c_NA !NA     
# ... with 1,462 more rows
# Explore the imputations in a histogram 
ggplot(ocean_imp_mean_gather, 
       aes(x = value, fill = value_NA)) + 
  geom_histogram() + 
  facet_wrap(~variable)

Imputacija sa različitim modelima

Koristićemo simputation paket. Za imputaciju putem linearnog modela koristićemo impute_lm.

library(simputation)

df13 <- tibble::tribble(
            ~y,  ~x1,  ~x2,
          2.67, 2.43, 3.27,
          3.87, 3.55, 1.45,
            NA,  2.9, 1.49,
          5.21, 2.72, 1.84,
            NA, 4.29, 1.15
          )

df13
# A tibble: 5 x 3
      y    x1    x2
  <dbl> <dbl> <dbl>
1  2.67  2.43  3.27
2  3.87  3.55  1.45
3 NA     2.9   1.49
4  5.21  2.72  1.84
5 NA     4.29  1.15
df13 %>% 
   bind_shadow(only_miss = TRUE) %>%
   add_label_shadow() %>%
   impute_lm(y ~ x1 + x2)
# A tibble: 5 x 5
      y    x1    x2 y_NA  any_missing
* <dbl> <dbl> <dbl> <fct> <chr>      
1  2.67  2.43  3.27 !NA   Not Missing
2  3.87  3.55  1.45 !NA   Not Missing
3  5.54  2.9   1.49 NA    Missing    
4  5.21  2.72  1.84 !NA   Not Missing
5  2.56  4.29  1.15 NA    Missing    
aq_imp_lm <- airquality %>%
  bind_shadow() %>%
  add_label_shadow() %>%
  impute_lm(Solar.R ~ Wind + Temp + Month) %>%
  impute_lm(Ozone ~ Wind + Temp + Month)

aq_imp_lm
# A tibble: 153 x 13
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
 * <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1 41       190    7.4    67     5     1 !NA      !NA        !NA     !NA    
 2 36       118    8      72     5     2 !NA      !NA        !NA     !NA    
 3 12       149   12.6    74     5     3 !NA      !NA        !NA     !NA    
 4 18       313   11.5    62     5     4 !NA      !NA        !NA     !NA    
 5 -9.04    138.  14.3    56     5     5 NA       NA         !NA     !NA    
 6 28       178.  14.9    66     5     6 !NA      NA         !NA     !NA    
 7 23       299    8.6    65     5     7 !NA      !NA        !NA     !NA    
 8 19        99   13.8    59     5     8 !NA      !NA        !NA     !NA    
 9  8        19   20.1    61     5     9 !NA      !NA        !NA     !NA    
10 35.2     194    8.6    69     5    10 NA       !NA        !NA     !NA    
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
#   any_missing <chr>

Važan deo imputacije jeste praćenje imputiranih vrednosti putem bind_shadow i add_label_missings funkcija.

aq_imp_lm <-
airquality %>%
  bind_shadow() %>%
  add_label_missings() %>%
  impute_lm(Solar.R ~ Wind + Temp + 
            Month) %>%
  impute_lm(Ozone ~ Wind + Temp + 
            Month)


ggplot(aq_imp_lm,
       aes(x = Solar.R,
           y = Ozone,
           colour = any_missing)) +
  geom_point()

Poređenje imputacionih modela.

aq_imp_small <- airquality %>%
bind_shadow() %>%
impute_lm(Ozone ~ Wind + Temp) %>%
impute_lm(Solar.R ~ Wind + Temp) %>%
add_label_shadow()

aq_imp_small
# A tibble: 153 x 13
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
   <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1  41      190    7.4    67     5     1 !NA      !NA        !NA     !NA    
 2  36      118    8      72     5     2 !NA      !NA        !NA     !NA    
 3  12      149   12.6    74     5     3 !NA      !NA        !NA     !NA    
 4  18      313   11.5    62     5     4 !NA      !NA        !NA     !NA    
 5 -11.7    127.  14.3    56     5     5 NA       NA         !NA     !NA    
 6  28      160.  14.9    66     5     6 !NA      NA         !NA     !NA    
 7  23      299    8.6    65     5     7 !NA      !NA        !NA     !NA    
 8  19       99   13.8    59     5     8 !NA      !NA        !NA     !NA    
 9   8       19   20.1    61     5     9 !NA      !NA        !NA     !NA    
10  29.7    194    8.6    69     5    10 NA       !NA        !NA     !NA    
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
#   any_missing <chr>
aq_imp_large <- airquality %>%
bind_shadow() %>%
impute_lm(Ozone ~ Wind + Temp + Month + Day) %>%
impute_lm(Solar.R ~ Wind + Temp + Month + Day) %>%
add_label_shadow()


aq_imp_large
# A tibble: 153 x 13
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
   <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1  41      190    7.4    67     5     1 !NA      !NA        !NA     !NA    
 2  36      118    8      72     5     2 !NA      !NA        !NA     !NA    
 3  12      149   12.6    74     5     3 !NA      !NA        !NA     !NA    
 4  18      313   11.5    62     5     4 !NA      !NA        !NA     !NA    
 5 -12.5    152.  14.3    56     5     5 NA       NA         !NA     !NA    
 6  28      189.  14.9    66     5     6 !NA      NA         !NA     !NA    
 7  23      299    8.6    65     5     7 !NA      !NA        !NA     !NA    
 8  19       99   13.8    59     5     8 !NA      !NA        !NA     !NA    
 9   8       19   20.1    61     5     9 !NA      !NA        !NA     !NA    
10  33.6    194    8.6    69     5    10 NA       !NA        !NA     !NA    
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
#   any_missing <chr>

Vezivanje modela u isti dataframe i prikaz.

bound_models <- 
bind_rows(small = aq_imp_small,
          large = aq_imp_large,
          .id = "imp_model")

bound_models
# A tibble: 306 x 14
   imp_model Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA
   <chr>     <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>  
 1 small      41      190    7.4    67     5     1 !NA      !NA        !NA    
 2 small      36      118    8      72     5     2 !NA      !NA        !NA    
 3 small      12      149   12.6    74     5     3 !NA      !NA        !NA    
 4 small      18      313   11.5    62     5     4 !NA      !NA        !NA    
 5 small     -11.7    127.  14.3    56     5     5 NA       NA         !NA    
 6 small      28      160.  14.9    66     5     6 !NA      NA         !NA    
 7 small      23      299    8.6    65     5     7 !NA      !NA        !NA    
 8 small      19       99   13.8    59     5     8 !NA      !NA        !NA    
 9 small       8       19   20.1    61     5     9 !NA      !NA        !NA    
10 small      29.7    194    8.6    69     5    10 NA       !NA        !NA    
# ... with 296 more rows, and 4 more variables: Temp_NA <fct>, Month_NA <fct>,
#   Day_NA <fct>, any_missing <chr>
ggplot(bound_models,
       aes(x = Ozone,
           y = Solar.R,
           colour = any_missing)) +
  geom_point() + 
  facet_wrap(~imp_model)

Poređenje imputacija za više modela.

bound_models_gather <- bound_models %>%
  select(Ozone, Solar.R,
         any_missing, imp_model) %>%
  gather(key = "variable", value = "value",
         -any_missing, -imp_model)

bound_models_gather
# A tibble: 612 x 4
   any_missing imp_model variable value
   <chr>       <chr>     <chr>    <dbl>
 1 Not Missing small     Ozone     41  
 2 Not Missing small     Ozone     36  
 3 Not Missing small     Ozone     12  
 4 Not Missing small     Ozone     18  
 5 Missing     small     Ozone    -11.7
 6 Missing     small     Ozone     28  
 7 Not Missing small     Ozone     23  
 8 Not Missing small     Ozone     19  
 9 Not Missing small     Ozone      8  
10 Missing     small     Ozone     29.7
# ... with 602 more rows
ggplot(bound_models_gather,
       aes(x = imp_model,
           y = value,colour=imp_model)) +
  geom_boxplot() + 
  facet_wrap(~variable)

Poređenje imputacija za više modela, samo za nedostajuće vrednosti.

bound_models_gather %>%
  filter(any_missing == "Missing") %>% 
  ggplot(aes(x = imp_model, y = value, colour=imp_model)) +
  geom_boxplot() +
  facet_wrap(~variable)

Vežba - imputacija korišćenjem simputation paketa.

# Impute humidity and air temperature using wind_ew and wind_ns, and track missing values
ocean_imp_lm_wind <- oceanbuoys %>% 
    bind_shadow() %>%
    impute_lm(air_temp_c ~ wind_ew + wind_ns) %>% 
    impute_lm(humidity ~ wind_ew + wind_ns) %>%
    add_label_shadow()
    
# Plot the imputed values for air_temp_c and humidity, colored by missingness
ggplot(ocean_imp_lm_wind, 
       aes(x = air_temp_c, y = humidity, color = any_missing)) +
  geom_point()

Vežba - evaluacija i poređenje.

# Bind the models together 
bound_models <- bind_rows(mean = ocean_imp_mean,
                          lm_wind = ocean_imp_lm_wind,
                          .id = "imp_model")

# Inspect the values of air_temp and humidity as a scatterplot
ggplot(bound_models,
       aes(x = air_temp_c,
           y = humidity,
           color = any_missing)) +
  geom_point() + 
  facet_wrap(~imp_model)

Vežba - evaluacija i poređenje za više mopdela i promenljivih.

# Build a model adding year to the outcome
ocean_imp_lm_wind_year <- bind_shadow(oceanbuoys) %>%
  impute_lm(air_temp_c ~ wind_ew + wind_ns + year) %>%
  impute_lm(humidity ~ wind_ew + wind_ns + year) %>%
  add_label_shadow()

# Bind the mean, lm_wind, and lm_wind_year models together
bound_models <- bind_rows(mean = ocean_imp_mean,
                          lm_wind = ocean_imp_lm_wind,
                          lm_wind_year = ocean_imp_lm_wind_year,
                          .id = "imp_model")

# Explore air_temp and humidity, coloring by any missings, and faceting by imputation model
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) +
  geom_point() + facet_wrap(~imp_model)

Statističko zaključivanje na osnovu imputiranih podataka u kontekstu modeliranja.

Evaluacija parametara modela.

lm(Temp ~ Ozone + Solar.R + Wind + Month + Day, data = airquality)

Call:
lm(formula = Temp ~ Ozone + Solar.R + Wind + Month + Day, data = airquality)

Coefficients:
(Intercept)        Ozone      Solar.R         Wind        Month          Day  
   57.25183      0.16528      0.01082     -0.17433      2.04246     -0.08919  

Spajanje datasetova.

#1.  Complete cases
aq_cc <-   airquality %>%
  na.omit() %>%
  bind_shadow() %>%
  add_label_shadow()


aq_cc
# A tibble: 111 x 13
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
   <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
 2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
 3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
 4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
 5    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
 6    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
 7     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
 8    16     256   9.7    69     5    12 !NA      !NA        !NA     !NA    
 9    11     290   9.2    66     5    13 !NA      !NA        !NA     !NA    
10    14     274  10.9    68     5    14 !NA      !NA        !NA     !NA    
# ... with 101 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
#   any_missing <chr>
#2. Imputation using the imputed data from the last lesson
aq_imp_lm <- bind_shadow(airquality) %>%
  add_label_shadow() %>%
  impute_lm(Ozone ~ Temp + Wind + Month + Day) %>%
  impute_lm(Solar.R ~ Temp + Wind + Month + Day)


aq_imp_lm
# A tibble: 153 x 13
   Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
 * <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
 1  41      190    7.4    67     5     1 !NA      !NA        !NA     !NA    
 2  36      118    8      72     5     2 !NA      !NA        !NA     !NA    
 3  12      149   12.6    74     5     3 !NA      !NA        !NA     !NA    
 4  18      313   11.5    62     5     4 !NA      !NA        !NA     !NA    
 5 -12.5    152.  14.3    56     5     5 NA       NA         !NA     !NA    
 6  28      189.  14.9    66     5     6 !NA      NA         !NA     !NA    
 7  23      299    8.6    65     5     7 !NA      !NA        !NA     !NA    
 8  19       99   13.8    59     5     8 !NA      !NA        !NA     !NA    
 9   8       19   20.1    61     5     9 !NA      !NA        !NA     !NA    
10  33.6    194    8.6    69     5    10 NA       !NA        !NA     !NA    
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
#   any_missing <chr>
# 3. Bind the models together
bound_models <- bind_rows(cc = aq_cc,
                          imp_lm = aq_imp_lm,
                          .id = "imp_model")


bound_models
# A tibble: 264 x 14
   imp_model Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA
   <chr>     <dbl>   <dbl> <dbl> <int> <int> <int> <fct>    <fct>      <fct>  
 1 cc           41     190   7.4    67     5     1 !NA      !NA        !NA    
 2 cc           36     118   8      72     5     2 !NA      !NA        !NA    
 3 cc           12     149  12.6    74     5     3 !NA      !NA        !NA    
 4 cc           18     313  11.5    62     5     4 !NA      !NA        !NA    
 5 cc           23     299   8.6    65     5     7 !NA      !NA        !NA    
 6 cc           19      99  13.8    59     5     8 !NA      !NA        !NA    
 7 cc            8      19  20.1    61     5     9 !NA      !NA        !NA    
 8 cc           16     256   9.7    69     5    12 !NA      !NA        !NA    
 9 cc           11     290   9.2    66     5    13 !NA      !NA        !NA    
10 cc           14     274  10.9    68     5    14 !NA      !NA        !NA    
# ... with 254 more rows, and 4 more variables: Temp_NA <fct>, Month_NA <fct>,
#   Day_NA <fct>, any_missing <chr>

Istraživanje modela

model_summary <- bound_models %>% 
  group_by(imp_model) %>%
  nest() %>%
  mutate(mod = map(data, 
                   ~lm(Temp ~ Ozone + Solar.R + Wind + Day + Month , 
                       data = .)),
         res = map(mod, residuals),
         pred = map(mod, predict),
         tidy = map(mod, broom::tidy))

model_summary
# A tibble: 2 x 6
# Groups:   imp_model [2]
  imp_model            data mod    res         pred        tidy            
  <chr>     <list<df[,13]>> <list> <list>      <list>      <list>          
1 cc             [111 x 13] <lm>   <dbl [111]> <dbl [111]> <tibble [6 x 5]>
2 imp_lm         [153 x 13] <lm>   <dbl [153]> <dbl [153]> <tibble [6 x 5]>

Istraživanje koeficijenata modela.

model_summary %>% 
  select(imp_model,
         tidy) %>%
  unnest()
# A tibble: 12 x 6
# Groups:   imp_model [2]
   imp_model term        estimate std.error statistic  p.value
   <chr>     <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 cc        (Intercept) 57.3       4.50      12.7    5.52e-23
 2 cc        Ozone        0.165     0.0239     6.92   3.66e-10
 3 cc        Solar.R      0.0108    0.00699    1.55   1.24e- 1
 4 cc        Wind        -0.174     0.212     -0.821  4.13e- 1
 5 cc        Day         -0.0892    0.0677    -1.32   1.91e- 1
 6 cc        Month        2.04      0.409      4.99   2.42e- 6
 7 imp_lm    (Intercept) 54.7       3.59      15.2    5.21e-32
 8 imp_lm    Ozone        0.196     0.0205     9.53   4.52e-17
 9 imp_lm    Solar.R      0.0102    0.00577    1.76   7.97e- 2
10 imp_lm    Wind        -0.00642   0.172     -0.0374 9.70e- 1
11 imp_lm    Day         -0.112     0.0538    -2.08   3.92e- 2
12 imp_lm    Month        2.11      0.340      6.21   5.09e- 9
model_summary %>% 
  select(imp_model,
         res) %>%
  unnest() %>%
  ggplot(aes(x = res,
             fill = imp_model)) +
  geom_histogram(position = "dodge")

model_summary %>% 
  select(imp_model,
         pred) %>%
  unnest() %>%
  ggplot(aes(x = pred,
             fill = imp_model)) +
  geom_histogram(position = "dodge")

Vežba - spajanje i poređenje više modela.

ocean_cc <- oceanbuoys %>% drop_na()

# Create an imputed dataset using a linear models
ocean_imp_lm_all <- bind_shadow(oceanbuoys) %>%
  add_label_shadow() %>%
  impute_lm(sea_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
  impute_lm(air_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
  impute_lm(humidity ~ wind_ew + wind_ns + year + latitude + longitude)

# Bind the datasets
bound_models <- bind_rows(cc = ocean_cc,
                          imp_lm_wind = ocean_imp_lm_wind,
                          imp_lm_all = ocean_imp_lm_all,
                          .id = "imp_model")

# Look at the models
bound_models
# A tibble: 2,037 x 18
   imp_model  year latitude longitude sea_temp_c air_temp_c humidity wind_ew
   <chr>     <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>
 1 cc         1997        0      -110       27.6       27.1     79.6   -6.40
 2 cc         1997        0      -110       27.5       27.0     75.8   -5.30
 3 cc         1997        0      -110       27.6       27       76.5   -5.10
 4 cc         1997        0      -110       27.6       26.9     76.2   -4.90
 5 cc         1997        0      -110       27.6       26.8     76.4   -3.5 
 6 cc         1997        0      -110       27.8       26.9     76.7   -4.40
 7 cc         1997        0      -110       28.0       27.0     76.5   -2   
 8 cc         1997        0      -110       28.0       27.1     78.3   -3.70
 9 cc         1997        0      -110       28.0       27.2     78.6   -4.20
10 cc         1997        0      -110       28.0       27.2     76.9   -3.60
# ... with 2,027 more rows, and 10 more variables: wind_ns <dbl>,
#   year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
#   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>,
#   any_missing <chr>
# Create the model summary for each dataset
model_summary <- bound_models %>% 
  group_by(imp_model) %>%
  nest() %>%
  mutate(mod = map(data, ~lm(sea_temp_c ~ air_temp_c + humidity + year, data = .)),
         res = map(mod, residuals),
         pred = map(mod, predict),
         tidy = map(mod, broom::tidy))

# Explore the coefficients in the model
model_summary %>% 
    select(imp_model,tidy) %>%
    unnest()
# A tibble: 12 x 6
# Groups:   imp_model [3]
   imp_model   term          estimate std.error statistic   p.value
   <chr>       <chr>            <dbl>     <dbl>     <dbl>     <dbl>
 1 cc          (Intercept)  -735.      45.9        -16.0  8.19e- 48
 2 cc          air_temp_c      0.864    0.0231      37.4  2.64e-154
 3 cc          humidity        0.0341   0.00390      8.74 2.69e- 17
 4 cc          year            0.369    0.0232      15.9  3.46e- 47
 5 imp_lm_wind (Intercept) -1742.      56.1        -31.0  1.83e-135
 6 imp_lm_wind air_temp_c      0.365    0.0279      13.1  2.73e- 35
 7 imp_lm_wind humidity        0.0225   0.00690      3.26 1.17e-  3
 8 imp_lm_wind year            0.880    0.0283      31.1  6.79e-136
 9 imp_lm_all  (Intercept)  -697.      51.8        -13.5  5.04e- 37
10 imp_lm_all  air_temp_c      0.890    0.0255      35.0  2.90e-158
11 imp_lm_all  humidity        0.0127   0.00463      2.75 6.03e-  3
12 imp_lm_all  year            0.351    0.0262      13.4  1.12e- 36
best_model <- "imp_lm_all"

best_model
[1] "imp_lm_all"