The best thing to do with missing data is to not have any
—Gertrude Mary Cox
Bavićemo se sledećim:
Podaci koji nedostaju su vrednosti koje su trebale biti zabeležene, ali nisu.
NA = Not Available.
Kako proveriti da li imamo nedostajuće podatke?
x <- c(1, NA, 3, NA, NA, 5)
library(naniar)
any_na(x)
[1] TRUE
are_na(x)
[1] FALSE TRUE FALSE TRUE TRUE FALSE
n_miss(x)
[1] 3
prop_miss(x)
[1] 0.5
Podaci koji nedostaju u kalkulacijama?
Generalno pravilo: NA + [anything] = NA
visineDF <- data.frame(Srdjan = c(182), Zoran = c(188), Milan = c(NA))
sum(visineDF)
[1] NA
Neke začkoljice u radu sa podacima koji nedostaju
NaN: Not a Number — R tumači NaN kao podatak koji nedostaje.
NULL — je prazna vrednost, i R je ne tretira kao nedostajuću.
Inf — je beskonačna vrednost, i R je ne tretira kao nedostajuću.
any_na(NaN)
[1] TRUE
any_na(NULL)
[1] FALSE
any_na(Inf)
[1] FALSE
Podaci koji nedostaju u uslovnim izjavama
NA | TRUE
[1] TRUE
NA | FALSE
[1] NA
NA + NaN
[1] NA
NaN + NA
[1] NaN
# Create x, a vector, with values NA, NaN, Inf, ".", and "missing"
x <- c(NA, NaN, Inf, ".", "missing")
# Use any_na() and are_na() on to explore the missings
any_na(x)
[1] TRUE
are_na(x)
[1] TRUE FALSE FALSE FALSE FALSE
Jedna od prvih stvari koje ćete želeti da uradite sa novim datasetom jeste provera da li, i koliko nedostajućih vrednosti poostoji.
library(readr)
dat_hw <- read_csv(here::here("data-raw","data_hw.csv"),
col_types = cols(height = col_double(), weight = col_double()))
# Use n_miss() to count the total number of missing values in dat_hw
dat_hw
# A tibble: 100 x 2
weight height
<dbl> <dbl>
1 104. -1.15
2 77.1 2.79
3 108. 1.81
4 85.0 1.71
5 87.4 0.918
6 87.4 0.249
7 87.2 1.94
8 124. 2.24
9 82.8 2.58
10 NA NA
# ... with 90 more rows
glimpse(dat_hw)
Observations: 100
Variables: 2
$ weight <dbl> 104.16064, 77.05248, 107.69692, 85.03788, 87.44273, 87.38751...
$ height <dbl> -1.1468301, 2.7899022, 1.8085469, 1.7118845, 0.9179269, 0.24...
n_miss(dat_hw)
[1] 30
# Use n_miss() on dat_hw$weight to count the total number of missing values
n_miss(dat_hw$weight)
[1] 15
# Use n_complete() on dat_hw to count the total number of complete values
n_complete(dat_hw)
[1] 170
# Use n_complete() on dat_hw$weight to count the total number of complete values
n_complete(dat_hw$weight)
[1] 85
# Use prop_miss() and prop_complete on dat_hw to count the total number of missing values in each of the variables
prop_miss(dat_hw)
[1] 0.15
prop_complete(dat_hw)
[1] 0.85
pct_complete(dat_hw)
[1] 85
pct_miss(dat_hw)
[1] 15
Vežba — koji će biti rezultat sledećih izjava?
1 + NA
NA + NA
NA | TRUE
NA | FALSE
Potrebni su nam sumarni izveštaji/pregledi/rezimei za:
pošto sve gore navedeno može uticati nepovoljno na dalju analizu podataka.
Prema strukturi rezimea postoje:
Paket naniar poseduje familiju funkcija čija imena počinju sa miss_. Svaka od ovih funkcija daje različite preglede nedostajućih podataka, i kao izlaz daje dataframe. Ove funkcije takođe rade sa dplyr funkcijom group_by, tako da možete istražiti obrasce nedostajanja po grupama.
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
glimpse(airquality)
Observations: 153
Variables: 6
$ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 1...
$ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290,...
$ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9...
$ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58,...
$ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
$ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
miss_var_summary(airquality)
# A tibble: 6 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 Ozone 37 24.2
2 Solar.R 7 4.58
3 Wind 0 0
4 Temp 0 0
5 Month 0 0
6 Day 0 0
miss_case_summary(airquality)
# A tibble: 153 x 3
case n_miss pct_miss
<int> <int> <dbl>
1 5 2 33.3
2 27 2 33.3
3 6 1 16.7
4 10 1 16.7
5 11 1 16.7
6 25 1 16.7
7 26 1 16.7
8 32 1 16.7
9 33 1 16.7
10 34 1 16.7
# ... with 143 more rows
Tabulacija nedostajućih podataka prikazuje broj promenljivih/zapisa sa 0, 1, 2, ..., n nedostajućih vrednosti i udeo u ukupnom broju promenljivih/zapisa koje ove promenljive/zapisi čine.
miss_var_table(airquality)
# A tibble: 3 x 3
n_miss_in_var n_vars pct_vars
<int> <int> <dbl>
1 0 4 66.7
2 7 1 16.7
3 37 1 16.7
miss_case_table(airquality)
# A tibble: 3 x 3
n_miss_in_case n_cases pct_cases
<int> <int> <dbl>
1 0 111 72.5
2 1 40 26.1
3 2 2 1.31
Takođe možemo istražiti nedostajanje po rasponima podataka. Ovo je veoma korisno za vremenske serije. miss_var_span daje izveštaj o nedostajućim podacima u obliku dataframe-a za svaki span, dok miss_var_run vraća dataframe sa dve nove kolone run_length i is_na koje opisuju complete/missing serije nedostajanja podataka. Ovo je korisno za istraživanje ponavljanja u obrascu nedostajanja.
head(pedestrian)
# A tibble: 6 x 9
hourly_counts date_time year month month_day week_day hour
<int> <dttm> <int> <ord> <int> <ord> <int>
1 883 2016-01-01 00:00:00 2016 Janu~ 1 Friday 0
2 597 2016-01-01 01:00:00 2016 Janu~ 1 Friday 1
3 294 2016-01-01 02:00:00 2016 Janu~ 1 Friday 2
4 183 2016-01-01 03:00:00 2016 Janu~ 1 Friday 3
5 118 2016-01-01 04:00:00 2016 Janu~ 1 Friday 4
6 68 2016-01-01 05:00:00 2016 Janu~ 1 Friday 5
# ... with 2 more variables: sensor_id <int>, sensor_name <chr>
glimpse(pedestrian)
Observations: 37,700
Variables: 9
$ hourly_counts <int> 883, 597, 294, 183, 118, 68, 47, 52, 120, 333, 761, 1...
$ date_time <dttm> 2016-01-01 00:00:00, 2016-01-01 01:00:00, 2016-01-01...
$ year <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,...
$ month <ord> January, January, January, January, January, January,...
$ month_day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ week_day <ord> Friday, Friday, Friday, Friday, Friday, Friday, Frida...
$ hour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ sensor_id <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
$ sensor_name <chr> "Bourke Street Mall (South)", "Bourke Street Mall (So...
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
# A tibble: 10 x 5
span_counter n_miss n_complete prop_miss prop_complete
<int> <int> <dbl> <dbl> <dbl>
1 1 0 4000 0 1
2 2 1 3999 0.00025 1.00
3 3 121 3879 0.0302 0.970
4 4 503 3497 0.126 0.874
5 5 745 3255 0.186 0.814
6 6 0 4000 0 1
7 7 1 3999 0.00025 1.00
8 8 0 4000 0 1
9 9 745 3255 0.186 0.814
10 10 432 3568 0.108 0.892
miss_var_run(pedestrian, hourly_counts)
# A tibble: 35 x 2
run_length is_na
<int> <chr>
1 6628 complete
2 1 missing
3 5250 complete
4 624 missing
5 3652 complete
6 1 missing
7 1290 complete
8 744 missing
9 7420 complete
10 1 missing
# ... with 25 more rows
Nekada nas interesuju nedostajanja u grupama podataka. Svaka od gornjih funkcija može raditi sa group_by funkcijom.
airquality %>%
group_by(Month) %>%
miss_var_summary()
# A tibble: 25 x 4
# Groups: Month [5]
Month variable n_miss pct_miss
<int> <chr> <int> <dbl>
1 5 Ozone 5 16.1
2 5 Solar.R 4 12.9
3 5 Wind 0 0
4 5 Temp 0 0
5 5 Day 0 0
6 6 Ozone 21 70
7 6 Solar.R 0 0
8 6 Wind 0 0
9 6 Temp 0 0
10 6 Day 0 0
# ... with 15 more rows
Vežbanje rezimiranja nedostajućih podataka
# Summarise missingness in each variable of the `airquality` dataset
miss_var_summary(airquality)
# A tibble: 6 x 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 Ozone 37 24.2
2 Solar.R 7 4.58
3 Wind 0 0
4 Temp 0 0
5 Month 0 0
6 Day 0 0
# Summarise missingness in each case of the `airquality` dataset
miss_case_summary(airquality)
# A tibble: 153 x 3
case n_miss pct_miss
<int> <int> <dbl>
1 5 2 33.3
2 27 2 33.3
3 6 1 16.7
4 10 1 16.7
5 11 1 16.7
6 25 1 16.7
7 26 1 16.7
8 32 1 16.7
9 33 1 16.7
10 34 1 16.7
# ... with 143 more rows
# Return the summary of missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_summary()
# A tibble: 25 x 4
# Groups: Month [5]
Month variable n_miss pct_miss
<int> <chr> <int> <dbl>
1 5 Ozone 5 16.1
2 5 Solar.R 4 12.9
3 5 Wind 0 0
4 5 Temp 0 0
5 5 Day 0 0
6 6 Ozone 21 70
7 6 Solar.R 0 0
8 6 Wind 0 0
9 6 Temp 0 0
10 6 Day 0 0
# ... with 15 more rows
# Return the summary of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_summary()
# A tibble: 153 x 4
# Groups: Month [5]
Month case n_miss pct_miss
<int> <int> <int> <dbl>
1 5 5 2 40
2 5 27 2 40
3 5 6 1 20
4 5 10 1 20
5 5 11 1 20
6 5 25 1 20
7 5 26 1 20
8 5 1 0 0
9 5 2 0 0
10 5 3 0 0
# ... with 143 more rows
Vežbanje tabulacije nedostajanja
# Tabulate missingness in each variable and case of the `airquality` dataset
miss_var_table(airquality)
# A tibble: 3 x 3
n_miss_in_var n_vars pct_vars
<int> <int> <dbl>
1 0 4 66.7
2 7 1 16.7
3 37 1 16.7
miss_case_table(airquality)
# A tibble: 3 x 3
n_miss_in_case n_cases pct_cases
<int> <int> <dbl>
1 0 111 72.5
2 1 40 26.1
3 2 2 1.31
# Tabulate the missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_table()
# A tibble: 12 x 4
# Groups: Month [5]
Month n_miss_in_var n_vars pct_vars
<int> <int> <int> <dbl>
1 5 0 3 60
2 5 4 1 20
3 5 5 1 20
4 6 0 4 80
5 6 21 1 20
6 7 0 4 80
7 7 5 1 20
8 8 0 3 60
9 8 3 1 20
10 8 5 1 20
11 9 0 4 80
12 9 1 1 20
# Tabulate of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_table()
# A tibble: 11 x 4
# Groups: Month [5]
Month n_miss_in_case n_cases pct_cases
<int> <int> <int> <dbl>
1 5 0 24 77.4
2 5 1 5 16.1
3 5 2 2 6.45
4 6 0 9 30
5 6 1 21 70
6 7 0 26 83.9
7 7 1 5 16.1
8 8 0 23 74.2
9 8 1 8 25.8
10 9 0 29 96.7
11 9 1 1 3.33
Vežbanje ostalih pregleda
# Calculate the summaries for each run of missingness for the variable, hourly_counts
miss_var_run(pedestrian, var = hourly_counts)
# A tibble: 35 x 2
run_length is_na
<int> <chr>
1 6628 complete
2 1 missing
3 5250 complete
4 624 missing
5 3652 complete
6 1 missing
7 1290 complete
8 744 missing
9 7420 complete
10 1 missing
# ... with 25 more rows
# Calculate the summaries for each span of missingness, for a span of 4000, for the variable hourly_counts
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)
# A tibble: 10 x 5
span_counter n_miss n_complete prop_miss prop_complete
<int> <int> <dbl> <dbl> <dbl>
1 1 0 4000 0 1
2 2 1 3999 0.00025 1.00
3 3 121 3879 0.0302 0.970
4 4 503 3497 0.126 0.874
5 5 745 3255 0.186 0.814
6 6 0 4000 0 1
7 7 1 3999 0.00025 1.00
8 8 0 4000 0 1
9 9 745 3255 0.186 0.814
10 10 432 3568 0.108 0.892
# For each `month` variable, calculate the run of missingness for hourly_counts
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)
# A tibble: 51 x 3
# Groups: month [12]
month run_length is_na
<ord> <int> <chr>
1 January 2976 complete
2 February 2784 complete
3 March 2976 complete
4 April 888 complete
5 April 552 missing
6 April 1440 complete
7 May 744 complete
8 May 72 missing
9 May 2160 complete
10 June 2880 complete
# ... with 41 more rows
# For each `month` variable, calculate the span of missingness of a span of 2000, for the variable hourly_counts
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)
# A tibble: 25 x 6
# Groups: month [12]
month span_counter n_miss n_complete prop_miss prop_complete
<ord> <int> <int> <dbl> <dbl> <dbl>
1 January 1 0 2000 0 1
2 January 2 0 2000 0 1
3 February 1 0 2000 0 1
4 February 2 0 2000 0 1
5 March 1 0 2000 0 1
6 March 2 0 2000 0 1
7 April 1 552 1448 0.276 0.724
8 April 2 0 2000 0 1
9 May 1 72 1928 0.036 0.964
10 May 2 0 2000 0 1
# ... with 15 more rows
vis_miss daje Heatmap nedostajućih podataka. Argument cluster = TRUE specifikuje da želimo pregled na osnovu hijerarhijskog klasterovanja (Mcquitty metoda) da bi sortirali redove prema nedostajanjima.
library(visdat)
vis_miss(airquality)
vis_miss(airquality, cluster = TRUE)
Vizualizacija po promenljivim i zapisima. Kod gg_miss_case svaka linija predstavlja količinu nedostajanja u tom zapisu. Ove vizualizacije su sortirane tako da je najveći broj nedostajanja na vrhu. Takođe je dozvoljen faceting na osnovu nivoa zadate promenljive.
gg_miss_var(airquality)
gg_miss_case(airquality)
gg_miss_case(airquality, order_cases = FALSE)
gg_miss_var(airquality, facet = Month)
Vizualizacija obrazaca nedostajanja. Da bi istražili česte obrasce nedostajanja (koje promenljive i zapisi istovremeno nestaju) koristićemo gg_miss_upset. Ova funkcija pokazuje broj kombinacija nedostajućih vrednosti koje se javljaju zajedno ili istovremeno. Na donjem grafiku možemo videti da nedostaju vrednosti samo za Solar.R i Ozone. 35 vrednosti nedostaje samo za Ozone, 5 vrednosti nedostaje samo za Solar.R, i 2 vrednosti nedostaju istovremeno za Solar.R i Ozone.
gg_miss_upset(airquality)
Da bi istražili kako se nedostajanje podataka za promenljive menja u zavisnosti od vrednosti zadate kategorijske promenljive koristimo gg_miss_fct. Ova funkcija ne podržava faceting.
gg_miss_fct(x = airquality, fct = Month)
gg_miss_span je vizualni analog funkcije miss_var_span. Recimo broj nedostajanja na svakih 3000 redova. Ova funkcija podržava faceting.
gg_miss_span(pedestrian, hourly_counts, span_every = 3000)
Vežba — vizualizacija nedostajanja.
glimpse(riskfactors)
Observations: 245
Variables: 34
$ state <fct> 26, 40, 72, 42, 32, 19, 45, 56, 18, 8, 55, 48, 26,...
$ sex <fct> Female, Female, Female, Male, Female, Male, Male, ...
$ age <int> 49, 48, 55, 42, 66, 66, 37, 62, 38, 42, 41, 64, 58...
$ weight_lbs <int> 190, 170, 163, 230, 135, 165, 150, 170, 146, 260, ...
$ height_inch <int> 64, 68, 64, 74, 62, 70, 68, 70, 70, 73, 72, 66, 62...
$ bmi <dbl> 32.68, 25.90, 28.04, 29.59, 24.74, 23.72, 22.86, 2...
$ marital <fct> Married, Divorced, Married, Married, Widowed, Marr...
$ pregnant <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, No, NA, NA, NA...
$ children <int> 0, 0, 0, 1, 0, 0, 3, 0, 2, 3, 2, 0, 0, 0, 0, 0, 0,...
$ education <fct> 6, 5, 4, 6, 5, 5, 6, 6, 4, 5, 6, 4, 4, 3, 4, 4, 4,...
$ employment <fct> 2, 1, 5, 1, 1, 7, 2, 7, 1, 3, 1, 1, 1, 5, 1, 1, 2,...
$ income <fct> 35-50k, 35-50k, <10k, >75k, 50-75k, 35-50k, >75k, ...
$ veteran <fct> 5, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5, 3, 5, 5, 5, 5, 5,...
$ hispanic <fct> No, No, Yes, No, No, No, No, No, No, No, No, No, N...
$ health_general <fct> Good, Fair, Fair, Excellent, Excellent, Excellent,...
$ health_physical <int> 3, 4, 0, 0, 0, 0, 0, 30, 30, 0, 0, 0, 5, 0, 0, 3, ...
$ health_mental <int> 15, 30, 0, 0, 0, 0, 0, 30, 30, 20, 0, 2, 0, 0, 30,...
$ health_poor <int> 2, 3, NA, NA, NA, NA, NA, 30, 14, 4, NA, 0, 3, NA,...
$ health_cover <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Y...
$ provide_care <fct> No, No, No, No, No, No, Yes, No, No, No, No, No, N...
$ activity_limited <fct> Yes, No, No, No, No, No, No, No, Yes, NA, No, No, ...
$ drink_any <fct> No, No, No, Yes, No, No, Yes, No, No, NA, Yes, Yes...
$ drink_days <int> NA, NA, NA, 15, NA, NA, 2, NA, NA, NA, 4, 3, 1, NA...
$ drink_average <int> NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, 2, 5, 1, NA...
$ smoke_100 <fct> No, No, No, No, Yes, No, No, Yes, Yes, Yes, No, Ye...
$ smoke_days <fct> NA, NA, NA, NA, Everyday, NA, NA, Not@All, Everyda...
$ smoke_stop <fct> NA, NA, NA, NA, Yes, NA, NA, NA, Yes, NA, NA, NA, ...
$ smoke_last <fct> NA, NA, NA, NA, NA, NA, NA, 7, NA, 5, NA, 7, 7, NA...
$ diet_fruit <int> 1095, 52, 36, NA, -7, 24, 52, 156, 24, NA, 365, 10...
$ diet_salad <int> 261, 209, 156, NA, 261, 52, 156, 24, 84, NA, 156, ...
$ diet_potato <int> 104, 52, 52, NA, 209, 104, 24, 52, 144, NA, 52, 10...
$ diet_carrot <int> 156, 0, 24, NA, 261, 52, 24, 104, 24, NA, 156, 0, ...
$ diet_vegetable <int> 521, 52, 24, NA, 365, 365, 730, 365, 0, NA, 1095, ...
$ diet_juice <int> 12, 0, 24, NA, 104, 365, 104, 0, 0, NA, 261, 12, 1...
skimr::skim(riskfactors)
| Name | riskfactors |
| Number of rows | 245 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| factor | 18 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| state | 0 | 1.00 | FALSE | 52 | 6: 12, 34: 12, 25: 11, 12: 10 |
| sex | 0 | 1.00 | FALSE | 2 | Fem: 153, Mal: 92 |
| marital | 1 | 1.00 | FALSE | 6 | Mar: 131, Wid: 41, Div: 39, Nev: 24 |
| pregnant | 215 | 0.12 | FALSE | 2 | No: 28, Yes: 2 |
| education | 1 | 1.00 | FALSE | 6 | 4: 85, 6: 75, 5: 59, 3: 14 |
| employment | 0 | 1.00 | FALSE | 7 | 1: 92, 7: 80, 2: 21, 8: 18 |
| income | 0 | 1.00 | FALSE | 10 | >75: 46, 35-: 38, 50-: 32, 25-: 28 |
| veteran | 3 | 0.99 | FALSE | 5 | 5: 198, 3: 35, 4: 6, 2: 2 |
| hispanic | 2 | 0.99 | FALSE | 2 | No: 221, Yes: 22 |
| health_general | 0 | 1.00 | FALSE | 6 | Goo: 82, Ver: 56, Exc: 45, Fai: 40 |
| health_cover | 0 | 1.00 | FALSE | 2 | Yes: 217, No: 28 |
| provide_care | 3 | 0.99 | FALSE | 2 | No: 181, Yes: 61 |
| activity_limited | 3 | 0.99 | FALSE | 2 | No: 170, Yes: 72 |
| drink_any | 2 | 0.99 | FALSE | 2 | No: 132, Yes: 111 |
| smoke_100 | 2 | 0.99 | FALSE | 2 | No: 126, Yes: 117 |
| smoke_days | 128 | 0.48 | FALSE | 3 | Not: 84, Eve: 26, Som: 7 |
| smoke_stop | 212 | 0.13 | FALSE | 2 | Yes: 19, No: 14 |
| smoke_last | 161 | 0.34 | FALSE | 6 | 7: 59, 5: 9, 6: 6, 8: 5 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1.00 | 58.11 | 17.50 | 7.00 | 48.00 | 59.00 | 70.00 | 97.00 | ▁▃▇▇▃ |
| weight_lbs | 10 | 0.96 | 174.27 | 44.70 | 96.00 | 144.00 | 170.00 | 195.00 | 410.00 | ▆▇▂▁▁ |
| height_inch | 2 | 0.99 | 66.35 | 3.80 | 57.00 | 64.00 | 66.00 | 69.00 | 75.00 | ▂▇▇▇▃ |
| bmi | 11 | 0.96 | 27.78 | 6.57 | 14.21 | 23.35 | 26.66 | 31.32 | 55.72 | ▃▇▃▁▁ |
| children | 0 | 1.00 | 0.42 | 0.88 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | ▇▁▁▁▁ |
| health_physical | 0 | 1.00 | 4.12 | 9.11 | -9.00 | 0.00 | 0.00 | 3.00 | 30.00 | ▁▇▁▁▁ |
| health_mental | 0 | 1.00 | 3.14 | 8.19 | -9.00 | 0.00 | 0.00 | 1.00 | 30.00 | ▁▇▁▁▁ |
| health_poor | 113 | 0.54 | 5.46 | 10.76 | -9.00 | 0.00 | 0.00 | 5.00 | 30.00 | ▁▇▁▁▂ |
| drink_days | 134 | 0.45 | 9.29 | 10.20 | -9.00 | 2.00 | 4.00 | 15.00 | 30.00 | ▁▇▃▃▂ |
| drink_average | 135 | 0.45 | 1.41 | 2.57 | -9.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▁▁▇▆▁ |
| diet_fruit | 8 | 0.97 | 324.71 | 338.11 | -9.00 | 60.00 | 209.00 | 365.00 | 2555.00 | ▇▁▁▁▁ |
| diet_salad | 8 | 0.97 | 175.48 | 140.00 | -9.00 | 52.00 | 156.00 | 261.00 | 730.00 | ▇▇▃▁▁ |
| diet_potato | 8 | 0.97 | 111.68 | 132.42 | -9.00 | 24.00 | 104.00 | 156.00 | 1095.00 | ▇▁▁▁▁ |
| diet_carrot | 8 | 0.97 | 91.52 | 101.32 | -9.00 | 24.00 | 52.00 | 156.00 | 365.00 | ▇▂▂▁▁ |
| diet_vegetable | 8 | 0.97 | 407.15 | 316.25 | -9.00 | 156.00 | 365.00 | 626.00 | 1825.00 | ▇▇▃▁▁ |
| diet_juice | 8 | 0.97 | 227.85 | 307.28 | -9.00 | 24.00 | 104.00 | 365.00 | 2555.00 | ▇▁▁▁▁ |
# Visualize all of the missingness in the `riskfactors` dataset
vis_miss(riskfactors)
# Visualize and cluster all of the missingness in the `riskfactors` dataset
vis_miss(riskfactors, cluster = TRUE)
# Visualize and sort the columns by missingness in the `riskfactors` dataset
vis_miss(riskfactors, sort_miss = TRUE)
Vežba — vizualizacija nedostajanja po zapisima i promenljivima.
# Visualize the number of missings in cases using `gg_miss_case()`
gg_miss_case(riskfactors)
# Explore the number of missings in cases using `gg_miss_case()` and facet by the variable `education`
gg_miss_case(riskfactors, facet = education)
# Visualize the number of missings in variables using `gg_miss_var()`
gg_miss_var(riskfactors)
# Explore the number of missings in variables using `gg_miss_var()` and facet by the variable `education`
gg_miss_var(riskfactors, facet = education)
# Explore the number of missings in variables using `gg_miss_var()` and facet by the variable `education`
gg_miss_var(riskfactors, facet = education)
Vežba — vizualizacija obrazaca nedostajanja.
# Using the airquality dataset, explore the missingness pattern using gg_miss_upset()
gg_miss_upset(airquality)
# With the riskfactors dataset, explore how the missingness changes across the marital variable using gg_miss_fct()
gg_miss_fct(x = riskfactors, fct = marital)
# Using the pedestrian dataset, explore how the missingness of hourly_counts changes over a span of 3000
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)
# Using the pedestrian dataset, explore the impact of month by facetting by month
# and explore how missingness changes for a span of 1000
gg_miss_span(pedestrian, var = hourly_counts, span_every = 1000, facet = month)
Do sada smo pretpostavljati da su nedostajuće vrednosti kodirane kao NA. To često nije slučaj, već one mogu biti kodirane na razne načine, ili uopšte nekodirane. Ovde ćemo se baviti sa sledećim:
NA.Funkcija miss_scan_count pretražuje dataframe prema definisanoj listi vrednosti koje predstavljaju nedostajuće vrednosti.
chaos <- tibble::tribble(
~score, ~grade, ~place,
3, "N/A", "-99",
-99, "E", "97",
4, "missing", "95",
-99, "na", "92",
7, "n/a", "-98",
10, NA, "missing",
12, ".", "88",
16, NA, ".",
9, "N/a", "86"
)
chaos %>%
miss_scan_count(search = list("N/A"))
# A tibble: 3 x 2
Variable n
<chr> <int>
1 score 0
2 grade 1
3 place 0
chaos %>%
miss_scan_count(search = list("N/A",
"N/a"))
# A tibble: 3 x 2
Variable n
<chr> <int>
1 score 0
2 grade 2
3 place 0
Pomoću funkcije replace_with_na možemo zameniti nedostajuće vrednosti za odgovajuću promenljivu i imenovanu listu vrednosti koje želimo da zamenimo sa NA.
chaos %>%
replace_with_na(replace = list(grade = c("N/A", "N/a")))
# A tibble: 9 x 3
score grade place
<dbl> <chr> <chr>
1 3 <NA> -99
2 -99 E 97
3 4 missing 95
4 -99 na 92
5 7 n/a -98
6 10 <NA> missing
7 12 . 88
8 16 <NA> .
9 9 <NA> 86
Da bi se izbeglo ponavljanje, ili postavili dodatni uslovi, možemo koristiti druge funkcije slične funkciji replace_with_na:
replace_with_na_all zamena nedostajućih vrednosti za sve promenljive.replace_with_na_at zamena nedostajućih vrednosti za podskup promenljivih.replace_with_na_if zamena nedostajućih vrednosti za podskup promenljivih koje ispunjavaju zadati uslov (numeric, character…).Ove funkcije koriste specifičnu sintaksu. Argumentu condition koje koriste ove funkcije moramo proslediti specijalnu funkciju koja počinje sa tildom ~ dok se promenljivoj obraćamo sa .x.
chaos %>%
replace_with_na_all(condition = ~.x == -99)
# A tibble: 9 x 3
score grade place
<dbl> <chr> <chr>
1 3 N/A <NA>
2 NA E 97
3 4 missing 95
4 NA na 92
5 7 n/a -98
6 10 <NA> missing
7 12 . 88
8 16 <NA> .
9 9 N/a 86
chaos %>%
replace_with_na_all(condition = ~.x %in% c("N/A", "missing", "na"))
# A tibble: 9 x 3
score grade place
<dbl> <chr> <chr>
1 3 <NA> -99
2 -99 E 97
3 4 <NA> 95
4 -99 <NA> 92
5 7 n/a -98
6 10 <NA> <NA>
7 12 . 88
8 16 <NA> .
9 9 N/a 86
Vežba — korišćenje miss_scan_count
pacman <- readRDS(here::here("data-raw","pacman_proba.rds"))
glimpse(pacman)
Observations: 2,000
Variables: 6
$ year <chr> "2007", "1995", "1980", "1982", "na", "2013", "2003", "2016...
$ month <chr> "10", "8", "2", "5", "na", "11", "12", "9", "3", "5", "9", ...
$ day <chr> "27", "23", "8", "9", "na", "15", "4", "9", "20", "19", "27...
$ initial <chr> "LEX", "PNY", "MBJ", "QRC", "YPZ", "RVJ", "VKD", "ZIS", "IY...
$ score <chr> "2065812", "1163465", "175380", "2025632", "925357", "31973...
$ country <chr> "CA", "JP", " ", "ES", "NZ", "AU", "US", "CN", "CN", "AU", ...
pacman
# A tibble: 2,000 x 6
year month day initial score country
<chr> <chr> <chr> <chr> <chr> <chr>
1 2007 10 27 LEX 2065812 "CA"
2 1995 8 23 PNY 1163465 "JP"
3 1980 2 8 MBJ 175380 " "
4 1982 5 9 QRC 2025632 "ES"
5 na na na YPZ 925357 "NZ"
6 2013 11 15 RVJ 319733 "AU"
7 2003 12 4 VKD 3322668 "US"
8 2016 9 9 ZIS 2137806 "CN"
9 2013 3 20 IYD 3059716 "CN"
10 1993 5 19 CHQ 231892 "AU"
# ... with 1,990 more rows
# Explore the strange missing values "N/A"
miss_scan_count(data = pacman, search = list("N/A"))
# A tibble: 6 x 2
Variable n
<chr> <int>
1 year 0
2 month 0
3 day 0
4 initial 0
5 score 100
6 country 0
# Explore the strange missing values "missing"
miss_scan_count(data = pacman, search = list("missing"))
# A tibble: 6 x 2
Variable n
<chr> <int>
1 year 93
2 month 93
3 day 93
4 initial 0
5 score 0
6 country 0
# Explore the strange missing values "na"
miss_scan_count(data = pacman, search = list("na"))
# A tibble: 6 x 2
Variable n
<chr> <int>
1 year 100
2 month 100
3 day 100
4 initial 0
5 score 0
6 country 0
# Explore the strange missing values " " (a single space)
miss_scan_count(data = pacman, search = list(" "))
# A tibble: 6 x 2
Variable n
<chr> <int>
1 year 0
2 month 0
3 day 0
4 initial 0
5 score 0
6 country 100
# Explore all of the strange missing values, "N/A", "missing", "na", " "
miss_scan_count(data = pacman, search = list("N/A", "missing","na", " "))
# A tibble: 6 x 2
Variable n
<chr> <int>
1 year 193
2 month 193
3 day 193
4 initial 0
5 score 100
6 country 100
Vežba — korišćenje replace_with_na
# Print the top of the pacman data using `head()`
head(pacman)
# A tibble: 6 x 6
year month day initial score country
<chr> <chr> <chr> <chr> <chr> <chr>
1 2007 10 27 LEX 2065812 "CA"
2 1995 8 23 PNY 1163465 "JP"
3 1980 2 8 MBJ 175380 " "
4 1982 5 9 QRC 2025632 "ES"
5 na na na YPZ 925357 "NZ"
6 2013 11 15 RVJ 319733 "AU"
# Replace the strange missing values "N/A", "na", and "missing" with `NA` for the variables, year, and score
pacman_clean <- replace_with_na(pacman, replace = list(year = c("N/A", "na", "missing"),
score = c("N/A", "na", "missing")))
# Test if `pacman_clean` still has these values in it?
miss_scan_count(pacman_clean, search = list("N/A", "na", "missing"))
# A tibble: 6 x 2
Variable n
<chr> <int>
1 year 0
2 month 193
3 day 193
4 initial 0
5 score 0
6 country 0
Vežba — korišćenje replace_with_na varijanti.
# Use `replace_with_na_at()` to replace with NA
replace_with_na_at(pacman,
.vars = c("year", "month", "day"),
~.x %in% c("N/A", "missing", "na", " "))
# A tibble: 2,000 x 6
year month day initial score country
<chr> <chr> <chr> <chr> <chr> <chr>
1 2007 10 27 LEX 2065812 "CA"
2 1995 8 23 PNY 1163465 "JP"
3 1980 2 8 MBJ 175380 " "
4 1982 5 9 QRC 2025632 "ES"
5 <NA> <NA> <NA> YPZ 925357 "NZ"
6 2013 11 15 RVJ 319733 "AU"
7 2003 12 4 VKD 3322668 "US"
8 2016 9 9 ZIS 2137806 "CN"
9 2013 3 20 IYD 3059716 "CN"
10 1993 5 19 CHQ 231892 "AU"
# ... with 1,990 more rows
# Use `replace_with_na_if()` to replace with NA the character values using `is.character`
replace_with_na_if(pacman,
.predicate = is.character,
~.x %in% c("N/A", "missing", "na", " "))
# A tibble: 2,000 x 6
year month day initial score country
<chr> <chr> <chr> <chr> <chr> <chr>
1 2007 10 27 LEX 2065812 CA
2 1995 8 23 PNY 1163465 JP
3 1980 2 8 MBJ 175380 <NA>
4 1982 5 9 QRC 2025632 ES
5 <NA> <NA> <NA> YPZ 925357 NZ
6 2013 11 15 RVJ 319733 AU
7 2003 12 4 VKD 3322668 US
8 2016 9 9 ZIS 2137806 CN
9 2013 3 20 IYD 3059716 CN
10 1993 5 19 CHQ 231892 AU
# ... with 1,990 more rows
# Use `replace_with_na_all()` to replace with NA
replace_with_na_all(pacman, ~.x %in% c("N/A", "missing", "na", " "))
# A tibble: 2,000 x 6
year month day initial score country
<chr> <chr> <chr> <chr> <chr> <chr>
1 2007 10 27 LEX 2065812 CA
2 1995 8 23 PNY 1163465 JP
3 1980 2 8 MBJ 175380 <NA>
4 1982 5 9 QRC 2025632 ES
5 <NA> <NA> <NA> YPZ 925357 NZ
6 2013 11 15 RVJ 319733 AU
7 2003 12 4 VKD 3322668 US
8 2016 9 9 ZIS 2137806 CN
9 2013 3 20 IYD 3059716 CN
10 1993 5 19 CHQ 231892 AU
# ... with 1,990 more rows
Do sada smo radili sa eksplicitno nedostajućim vrednostima. Implicitno nedostajuće vrednosti su vrednosti koje bi trebale da budu u skupu podataka ali nisu.
# Za Sema nedostaje evening score
tetris_skorovi <- tibble::tribble(
~name, ~time, ~score,
"robin", "morning", 358,
"robin", "afternoon", 534,
"robin", "evening", 100,
"sam", "morning", 139,
"sam", "afternoon", 177,
"blair", "morning", 963,
"blair", "afternoon", 962,
"blair", "evening", 929
)
tetris_skorovi %>% spread(key=time,value = score)
# A tibble: 3 x 4
name afternoon evening morning
<chr> <dbl> <dbl> <dbl>
1 blair 962 929 963
2 robin 534 100 358
3 sam 177 NA 139
Da bismo implicitne nedostajuće vrednosti pretvorili u eksplicitne koristimo funkciju tidyr::complete. Ona traži unikatne kombinacije za željene promenljive.
tetris_skorovi %>%
tidyr::complete(name, time)
# A tibble: 9 x 3
name time score
<chr> <chr> <dbl>
1 blair afternoon 962
2 blair evening 929
3 blair morning 963
4 robin afternoon 534
5 robin evening 100
6 robin morning 358
7 sam afternoon 177
8 sam evening NA
9 sam morning 139
U sledećem slučaju bismo želeli da popunimo NA vrednosti sa očekivanim vrednostima metodom popunjavanja nedostajućih vrednosti sa prethodnom ili sledećom prisutnom vrednošću. Ova metoda se još naziva i LOCF (Last observation carried forward). Ova metoda je korisna ukoliko znamo da postoji neka struktura u podacima na osnovu koje možemo lako da znamo koje vrednosti nedostaju. Za ovu svrhu koristimo funkciju tidyr::fill.
fill(data, ..., .direction = c("down", "up", "downup", "updown")).
tetris_skorovi2 <- tibble::tribble(
~name, ~time, ~value,
"robin", "morning", 936,
NA, "afternoon", 635,
NA, "evening", 438,
"sam", "morning", 208,
NA, "afternoon", 92,
NA, "evening", 79,
"blair", "morning", 969,
NA, "afternoon", 918,
NA, "evening", 954
)
tetris_skorovi2 %>%
tidyr::fill(name)
# A tibble: 9 x 3
name time value
<chr> <chr> <dbl>
1 robin morning 936
2 robin afternoon 635
3 robin evening 438
4 sam morning 208
5 sam afternoon 92
6 sam evening 79
7 blair morning 969
8 blair afternoon 918
9 blair evening 954
Vežba — ispravljanje implicitno nedostajućih vrednosti funkcijom complete()
frogger <- readRDS(here::here("data-raw","frogger.rds"))
frogger
name time value
1 jesse morning 6678
2 jesse afternoon 800060
3 jesse evening 475528
4 jesse late_night 143533
5 andy morning 425115
6 andy afternoon 587468
7 andy late_night 111000
8 nic afternoon 588532
9 nic late_night 915533
10 dan morning 388148
11 dan evening 180912
12 alex morning 552670
13 alex afternoon 98355
14 alex evening 266055
15 alex late_night 121056
# Use `complete()` on the `time` and `name` variables to make implicit missing values explicit
frogger_tidy <- frogger %>% complete(time, name)
Vežba — ispravljanje eksplicitno nedostajućih vrednosti funkcijom fill().
frogger2 <- readRDS(here::here("data-raw","frogger2.rds"))
frogger2
# A tibble: 15 x 4
name time value year
<chr> <chr> <int> <dbl>
1 jesse morning 6678 1981
2 <NA> afternoon 800060 NA
3 <NA> evening 475528 NA
4 <NA> late_night 143533 NA
5 andy morning 425115 NA
6 <NA> afternoon 587468 NA
7 <NA> late_night 111000 NA
8 nic afternoon 588532 NA
9 <NA> late_night 915533 NA
10 dan morning 388148 1988
11 <NA> evening 180912 NA
12 alex morning 552670 NA
13 <NA> afternoon 98355 NA
14 <NA> evening 266055 NA
15 <NA> late_night 121056 NA
# Use `fill()` to fill down the name variable in the frogger dataset
frogger %>% fill(name)
name time value
1 jesse morning 6678
2 jesse afternoon 800060
3 jesse evening 475528
4 jesse late_night 143533
5 andy morning 425115
6 andy afternoon 587468
7 andy late_night 111000
8 nic afternoon 588532
9 nic late_night 915533
10 dan morning 388148
11 dan evening 180912
12 alex morning 552670
13 alex afternoon 98355
14 alex evening 266055
15 alex late_night 121056
Vežba — korišćenje funkcija complete() i fill() zajedno.
# Correctly fill() and complete() missing values so that our dataset becomes sensible
frogger2 %>%
fill(name) %>%
complete(name, time)
# A tibble: 20 x 4
name time value year
<chr> <chr> <int> <dbl>
1 alex afternoon 98355 NA
2 alex evening 266055 NA
3 alex late_night 121056 NA
4 alex morning 552670 NA
5 andy afternoon 587468 NA
6 andy evening NA NA
7 andy late_night 111000 NA
8 andy morning 425115 NA
9 dan afternoon NA NA
10 dan evening 180912 NA
11 dan late_night NA NA
12 dan morning 388148 1988
13 jesse afternoon 800060 NA
14 jesse evening 475528 NA
15 jesse late_night 143533 NA
16 jesse morning 6678 1981
17 nic afternoon 588532 NA
18 nic evening NA NA
19 nic late_night 915533 NA
20 nic morning NA NA
Tip mehanizma nedostajanja podataka usmerava traganje za razlozima nedostajanja ukazivanjem na postojanje povezanosti merenih promenljivih i verovatnoće nedostajanja podataka.
Rubin (1976) je definisao tri osnovna mehanizma nedostajanja podataka:
Razlika između MCAR i MAR je u tome što izostajanje podataka u okviru MCAR nije pod uticajem nijedne varijable u datom setu. S druge strane, izostajanje podataka po MAR mehanizmu nije pod uticajem varijable u kojoj ima nedostajućih podataka, ali jeste pod uticajem neke druge varijable u setu podataka.
Ukoliko, pak, distribucija nedostajućih podataka zavisi od samih nedostajućih podatka, kažemo da podaci ne nedostaju po slučajnom rasporedu, odnosno da se distribuiraju po MNAR obrascu (Schafer & Graham, 2002). U ovom slučaju, nedostajući podatak je povezan s razlogom zašto nedostaje.
MCAR:
# Primer MCAR
test_vacation <- tibble::tribble(
~test, ~vacation,
NA, TRUE,
11.53334, FALSE,
10.126115, TRUE,
NA, FALSE,
NA, TRUE,
8.551881, FALSE,
NA, FALSE,
NA, TRUE,
10.608264, TRUE,
8.611877, TRUE
)
test_vacation
# A tibble: 10 x 2
test vacation
<dbl> <lgl>
1 NA TRUE
2 11.5 FALSE
3 10.1 TRUE
4 NA FALSE
5 NA TRUE
6 8.55 FALSE
7 NA FALSE
8 NA TRUE
9 10.6 TRUE
10 8.61 TRUE
MAR:
# Primer MAR - veća verovatnoća nedostajanja test skorova za radnike sa visokim nivoom depresije
test_vacation2 <- tibble::tribble(
~test, ~vacation, ~depression,
NA, TRUE, 87.93109,
11.53334, FALSE, 40.02708,
10.126115, TRUE, 48.62883,
NA, FALSE, 88.21743,
NA, TRUE, 90.29282,
8.551881, FALSE, 44.77343,
NA, FALSE, 89.48865,
NA, TRUE, 89.99209,
10.608264, TRUE, 45.56832,
8.611877, TRUE, 42.41686
)
test_vacation2
# A tibble: 10 x 3
test vacation depression
<dbl> <lgl> <dbl>
1 NA TRUE 87.9
2 11.5 FALSE 40.0
3 10.1 TRUE 48.6
4 NA FALSE 88.2
5 NA TRUE 90.3
6 8.55 FALSE 44.8
7 NA FALSE 89.5
8 NA TRUE 90.0
9 10.6 TRUE 45.6
10 8.61 TRUE 42.4
MNAR:
# Ukoliko bi bilo poznato da postoji veza između test skorova i depresije, a nedostaju i test skorovi i skorovi depresivnosti
test_vacation3 <- tibble::tribble(
~test, ~vacation, ~depression,
NA, TRUE, NA,
11.53334, FALSE, 11.53334,
10.126115, TRUE, 10.126115,
NA, FALSE, NA,
NA, TRUE, NA,
8.551881, FALSE, 8.551881,
NA, FALSE, NA,
NA, TRUE, NA,
10.608264, TRUE, 10.608264,
8.611877, TRUE, 8.611877
)
test_vacation3
# A tibble: 10 x 3
test vacation depression
<dbl> <lgl> <dbl>
1 NA TRUE NA
2 11.5 FALSE 11.5
3 10.1 TRUE 10.1
4 NA FALSE NA
5 NA TRUE NA
6 8.55 FALSE 8.55
7 NA FALSE NA
8 NA TRUE NA
9 10.6 TRUE 10.6
10 8.61 TRUE 8.61
U sledećem primeru vidimo da postoji dosta šuma u nedostajanju. Nasumični obrasci nedostajanja, ili obrasci nedostajanja za koje nam se čini da imaju punu šuma, sugerišu da nema punu varijanse u našem obrascu nedostajanja. Na osnovu toga možemo reći da se radi o MCAR mehanizmu nedostajanja podataka.
vis_miss(mt_cars, cluster = TRUE)
U sledećem primeru vidimo da definitivno postoji grupisanje nedostajanja. Ovo je često slučaj kod MAR mehanizma nedostajanja podataka.
oceanbuoys
# A tibble: 736 x 8
year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40
2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30
3 1997 0 -110 27.6 27 76.5 -5.10 4.5
4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5
5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10
6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60
7 1997 0 -110 28.0 27.0 76.5 -2 3.5
8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5
9 1997 0 -110 28.0 27.2 78.6 -4.20 5
10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5
# ... with 726 more rows
oceanbuoys %>% arrange(year) %>% vis_miss()
U sledećem primeru demonstrira se MNAR mehanizam nedostajanja. Ocean podaci su izmenjeni tako da su podaci koji nedostaju generisani na osnovu veze sa vrednostima promenljive koja je uklonjena iz dataseta, dakle na osnovu nečega što sada nije posmatrano. U ovom slučaju jasno se vidi struktura grupisanja, međutim ovo često ne mora biti slučaj. Važno je zapamtiti da je veoma teško pouzdano ustanoviti da li je mehanizam nedostajanja MCAR, MAR, ili MNAR. Vizualizacije ne daju definitivan odgovor na to pitanje.
vis_miss(ocean, cluster = TRUE)
Vežba — istraživanje mehanizama nedostajanja podataka. Najpre ćemo pogledati kako sortiranje po različitim promenljivim utiče na pregled nedostajućih podataka.
# Arrange by year
oceanbuoys %>% arrange(year) %>% vis_miss()
# Arrange by latitude
oceanbuoys %>% arrange(latitude) %>% vis_miss()
# Arrange by wind_ew (wind east west)
oceanbuoys %>% arrange(wind_ew) %>% vis_miss()
Na osnovu svega navedenog može se zaključiti da je u pitanju MAR mehanizam. Promenljive year i location su obe važne za objašnjenje obrasca nedostajanja podataka, jer pokazuju grupisanje nedostajanja podataka. Kod MAR-a visoke ili niske vrednosti nekih promenljivih imaju uticaj na nedostajanje podataka.
gg_miss_var(oceanbuoys, facet = year)
gg_miss_case(oceanbuoys, facet = year)
Recimo da imamo sledeće podatke.
cenzus1 <- tibble::tribble(
~income, ~education,
48.69087, NA,
40.93218, NA,
52.69245, "high_school",
31.33808, NA,
89.35671, "university",
103.87278, "university"
)
cenzus1
# A tibble: 6 x 2
income education
<dbl> <chr>
1 48.7 <NA>
2 40.9 <NA>
3 52.7 high_school
4 31.3 <NA>
5 89.4 university
6 104. university
Ukoliko pogledamo raspodelu za income vidimo da je većina vrednosti između 70k-80k. Ali, ako income razložimo prema nedostajanju vrednosti promenljive education možemo videti da postoji gap između te dve raspodele.
cenzus1 %>%
bind_shadow() %>%
ggplot(aes(x = income, fill = education_NA)) +
geom_density(alpha = 0.5)
Shadow matrica je jasnija reprezentacija nedostajanja podataka. Dva glavna svojstva shadow matrice su:
_NA.NA ili !NA.Kada se ovakvi podaci columnwise dodaju na originalne podatke dobijamo takozvane nabular podatke (NA i Tabular). Ovo radimo pomoću funkcije bind_shadow().
cenzus1 %>%
bind_shadow()
# A tibble: 6 x 4
income education income_NA education_NA
<dbl> <chr> <fct> <fct>
1 48.7 <NA> !NA NA
2 40.9 <NA> !NA NA
3 52.7 high_school !NA !NA
4 31.3 <NA> !NA NA
5 89.4 university !NA !NA
6 104. university !NA !NA
Primena nabular na airquality dataset.
bind_shadow(airquality)
# A tibble: 153 x 12
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
<int> <int> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
1 41 190 7.4 67 5 1 !NA !NA !NA !NA
2 36 118 8 72 5 2 !NA !NA !NA !NA
3 12 149 12.6 74 5 3 !NA !NA !NA !NA
4 18 313 11.5 62 5 4 !NA !NA !NA !NA
5 NA NA 14.3 56 5 5 NA NA !NA !NA
6 28 NA 14.9 66 5 6 !NA NA !NA !NA
7 23 299 8.6 65 5 7 !NA !NA !NA !NA
8 19 99 13.8 59 5 8 !NA !NA !NA !NA
9 8 19 20.1 61 5 9 !NA !NA !NA !NA
10 NA 194 8.6 69 5 10 NA !NA !NA !NA
# ... with 143 more rows, and 2 more variables: Month_NA <fct>, Day_NA <fct>
airquality %>%
bind_shadow() %>%
group_by(Ozone_NA) %>%
summarise(mean = mean(Wind))
# A tibble: 2 x 2
Ozone_NA mean
<fct> <dbl>
1 !NA 9.86
2 NA 10.3
Vežba — shadow matrix.
# Create shadow matrix data with `as_shadow()`
as_shadow(oceanbuoys)
# A tibble: 736 x 8
year_NA latitude_NA longitude_NA sea_temp_c_NA air_temp_c_NA humidity_NA
<fct> <fct> <fct> <fct> <fct> <fct>
1 !NA !NA !NA !NA !NA !NA
2 !NA !NA !NA !NA !NA !NA
3 !NA !NA !NA !NA !NA !NA
4 !NA !NA !NA !NA !NA !NA
5 !NA !NA !NA !NA !NA !NA
6 !NA !NA !NA !NA !NA !NA
7 !NA !NA !NA !NA !NA !NA
8 !NA !NA !NA !NA !NA !NA
9 !NA !NA !NA !NA !NA !NA
10 !NA !NA !NA !NA !NA !NA
# ... with 726 more rows, and 2 more variables: wind_ew_NA <fct>,
# wind_ns_NA <fct>
# Create nabular data by binding the shadow to the data with `bind_shadow()`
bind_shadow(oceanbuoys)
# A tibble: 736 x 16
year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40
2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30
3 1997 0 -110 27.6 27 76.5 -5.10 4.5
4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5
5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10
6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60
7 1997 0 -110 28.0 27.0 76.5 -2 3.5
8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5
9 1997 0 -110 28.0 27.2 78.6 -4.20 5
10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5
# ... with 726 more rows, and 8 more variables: year_NA <fct>,
# latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
# air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>
# Bind only the variables with missing values by using bind_shadow(only_miss = TRUE)
bind_shadow(oceanbuoys, only_miss = TRUE)
# A tibble: 736 x 11
year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40
2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30
3 1997 0 -110 27.6 27 76.5 -5.10 4.5
4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5
5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10
6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60
7 1997 0 -110 28.0 27.0 76.5 -2 3.5
8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5
9 1997 0 -110 28.0 27.2 78.6 -4.20 5
10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5
# ... with 726 more rows, and 3 more variables: sea_temp_c_NA <fct>,
# air_temp_c_NA <fct>, humidity_NA <fct>
Vežba — grupni pregled nedostajanja pomoću nabularnih podataka.
# `bind_shadow()` and `group_by()` humidity missingness (`humidity_NA`)
oceanbuoys %>%
bind_shadow() %>%
group_by(humidity_NA) %>%
summarise(wind_ew_mean = mean(wind_ew), # calculate mean of wind_ew
wind_ew_sd = sd(wind_ew)) # calculate standard deviation of wind_ew
# A tibble: 2 x 3
humidity_NA wind_ew_mean wind_ew_sd
<fct> <dbl> <dbl>
1 !NA -3.78 1.90
2 NA -3.30 2.31
# Repeat this, but calculating summaries for wind north south (`wind_ns`)
oceanbuoys %>%
bind_shadow() %>%
group_by(humidity_NA) %>%
summarise(wind_ns_mean = mean(wind_ns),
wind_ns_sd = sd(wind_ns))
# A tibble: 2 x 3
humidity_NA wind_ns_mean wind_ns_sd
<fct> <dbl> <dbl>
1 !NA 2.78 2.06
2 NA 1.66 2.23
Vežba — dalje istraživanje kombinacija nedostajanja. Dodaćemo broj posmatranih slučajeva, i dodatni nivo grupisanja za kombinacije nedostajanja vlažnosti i temperature vazduha. Brojeći broj opservacija u svakoj grupi dobijamo koristan kontekst za sumarni pregled.
# Summarise wind_ew by the missingness of `air_temp_c_NA`
oceanbuoys %>%
bind_shadow() %>%
group_by(air_temp_c_NA) %>%
summarise(wind_ew_mean = mean(wind_ew),
wind_ew_sd = sd(wind_ew),
n_obs = n())
# A tibble: 2 x 4
air_temp_c_NA wind_ew_mean wind_ew_sd n_obs
<fct> <dbl> <dbl> <int>
1 !NA -3.91 1.85 655
2 NA -2.17 2.14 81
# Summarise wind_ew by missingness of `air_temp_c_NA` and `humidity_NA`
oceanbuoys %>%
bind_shadow() %>%
group_by(air_temp_c_NA, humidity_NA) %>%
summarise(wind_ew_mean = mean(wind_ew),
wind_ew_sd = sd(wind_ew),
n_obs = n())
# A tibble: 4 x 5
# Groups: air_temp_c_NA [2]
air_temp_c_NA humidity_NA wind_ew_mean wind_ew_sd n_obs
<fct> <fct> <dbl> <dbl> <int>
1 !NA !NA -4.01 1.74 565
2 !NA NA -3.24 2.31 90
3 NA !NA -2.06 2.08 78
4 NA NA -4.97 1.74 3
Koristićemo nabularne podatke i sledeće vizualizacije:
Najpre density.
ggplot(airquality,
aes(x = Temp)) +
geom_density()
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp,
color = Ozone_NA)) +
geom_density()
Takođe možemo koristiti boxplotove. Vidimo da su srednje vrednosti temperatura slične i kada nedostaje, i kada ne nedostaju ozon. Međutim varijansa temperature je manja kada nedostaju vrednosti ozona, takođe primetni su outlieri u temperaturi.
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Ozone_NA,
y = Temp)) +
geom_boxplot()
Možemo koristiti i faceting za density.
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp)) +
geom_density() +
facet_wrap(~Ozone_NA)
Možemo koristiti i faceting za scatter plot. Ovde se vidi da imamo manji broj vrednosti za Wind i Temp kada nedostaju podaci za Ozone.
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp,
y = Wind)) +
geom_point() +
facet_wrap(~Ozone_NA)
Tačke se mogu kodirati bojom prema nedostajanju. Možemo takođe dodati sloj za nedostajanje, prema nedostajanju Solar.R.
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp,
y = Wind,
color = Ozone_NA)) +
geom_point()
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Temp,
color = Ozone_NA)) +
geom_density() +
facet_wrap(~Solar.R_NA)
# First explore the missingness structure of `oceanbuoys` using `vis_miss()`
vis_miss(oceanbuoys)
# Explore the distribution of `wind_ew` for the missingness of `air_temp_c_NA` using `geom_density()`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
color = air_temp_c_NA)) +
geom_density()
# Explore the distribution of sea temperature for the missingness of humidity (humidity_NA) using `geom_density()`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = sea_temp_c,
color = humidity_NA)) +
geom_density()
Vežba — nabularni podaci i vizualizacije. Istražićemo varijacije u jednoj promenljivoj u zavisnosti od nedostajanja drugih.
# First explore the missingness structure of `oceanbuoys` using `vis_miss()`
vis_miss(oceanbuoys)
# Explore the distribution of `wind_ew` for the missingness of `air_temp_c_NA` using `geom_density()`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
color = air_temp_c_NA)) +
geom_density()
# Explore the distribution of sea temperature for the missingness of humidity (humidity_NA) using `geom_density()`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = sea_temp_c,
color = humidity_NA)) +
geom_density()
Vežba — Istražićemo varijacije u jednoj promenljivoj u zavisnosti od nedostajanja drugih pomoću faceting-a.
# Explore the distribution of wind east west (`wind_ew`) for the missingness of air temperature using `geom_density()` and facetting by the missingness of air temperature (`air_temp_c_NA`).
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = wind_ew)) +
geom_density() +
facet_wrap(~air_temp_c_NA)
# Build upon this visualisation by filling by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = wind_ew,
color = humidity_NA)) +
geom_density() +
facet_wrap(~air_temp_c_NA)
Vežba — Istražićemo varijacije u jednoj promenljivoj u zavisnosti od nedostajanja drugih pomoću box plotova.
# Explore the distribution of wind east west (`wind_ew`) for the missingness of air temperature using `geom_boxplot()`
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = air_temp_c_NA,
y = wind_ew)) +
geom_boxplot()
# Build upon this visualisation by facetting by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
bind_shadow() %>%
ggplot(aes(x = air_temp_c_NA,
y = wind_ew)) +
geom_boxplot() +
facet_wrap(~humidity_NA)
Kod scatterplotova imamo problem da se uklonjaju celi redovi za vrednosti koje nedostaju. ggplot2 daje warning u tom slučaju. Koristićemo geom geom_miss_point. Na donjem primeru dobijamo informaciju da se vrednosti koje nedostaju javljaju za manje vrednosti ozona.
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_point()
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
geom_miss_point u pozadini imputira nedostajuće vrednosti. Podaci se imputiraju za 10% ispod minimalne vrednosti.
primerGMP <- tibble::tribble(
~Ozone, ~Ozone_shift, ~Ozone_NA,
41, 41, "!NA",
36, 36, "!NA",
12, 12, "!NA",
18, 18, "!NA",
NA, -19.72321, NA,
28, 28, "!NA"
)
primerGMP
# A tibble: 6 x 3
Ozone Ozone_shift Ozone_NA
<dbl> <dbl> <chr>
1 41 41 !NA
2 36 36 !NA
3 12 12 !NA
4 18 18 !NA
5 NA -19.7 <NA>
6 28 28 !NA
Sa geom_miss_point takođe možemo koristiti faceting i nabularne podatke.
ggplot(airquality,
aes(x = Wind,
y = Ozone)) +
geom_miss_point() +
facet_wrap( ~ Month)
airquality %>%
bind_shadow() %>%
ggplot(aes(x = Wind,
y = Ozone)) +
geom_miss_point() +
facet_wrap( ~ Solar.R_NA)
Vežba — Istraživanje nedostajućih podataka pomoću scatterplotova.
# Explore the missingness in wind and air temperature, and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point()
# Explore the missingness in humidity and air temperature, and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
aes(x = humidity,
y = air_temp_c)) +
geom_miss_point()
Vežba — Istraživanje nedostajućih podataka pomoću scatterplotova i facetinga.
# Explore the missingness in wind and air temperature, and display the missingness using `geom_miss_point()`. Facet by year to explore this further.
ggplot(oceanbuoys,
aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point() +
facet_wrap(~year)
# Explore the missingness in humidity and air temperature, and display the missingness using `geom_miss_point()` Facet by year to explore this further.
ggplot(oceanbuoys,
aes(x = humidity,
y = air_temp_c)) +
geom_miss_point() +
facet_wrap(~year)
Vežba — Istraživanje nedostajućih podataka pomoću scatterplotova, facetinga i nabularnih podataka.
# Use geom_miss_point() and facet_wrap to explore how the missingness in wind_ew and air_temp_c is different for missingness of humidity
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point() +
facet_wrap(~humidity_NA)
# Use geom_miss_point() and facet_grid to explore how the missingness in wind_ew and air_temp_c is different for missingness of humidity AND by year - by using `facet_grid(humidity_NA ~ year)`
bind_shadow(oceanbuoys) %>%
ggplot(aes(x = wind_ew,
y = air_temp_c)) +
geom_miss_point() +
facet_grid(humidity_NA~year)
Funkcija impute_below() imputira vrednosti ispod minimuma iz postojećih podataka.
impute_below_if() imputira samo promenljive koje ispunjavaju uslov. impute_below_if(data, is.numeric).impute_below_at() imputira samo specifikovane promenljive. impute_below_at(data, vars(var1,var2)).impute_below_all() imputira sve promenljive. impute_below_all(data).Moramo znati koje su vrednosti imputirane.
impute_below(c(5,6,7,NA,9,10))
[1] 5.00000 6.00000 7.00000 4.40271 9.00000 10.00000
df11 <- data.frame(var1 = c(1, 5, 2, 6, 3, 7, 4, NA, 5, 9, 6, 10))
impute_below_all(df11)
var1
1 1.0000000
2 5.0000000
3 2.0000000
4 6.0000000
5 3.0000000
6 7.0000000
7 4.0000000
8 0.2524351
9 5.0000000
10 9.0000000
11 6.0000000
12 10.0000000
bind_shadow(df11)
# A tibble: 12 x 2
var1 var1_NA
<dbl> <fct>
1 1 !NA
2 5 !NA
3 2 !NA
4 6 !NA
5 3 !NA
6 7 !NA
7 4 !NA
8 NA NA
9 5 !NA
10 9 !NA
11 6 !NA
12 10 !NA
bind_shadow(df11) %>%
impute_below_all()
# A tibble: 12 x 2
var1 var1_NA
<dbl> <fct>
1 1 !NA
2 5 !NA
3 2 !NA
4 6 !NA
5 3 !NA
6 7 !NA
7 4 !NA
8 0.252 NA
9 5 !NA
10 9 !NA
11 6 !NA
12 10 !NA
Vizualizacija imputiranih podataka putem histograma i facetinga.
aq_imp <- airquality %>%
bind_shadow() %>%
impute_below_all()
ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram()
ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram() +
facet_wrap(~Month)
Plot možemo podeliti na osnovu nedostajanja podataka u Solar.R.
ggplot(aq_imp,
aes(x = Ozone,
fill = Ozone_NA)) +
geom_histogram() +
facet_wrap(~Solar.R_NA)
add_label_missings dodaje kolonu koja oposuje da li nedostaju podaci za bilo koju promenljivu iz zapisa. add_label_shadow proverava da li su vrednosti imputirane kada je primenjene bind_shadow ili add_shadow.
aq_imp <- airquality %>%
bind_shadow() %>%
add_label_missings() %>%
impute_below_all()
aq_imp
# A tibble: 153 x 13
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
<dbl> <dbl> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
1 41 190 7.4 67 5 1 !NA !NA !NA !NA
2 36 118 8 72 5 2 !NA !NA !NA !NA
3 12 149 12.6 74 5 3 !NA !NA !NA !NA
4 18 313 11.5 62 5 4 !NA !NA !NA !NA
5 -19.7 -33.6 14.3 56 5 5 NA NA !NA !NA
6 28 -33.1 14.9 66 5 6 !NA NA !NA !NA
7 23 299 8.6 65 5 7 !NA !NA !NA !NA
8 19 99 13.8 59 5 8 !NA !NA !NA !NA
9 8 19 20.1 61 5 9 !NA !NA !NA !NA
10 -18.5 194 8.6 69 5 10 NA !NA !NA !NA
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
# any_missing <chr>
ggplot(aq_imp,
aes(x = Ozone,
y = Solar.R,
colour = any_missing)) +
geom_point()
airquality %>%
add_shadow(Ozone, Solar.R) %>%
add_label_shadow()
# A tibble: 153 x 9
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA any_missing
<int> <int> <dbl> <int> <int> <int> <fct> <fct> <chr>
1 41 190 7.4 67 5 1 !NA !NA Not Missing
2 36 118 8 72 5 2 !NA !NA Not Missing
3 12 149 12.6 74 5 3 !NA !NA Not Missing
4 18 313 11.5 62 5 4 !NA !NA Not Missing
5 NA NA 14.3 56 5 5 NA NA Missing
6 28 NA 14.9 66 5 6 !NA NA Missing
7 23 299 8.6 65 5 7 !NA !NA Not Missing
8 19 99 13.8 59 5 8 !NA !NA Not Missing
9 8 19 20.1 61 5 9 !NA !NA Not Missing
10 NA 194 8.6 69 5 10 NA !NA Missing
# ... with 143 more rows
Vežba — imputiranje ispod opsega.
# Impute the data below the range using `impute_below`.
ocean_imp <- impute_below_all(oceanbuoys)
# Visualise the new missing values
ggplot(ocean_imp,
aes(x = wind_ew, y = air_temp_c)) +
geom_point()
# Impute and track data with `bind_shadow`, `impute_below`, and `add_label_shadow`
ocean_imp_track <- bind_shadow(oceanbuoys) %>%
impute_below_all() %>%
add_label_shadow()
# Look at the imputed values
ocean_imp_track
# A tibble: 736 x 17
year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40
2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30
3 1997 0 -110 27.6 27 76.5 -5.10 4.5
4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5
5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10
6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60
7 1997 0 -110 28.0 27.0 76.5 -2 3.5
8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5
9 1997 0 -110 28.0 27.2 78.6 -4.20 5
10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5
# ... with 726 more rows, and 9 more variables: year_NA <fct>,
# latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
# air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>,
# any_missing <chr>
Vežba — vizualizacija imputiranih podataka u scatterplotu.
# Impute and track the missing values
ocean_imp_track <- bind_shadow(oceanbuoys) %>%
impute_below_all() %>%
add_label_shadow()
# Visualise the missingness in wind and air temperature, coloring missing air temp values with air_temp_c_NA
ggplot(ocean_imp_track,
aes(x = wind_ew, y = air_temp_c, color = air_temp_c_NA)) +
geom_point()
# Visualise humidity and air temp, coloring any missing cases using the variable any_missing
ggplot(ocean_imp_track,
aes(x = humidity, y = air_temp_c, color = any_missing)) +
geom_point()
Vežba — kreiranje histograma imputiranih podataka.
# Explore the values of air_temp_c, visualising the amount of missings with `air_temp_c_NA`.
p <- ggplot(ocean_imp_track, aes(x = air_temp_c, fill = air_temp_c_NA)) + geom_histogram()
# Expore the missings in humidity using humidity_NA
p2 <- ggplot(ocean_imp_track, aes(x = humidity, fill = humidity_NA)) + geom_histogram()
# Explore the missings in air_temp_c according to year, using `facet_wrap(~year)`.
p + facet_wrap(~year)
# Explore the missings in humidity according to year, using `facet_wrap(~year)`.
p2 + facet_wrap(~year)
Jedna veoma loša imputacija je mean imputacija.
df12 <- data.frame(x = c(1, 4, 9, 16, NA, 36))
mean(df12$x, na.rm = TRUE)
[1] 13.2
Za istraživanje ovakve imputacije koristićemo sledeće funkcije:
impute_mean(data$variable).impute_mean_if(data, is.numeric).impute_mean_at(data, vars(variable1, variable2)).impute_mean_all(data).aq_impute_mean <- airquality %>%
bind_shadow(only_miss = TRUE) %>%
impute_mean_all() %>%
add_label_shadow()
aq_impute_mean
# A tibble: 153 x 9
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA any_missing
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <chr>
1 41 190 7.4 67 5 1 !NA !NA Not Missing
2 36 118 8 72 5 2 !NA !NA Not Missing
3 12 149 12.6 74 5 3 !NA !NA Not Missing
4 18 313 11.5 62 5 4 !NA !NA Not Missing
5 42.1 186. 14.3 56 5 5 NA NA Missing
6 28 186. 14.9 66 5 6 !NA NA Missing
7 23 299 8.6 65 5 7 !NA !NA Not Missing
8 19 99 13.8 59 5 8 !NA !NA Not Missing
9 8 19 20.1 61 5 9 !NA !NA Not Missing
10 42.1 194 8.6 69 5 10 NA !NA Missing
# ... with 143 more rows
Kada evaluirate imputacije, istražite promene i sličnosti u:
Donji plot pokazuje da je median niži za !NA grupu.
ggplot(aq_impute_mean,
aes(x = Ozone_NA,
y = Ozone)) +
geom_boxplot()
Evalucija - spread imputacija pomoću scatterplota. Ova vizualizacija pokazuje da nema varijabilnosti u raširenosti imputiranih tačaka.
ggplot(aq_impute_mean,
aes(x = Ozone,
y = Solar.R,
colour = any_missing)) +
geom_point()
Evaluacija imputacija za više promenljivih.
aq_imp <- airquality %>%
bind_shadow() %>%
impute_mean_all()
aq_imp_long <- shadow_long(aq_imp,
Ozone,
Solar.R)
aq_imp_long
# A tibble: 306 x 4
variable value variable_NA value_NA
<chr> <dbl> <chr> <chr>
1 Ozone 41 Ozone_NA !NA
2 Ozone 36 Ozone_NA !NA
3 Ozone 12 Ozone_NA !NA
4 Ozone 18 Ozone_NA !NA
5 Ozone 42.1 Ozone_NA NA
6 Ozone 28 Ozone_NA !NA
7 Ozone 23 Ozone_NA !NA
8 Ozone 19 Ozone_NA !NA
9 Ozone 8 Ozone_NA !NA
10 Ozone 42.1 Ozone_NA NA
# ... with 296 more rows
ggplot(aq_imp_long,
aes(x = value,
fill = value_NA)) +
geom_histogram() +
facet_wrap(~variable)
Vežba - evaluacija loših imputacija.
# Impute the mean value and track the imputations
ocean_imp_mean <- bind_shadow(oceanbuoys) %>%
impute_mean_all() %>%
add_label_shadow()
# Explore the mean values in humidity in the imputed dataset
ggplot(ocean_imp_mean,
aes(x = humidity_NA, y = humidity)) +
geom_boxplot()
# Explore the values in air temperature in the imputed dataset
ggplot(ocean_imp_mean,
aes(x = air_temp_c_NA, y = air_temp_c)) +
geom_boxplot()
Vežba - evaluacija loših imputacija - scale.
# Explore imputations in air temperature and humidity, coloring by the variable, any_missing
ggplot(ocean_imp_mean,
aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point()
# Explore imputations in air temperature and humidity, coloring by the variable, any_missing, and faceting by year
ggplot(ocean_imp_mean,
aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point() +
facet_wrap(~year)
Vežba - evaluacija loših imputacija - za više promenljivih.
# Gather the imputed data
ocean_imp_mean_gather <- shadow_long(ocean_imp_mean,
humidity,
air_temp_c)
# Inspect the data
ocean_imp_mean_gather
# A tibble: 1,472 x 4
variable value variable_NA value_NA
<chr> <chr> <chr> <chr>
1 air_temp_c 27.14999962 air_temp_c_NA !NA
2 air_temp_c 27.02000046 air_temp_c_NA !NA
3 air_temp_c 27 air_temp_c_NA !NA
4 air_temp_c 26.93000031 air_temp_c_NA !NA
5 air_temp_c 26.84000015 air_temp_c_NA !NA
6 air_temp_c 26.94000053 air_temp_c_NA !NA
7 air_temp_c 27.04000092 air_temp_c_NA !NA
8 air_temp_c 27.11000061 air_temp_c_NA !NA
9 air_temp_c 27.20999908 air_temp_c_NA !NA
10 air_temp_c 27.25 air_temp_c_NA !NA
# ... with 1,462 more rows
ocean_imp_mean_gather$value <- as.numeric(ocean_imp_mean_gather$value)
ocean_imp_mean_gather
# A tibble: 1,472 x 4
variable value variable_NA value_NA
<chr> <dbl> <chr> <chr>
1 air_temp_c 27.1 air_temp_c_NA !NA
2 air_temp_c 27.0 air_temp_c_NA !NA
3 air_temp_c 27 air_temp_c_NA !NA
4 air_temp_c 26.9 air_temp_c_NA !NA
5 air_temp_c 26.8 air_temp_c_NA !NA
6 air_temp_c 26.9 air_temp_c_NA !NA
7 air_temp_c 27.0 air_temp_c_NA !NA
8 air_temp_c 27.1 air_temp_c_NA !NA
9 air_temp_c 27.2 air_temp_c_NA !NA
10 air_temp_c 27.2 air_temp_c_NA !NA
# ... with 1,462 more rows
# Explore the imputations in a histogram
ggplot(ocean_imp_mean_gather,
aes(x = value, fill = value_NA)) +
geom_histogram() +
facet_wrap(~variable)
Koristićemo simputation paket. Za imputaciju putem linearnog modela koristićemo impute_lm.
library(simputation)
df13 <- tibble::tribble(
~y, ~x1, ~x2,
2.67, 2.43, 3.27,
3.87, 3.55, 1.45,
NA, 2.9, 1.49,
5.21, 2.72, 1.84,
NA, 4.29, 1.15
)
df13
# A tibble: 5 x 3
y x1 x2
<dbl> <dbl> <dbl>
1 2.67 2.43 3.27
2 3.87 3.55 1.45
3 NA 2.9 1.49
4 5.21 2.72 1.84
5 NA 4.29 1.15
df13 %>%
bind_shadow(only_miss = TRUE) %>%
add_label_shadow() %>%
impute_lm(y ~ x1 + x2)
# A tibble: 5 x 5
y x1 x2 y_NA any_missing
* <dbl> <dbl> <dbl> <fct> <chr>
1 2.67 2.43 3.27 !NA Not Missing
2 3.87 3.55 1.45 !NA Not Missing
3 5.54 2.9 1.49 NA Missing
4 5.21 2.72 1.84 !NA Not Missing
5 2.56 4.29 1.15 NA Missing
aq_imp_lm <- airquality %>%
bind_shadow() %>%
add_label_shadow() %>%
impute_lm(Solar.R ~ Wind + Temp + Month) %>%
impute_lm(Ozone ~ Wind + Temp + Month)
aq_imp_lm
# A tibble: 153 x 13
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
* <dbl> <dbl> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
1 41 190 7.4 67 5 1 !NA !NA !NA !NA
2 36 118 8 72 5 2 !NA !NA !NA !NA
3 12 149 12.6 74 5 3 !NA !NA !NA !NA
4 18 313 11.5 62 5 4 !NA !NA !NA !NA
5 -9.04 138. 14.3 56 5 5 NA NA !NA !NA
6 28 178. 14.9 66 5 6 !NA NA !NA !NA
7 23 299 8.6 65 5 7 !NA !NA !NA !NA
8 19 99 13.8 59 5 8 !NA !NA !NA !NA
9 8 19 20.1 61 5 9 !NA !NA !NA !NA
10 35.2 194 8.6 69 5 10 NA !NA !NA !NA
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
# any_missing <chr>
Važan deo imputacije jeste praćenje imputiranih vrednosti putem bind_shadow i add_label_missings funkcija.
aq_imp_lm <-
airquality %>%
bind_shadow() %>%
add_label_missings() %>%
impute_lm(Solar.R ~ Wind + Temp +
Month) %>%
impute_lm(Ozone ~ Wind + Temp +
Month)
ggplot(aq_imp_lm,
aes(x = Solar.R,
y = Ozone,
colour = any_missing)) +
geom_point()
Poređenje imputacionih modela.
aq_imp_small <- airquality %>%
bind_shadow() %>%
impute_lm(Ozone ~ Wind + Temp) %>%
impute_lm(Solar.R ~ Wind + Temp) %>%
add_label_shadow()
aq_imp_small
# A tibble: 153 x 13
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
<dbl> <dbl> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
1 41 190 7.4 67 5 1 !NA !NA !NA !NA
2 36 118 8 72 5 2 !NA !NA !NA !NA
3 12 149 12.6 74 5 3 !NA !NA !NA !NA
4 18 313 11.5 62 5 4 !NA !NA !NA !NA
5 -11.7 127. 14.3 56 5 5 NA NA !NA !NA
6 28 160. 14.9 66 5 6 !NA NA !NA !NA
7 23 299 8.6 65 5 7 !NA !NA !NA !NA
8 19 99 13.8 59 5 8 !NA !NA !NA !NA
9 8 19 20.1 61 5 9 !NA !NA !NA !NA
10 29.7 194 8.6 69 5 10 NA !NA !NA !NA
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
# any_missing <chr>
aq_imp_large <- airquality %>%
bind_shadow() %>%
impute_lm(Ozone ~ Wind + Temp + Month + Day) %>%
impute_lm(Solar.R ~ Wind + Temp + Month + Day) %>%
add_label_shadow()
aq_imp_large
# A tibble: 153 x 13
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
<dbl> <dbl> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
1 41 190 7.4 67 5 1 !NA !NA !NA !NA
2 36 118 8 72 5 2 !NA !NA !NA !NA
3 12 149 12.6 74 5 3 !NA !NA !NA !NA
4 18 313 11.5 62 5 4 !NA !NA !NA !NA
5 -12.5 152. 14.3 56 5 5 NA NA !NA !NA
6 28 189. 14.9 66 5 6 !NA NA !NA !NA
7 23 299 8.6 65 5 7 !NA !NA !NA !NA
8 19 99 13.8 59 5 8 !NA !NA !NA !NA
9 8 19 20.1 61 5 9 !NA !NA !NA !NA
10 33.6 194 8.6 69 5 10 NA !NA !NA !NA
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
# any_missing <chr>
Vezivanje modela u isti dataframe i prikaz.
bound_models <-
bind_rows(small = aq_imp_small,
large = aq_imp_large,
.id = "imp_model")
bound_models
# A tibble: 306 x 14
imp_model Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <fct> <fct> <fct>
1 small 41 190 7.4 67 5 1 !NA !NA !NA
2 small 36 118 8 72 5 2 !NA !NA !NA
3 small 12 149 12.6 74 5 3 !NA !NA !NA
4 small 18 313 11.5 62 5 4 !NA !NA !NA
5 small -11.7 127. 14.3 56 5 5 NA NA !NA
6 small 28 160. 14.9 66 5 6 !NA NA !NA
7 small 23 299 8.6 65 5 7 !NA !NA !NA
8 small 19 99 13.8 59 5 8 !NA !NA !NA
9 small 8 19 20.1 61 5 9 !NA !NA !NA
10 small 29.7 194 8.6 69 5 10 NA !NA !NA
# ... with 296 more rows, and 4 more variables: Temp_NA <fct>, Month_NA <fct>,
# Day_NA <fct>, any_missing <chr>
ggplot(bound_models,
aes(x = Ozone,
y = Solar.R,
colour = any_missing)) +
geom_point() +
facet_wrap(~imp_model)
Poređenje imputacija za više modela.
bound_models_gather <- bound_models %>%
select(Ozone, Solar.R,
any_missing, imp_model) %>%
gather(key = "variable", value = "value",
-any_missing, -imp_model)
bound_models_gather
# A tibble: 612 x 4
any_missing imp_model variable value
<chr> <chr> <chr> <dbl>
1 Not Missing small Ozone 41
2 Not Missing small Ozone 36
3 Not Missing small Ozone 12
4 Not Missing small Ozone 18
5 Missing small Ozone -11.7
6 Missing small Ozone 28
7 Not Missing small Ozone 23
8 Not Missing small Ozone 19
9 Not Missing small Ozone 8
10 Missing small Ozone 29.7
# ... with 602 more rows
ggplot(bound_models_gather,
aes(x = imp_model,
y = value,colour=imp_model)) +
geom_boxplot() +
facet_wrap(~variable)
Poređenje imputacija za više modela, samo za nedostajuće vrednosti.
bound_models_gather %>%
filter(any_missing == "Missing") %>%
ggplot(aes(x = imp_model, y = value, colour=imp_model)) +
geom_boxplot() +
facet_wrap(~variable)
Vežba - imputacija korišćenjem simputation paketa.
# Impute humidity and air temperature using wind_ew and wind_ns, and track missing values
ocean_imp_lm_wind <- oceanbuoys %>%
bind_shadow() %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns) %>%
impute_lm(humidity ~ wind_ew + wind_ns) %>%
add_label_shadow()
# Plot the imputed values for air_temp_c and humidity, colored by missingness
ggplot(ocean_imp_lm_wind,
aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point()
Vežba - evaluacija i poređenje.
# Bind the models together
bound_models <- bind_rows(mean = ocean_imp_mean,
lm_wind = ocean_imp_lm_wind,
.id = "imp_model")
# Inspect the values of air_temp and humidity as a scatterplot
ggplot(bound_models,
aes(x = air_temp_c,
y = humidity,
color = any_missing)) +
geom_point() +
facet_wrap(~imp_model)
Vežba - evaluacija i poređenje za više mopdela i promenljivih.
# Build a model adding year to the outcome
ocean_imp_lm_wind_year <- bind_shadow(oceanbuoys) %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns + year) %>%
impute_lm(humidity ~ wind_ew + wind_ns + year) %>%
add_label_shadow()
# Bind the mean, lm_wind, and lm_wind_year models together
bound_models <- bind_rows(mean = ocean_imp_mean,
lm_wind = ocean_imp_lm_wind,
lm_wind_year = ocean_imp_lm_wind_year,
.id = "imp_model")
# Explore air_temp and humidity, coloring by any missings, and faceting by imputation model
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) +
geom_point() + facet_wrap(~imp_model)
Evaluacija parametara modela.
lm(Temp ~ Ozone + Solar.R + Wind + Month + Day, data = airquality)
Call:
lm(formula = Temp ~ Ozone + Solar.R + Wind + Month + Day, data = airquality)
Coefficients:
(Intercept) Ozone Solar.R Wind Month Day
57.25183 0.16528 0.01082 -0.17433 2.04246 -0.08919
Spajanje datasetova.
#1. Complete cases
aq_cc <- airquality %>%
na.omit() %>%
bind_shadow() %>%
add_label_shadow()
aq_cc
# A tibble: 111 x 13
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
<int> <int> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
1 41 190 7.4 67 5 1 !NA !NA !NA !NA
2 36 118 8 72 5 2 !NA !NA !NA !NA
3 12 149 12.6 74 5 3 !NA !NA !NA !NA
4 18 313 11.5 62 5 4 !NA !NA !NA !NA
5 23 299 8.6 65 5 7 !NA !NA !NA !NA
6 19 99 13.8 59 5 8 !NA !NA !NA !NA
7 8 19 20.1 61 5 9 !NA !NA !NA !NA
8 16 256 9.7 69 5 12 !NA !NA !NA !NA
9 11 290 9.2 66 5 13 !NA !NA !NA !NA
10 14 274 10.9 68 5 14 !NA !NA !NA !NA
# ... with 101 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
# any_missing <chr>
#2. Imputation using the imputed data from the last lesson
aq_imp_lm <- bind_shadow(airquality) %>%
add_label_shadow() %>%
impute_lm(Ozone ~ Temp + Wind + Month + Day) %>%
impute_lm(Solar.R ~ Temp + Wind + Month + Day)
aq_imp_lm
# A tibble: 153 x 13
Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
* <dbl> <dbl> <dbl> <int> <int> <int> <fct> <fct> <fct> <fct>
1 41 190 7.4 67 5 1 !NA !NA !NA !NA
2 36 118 8 72 5 2 !NA !NA !NA !NA
3 12 149 12.6 74 5 3 !NA !NA !NA !NA
4 18 313 11.5 62 5 4 !NA !NA !NA !NA
5 -12.5 152. 14.3 56 5 5 NA NA !NA !NA
6 28 189. 14.9 66 5 6 !NA NA !NA !NA
7 23 299 8.6 65 5 7 !NA !NA !NA !NA
8 19 99 13.8 59 5 8 !NA !NA !NA !NA
9 8 19 20.1 61 5 9 !NA !NA !NA !NA
10 33.6 194 8.6 69 5 10 NA !NA !NA !NA
# ... with 143 more rows, and 3 more variables: Month_NA <fct>, Day_NA <fct>,
# any_missing <chr>
# 3. Bind the models together
bound_models <- bind_rows(cc = aq_cc,
imp_lm = aq_imp_lm,
.id = "imp_model")
bound_models
# A tibble: 264 x 14
imp_model Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <fct> <fct> <fct>
1 cc 41 190 7.4 67 5 1 !NA !NA !NA
2 cc 36 118 8 72 5 2 !NA !NA !NA
3 cc 12 149 12.6 74 5 3 !NA !NA !NA
4 cc 18 313 11.5 62 5 4 !NA !NA !NA
5 cc 23 299 8.6 65 5 7 !NA !NA !NA
6 cc 19 99 13.8 59 5 8 !NA !NA !NA
7 cc 8 19 20.1 61 5 9 !NA !NA !NA
8 cc 16 256 9.7 69 5 12 !NA !NA !NA
9 cc 11 290 9.2 66 5 13 !NA !NA !NA
10 cc 14 274 10.9 68 5 14 !NA !NA !NA
# ... with 254 more rows, and 4 more variables: Temp_NA <fct>, Month_NA <fct>,
# Day_NA <fct>, any_missing <chr>
Istraživanje modela
model_summary <- bound_models %>%
group_by(imp_model) %>%
nest() %>%
mutate(mod = map(data,
~lm(Temp ~ Ozone + Solar.R + Wind + Day + Month ,
data = .)),
res = map(mod, residuals),
pred = map(mod, predict),
tidy = map(mod, broom::tidy))
model_summary
# A tibble: 2 x 6
# Groups: imp_model [2]
imp_model data mod res pred tidy
<chr> <list<df[,13]>> <list> <list> <list> <list>
1 cc [111 x 13] <lm> <dbl [111]> <dbl [111]> <tibble [6 x 5]>
2 imp_lm [153 x 13] <lm> <dbl [153]> <dbl [153]> <tibble [6 x 5]>
Istraživanje koeficijenata modela.
model_summary %>%
select(imp_model,
tidy) %>%
unnest()
# A tibble: 12 x 6
# Groups: imp_model [2]
imp_model term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 cc (Intercept) 57.3 4.50 12.7 5.52e-23
2 cc Ozone 0.165 0.0239 6.92 3.66e-10
3 cc Solar.R 0.0108 0.00699 1.55 1.24e- 1
4 cc Wind -0.174 0.212 -0.821 4.13e- 1
5 cc Day -0.0892 0.0677 -1.32 1.91e- 1
6 cc Month 2.04 0.409 4.99 2.42e- 6
7 imp_lm (Intercept) 54.7 3.59 15.2 5.21e-32
8 imp_lm Ozone 0.196 0.0205 9.53 4.52e-17
9 imp_lm Solar.R 0.0102 0.00577 1.76 7.97e- 2
10 imp_lm Wind -0.00642 0.172 -0.0374 9.70e- 1
11 imp_lm Day -0.112 0.0538 -2.08 3.92e- 2
12 imp_lm Month 2.11 0.340 6.21 5.09e- 9
model_summary %>%
select(imp_model,
res) %>%
unnest() %>%
ggplot(aes(x = res,
fill = imp_model)) +
geom_histogram(position = "dodge")
model_summary %>%
select(imp_model,
pred) %>%
unnest() %>%
ggplot(aes(x = pred,
fill = imp_model)) +
geom_histogram(position = "dodge")
Vežba - spajanje i poređenje više modela.
ocean_cc <- oceanbuoys %>% drop_na()
# Create an imputed dataset using a linear models
ocean_imp_lm_all <- bind_shadow(oceanbuoys) %>%
add_label_shadow() %>%
impute_lm(sea_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
impute_lm(air_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
impute_lm(humidity ~ wind_ew + wind_ns + year + latitude + longitude)
# Bind the datasets
bound_models <- bind_rows(cc = ocean_cc,
imp_lm_wind = ocean_imp_lm_wind,
imp_lm_all = ocean_imp_lm_all,
.id = "imp_model")
# Look at the models
bound_models
# A tibble: 2,037 x 18
imp_model year latitude longitude sea_temp_c air_temp_c humidity wind_ew
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cc 1997 0 -110 27.6 27.1 79.6 -6.40
2 cc 1997 0 -110 27.5 27.0 75.8 -5.30
3 cc 1997 0 -110 27.6 27 76.5 -5.10
4 cc 1997 0 -110 27.6 26.9 76.2 -4.90
5 cc 1997 0 -110 27.6 26.8 76.4 -3.5
6 cc 1997 0 -110 27.8 26.9 76.7 -4.40
7 cc 1997 0 -110 28.0 27.0 76.5 -2
8 cc 1997 0 -110 28.0 27.1 78.3 -3.70
9 cc 1997 0 -110 28.0 27.2 78.6 -4.20
10 cc 1997 0 -110 28.0 27.2 76.9 -3.60
# ... with 2,027 more rows, and 10 more variables: wind_ns <dbl>,
# year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
# air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>,
# any_missing <chr>
# Create the model summary for each dataset
model_summary <- bound_models %>%
group_by(imp_model) %>%
nest() %>%
mutate(mod = map(data, ~lm(sea_temp_c ~ air_temp_c + humidity + year, data = .)),
res = map(mod, residuals),
pred = map(mod, predict),
tidy = map(mod, broom::tidy))
# Explore the coefficients in the model
model_summary %>%
select(imp_model,tidy) %>%
unnest()
# A tibble: 12 x 6
# Groups: imp_model [3]
imp_model term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 cc (Intercept) -735. 45.9 -16.0 8.19e- 48
2 cc air_temp_c 0.864 0.0231 37.4 2.64e-154
3 cc humidity 0.0341 0.00390 8.74 2.69e- 17
4 cc year 0.369 0.0232 15.9 3.46e- 47
5 imp_lm_wind (Intercept) -1742. 56.1 -31.0 1.83e-135
6 imp_lm_wind air_temp_c 0.365 0.0279 13.1 2.73e- 35
7 imp_lm_wind humidity 0.0225 0.00690 3.26 1.17e- 3
8 imp_lm_wind year 0.880 0.0283 31.1 6.79e-136
9 imp_lm_all (Intercept) -697. 51.8 -13.5 5.04e- 37
10 imp_lm_all air_temp_c 0.890 0.0255 35.0 2.90e-158
11 imp_lm_all humidity 0.0127 0.00463 2.75 6.03e- 3
12 imp_lm_all year 0.351 0.0262 13.4 1.12e- 36
best_model <- "imp_lm_all"
best_model
[1] "imp_lm_all"