Chapter 4 Missing values

The only dataset with missing values was the Trips by Distance data from the U.S. DOT Bureau of Transportation Statistics.

So, working with that set, let us find some NA’s

level na_s
Number of NA’s National: 0
Number of NA’s State: 0
Number of NA’s County: 5779

We can see here that the NA values only occur at the county level. So, let’s find out where those NA’s are hiding.

##       Date              State.FIPS        State         County.FIPS   
##  Min.   :2020-01-22   Min.   : 1.00   TX     :109474   Min.   : 1001  
##  1st Qu.:2020-05-08   1st Qu.:18.00   GA     : 68529   1st Qu.:18177  
##  Median :2020-08-24   Median :29.00   VA     : 57323   Median :29176  
##  Mean   :2020-08-24   Mean   :30.28   KY     : 51720   Mean   :30384  
##  3rd Qu.:2020-12-10   3rd Qu.:45.00   MO     : 49565   3rd Qu.:45081  
##  Max.   :2021-03-27   Max.   :56.00   KS     : 45255   Max.   :56045  
##                                       (Other):972336                  
##                County        Population.Staying.at.Home
##  Washington County:  12930   Min.   :     10           
##  Jefferson County :  10775   1st Qu.:   2185           
##  Franklin County  :  10344   Median :   5221           
##  Jackson County   :   9913   Mean   :  25984           
##  Lincoln County   :   9913   3rd Qu.:  15053           
##  Madison County   :   8189   Max.   :4131225           
##  (Other)          :1292138   NA's   :5779              
##  Population.Not.Staying.at.Home Number.of.Trips    Number.of.Trips..1
##  Min.   :    117                Min.   :     227   Min.   :       0  
##  1st Qu.:   8826                1st Qu.:   37386   1st Qu.:    8051  
##  Median :  20719                Median :   88712   Median :   19530  
##  Mean   :  78593                Mean   :  325210   Mean   :   82383  
##  3rd Qu.:  54017                3rd Qu.:  231308   3rd Qu.:   52890  
##  Max.   :8636354                Max.   :43031242   Max.   :12335889  
##  NA's   :5779                   NA's   :5779       NA's   :5779      
##  Number.of.Trips.1.3 Number.of.Trips.3.5 Number.of.Trips.5.10 Number.of.Trips.10.25
##  Min.   :       0    Min.   :      0     Min.   :      0      Min.   :      0      
##  1st Qu.:    8345    1st Qu.:   3548     1st Qu.:   4909      1st Qu.:   5829      
##  Median :   21520    Median :   9989     Median :  12433      Median :  13901      
##  Mean   :   80587    Mean   :  39327     Mean   :  49127      Mean   :  47670      
##  3rd Qu.:   57947    3rd Qu.:  28213     3rd Qu.:  34078      3rd Qu.:  35061      
##  Max.   :11401040    Max.   :5423126     Max.   :6535454      Max.   :5844180      
##  NA's   :5779        NA's   :5779        NA's   :5779         NA's   :5779         
##  Number.of.Trips.25.50 Number.of.Trips.50.100 Number.of.Trips.100.250
##  Min.   :      0       Min.   :     0         Min.   :     0         
##  1st Qu.:   3013       1st Qu.:  1184         1st Qu.:   424         
##  Median :   6412       Median :  2598         Median :   967         
##  Mean   :  16454       Mean   :  6086         Mean   :  2678         
##  3rd Qu.:  14435       3rd Qu.:  5742         3rd Qu.:  2301         
##  Max.   :1859953       Max.   :555732         Max.   :356594         
##  NA's   :5779          NA's   :5779           NA's   :5779           
##  Number.of.Trips.250.500 Number.of.Trips...500
##  Min.   :    0.0         Min.   :     0.0     
##  1st Qu.:   57.0         1st Qu.:    11.0     
##  Median :  166.0         Median :    41.0     
##  Mean   :  574.5         Mean   :   323.7     
##  3rd Qu.:  442.0         3rd Qu.:   150.0     
##  Max.   :90947.0         Max.   :125691.0     
##  NA's   :5779            NA's   :5779

From the summary we can see that there is an equal 5779 NA’s for each numeric column thus, upon further inspection, 5779 rows are fully incomplete for trips data. So, let’s isolate those rows.

Frist, since this data is a timeseries, let’s check the dates to see if there is a pattern.

Date count
2021-03-26 1228
2020-11-29 113
2020-09-06 101
2020-07-04 99
2020-07-03 93
2020-07-02 91
2020-09-07 88
2020-07-01 87
2020-06-30 85
2020-06-29 79

From this we see the vast majority of NA occur on 2021-03-26, which is one day before the ending day of the dataset. Thus, we can conclude that the rows for second to last day have been created for data input but have not updated by the time of data procurement. Thus, removing the last two days from the dataset will reduce potential error and will not affect analysis of mobility on COVID-19 rates from January 2020 to March 2021. Now to get the total number of days in the dataset.

## Time difference of 428 days

Here we can see that there are 428 total days in the dataset so, lets narrow down the search to counties with > 10% of missing data. (10% is chosen because 90% of available data could still provide accurate insights for trips data) here is a sample:

State County n
AK Aleutians East Borough 62
AK Bristol Bay Borough 353
AK Haines Borough 101
AK Lake and Peninsula Borough 295
AK Skagway Municipality 141
AK Yakutat City and Borough 395

There are 22 counties with high rates of NA values. Thus, these are the counties that could be most problematic when analyzing at a county scale. So, lets save the data frame of these top NA violators for future reference when further micro analysis is desired within these areas.

Now to consider, would the NA values from certain counties affect State level analysis? To find out let’s plot the states with county NA values.

Here we can see the states with the majority of the NA values, lets pick the top 5 (AK, NE, MT, TX, HI) and see how much the NA values might impact the aggregate state values.

We can see by the above stacked bar plot that the NA values account for a negligible amount of total values from each of the top NA states. Thus, the NA values should be of note, but of not much concern when considering total trips by distance for each state.

With this analysis of NA values, the decision to leave the NA rows in the dataset is made for continuity of county timelines.