## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:base':
##     date, intersect, setdiff, union
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
##     scale_discrete_manual
## Loading required package: Matrix
## Loading required package: viridisLite

1 Missing data

1.1 Importance of missing data

Direct calculation of aggregate measures such as distance traveled, time spent in transit, or number of trips in a day is biased downward if the data that are missing occur during movement behavior. Weighting these measures with respect to the percentage of missingness can produce biased results when the cause of missingness is related to movement behavior. It is therefore necessary to understand and quantify the missingness mechanisms.

1.2 Sparsity as a measurement of missing data

Movement behavior is an inherently continuous process. By sampling locations over time, we descretize the movement behavior into a series of recorded single instances with unknown trajectories existing between the sample moments. The smaller the gaps between sample moments, the more accurately the true underlying trajectory can be approximated. Conceptually, sparsity refers to this relationship: the data become more sparse as the sampling interval increases. We can parameterize sparsity by considering it as a continuum between 0 and 1 when full data and no data are available, respectively. Because each sample moment represents a single instant, sparsity calculated across a continuous measurement of time is not a particularly useful metric, necessitating discretization of time into intervals. Cite Chen 2019 refer to this interval as the temporal resolution. The choice of temporal resolution depends on analysis goals, but should represent an interval short enough to preclude missing impactful behavioral changes. A person’s trajectory can be assessed for missingness at each time interval. We can use the proportion of intervals that do not contain data relative to those that contain at least some data in order to measure sparsity with respect to our temporal resolution. We paramaterize this measurement as q following from Cite Wang 2020. This measurement q is therefore directly influenced by our choice of temporal resolution as shown in Figure 1.1. In characterizing the missingness within the Tabi study, we use a temporal resolution of five minutes to accommodate distance estimation as a metric of interest.

Difference in sparsity measurements

Figure 1.1: Difference in sparsity measurements

Q can be calculated across different resolutions: over trajectories, days, or observation periods in order to characterize differing patterns of missingness. Movement behavior is expected to differ over the days of the week in accordance with work-related travel. As shown in Figure 1.2, we see little evidence of difference in mean q per day across the days of the week. We do however see evidence of a change in q with respect to time in the study: longer participation is related to increased sparsity, as in Figure 1.3.

Sparsity over days of the week

Figure 1.2: Sparsity over days of the week

Sparsity over days of participation

Figure 1.3: Sparsity over days of participation

Sparsity over time

Figure 1.4: Sparsity over time

This pattern may not indicate the obvious causitive relationship and may be instead indicative of one of the device-specific causes of missingness that exist within this data. In fact, it may be that the factor that causes the sparsity in the data is what allows for longer term participation. Investigation of q per day across devices yields an ICC of .58 and Figure 1.3 demonstrates the wide variety of relationships across differing devices.

Change in q over time for a random sample of participants

Figure 1.5: Change in q over time for a random sample of participants

Investigation with a multilevel model with random slopes for day of participation yields a fixed effect for day of participation of \(0.007\). The random effect correlation for day of participation of \(-0.46\) is logical considering that our parameter q is capped at 1 and many participants have intercepts in the upper ranges. THe correlation of fixed effects is \(-.5\)

## Linear mixed model fit by REML ['lmerMod']
## Formula: q5pd ~ day_of_participation + (day_of_participation | device_id_f)
##    Data: res8
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## REML criterion at convergence: 88.6
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8902 -0.5027  0.0531  0.4944  4.2044 
## Random effects:
##  Groups      Name                 Variance  Std.Dev. Corr 
##  device_id_f (Intercept)          0.0824900 0.28721       
##              day_of_participation 0.0002324 0.01524  -0.46
##  Residual                         0.0425116 0.20618       
## Number of obs: 5010, groups:  device_id_f, 530
## Fixed effects:
##                      Estimate Std. Error t value
## (Intercept)          0.561370   0.013795  40.693
## day_of_participation 0.007423   0.001138   6.523
## Correlation of Fixed Effects:
##             (Intr)
## dy_f_prtcpt -0.500

Question: There seems to be a clear effect that I’m trying to demonstrate where people who have many days of participation more often have a negative or flat slope wrt q. I’m not sure how to model that. If I include total days of participation as a second-level predictor, is that totally inappropriate because it’s related to day of participation? My gut says yes and there’s some better way to get at this.

1.3 Patterns