```
##
## Attaching package: 'lubridate'
```

```
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
```

```
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
```

```
##
## Attaching package: 'ggridges'
```

```
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
```

`## Loading required package: Matrix`

`## Loading required package: viridisLite`

Direct calculation of aggregate measures such as distance traveled, time spent in transit, or number of trips in a day is biased downward if the data that are missing occur during movement behavior. Weighting these measures with respect to the percentage of missingness can produce biased results when the cause of missingness is related to movement behavior. It is therefore necessary to understand and quantify the missingness mechanisms.

Movement behavior is an inherently continuous process. By sampling locations over time, we descretize the movement behavior into a series of recorded single instances with unknown trajectories existing between the sample moments. The smaller the gaps between sample moments, the more accurately the true underlying trajectory can be approximated. Conceptually, sparsity refers to this relationship: the data become more sparse as the sampling interval increases. We can parameterize sparsity by considering it as a continuum between 0 and 1 when full data and no data are available, respectively. Because each sample moment represents a single instant, sparsity calculated across a continuous measurement of time is not a particularly useful metric, necessitating discretization of time into intervals. **Cite Chen 2019** refer to this interval as the *temporal resolution*. The choice of temporal resolution depends on analysis goals, but should represent an interval short enough to preclude missing impactful behavioral changes. A personâ€™s trajectory can be assessed for missingness at each time interval. We can use the proportion of intervals that do not contain data relative to those that contain at least some data in order to measure sparsity with respect to our temporal resolution. We paramaterize this measurement as *q* following from **Cite Wang 2020**. This measurement *q* is therefore directly influenced by our choice of temporal resolution as shown in Figure 1.1. In characterizing the missingness within the Tabi study, we use a temporal resolution of five minutes to accommodate distance estimation as a metric of interest.

*Q* can be calculated across different resolutions: over trajectories, days, or observation periods in order to characterize differing patterns of missingness. Movement behavior is expected to differ over the days of the week in accordance with work-related travel. As shown in Figure 1.2, we see little evidence of difference in mean *q* per day across the days of the week. We do however see evidence of a change in *q* with respect to time in the study: longer participation is related to increased sparsity, as in Figure 1.3.

This pattern may not indicate the obvious causitive relationship and may be instead indicative of one of the device-specific causes of missingness that exist within this data. In fact, it may be that the factor that causes the sparsity in the data is what allows for longer term participation. Investigation of *q* per day across devices yields an ICC of .58 and Figure 1.3 demonstrates the wide variety of relationships across differing devices.

Investigation with a multilevel model with random slopes for day of participation yields a fixed effect for day of participation of \(0.007\). The random effect correlation for day of participation of \(-0.46\) is logical considering that our parameter *q* is capped at 1 and many participants have intercepts in the upper ranges. THe correlation of fixed effects is \(-.5\) â€¦

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: q5pd ~ day_of_participation + (day_of_participation | device_id_f)
## Data: res8
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
##
## REML criterion at convergence: 88.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8902 -0.5027 0.0531 0.4944 4.2044
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## device_id_f (Intercept) 0.0824900 0.28721
## day_of_participation 0.0002324 0.01524 -0.46
## Residual 0.0425116 0.20618
## Number of obs: 5010, groups: device_id_f, 530
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.561370 0.013795 40.693
## day_of_participation 0.007423 0.001138 6.523
##
## Correlation of Fixed Effects:
## (Intr)
## dy_f_prtcpt -0.500
```

**Question: There seems to be a clear effect that Iâ€™m trying to demonstrate where people who have many days of participation more often have a negative or flat slope wrt q. Iâ€™m not sure how to model that. If I include total days of participation as a second-level predictor, is that totally inappropriate because itâ€™s related to day of participation? My gut says yes and thereâ€™s some better way to get at this.**