## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## 
## Attaching package: 'ggridges'

## The following object is masked from 'package:ggplot2':
## 
##     scale_discrete_manual

## Loading required package: Matrix

## Loading required package: viridisLite

1 Missing data

1.1 Importance of missing data

Direct calculation of aggregate measures such as distance traveled, time spent in transit, or number of trips in a day is biased downward if the data that are missing occur during movement behavior. Weighting these measures with respect to the percentage of missingness can produce biased results when the cause of missingness is related to movement behavior. It is therefore necessary to understand and quantify the missingness mechanisms.

1.2 Sparsity as a measurement of missing data

Movement behavior is an inherently continuous process. By sampling locations over time, we descretize the movement behavior into a series of recorded single instances with unknown trajectories existing between the sample moments. The smaller the gaps between sample moments, the more accurately the true underlying trajectory can be approximated. Conceptually, sparsity refers to this relationship: the data become more sparse as the sampling interval increases. We can parameterize sparsity by considering it as a continuum between 0 and 1 when full data and no data are available, respectively. Because each sample moment represents a single instant, sparsity calculated across a continuous measurement of time is not a particularly useful metric, necessitating discretization of time into intervals. Cite Chen 2019 refer to this interval as the temporal resolution. The choice of temporal resolution depends on analysis goals, but should represent an interval short enough to preclude missing impactful behavioral changes. A person’s trajectory can be assessed for missingness at each time interval. We can use the proportion of intervals that do not contain data relative to those that contain at least some data in order to measure sparsity with respect to our temporal resolution. We paramaterize this measurement as q following from Cite Wang 2020. This measurement q is therefore directly influenced by our choice of temporal resolution as shown in Figure 1.1. In characterizing the missingness within the Tabi study, we use a temporal resolution of five minutes to accommodate distance estimation as a metric of interest.

Figure 1.1: Difference in sparsity measurements

Q can be calculated across different resolutions: over trajectories, days, or observation periods in order to characterize differing patterns of missingness. Movement behavior is expected to differ over the days of the week in accordance with work-related travel. As shown in Figure 1.2, we see little evidence of difference in mean q per day across the days of the week. We do however see evidence of a change in q with respect to time in the study: longer participation is related to increased sparsity, as in Figure 1.3.

Figure 1.2: Sparsity over days of the week

Figure 1.3: Sparsity over days of participation

Figure 1.4: Sparsity over time

This pattern may not indicate the obvious causitive relationship and may be instead indicative of one of the device-specific causes of missingness that exist within this data. In fact, it may be that the factor that causes the sparsity in the data is what allows for longer term participation. Investigation of q per day across devices yields an ICC of .58 and Figure 1.3 demonstrates the wide variety of relationships across differing devices.

Figure 1.5: Change in q over time for a random sample of participants

Investigation with a multilevel model with random slopes for day of participation yields a fixed effect for day of participation of \(0.007\). The random effect correlation for day of participation of \(-0.46\) is logical considering that our parameter q is capped at 1 and many participants have intercepts in the upper ranges. THe correlation of fixed effects is \(-.5\) …

## Linear mixed model fit by REML ['lmerMod']
## Formula: q5pd ~ day_of_participation + (day_of_participation | device_id_f)
##    Data: res8
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
## REML criterion at convergence: 88.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8902 -0.5027  0.0531  0.4944  4.2044 
## 
## Random effects:
##  Groups      Name                 Variance  Std.Dev. Corr 
##  device_id_f (Intercept)          0.0824900 0.28721       
##              day_of_participation 0.0002324 0.01524  -0.46
##  Residual                         0.0425116 0.20618       
## Number of obs: 5010, groups:  device_id_f, 530
## 
## Fixed effects:
##                      Estimate Std. Error t value
## (Intercept)          0.561370   0.013795  40.693
## day_of_participation 0.007423   0.001138   6.523
## 
## Correlation of Fixed Effects:
##             (Intr)
## dy_f_prtcpt -0.500

Question: There seems to be a clear effect that I’m trying to demonstrate where people who have many days of participation more often have a negative or flat slope wrt q. I’m not sure how to model that. If I include total days of participation as a second-level predictor, is that totally inappropriate because it’s related to day of participation? My gut says yes and there’s some better way to get at this.

1.3 Patterns

2 Device-related characteristics

Although requests to the operating-system-specific location provider were made at a rate of once per second while a device was judged to be in motion and once per minute when it was stationary, changes in recent years to the behavior of these system calls reduced the true frequency of location request. iOS and Android are known to handle this in different ways, although recent versions of both operating systems both limit the number of total requests made when a device is not currently in use.

Android OS versions with the same version number differ across manufacturers, and some aggressively close applications that are judged to be not in active use, requiring the user to reopen the app in order to continue processing. An app that is reopened is again free to make location requests, with some caveats. iOS restricts processing during windows of device inactivity but largely does not close the app. We therefore see a discrepancy between sparsity over operating systems: Android users who record any location data on a day are more likely to record more location data, whereas iOS users are more likely to record small amounts of location data on a regular basis. See Figure 2.1

## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## q5pd ~ day_of_participation + os + (day_of_participation | device_id_f)
##    Data: res8
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
## REML criterion at convergence: -10.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8919 -0.5035  0.0616  0.4978  4.1779 
## 
## Random effects:
##  Groups      Name                 Variance  Std.Dev. Corr 
##  device_id_f (Intercept)          0.0781478 0.27955       
##              day_of_participation 0.0002168 0.01472  -0.64
##  Residual                         0.0426503 0.20652       
## Number of obs: 5010, groups:  device_id_f, 530
## 
## Fixed effects:
##                      Estimate Std. Error t value
## (Intercept)          0.459867   0.016110   28.55
## day_of_participation 0.006982   0.001078    6.48
## osiOS                0.243442   0.021717   11.21
## 
## Correlation of Fixed Effects:
##             (Intr) dy_f_p
## dy_f_prtcpt -0.462       
## osiOS       -0.550 -0.071

## Linear mixed model fit by REML ['lmerMod']
## Formula: q5pd ~ day_of_participation + os + day_of_participation * os +  
##     (day_of_participation | device_id_f)
##    Data: res8
## Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
## REML criterion at convergence: -26.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9151 -0.5059  0.0514  0.5034  4.1975 
## 
## Random effects:
##  Groups      Name                 Variance  Std.Dev. Corr 
##  device_id_f (Intercept)          0.0772512 0.2779        
##              day_of_participation 0.0001989 0.0141   -0.65
##  Residual                         0.0425730 0.2063        
## Number of obs: 5010, groups:  device_id_f, 530
## 
## Fixed effects:
##                            Estimate Std. Error t value
## (Intercept)                0.500129   0.017797  28.102
## day_of_participation       0.001135   0.001523   0.745
## osiOS                      0.159950   0.027046   5.914
## day_of_participation:osiOS 0.010930   0.002097   5.213
## 
## Correlation of Fixed Effects:
##             (Intr) dy_f_p osiOS 
## dy_f_prtcpt -0.600              
## osiOS       -0.658  0.395       
## dy_f_prt:OS  0.436 -0.726 -0.601

Figure 2.1: Difference in sparsity over days of participation by device OS

Both operating systems have restrictions that limit processing power during nighttime hours, however, the actual effect on mean sparsity is relatively small as shown in Figure 2.2. Figure 2.3 demonstrates that sparsity is a better metric with which to understand data completeness than number of records.

Figure 2.2: Average sparsity per hour

Figure 2.3: Sparsity and GPS records

2.1 Location patterns over time

see Figure @ref{fig:records2}.

Figure 2.4: Records over days of week

Figure 2.5: Records over time

3 Battery levels

4 The completeness data fig

Also with the person’s first day as first instead of date Most complete seven days of data high-level data, then zoom in on mechanisms Give example of cold start, etc.

Describe the first figure. Look at time and distance from gaps, average distance, average length, etc.

Define what a gap is. If you miss a second, it’s already a conceptual gap because the underlying data is continuous. Basis is battery data. Second sort of gap within 15 minutes, missing a location every 5 minutes.

How long the missing data episodes are

Less than 5 minutes, look at distance and time

ML model variance between persons gap length random effect

Missing Data

Danielle McCool