Dynamic-time-warping-parameters

DTWBI

Dynamic Time Warping Based Imputation has been proposed in the literature independently at least two different authors.(1, 2) Its primary mechanism is imputation of missing data that occurs within a time series where there are other time series available that can function as candidate donors for the time period which is missing.

Anything that is periodic in nature can make use of the periodicity by aligning the non-missing portion of the data to a suitable portion of a reference data set and then using the contents of the reference data set during the missing point to fill in the gap. Although there are different ways to accomplish this goal, the one illustrated here is meant to align with the general concept of multiple imputation through chained equations, using Predictive Mean Matching as a guiding tool.

We will start with the time series with the least amount of missing data – this becomes our query. Some portion of the non-missing data in the query is selected to be matched against all other time series containing no missing data. The best-fitting subsequence alignment is found between the query and each of these reference sequences. The Euclidean distance between the aligned query and reference is captured for each reference set. This is used to create a probability of selection as an imputation candidate, where the closer matches are more likely to be chosen. Next, a number of new data sets are created in which the missing data in the query has been filled by the data from the chosen reference candidate. We then move to the next time series with the least amount of missing data and start again, now with one additional reference sequence – the initial query.

We proceed through the data until all missing data has been filled, leaving us with a number of complete data sets (depending on how many imputations were chosen).

The parameters

The simulation is parameterized by a number of different features that may impact the way in which candidates time windows are chosen for imputation.

N imputations

The first is the number of imputations. Because we select candidates on the basis of probability within each imputation window, the impact of the number of imputations is in increasing the variation. For the moment, we will delay on this point because the number of imputations can be controlled after the fact. The impact on the imputation procedure itself is to increase it in length. Currently, the imputations don’t run in parallel, although they could, so running 10 imputations takes twice as long as running 5 imputations.

Time window

The time window is what controls how specific to a given time point we must be in order to impute it. If we select a time window of 1 hour, and we have a gap between 14.00 and 16.00, then we can select candidates for imputing the gap for any two-hour time periods starting between 13.00 and 15.00 (and thus ending between 15.00 and 17.00.) If we had a two-hour time window, we could select gap imputation candidate data that started between 12.00 and 16.00.

The tighter the time window, the more we restrict our matches to conform to time-based routines. The looser the time window, the more we allow for matches that favor similarity in behavior without requiring the time element. A time window of 12 hours would allow for unrestricted pattern matching in the time series (12 hours before, and 12 hours afterward).

Time window is the only parameter that also has an impact on TWI – the non-pattern-matching based approach, and one of our comparisons. Instead of selecting probability-based matches from among those with matching time slots, each imputation selects them completely at random.

This simulation applied time windows of 1, 2 and 3 hours.

Matching buffer

In addition to the time window, we also vary the number of elements against which we match. This is the buffer following the gap in the case of missingness in the beginning of a time series, preceding the gap in the case of missingness at the end of a time series, and on both sides in the case of missingness that occurs in the middle of a time series.

A matching buffer is specified in terms of the number of elements. If we specify a matching buffer of 4, and our time is discretized into units of 15 minutes, then we look at a buffer equivalent to an hour. In theory, a larger buffer prefers longer patterns (like a person’s repeated work commute), and a shorter buffer prefers shorter patterns (e.g. someone is moving a long distance before, but not afterward).

Matching buffers are a preference – we select the larger of the matching buffer and the available elements. Consider missingness occurring in the middle of a time series of 10 elements. If elements 7 and 8 are missing, we may have at most a matching buffer of 2 following the gap, and at most a matching buffer of 6 preceding the gap.

The matching buffer has no impact on TWI because there is no matching.

This simulation applied matching buffers of 4, 8, 16, 24 and 32 elements, which is the equivalent of 1, 2, 4, 6 and 8 hours.

Candidate specificity

Candidate specificity describes the way in which the probability of candidate selection is generated. Because candidates are chosen independently in each imputation chain, we rely on a probabilistic mechanism for their selection in order to allow them to vary. If we have a low candidate specificity, then we have only a mild preference for better pattern matches. If we have a high candidate specificity, then we will be much less likely to select poor matches, but we also will increase the selection potential of a few best matches.

If we have zero candidate specificity, we are in the TWI case. We select at random from among the candidates with matching time periods and we lose the potential benefit of the subsequence matching.

On the other hand, if our candidate specificity is too high, all imputations may select the single best donor period and we will lose any variation.

Here’s an example of the mechanism of candidate selection:

query <- 1:10
ref1 <- 2:11
ref2 <- 0:9
ref3 <- 3:12
ref4 <- 11:20 

library(dtw)

r1 <- dtw(query, ref1)$normalizedDistance
r2 <- dtw(query, ref2)$normalizedDistance
r3 <- dtw(query, ref3)$normalizedDistance
r4 <- dtw(query, ref4)$normalizedDistance

dists <- c(r1, r2, r3, r4)
dists

## [1] 0.10 0.10 0.30 5.45

prob.powers <- c(0, 1, 2, 3, 4)
probFromDist_pretty <- function(prob.power, dists){
  probs <- (1/(dists^prob.power))/sum(1/(dists^prob.power))
  paste0("Prob.power ", prob.power, ": ", 
         paste0(round(probs, 2), collapse = ", "))
}
sapply(prob.powers, probFromDist_pretty, dists = dists)

## [1] "Prob.power 0: 0.25, 0.25, 0.25, 0.25"
## [2] "Prob.power 1: 0.43, 0.43, 0.14, 0.01"
## [3] "Prob.power 2: 0.47, 0.47, 0.05, 0"   
## [4] "Prob.power 3: 0.49, 0.49, 0.02, 0"   
## [5] "Prob.power 4: 0.5, 0.5, 0.01, 0"

This simulation applies candidate specificity powers of 2, 3, and 4 (with 0 for comparison).

DTW Bias

We can also judge the extent to which finding Dynamic Time Warping matches improves our ability to impute the missingness. We subtract the absolute Dynamic Time Warping bias from the absolute time-window-only bias. A positive number means that DTWBI outperforms TWBI.

We can use this to help guide parameter selection. Here’s an overview, where we can see the impact of the time windowing, the selection buffer and candidate specificity. Notably, these relationships differ based on where the missingness is located.

DTW Bias Middle

DTWBI outperforms TWI when the missingness occurs in the middle. If we look only at the difference in biases, the largest impact comes from a relatively short buffer window. We would expect higher variability for lower buffering. In data sets in which individuals have recorded multiple similar days against which to match, we would expect to find a tradeoff in which longer windows offered higher precision but more bias.

Here we are matching primarily generic trends where the immediacy of what precedes and follows the missingness is the strongest predictor of what occurs during it. Shorter matching buffers produce better results in this situation.

DTW Bias Middle 2

Let’s reduce our scope to a matching buffer of one hour. As the time window increases, the relative performance of DTWBI increases. It’s worth keeping in mind here that TWI only differs across time windows conditions (and number of imputations), so the relative performance here is an imperfect measure for time window selection in DTWBI.

DTW Bias Middle 3

Let’s facet on time window and look at candidate specificity. There seems to be no clear pattern when it comes to bias. We would expect a higher specificity (low tolerance for deviation from the matching pattern) to …

DTW Bias Middle 4

TWI can be thought of a special case of DTWBI in which the candidate specificity is “none.” Here we’re still looking at missingness in the middle, and in the 1 hour matching buffer. We’ve averaged over the imputations (here we’re using 3) within persons and conditions in order to make it comparable, but we can see that in this case, we’re both more precise than TWI and also less biased, although we have more instances of dramatically-wrong average imputations.

## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.

DTW Bias Middle 5

We can zoom in to learn more about the distribution here, but we need to switch to a density-based setup to preserve the outlier information by squishing it into the limits. We know that there’s a little skew, but it’s hard to draw conclusions here about the windowing or candidate specificity. Theoretically we would want high specificity with larger windows (capturing identical tracks at different times) and lower specificity with smaller windows (capturing generic trip features). Let’s compare the true distance distribution and imputed ones.

DTW Bias Dist

Still working with the 1 hour buffer and missingness in the middle, we can look at the distributions of our imputations compared with the truth. The limits are collapsed here to be between 0 and 300Km, but the out-of-bound entries are squished into the rightmost area of the density.

Consistently across parameters, we have a small underestimation.

DTW Bias Dist 2

If we zoom in and consider only 0-100Km sets, we see a pretty robust set of distributions that sometimes overestimate and sometimes underestimate. A low Candidate Specification number potentially performs worse here than either a medium or high.

DTW End

So far we’ve only been working with missing in the middle, because that’s where DTWBI has the greatest gains over TWI. We see less impressive gains in the beginning and ending. Here we’re looking at the ending. It’s worth noting that there’s a great deal of discrepancy based on how we look at bias. If we first compute the mean bias for an individual over the DTWBI imputations and then compare this to the mean bias of each individual for the TWI imputations then plot the mean of this difference over conditions (below labeled Mean Difference of Biases) we generally do quite poorly across all conditions, but we improve as we increase the match buffer length.

If, however, we compute the mean bias across individuals within each DTWBI condition (below labeled Difference of Mean Bias), and compare that to the relevant mean bias from the equivalent TWI condition, we see that DTWBI performs better when the MatchBuffer is lower.

##    MatchBuffer CandSpecif window   BiasDiff          V2
## 1:        1 hr     Medium   1 hr -8163.9516    733.2012
## 2:        2 hr     Medium   1 hr -2559.8439  -3413.5939
## 3:        4 hr     Medium   1 hr   453.0581 -12405.6180
## 4:        6 hr     Medium   1 hr   786.1979 -14803.9305
## 5:        8 hr     Medium   1 hr   717.8537 -13559.0206

## Loading required package: gridExtra

DTWBI/TWI bias distribution

Here we’re looking only at medium Candidate Specificity and a 1-hour window. We can see how the distribution of the DTWBI biases corresponds to the change in Match Buffer length. DTWBI (pink) has the better median bias, while TWI (blue) has the better mean bias. TWI underestimates the distance while DTWBI more often overestimates the distance (which is more pronounced as the Match Buffer grows)

Distribution of distances across sources - 1 hr window

Here we can see the distribution not of the biases, but of the total travel distance under the truth (green), TWI (blue) and DTWBI (pink) imputation schemes. We see three imputations, within the one-hour window condition. In the 1-hour match buffer condition, there’s underestimation of the lower travel distances. In the 4- and 8-hour buffer conditions, there’s overestimation of the lower travel distances.

Distribution of distances across sources - 2 hr window

For comparison, here’s the two-hour window

Distribution of distances across sources - 3 hr window

And three-hour window

Ending ending

Note that here, the scale is inverted for the comparison between DTWBI and TWI for comparison. Before, a high number was better (abs(TWIbias) - abs(DTWbias)), but in order to compare it against the absolute bias, we need lower numbers to be better (abs(DTWbias) - abs(TWIbias)). In the below picture, the darker areas can be interpreted as preferential in terms of mean bias.

We may have an indication here that we prefer a lesser match buffer time, but the window information is still unclear.

Beginning

Let’s jump directly into the equivalent plot for missingness at the beginning. Here we see that what occurrs following the missingness does help us to make less biased predictions when imputing the gap. If we have a one-hour window, there’s signs of a preference for a 6-hour matching buffer and medium Candidate Specificity following the gap. The other windows make this less clear. As the window size increases, we also see the necessary match buffer length decreasing.

Difference of means vs mean of differences

We see something similar here for the beginning as for the end. I’m not totally clear how to think about this distinction.

Distribution of distances across sources - 2 hr window, medium Candidate Specificity

Really, there isn’t much difference here when it comes to the scale, making it difficult to see.

Distribution of bias across conditions

Here’s the “squished” distribution of bias across the conditions. Absolute bias greater than 20km is squished into the tails of the distribution. An optimal shape here would look like a lump at 0 without widening at the limits. We seem to find the relative best performance here with a high candidate specificity and a short window or a low specificity and a larger window. Even so, we are often off for any one set of imputations by a large amount.

### Distribution of bias without matching

For comparison, here’s the same plot for TWI (without taking the mean across conditions – each imputation actually has one direct comparison, both derived by chance. When we’ve been comparing until now, it’s against the mean of all TWI in the same window [and either within the same person or not]). It’s hard to say if it might be favorable.

## Warning: Removed 45 rows containing non-finite values (stat_ydensity).

Comparison measures from other paper

The most popular reference for dynamic time warping based imputation (https://www.sciencedirect.com/science/article/am/pii/S0167865517302751) offers some measures for diagnostic purposes. In these, the imputed time series is compared against the true time series.

FA2: This function returns the value of FA2 of two vectors corresponding to univariate signals X (true values) and Y (imputed values). This FA2 corresponds to the percentage of pairs of values (x_{i}, y_{i}) satisfying the condition 0,5 <= (Y_{i}/X_{i}) <= 2. The closer FA2 is to 1, the more accurate is the imputation model. Both vectors Y and X must be of equal length, on the contrary an error will be displayed. In both input vectors, eventual NA will be exluded with a warning diplayed.

fb: (Fractional Bias) - This function returns the value of FB of two vectors corresponding to univariate signals, indicating whether predicted values are underestimated or overestimated compared to true values. A perfect imputation model gets FB = 0. An acceptable imputation model gives FB <= 0.3. Both vectors Y and X must be of equal length, on the contrary an error will be displayed. In both input vectors, eventual NA will be exluded with a warning diplayed.

fsd: (Fraction of Standard Deviation) - This function returns the value of FSD of two vectors corresponding to univariate signals. Values of FSD closer to zero indicate a better performance method for the imputation task. Both vectors Y and X must be of equal length, on the contrary an error will be displayed. In both input vectors, eventual NA will be excluded with a warning displayed.

nmae: (Normalized Mean Absolute Error) - This function returns the value of NMAE of two vectors corresponding to univariate signals. A lower NMAE (NMAE ) value indicates a better performance method for the imputation task. Both vectors Y and X must be of equal length, on the contrary an error will be displayed. In both input vectors, eventual NA will be excluded with a warning displayed.

rmse: (Root Mean Square Error) - This function returns the value of RMSE of two vectors corresponding to univariate signals. A lower RMSE (RMSE ) value indicates a better performance method for the imputation task. Both vectors Y and X must be of equal length, on the contrary an error will be displayed. In both input vectors, eventual NA will be excluded with a warning displayed.

sim: (Similarity) - This function returns the value of similarity of two vectors corresponding to univariate signals. A higher similarity (Similarity ) highlights a more accurate method for completing missing values in univariate datasets. Both vectors Y and X must be of equal length, on the contrary an error will be displayed. In both input vectors, eventual NA will be excluded with a warning displayed.

Pink is TWI (and therefore varies only over windowing conditions). High candidate specificity is preferred for measures fa2, fb and fsd. RMSE is lower with a larger window and higher MatchBuffer.

Variance

dtwbidat <- disttot[Source == "DTWBI" & imp %in% 1:10]

# qbar = 1/m sum(Q_t)
# This is  the mean of the point estimates across imputations.

dtwbidat[, qbar := mean(SumDist), .(cgrp, power, window, match)]

Posterior variance of P(Q|Y_obs) is the sum of two variance components:

Within variance
Between variance

Uhat = 1/m * sum(Uhat_l) where Uhat_l is the vcov matrix of Qhat_l

Little unclear of how to interpret that here. Qhat_l is the estimate of the lth imputation, but in my case, is this across persons? Otherwise I’m not certain what it looks like. So Uhat_l is the variance of all SumDist for the lth imputation? This seems like it’s only relevant on the calculation of the statistic of interest at the end, but I’m looking to capture variation in general?

This is variance between the m estimates

B = 1/(m-1) * sum(Q_t - Qbar)^2

We can check if B is a confidence valid estimate of the variance due to missing data (1 + 1/m)*E(B|Y) >= Var(Qbar)

dtwbidat[, B := (1/(max(imp) - 1)) * sum((SumDist - qbar)^2), .(cgrp, power, window, match)]
# dtwbidat[, Uhatl := var(SumDist), .(power, window, match)]

# dtwbidat[, PooledMeanSumDist  := mean(SumDist), .(cgrp, power, window, match)]
# dtwbidat[, SE := sd(SumDist)/9, .(cgrp, power, window, match)]
# dtwbidat[, 1/10 * sum(SE^2), .(power, window, match)]

dtwbidat[, Vw := mean(var(SumDist)), .(power, window, match)]
# dtwbidat[, Vb := (1/9)*sum((SumDist - PooledMeanSumDist)^2), .(power, window, match)]
dtwbidat[, t := Vw + (1 + (1/max(imp)))*B, .(power, window, match)]
# dtwbidat[, .(Vw, Vb, t, B)]



# Within imputation variance
# average of the mean of the within variance estimate
# i.e. squared standard error in each imputed data set

Working here

From https://stefvanbuuren.name/fimd/sec-evaluation.html

Raw bias and percent bias
Coverage Rate
Average width
RMSE

# Raw bias

# dtwbidat[, rawbias := qbar - TrueDist]
# Not enough simulations to calculate an empirical CI
# What's the right sd to use?

# dtwbidat[, sd := sd(SumDist), .(window, power, match, cgrp)]
dtwbidat[, sd2 := sqrt(t), .(window, power, match, cgrp)]

dtwbidat[, CI_low := qbar - 1.96 * sd2/sqrt(max(imp))]
dtwbidat[, CI_high := qbar + 1.96 * sd2/sqrt(max(imp))]
dtwbidat[, Covered := TrueDist %between% list(CI_low, CI_high)]


dtwbidat[, .(RB = round(mean(qbar) - mean(TrueDist)),
             PB = 100 * abs((mean(qbar) - mean(TrueDist))/mean(TrueDist)),
             CV = round(sum(Covered) / length(Covered), 3),
             AW = round(mean(CI_high - CI_low)),
             RMSE = round(sqrt(mean((SumDist - TrueDist)^2)))),
         .(match, window, power)][order(power, match, window)]

##     match window power     RB        PB    CV     AW  RMSE
##  1:    16   1 hr     2  -8817  9.481548 0.933 145506 33021
##  2:    16   2 hr     2  -9490 10.204981 0.950 145040 32368
##  3:    16   3 hr     2  -8956  9.631235 0.950 145509 31118
##  4:    24   1 hr     2  -9238  9.934427 0.967 149715 32024
##  5:    24   2 hr     2  -8192  8.809680 0.950 151982 34897
##  6:    24   3 hr     2  -9375 10.081694 0.967 148204 32172
##  7:    32   1 hr     2 -10130 10.893448 0.950 146649 31297
##  8:    32   2 hr     2  -8789  9.450915 0.950 147901 34087
##  9:    32   3 hr     2 -10129 10.892371 0.933 148361 32662
## 10:     4   1 hr     2  -6564  7.059165 0.967 147667 29953
## 11:     4   2 hr     2  -4601  4.947449 0.983 146958 29499
## 12:     4   3 hr     2  -5037  5.416594 0.967 147507 30205
## 13:     8   1 hr     2  -7811  8.399451 0.950 146653 33025
## 14:     8   2 hr     2  -6851  7.367037 0.950 147070 31101
## 15:     8   3 hr     2  -5656  6.082756 0.967 146507 33137
## 16:    16   1 hr     3  -8250  8.871455 0.983 145415 31755
## 17:    16   2 hr     3  -8704  9.360096 0.950 146980 32736
## 18:    16   3 hr     3  -7330  7.882368 0.967 147457 32823
## 19:    24   1 hr     3  -7632  8.206823 0.950 148397 34126
## 20:    24   2 hr     3  -8956  9.630981 0.950 146532 32209
## 21:    24   3 hr     3 -10415 11.199942 0.950 145637 31255
## 22:    32   1 hr     3 -10718 11.525996 0.967 146145 30281
## 23:    32   2 hr     3  -8464  9.102403 0.967 146705 32607
## 24:    32   3 hr     3  -9397 10.105469 0.950 147531 31543
## 25:     4   1 hr     3  -3959  4.257900 0.967 145919 28531
## 26:     4   2 hr     3  -5143  5.530446 0.983 147207 30079
## 27:     4   3 hr     3  -4259  4.579838 0.983 146907 31013
## 28:     8   1 hr     3  -4538  4.880535 0.983 149598 31729
## 29:     8   2 hr     3  -6911  7.431709 0.967 148720 29963
## 30:     8   3 hr     3  -5371  5.775636 0.983 146873 30711
## 31:    16   1 hr     4  -7147  7.686133 0.983 144674 30538
## 32:    16   2 hr     4  -7789  8.375831 0.967 145549 31673
## 33:    16   3 hr     4  -6752  7.260555 0.967 145890 29937
## 34:    24   1 hr     4 -10044 10.800613 0.950 144789 30026
## 35:    24   2 hr     4  -8933  9.606252 0.967 147819 31351
## 36:    24   3 hr     4  -8535  9.177829 0.950 146126 30391
## 37:    32   1 hr     4  -9726 10.458508 0.950 146159 31897
## 38:    32   2 hr     4  -8195  8.813007 0.967 147688 34319
## 39:    32   3 hr     4  -8492  9.132342 0.967 146973 32188
## 40:     4   1 hr     4  -2694  2.897203 0.983 147019 28331
## 41:     4   2 hr     4  -5797  6.233983 0.967 146148 27430
## 42:     4   3 hr     4  -3714  3.994135 0.967 146750 29233
## 43:     8   1 hr     4  -6059  6.515865 0.967 146347 27806
## 44:     8   2 hr     4  -3883  4.175759 0.967 148148 30991
## 45:     8   3 hr     4  -5035  5.414769 0.967 147212 28617
##     match window power     RB        PB    CV     AW  RMSE

Plot of those

dtwbidat[, MatchBuffer := factor(match, levels = c(4, 8, 16, 24, 32), labels = c("1 hr", "2 hr", "4 hr", "6 hr", "8 hr"))]
res <- dtwbidat[, .(RB = round(mean(qbar) - mean(TrueDist)),
             PB = 100 * abs((mean(qbar) - mean(TrueDist))/mean(TrueDist)),
             CV = round(sum(Covered) / 600, 3),
             AW = round(mean(CI_high - CI_low)),
             RMSE = round(sqrt(mean((SumDist - TrueDist)^2)))),
         .(MatchBuffer, window, power)][order(power, MatchBuffer, window)]

ggplot(res, aes(fill = RB, x = MatchBuffer, y = window)) +
  geom_tile()+
  facet_wrap(~ power) +
  scale_fill_continuous(labels = scales::label_number_si(unit = "m"))+
  theme_minimal() +
  ggtitle("Raw bias")

ggplot(res, aes(fill = PB, x = MatchBuffer, y = window)) +
  geom_tile()+
  facet_wrap(~ power) +
  scale_fill_continuous(labels = scales::label_percent(scale = 1))+
  theme_minimal()+
  ggtitle("Percent bias")

ggplot(res, aes(fill = CV, x = MatchBuffer, y = window)) +
  geom_tile()+
  facet_wrap(~ power) +
  scale_fill_continuous(labels = scales::label_percent())+
  theme_minimal()+
  ggtitle("Coverage")

Bias calculation on missing data (as opposed to entire series)

disttot[Source != "Truth", Bias2 := MissedDist - ImputedDist]
disttot[, MatchBuffer := factor(match, levels = c(4, 8, 16, 24, 32), labels = c("1 hr", "2 hr", "4 hr", "6 hr", "8 hr"))]
ggplot(disttot[Source == "DTWBI" & imp %in% 1:5, .(Bias2 = mean(Bias2, na.rm = TRUE)), .(MatchBuffer, power, Source, window, imp)], aes(fill = Bias2, x = MatchBuffer, y = power)) +
  geom_tile() +
  facet_grid(rows = vars(imp), cols = vars(window), labeller = "label_both") +
  scale_fill_continuous(labels = scales::label_number_si(unit = "m"))+
  theme_minimal()

## https://stefvanbuuren.name/fimd/sec-whyandwhen.html
# dtwbidat <- disttot[Source == "DTWBI"]
# dtwbidat[, sum(Bias2^2), .(cgrp, match, power, window)]
# 
# ggplot(disttot[Source == "DTWBI" & imp %in% 1:5, .( = mean(Bias2, na.rm = TRUE)), .(MatchBuffer, power, Source, window, imp)], aes(fill = Bias2, x = MatchBuffer, y = power)) +
#   geom_tile() +
#   facet_grid(rows = vars(imp), cols = vars(window), labeller = "label_both") +
#   scale_fill_continuous(labels = scales::label_number_si(unit = "m"))+
#   theme_minimal()