Paper 3 Annotations

Complete trajectory reconstruction from sparse mobile phone data - Knapen et al. (2018)

Comments

really important paper, may be one method of solving long gaps in paper 3. they use tensors on repeated data, so this is a way of integrating the repeatability of human mobility into the mix.
also probably holds for long gap missing data in our study

Direct quotes

"Call Detail Records (CDR) are the current de-facto standard for mobile phone data used in human mobility studies." Call detail records are much more sparse than the data that we collect but are more reliably collected --they have few missing data themselves but rather they are imperfect annotations of the underlying movement behavior that itself must be interpreted.
e introduce our novel approach, CTR, which leverages well-known features of human mobility to customize tensor factorization
it is performed at quite different temporal res- olutions by works in the literature, spanning from 15 minutes [14] to 1 hour [3, 4], or even
Indeed, when extensive knowledge of the user mobility is required by the nature of the investigation, the straightforward way to deal with incomplete trajectory data is that of filtering users so as to exclude from the analysis those with insufficient location information. This is typically performed by imposing thresholds on the amount of mobility data available in CDR, though there is hardly a standard to define them. As a reference, several important works opt for a choice of users who have a minimum average sampling frequency of 0.5 events per hour, and who are observed at two unique locations at least [3, 4, 14]. User filtering comes with a cost in terms of below-threshold users that are excluded from the analysis. Unfortunately, in practical cases involving CDR datasets, even the loose requirements set by the thresholds mentioned above risk to lead to severe reductions in the examined user population. For instance, two very well-known studies analyze just 1.67% [3] and 0.45% [4] of the total subscribers available in their datasets upon filtering, leaving substantial richness in the original data untapped.
by assuming that users do not move for some periods (e.g., in the order of hours) before and after every recorded event in the CDR dataset
17]. A different strategy hinges on probabilistic reconstruction, implemented by modeling positions between two CDR events as random variables
We first quantify the amount of missing information in the individual trajectories. To this end, we assume a total observation time T and discretize it into time intervals of duration t , which we will refer to as the temporal resolution.We define the completeness of a specific trajectory as the fraction of time intervals for which we have at least one location sample. As an example, let us assume that T = 7 days and the temporal resolution is t = 1 hour: then, a trajectory with locations in 80 different hours has a completeness of 80/(7 × 24) = 0.476
Table 2 summarizes the results. Goodness-of-fit tests consistently point at either of two distributions, i.e., Weibull and lognormal, depending on the combination of system pa- rameters.
The PDF of the Weibull distribution is expressed as fWeibull(x) = k k ( x x )k–1e–(x/)k . Plots (b) and (c) of Fig. 3 show that the estimated Weibull parameters k and scale linearly (Pearson correlation coefficients of 0.89 and 0.98, respectively) with the most important system
parameter, t . Also, the observation time T introduces an offset to the linear relationship. We thus model k ˆ k ˆ = [t × (4.569 – 0.01037T ) + 1086.5 – 1.14515T ] × 10–3 and ˆ = [t × (1.814 – 0.00379T ) + 26.23 – 0.07659T ] × 10–3, and the CDF of CDR-based trajectory completeness as fWeibull(x; k k ˆ , ˆ).
oscillations
We define the observed trajectory of non-null locations as LO = {li | i O}; this is the user mobility information that can be derived from the mobile phone data.
The first step in CTR aims at reconstructing missing portions of a trajectory during the night hours
We first identify the most frequent location visited by the user during nighttime, which we name ˆlH . Then, we verify that ˆlH accounts for a conservative 80% (or more) of the po- sitions of the user during the night period. If this is not the case, we deem that the result is not consistent enough to establish with high confidence the actual home location of the user, and we skip the nighttime trajectory reconstruction step
location tensor of a user X . A trajectory LT is represented as daily and weekly combinations of one-day sub-trajectories
Tensor factorization exploits redundancy to recover missing data; in our context, redundancy is created by the regularity of human movement, which creates repeated pat- terns of visited locations over the many days and weeks covered by a CDR dataset, as discussed in Sect. 4.2.
First, the trajectory is split into multiple one-day sub-trajectories. Each one-day sub-trajectory is then converted into a one-dimensional vector: for instance, the sub-trajectory of the jth day of the ith week is denoted by xi,j and satisfies xi,j = (l(x) ni,j (x) ni,j,1 ,l(y) ni,j (y) ni,j,1 ,...,l(x) ni,j (x) ni,j,Nt ,l(y) ni,j (y) ni,j,Nt ) R2Nt , where l(x) k (x) k and l(y) k (y) k are the two coordinates that identify the location at time step k, for lk = (l(x) k (x) k ,l(y) k (y) k ). • Second, we enter all of the one-day sub-trajectories of a user trajectory into the location tensor by organizing them into a matrix of one-day sub-trajectories for all Nw weeks in the observation period T , i.e., X = [xi,j]Nw×Nd = {Xi,j,k}Nw×Nd×2Nt
decomposes the tensor into hyper-parameters and then uses them to infer missing value
Indeed, (i) daily periodicities tend to stronger than weekly ones [3], and (ii) consecutive time slots, hence nearby values in same one-day sub-trajectory vector, show a strong correlation in terms of locations of t user [18] [18]. In the light of these considerations, we customize the TF problem for locati inference, by introducing two additional elements to the optimization problem in
The first element emphasizes daily repetitive mobility patterns by means of Toepli matrices. Toeplitz matrices allow modelling relationships between specific elements the location tensor.
towers. Instead, the solution of the enha ed TF problem in (4) returns locations whose coordinates are real values, which may no match the position of actual cellular towers.
We remark that the available fine-grained ground-truth trajectories bound our validation to a two-week period. We
We en- sure that the incomplete trajectories mimic the temporal sampling properties of those extracted from CDR as follows: (i) we construct an empirical distribution of the probabil- ity that events are recorded at each time slots, based on CDR data of all 1.8 million users in our dataset; (ii) we set the desired completeness value k that shall characterize the out- put incomplete trajectory; (iii) we retain a fraction k of the overall location samples in one complete trajectory according to a random selection of time slots that follows the empirical distribution above
performance metric we adopt is cell displacement, which expresses the MAE used in (1) in terms of mobile network cells
The level of reconstruction accuracy achieved by CTR is acceptable for metropolitan- scale analyses
As a preliminary result in their analysis, Gonzalez et al. find that the travel distances and radiuses of gyration aggregated over the whole user population follow truncated power laws (x + x0)–ß e–x/k . We confirm that this is also the case under complete trajectory data, as shown in the top plots of Fig. 9. The exponent values are also consistent, as Gonzalez et al. find ßr = 1.77 and ßrg = 1.65, whereas ßr = 1.78 and ßrg = 1.68 from our complete trajectory data. However, we remark a sensible difference in the cutoff values. Travel distances can be sensibly higher in complete data; also, krg is at 400 km in [3], which is not far from the 340 km from our original
trips. Our conclusion is that, CDR data sparsity risks to both underestimate long trips, and overestimate the region within which the Lévy flight behavior of human mobility occurs.
that the extremely high uniqueness identified by De Montjoye et al. is mainly caused by the diverse temporal patterns of the mobile communications of each user, rather than by his/her dis- tinctive mobility.
The mean predictability is at 81%, which is high yet quite far from the 93% found by Song et al

Variational embedding of a hidden Markov model to generate human activity sequences - Zhao and Zhang (2019)

Comments

potentially interesting way of solving long gaps through an HMM generating activity states based (in this case) on travel diary results.

Direct quotes

ysov et al., 2019; Valarezo et al., 2020; and Butepage et al., 2019) In the present study, a model process was devised to generate human activity sequences and the corresponding qualitative in­ formation using passively collected mobile data. The model uses a fully probabilistic process of generating an activity sequence present study, a model process was devised to generate human activity sequences and the corresponding qualitative in­ formation using passively collected mobile data. The model uses a fully probabilistic process of generating an activity sequence to overcome the drawbacks of the deterministic estimation of human activity. An HMM and a VAE were incorporated into a single g passively collected mobile data. The model uses a fully probabilistic process of generating an activity sequence to overcome the drawbacks of the deterministic estimation of human activity. An HMM and a VAE were incorporated into a single modeling framework such that the former generates activity sequences in a fully stochastic manner, and the latter moves the clustering y. An HMM and a VAE were incorporated into a single modeling framework such that the former generates activity sequences in a fully stochastic manner, and the latter moves the clustering task from an inconsistent observation space into a homogeneous latent space with a smaller dimension. Such an integrated approach is g framework such that the former generates activity sequences in a fully stochastic manner, and the latter moves the clustering task from an inconsistent observation space into a homogeneous latent space with a smaller dimension. Such an integrated approach is rarely found in the literature regarding inferences for human activity. As far as we could ascertain, the present study is the first attempt pace into a homogeneous latent space with a smaller dimension. Such an integrated approach is rarely found in the literature regarding inferences for human activity. As far as we could ascertain, the present study is the first attempt to jointly formulate a VAE and an HMM to infer and synthesize chains of human activity based on a real mobile dataset. y found in the literature regarding inferences for human activity. As far as we could ascertain, the present study is the first attempt to jointly formulate a VAE and an HMM to infer and synthesize chains of human activity based on a real mobile dataset. Researchers have recently attempted to apply the same integrated approach to automatic speech recognition (Ebbers
We set up a stochastic process to generate an activity sequence with prior distributions of variables, which are based on a Markov process and Gaussian emission with latent embedding. A variational inference (VI) approach was adopted to estimate the posterior p a stochastic process to generate an activity sequence with prior distributions of variables, which are based on a Markov process and Gaussian emission with latent embedding. A variational inference (VI) approach was adopted to estimate the posterior distribution of latent variables conditional on
chain. The number of candidate activities must be predefined, even though each cannot be titled from scratch. The generation of an activity chain begins with the selection of the first activity. The initial proposed a probabilistic process to generate an activity chain. The number of candidate activities must be predefined, even though each cannot be titled from scratch. The generation of an activity chain begins with the selection of the first activity. The initial activity within an activity chain is assumed to be randomly selected via a categorical probabilistic distribution p = (p1,,pNs). Eq. (1) gh each cannot be titled from scratch. The generation of an activity chain begins with the selection of the first activity. The initial activity within an activity chain is assumed to be randomly selected via a categorical probabilistic distribution p = (p1,,pNs). Eq. (1) shows the probability that the initial activity will become a specific activity, s1. Once the initial activity within an activity
The proposed VAE-HMM can be trained in either a supervised or an unsupervised manner. The supervised training was for the inference of activities, and the unsupervised training was for the synthesis of observed features for activity sequences. Basically, feature proposed VAE-HMM can be trained in either a supervised or an unsupervised manner. The supervised training was for the inference of activities, and the unsupervised training was for the synthesis of observed features for activity sequences. Basically, feature data labeled with true activity types are necessary even for the unsupervised training so that the model performance can be validated. , and the unsupervised training was for the synthesis of observed features for activity sequences. Basically, feature data labeled with true activity types are necessary even for the unsupervised training so that the model performance can be validated. The present study utilized household travel diaries collected for the purpose of transportation planning in the Seoul metropolitan area.
y was conducted every 5 years for households that were randomly sampled by about 2.5% of the population. The trip diary data includes 7 different activity types and relevant features of household and individuals who participated in the survey. The trip diary data were collected only for a single day, and multi-date activity chains could not be analyzed. The activity p diary data includes 7 different activity types and relevant features of household and individuals who participated in the survey. The trip diary data were collected only for a single day, and multi-date activity chains could not be analyzed. The activity chains extracted from the household travel diaries were used to train and test the proposed model. Several individual- or
g. 5 (a) compares the observed and estimated marginal distributions of features. As shown in the figure, the proposed VAE-HMM reproduced almost exact distributions for binary discrete variables of individual-specific features. For continuous features, the marginal pares the observed and estimated marginal distributions of features. As shown in the figure, the proposed VAE-HMM reproduced almost exact distributions for binary discrete variables of individual-specific features. For continuous features, the marginal distri­ butions of synthesized features were also very close to those of the observed features except for several features with y discrete variables of individual-specific features. For continuous features, the marginal distri­ butions of synthesized features were also very close to those of the observed features except for several features with observed probabilistic density that deviated greatly from the normal distribution. Exceptions could have occurred because decoder output was ynthesized features were also very close to those of the observed features except for several features with observed probabilistic density that deviated greatly from the normal distribution.
The model performance could be improved in the future by regarding the global parameters as random. Although random parameters increase the computation burden, the basic global parameters are all fixed for the brevity of modeling. The model performance could be improved in the future by regarding the global parameters as random. Although random parameters increase the computation burden, the basic structure of the proposed VAE-HMM will not change. For this, we are devising a plausible graphical representation of the Bayesian y regarding the global parameters as random. Although random parameters increase the computation burden, the basic structure of the proposed VAE-HMM will not change. For this, we are devising a plausible graphical representation of the Bayesian modeling. Furthermore, the proposed VAE-HMM

Likelihood-based offline map matching of GPS recordings using global trace information - Chen et al. (2019)

Comments

important paper on map matching, outlines the technique and state of the art etc. doesn't consider missing data and lowest density is once per 30 sec.

Direct quotes

The technique proposed in this paper is aimed at offline (batch) map matching of GPS traces.
Quddus et al. (2007) provide a comprehensive overview of online map matchers.
Nowadays map matchers are based on the multi-hypothesis technique (MHT) about the position of the vehicle.
Recent online map matchers, starting with Greenfeld (2002), usually incorporate topology constraints
Schüssler and Axhausen (2009) state that map matching of person traces requires high resolution network information
map matching techniques (both online and offline) are classified as (i) pure geometry based methods, (ii) topological methods, (iii) probabilistic methods and (iv) advanced procedures. Pure geometric methods are further classified by Quddus et al. (2007) as point to point matching (finding the nearest node or shape point), point to curve matching (finding the polyline to which the distance is minimal) and curve to curve matching (matching the vehicle trajectory against known roads).
problem of non-connected walks
The route candidates then are assigned a score and in order to avoid huge sets of route candidates, only the N route candidates having the best scores are kept (N = 30). Each GPS point can match at most one link in each route candidate and each link in the route needs at least one GPS fix
The paper concludes that the value reported in Marchal et al. (2005) is a valid one; the average score per GPS point does not significantly decrease with the candidate set size for N > 30. It also reports that the processing time per point is between 10 [ms] (for N = 20) and 75 [ms] (for N = 100).
hou and Golledge (2006) use a similar procedure implemented in ArcGIS. GPS recordings are processed sequentially and a pool of candidate solutions is kept. In a preprocessing stage, they first replace clusters of GPS points by their centroid (cluster reduction) but also add interpolated GPS points when the distance between two consecutive points is larger than half of the minimum length for the links in the buffer defined by the two GPS points. Then a 2-norm (distance) and a rotation measure are used to determine the weight for each point in the preprocessed dataset. A set of candidate partial paths is kept and extended so that a connected walk results from the method.
Feng and Timmermans (2013) use a Bayesian Belief Network (BBN) to replace the ad hoc rules used in map matchers not making use of the multi-hypothesis technique (MHT), to select the next road segment candidate in a route.
Chen et al. (2011) propose a probabilistic method to simultaneously detect the road segment sequence and the transportation modes used. The likelihood that a given multi-modal path in a network generates the observed sequence of smartphone data is estimated. The measurement equations establish the probability that a given path generates a given time series of measurements.
The methods mentioned above process the GPS points in chronological order
Brakatsoulas et al. (2005) propose three algorithms: (i) a greedy algorithm processing one point at a time using a distance and an angular criterion to select the next edge, (ii) a recursive local look-ahead method (inspecting up to 4 network links and GPS points ahead) and (iii) a global method that minimizes the Fréchet distance between curves
free space concept then is extended to free space surface in order to compare a curve C to a graph (each edge combined with C generates a free space and those are combined into a free space surface). surface). The sequence of GPS coordinates constitutes a piecewise linear curve. For a given , the free space surface for such curve and each path in the graph is computed.
Wei et al. (2012) present a clear overview of the information and weight functions used in published maximal weight methods. A global method based on a hidden Markov model (HMM) and the Viterbi probability maximization is proposed.
shortest paths are inserted to connect each link matched by the first fix to each link matched by the second fix.
Deka and Quddus (2015) present a global method for trip based offline map matching that minimizes the additional data re- quirements like route choice, mobility patterns, etc.
In the method presented in this paper (i) a set of multiple initial matches is allowed, (ii) outliers are processed according to the GPS device accuracy, (iii) the concept of likelihood (as opposed to weight) is used, (iv) direction is not considered and (v) a graph is build that allows for enumeration of high likelihood candidate paths for posterior application of rules based on additional as- sumption
If subsequent research aims to verify a particular hypothesis H to hold for W, then the map matcher shall not use any assumptions that affect the hypothesis H verification. Example: when the final result serves to assess speeding behavior the link selection stage shall not include speeding related behavior assumptions (e.g. respecting maximum speed on road segments)
Results of map matching are used in research projects that focus on the analysis of revealed travel behavior. In particular, researchers aim to extract properties of routes revealed by GPS traces in order to support route choice set generation (i.e. the same purpose as the one mentioned in Bierlaire et al. (2013)). This leads to the requirement to efficiently derive the route in the road network that has the highest probability to have generated the time series of GPS recordings. Such route in general corresponds to a graph theoretical walk (repeated visits of edges and vertices allowed) and not necessarily to a path (no repeated visits allowed)
On one hand, gap filling by means of shortest route segments in the case of missing recordings reduces the route complexity; hence it introduces bias and needs to be avoided.
This paper proposes a map matching method based on likelihood maximization in which the requirement to have at least one GPS record for each link is relaxed.
The road network is modeled by a directed graph G ( , ) V E T , ) V E T T T , ) V E T T T T ) V E T T T . The superscript T is used to identify the transportation graph. Each vertex v V T corresponds to a node in the network. Each edge e E T is associated to an ordered pair of nodes and identifies the set of lanes of a particular road segment usable to move from the source vs to the target vt. A bidirectional road segment is represented by two edges
The trace is subdivided into contiguous subtraces.
Subgraphs for chronologically consecutive subtraces are constructed so that they are not disjoint
Maximum likelihood (MLH) partial routes are determined for each entry-exit combination
A new acyclic digraph is build in which each edge represents a MLH partial route.
Each such part corresponds to a period in time (denoted by p) which is defined by the first and last recordings in the part.
construct a graph GU in which each vertex represents the assumption that a specific link l is used in a specific period p
In summary, the trace is partitioned. Each part corresponds to a time period. Each GPS record in a part matches the same set of links and these links constitute the subnetwork that is crossed during the time period. A computationally feasible method is proposed to find the maximum likelihood walks linking the entries to the exits in the subnetwork. The subnetworks then are assembled by connecting exits to entries. This results in a layering of subnetworks for which the maximum likelihood crossing walks are known. A simple recursive algorithm then is used to fine the maximum likelihood walk for the observed trace.
The proposed method may fail if the GPS trace contains too many consecutive erroneous recordings.
Given the accuracy threshold a and the associated probability p , we derive that the probability to find Ne consecutive erroneous recordings is given by (1 ) -p Ne
The distance d between the true position and the measured one then is given by d = + e e lon lat 2 e lon l + e e lon lat lon lat 2 lat 2 2 and has a Rayleigh distribution: d ~ Rayleigh s( ). The probability that the error is larger than d is determined as follows.
We deliberately refrain from making the likelihood dependent on the speed because part of the traces are produced by vehicles in
congested traffic.
because at least one correct link match is required to start the algorithm.
Multiple chronologically consecutive recordings may generate the same MLS (illustrated in Fig. 1). This allows to partition the sequence of recordings into contiguous subsequences such that two chronologically consecutive recordings belong to the same part if and only if they share the MLS
Haklay et al. (2010) investigated the positional accuracy for OpenStreetMap roads in the Greater London area. 109 different roads having a total length of 328[km] were compared to their counterpart in ITN (Integrated Transport Network) maps for which it can be assumed that the error is below 1 [m]. It is concluded that if 15 contributors are active in an area, the positional error for the road is well below 6 [m]. In complete areas the average error is 9.57 [m] with a standard deviation is 6.51 [m]. In incomplete areas the average error is 11.72 [m] and the standard deviation is 7.73 [m]. Completeness is defined as ‘a measure of the lack of data’ and examined for specific areas by Haklay (2010) using visual inspection of maps and by comparing (by means of GIS) the total road length found in OSM and in reference maps respectively.
accuracy based on the number of extra links - An accuracy based on the number of missed links + Ad accuracy based on the total length (distance) of the extra links - Ad accuracy based on the total length (distance) of the missed links
The maximum is taken in order to avoid negative accuracy values in cases of extremely short trips where ˜ ˜ ˜ + - |L L |L + - |L L - |L L L ˜ ˜ ˜ + - |L L L | | ˜ ˜ + - |L L L | | | | ˜ + - |L L L | | | | | 1
The accuracy value based on distance (Ad) is slightly lower for the 2 [s] recording period than for the 5 [s] recording period. The effect is less prominent for the accuracy based on the number of correctly matched links (An). It may have been caused by incorrect selection of short links in the 2 [s] case because, on average, more GPS recordings are available for each link.
Processing time however heavily grows with the recording period and becomes too high. The matching radius RM is proportional to the recording period (time- space prism) and hence the SNTS grows which causes more options to be evaluated
The simple heuristic poses a problem for short trips consisting of a single link.
The evaluation of these walks in GT using particular (travel behavior related) scoring functions requires a simple extension of the tool. This can be compared to the evaluation of the Fréchet distance for each path in Brakatsoulas et al. (2005).
The U-turn problem mentioned in Schüssler and Axhausen (2009) did not require any particular technique to be implemented in the proposed method
. Future research – extensions 1. Investigation of the number of near MLH sequences and of their properties is required. This includes estimating the effects on performance and accuracy of changes of the assumed global upper bound for the speed. 2. Investigation of the distribution for the results of behavior related scoring functions to the set of near MLH walks. 3. Determination and evaluation of a heuristic likelihood estimate that takes timing into account at a finer level. The current technique derives the link sequence order from topology and chronology but the GPS times are not used to determine consecutive positions on a single link. Extending the method in this way may solve the problem with single link trips mentioned in Section 5.1 item 2 but it may also improve the estimate of the likelihood and hence the accuracy of the method. 4. Computational performance for cases where the recording period is large needs to be increased.
Apart from the coordinates and the order induced by the timestamps, no other information is required.

An Interpolation Method for Trajectory Measurement Missing Data Based on Bidirectional Unequal Interval Kernel Adaptive Filtering - Xu et al. (2021)

Comments

uses a kernel method to interpolate short gaps in trajectories from both sides to the middle of the gap and runs it multiple times in order to allow the newly predicted points to impact the model. doesn't work well where transitions are not smooth. uses down sampling of the points

Direct quotes

when the number of missing data points is large, the extrapolation accuracy of the ARMA model is limited, and the compensation accuracy of the spline function is affected by the number of nodes and the boundary points, and the interpolation precision is also not high. Therefore, this paper proposes a trajectory missing data interpolation method based on bidirectional unequal interval kernel adaptive filtering.
The kernel adaptive filter is essentially an adaptive filtering method that maps training samples to high-dimensional feature spaces and uses kernel methods in the mapping process
The basic idea of the bidirectional prediction method is that the pair of the training samples is formed by the measured data before and after the missing data segment for the kernel adaptive filter. Point-by-point prediction of missing data from forward and reverse is performed using two kernel adaptive filters. In this way, no matter from the forward prediction or the backward prediction, the time span of the missing data is half of the entire missing data segment, so the accuracy of the predicted data point can be greatly improved.
If the sample points with redundant information are down-sampling, the time interval of one step prediction can be lengthened. Let the down-sampling time is k , then the output of the prediction data point is ikt+ . Through the down-sampling operation, the predicted time can be lengthened to compensate for the problem of reduced accuracy of the filtered prediction output data.
if the vehicle target has a significant acceleration or deceleration change in the missing data segment, causing the externally lost segment data to appear discontinuous, the performance of the filtered prediction interpolation method needs further verification.

A new fuzzy logic-based similarity measure applied to large gap imputation for uncorrelated multivariate time series - Tong et al. (2017)

Comments

compares performance of multiple imputation methods on time series data with long time gaps. not explicitly gps, but useful because the method is explicitly for low correlation situations. maybe good for long gaps when individuals don't really have much own data

Direct quotes

imperfect time series can be modelled using fuzzy sets.
The objective of this study is to fill large missing values in uncorrelated multivariate time series
This method makes it possible to build a new hybrid similarity measure for finding similar values between subsequences of time series.
The proposed imputation method is based on the retrieval and the similarity comparison of available subsequences.
. A -gap is large when the duration is longer than known change process.
Further, an important point of our approach is that each incomplete signal is processed as two separated time series, one time series before the considered gap and one time series after this gap
In the proposed approach, the dynamics and the shape of data before and after a gap are a key-point of our method.
Amelia II (Amelia II R-package) [57]: The algorithm uses the familiar expectation-maximization algo- rithm on multiple bootstrapped samples of the orig- inal incomplete data to draw values of the complete data parameters. The algorithm then draws imputed values from each set of bootstrapped parameters, replacing the missing values with the drawn values.
DTWUMI [62]: For each gap, this approach finds the most similar window to the subsequence after (resp. before) the gap based on the combination of shape- features extraction and Dynamic Time Warping algo- rithms. Then, the previous (resp. following) window of the most similar one in the incomplete signal is used to complete the gap.
The results clearly show that when a gap size is greater than 2%, the proposed method yields the gap size is greater than 2%, the proposed method yields the highest similarity, 2, FA2, and the lowest RMSE, FB.
In contrast to the two datasets above, on the MAREL- Carnot data, na.approx indicates quite good results: the permanent second or third rank for the accuracy indices (the permanent second or third rank for the accuracy indices (the 1 order at 5% missing rate on 2 score), the lowest FSD (from 3% to 5% missing rates), and FB at some other levels of missing data. But when looking at the shape of imputation values generated from this method, it absolutely gives the worst results (Figure 6).
Other approaches (including FcM-based imputation, MI, MICE, Amelia, and missForest) exploit the relations between attributes to estimate missing values. However, three con- sidered datasets have low correlations between variables (roundly 0.2 for MAREL-Carnot data, = 0.1 for simulated and synthetic datasets). So these methods do not demon- strate their performance for completing missing values in low/uncorrelated multivariate time series.

A Framework for Bus Trajectory Extraction and Missing Data Recovery for Data Sampled from the Internet - Li, Zhao, and Li (2019)

Comments

Paper discusses using fuzzy C-means clustering for evaluating the likelihood that various recorded trajectories were generated by the same underlying movement behavior. I imagine this fits into paper 3 when we evaluate filling gaps with a person's own data, but also in establishing how predictable a person's own behavior is as a metric for variance estimation. Can shed light (maybe) on between vs within person variation.
this paper is good for gathering some matrix notation for defining trips over users over days

Direct quotes

In our work, noise removal is based on the anomaly detection method
To date, linear interpolation is the easiest, most conservative and most common interpolation method, which assumes that movement follows a straight-line path between two known points [23]. However, straight lines are not consistent with many types of path dynamics, especially for bus trajectories that include consecutive congested and non-congested roads
Unfortunately, the mean absolute error (MAE) of those cubic interpolation methods exhibits a worse performance than the straight linear interpolation in most cases for the sampled station-level sparse arrival data.
G(l) = 584.2000 3 1 583.0167 7 2 584.8333 12 3 , for l = 158
fuzzy C-means clustering (FCM) method is applied to cluster the data in G(m,l), and the trajectory fragment set is generated. T
The first problem for FCM is how to measure the distance of trajectory data in order to get a large distance between different trajectories and a small distance in the same trajectory.
trajectory clustering by FCM, we construct an enhanced distance metric by considering the traveling time along the route direction
The X constructed by Equation (5) is called the forward clustering feature, and the one constructed by Equation (8) is called the backward clustering feature in this paper. Accordingly, we call the fuzzy C-means clustering with the forward clustering feature as forward FCM (F-FCM) and the one with backward clustering feature as backward FCM (B-FCM)
fuzzy membership function (MF) to describe the relationship of arrival data in a trajectory, i.e., the introduced MF in our AD procedure describes the connecting possibility between two arrival events G(m,l) i i and G(m,l) k
In this paper, we extend the triangular to pentagonal MF,
data range should be extended from historical data to make a better coverage for the futur the extensions are not symmetric for the lower bound and upper bound. The shortest traveling time is restricted by the highest vehicle velocity and distance between two stations, which can be expected once the shortest distance between two adjacent stations is given.
arrival data missing randomly occurring along the stations. F
is contextual linear interpolation for missing data recovery (CLIMDR) for the cases of missing data inside the trajectory, a
The CLIMDR method is implemented based on the assumption that tA1,A2 and tA1,A3 are approximately linear
F-FCM method is suitable for the trajectories that have the right traveling direction, whereas the B-FCM method is suitable for those that have the wrong traveling directio
Thus, missing data recovery is important for the trajectory mining of urban buses. The proposed CLIMDR method is easily implemented and outperforms when the dataset is large enough (thousands of trajectories or more)
As compared in the figure, the performance of our method is better than those of the other two methods for almost allstations, especially for the stations with congested roads and a large MAE
The CLIMDR-n method is suitable for the cases where n consecutive data are missing in the trajectories, and it outperforms comparable methods in most cases.
One merit of the trajectory matrix is that the traveling time set between any two stations can be easily derived by subtracting the corresponding two columns from the matrix.

Effects of Data Preprocessing Methods on Addressing Location Uncertainty in Mobile Signaling Data - Phan, Bigand, and Caillault (2018)

Comments

paper gives a method for what is effectively stay point detection using CDR but importantly points out that the mobility measures are dependent on the sort of preprocessing done and the parameters selected for e.g. distance and time

Direct quotes

choice of preprocessing methods could lead to changes in the data characteristics
The number of records of a trajectory vary nota- bly from each other, with mean and median values of 59.3 and 41.0, respectively (Figure 1A). The interevent time, measured as the duration between two consecutive records in a trajectory, ranges from a few seconds to several hours (F
The two-stage clustering and time window–based methods are frequently used in existing studies to tackle location uncertainty issues in mobile phone data.
the two- stage clustering method is primarily used to generalize users’ documented locations to derive their represen- tative locations (e.g., stay locations). The time win- dow–based method focuses explicitly on detecting and removing oscillations in the data.
in the first stage, we com- pare each observation with the subsequent one and merge them into a segment if they fall within a roaming distance of Dd1: We use the medoid of these observations—defined as the most visited cell in that segment—as its representative location. The representative location is then compared with the next observation, which will be merged into the same segment if they (representative location of the segment and the observation) fall within Dd1: The representative location of the segment will be updated as more observations are added. The clustering process will terminate until all seg- ments are identified. This results in a sequence fðl0 1 0 1, t0 1 0 1, dur1Þ,ðl0 2 0 2, t0 2 0 2, dur2Þ, :::,ðl0 n 0 n, t0 n 0 n, durnÞg where l0 i 0 i, t0 i 0 i, and duri denote the medoid, starting time, and the stay duration of the ith segment, respectively
By imposing a moving window with a fixed length (e.g., 5 minutes), the method aims to detect circular events—subsequences in a trajectory that start and end at the same cell—and identify those within the time window as oscillations
The result suggests that the way the oscillations are tackled in the mobile signaling data could have a notable impact on the estimation of OD trips.
this mobility indicator than performing the time window–based method, given that the latter is more or less a downsampling pro- cess of mobile phone trajectories. The result indicates that by ignoring structural prop- erties of oscillations in a trajectory, the estimation of total stay time can be off by several hours or even longer
Our results demonstrate that the choice of data prepro- cessing methods could lead to changes in the data characteristics. Such changes, which are nontrivial, will further affect the characterization of human mobility patterns.
that cell tower oscillations tend to be more pronounced in densely populated areas. Ignoring the uncertainty issues in the mobile phone data will have a larger impact on these areas, where decision making on urban design and manage- ment is frequently needed

A distributionally robust optimization approach to reconstructing missing locations and paths using high-frequency trajectory data - Jeong et al. (2021)

Comments

this is about using map matching from users own historic trajectories in order to fill long gaps. probably going to be one of my primary references in doing that. kinda hard to follow though

Direct quotes

positioning errors could be larger than 2.5 m even with high-precision GPS devices (e.g., u- blox GPS modules). The missing observations, caused by the discontinuity and errors of the data, incur uncertainties and make the ). The missing observations, caused by the discontinuity and errors of the data, incur uncertainties and make the high-frequency trajectories cannot be directly used to observe individual location-duration-path choices. Methods to solve these gh-frequency trajectories cannot be directly used to observe individual location-duration-path choices. Methods to solve these uncertainties are data-driven models including machine learning and Distributionally Robust Optimization (DRO) models (e.g., DRO g machine learning and Distributionally Robust Optimization (DRO) models (e.g., DRO with moment bounds and DRO with likelihood bounds).
In this paper, we apply the DRO models with likelihood bounds in a data-driven optimization modeling framework to reconstruct paper, we apply the DRO models with likelihood bounds in a data-driven optimization modeling framework to reconstruct the missing location-duration-path choices using high-frequency trajectory data over many days. The raw trajectory data will be first g location-duration-path choices using high-frequency trajectory data over many days. The raw trajectory data will be first processed by a map-matching model to observe historical choices of location-duration-path. Furthermore, we can identify the missing processed by a map-matching model to observe historical choices of location-duration-path. Furthermore, we can identify the missing observations in space and time dimensions from the map-matched results. For the missing observations, we then apply data-driven pace and time dimensions from the map-matched results. For the missing observations, we then apply data-driven network-time prisms to reduce the search spaces for the potential location-duration-path choices. In the reduced search spaces, we prisms to reduce the search spaces for the potential location-duration-path choices. In the reduced search spaces, we formulate the DRO models with likelihood bounds to reconstruct the missing choices.
prisms (Fig. 1(b)) to reduce the search spaces for the estimation of missing location-duration-path choices. For each individual, we can identify a network-time prism as a reduced space of potential location-duration-path choices for , we can identify a network-time prism as a reduced space of potential location-duration-path choices for a missing period. Then, two DRO models with likelihood bounds are applied to reconstruct the location-duration-path choices for the g period. Then, two DRO models with likelihood bounds are applied to reconstruct the location-duration-path choices for the missing observations. Fin
the many-day trajectory data, we can solve a map-matching problem to detect the locations/stays and match the paths from y-day trajectory data, we can solve a map-matching problem to detect the locations/stays and match the paths from the observed trajectories for individuals. Based on the map-matched results, we identify a set of all possible location-duration-path jectories for individuals. Based on the map-matched results, we identify a set of all possible location-duration-path choices for the missing observations for each individual.
; White et al., 2000; Chen et al., 2003; Ochieng et al., 2003; Quddus et al., 2003; Yin and Wolfson, 2004; Meng, 2006; Tang et al., 2016). We use the map-matching model from Tang et al. (2016) to match the observed trajectories to ; Meng, 2006; Tang et al., 2016). We use the map-matching model from Tang et al. (2016) to match the observed trajectories to the ground-truth network.
The data-driven network-time prism is an envelope of location-duration choices and path choices for the missing observations. prism is an envelope of location-duration choices and path choices for the missing observations. The classic network-time prism estimates all reachable locations and paths considering constraints of geometry, connectivity, and speed limit geometry, connectivity, and speed limit (Miller, 1991; Tang et al., 2016). Different from the classic network-time prism, a data-driven network-time prism first peed limit (Miller, 1991; Tang et al., 2016). Different from the classic network-time prism, a data-driven network-time prism first defines a search space using the observed link travel time from many-vehicle trajectories over many days.
the time budget (i.e. the time period of missing observations) for the missing observations from location sto a d ( ) ( ) is the time budget (i.e. the time period of missing observations) for the missing observations from location sto location r.
g records. Based on the calculation of path travel time from the driving records, we have two possible path choices for the inbound trip (i.e., path travel time from the driving records, we have two possible path choices for the inbound trip (i.e., trip index 0 from the observed location r in pair g records, we have two possible path choices for the inbound trip (i.e., trip index 0 from the observed location r in pair ( , ) r s to the missing location) and four possible path choices for the outbound trip p index 0 from the observed location r in pair ( , ) r s to the missing location) and four possible path choices for the outbound trip (i.e., trip index 1 from the missing location to the observed location s in pair ) and four possible path choices for the outbound trip (i.e., trip index 1 from the missing location to the observed location s in pair ( , ) r s ) in Table 3. According to the optimal solution of the (i.e., trip index 1 from the missing location to the observed location s in pair ( , ) r s ) in Table 3. According to the optimal solution of the DRO model, we derive the probabilities for all the possible path choices
In the future, we can gher confidence level increases the probability value of the missing choice. In the future, we can apply the proposed modeling framework to a joint location-duration-path model to find the most likely location-duration-path pply the proposed modeling framework to a joint location-duration-path model to find the most likely location-duration-path choices simultaneously for the missing observations if we have long-period (e.g. over one year) high-frequency trajectory data.

References

Chen, Guangshuo, Aline Carneiro Viana, Marco Fiore, and Carlos Sarraute. 2019. “Complete Trajectory Reconstruction from Sparse Mobile Phone Data.” EPJ Data Science 8 (1): 30. https://doi.org/10.1140/epjds/s13688-019-0206-8.
Jeong, Seungyun, Yeseul Kang, Jincheol Lee, and Keemin Sohn. 2021. “Variational Embedding of a Hidden Markov Model to Generate Human Activity Sequences.” Transportation Research. Part C, Emerging Technologies 131 (103347): 103347. https://doi.org/10.1016/j.trc.2021.103347.
Knapen, Luk, Tom Bellemans, Davy Janssens, and Geert Wets. 2018. “Likelihood-Based Offline Map Matching of GPS Recordings Using Global Trace Information.” Transportation Research Part C: Emerging Technologies 93: 13–35. https://doi.org/10.1016/j.trc.2018.05.014.
Li, Zhenxing, Shuyuan Zhao, and Dong Li. 2019. “An Interpolation Method for Trajectory Measurement Missing Data Based on Bidirectional Unequal Interval Kernel Adaptive Filtering.” In 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 1–5. Suzhou, China: IEEE. https://doi.org/10.1109/CISP-BMEI48845.2019.8965696.
Phan, Thi-Thu-Hong, André Bigand, and Émilie Poisson Caillault. 2018. “A New Fuzzy Logic-Based Similarity Measure Applied to Large Gap Imputation for Uncorrelated Multivariate Time Series.” Applied Computational Intelligence and Soft Computing 2018: 1–15. https://doi.org/10.1155/2018/9095683.
Tong, Changfei, Huiling Chen, Qi Xuan, and Xuhua Yang. 2017. “A Framework for Bus Trajectory Extraction and Missing Data Recovery for Data Sampled from the Internet.” Sensors 17 (2). https://doi.org/10.3390/s17020342.
Xu, Yang, Xinyu Li, Shih-Lung Shaw, Feng Lu, Ling Yin, and Bi Yu Chen. 2021. “Effects of Data Preprocessing Methods on Addressing Location Uncertainty in Mobile Signaling Data.” Annals of the Association of American Geographers. Association of American Geographers 111 (2): 515–39. https://doi.org/10.1080/24694452.2020.1773232.
Zhao, Shuaidong, and Kuilin Zhang. 2019. “A Distributionally Robust Optimization Approach to Reconstructing Missing Locations and Paths Using High-Frequency Trajectory Data.” Transportation Research Part C: Emerging Technologies 102: 316–35. https://doi.org/10.1016/j.trc.2019.03.012.