0% found this document useful (0 votes)
36 views

Electrical Engineering and Computer Science Department

This document discusses efficient similarity joins of large sets of spatio-temporal trajectories. It introduces a novel Fréchet distance-based measure for identifying trajectory similarity that can be used with underlying indices in native space. An efficient trajectory similarity join method is presented that combines distance calculations with incremental access to spatio-temporal indices, providing over 50% performance improvement on average compared to previous transformed space approaches. Extensive experiments demonstrate the efficiency and effectiveness of the proposed techniques.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Electrical Engineering and Computer Science Department

This document discusses efficient similarity joins of large sets of spatio-temporal trajectories. It introduces a novel Fréchet distance-based measure for identifying trajectory similarity that can be used with underlying indices in native space. An efficient trajectory similarity join method is presented that combines distance calculations with incremental access to spatio-temporal indices, providing over 50% performance improvement on average compared to previous transformed space approaches. Extensive experiments demonstrate the efficiency and effectiveness of the proposed techniques.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Electrical Engineering and Computer Science Department

Technical Report
NWU-EECS-08-01
January 9, 2006

Efficient Similarity Join of Large Sets of Spatio-temporal Trajectories


Hui Ding, Goce Trajcevski, Peter Scheuermann

Abstract
We address the problem of performing efficient similarity join for large sets of moving
objects trajectories. Unlike previous approaches which use a dedicated index in a
transformed space, our premise is that in many applications of location-based services,
the trajectories are already indexed in their native space, in order to facilitate the
processing of common spatio-temporal queries, e.g., range, nearest neighbor etc. We
introduce a novel distance measure adapted from the classic Frechet distance, which can
be naturally extended to support lower/upper bounding using the underlying indices of
moving object databases in the native space. This, in turn, enables efficient
implementation of various trajectory similarity joins. We report on extensive experiments
demonstrating that our methodology provides performance speed-up of trajectory
similarity join by more than 50% on average, while maintaining effectiveness comparable
to the well-known approaches for identifying trajectory similarity based on time-series
analysis.

Northrop Grumman Corp., contract: P.O.8200082518


NSF grant IIS-0325144/003

Keywords: Spatio-temporal trajectory, similarity join


Robust and Fast Similarity Join of Large Sets of Moving Object Trajectories

Hui Ding, Goce Trajcevski and Peter Scheuermann

Dept. of EECS, Northwestern University

2145 Sheridan Road

Evanston, IL 60208, U.S.A.

Abstract

We address the problem of performing efficient similarity join for large sets of moving objects trajectories. Unlike previous approaches

which use a dedicated index in a transformed space, our premise is that in many applications of location-based services, the trajectories

are already indexed in their native space, in order to facilitate the processing of common spatio-temporal queries, e.g., range, nearest

neighbor etc. We introduce a novel distance measure adapted from the classic Fréchet distance, which can be naturally extended to

support lower/upper bounding using the underlying indices of moving object databases in the native space. This, in turn, enables efficient

implementation of various trajectory similarity joins. We report on extensive experiments demonstrating that our methodology provides

performance speed-up of trajectory similarity join by more than 50% on average, while maintaining effectiveness comparable to the

well-known approaches for identifying trajectory similarity based on time-series analysis.

1 Introduction

The advances in Global Positioning Systems, wireless communication systems and miniaturization of computing devices

have brought an emergence of various applications in Location-Based Services (LBS). As a result, there is an increasing

need for efficient management of vast amounts of location-in-time information for moving objects. An important operation

on spatio-temporal trajectories, which is fundamental to many data mining applications, is the similarity join [17, 6]. Given

a user defined similarity measure, a similarity join identifies all pairs of objects that are similar based on a join predicate.

Efficient similarity joins are especially desirable for spatio-temporal trajectories, because the distance calculation between
trajectories is generally very expensive due to the intrinsic characteristics of the data.

Previous research efforts on efficient similarity search in time series data sets mainly follow the GEMINI framework [14,

21, 27, 10]: given a similarity measure on the time series, each trajectory is transformed into a point in a high-dimensional

metric space and an index is constructed in the transformed space using the defined measure (or the lower-bounding measure

if one is proposed). These transformed space approaches have been proved efficient for a large number of different similarity

measures in a variety of time series application domains.

However, when it comes to moving object trajectories which constitute a special category of time series data, we observe

that one can perform the similarity join more efficiently using a different approach. The transformed space approaches

incur extra overheads building dedicated index structures and applying trajectory transformations. On the other hand, we can

exploit the fact that trajectories are often already indexed in their native space, in order to facilitate processing of the common

spatio-temporal queries such as range, nearest neighbor, etc. [24, 8, 22]. The main focus of this work is to provide efficient

and scalable similarity joins of spatio-temporal trajectories.

Our main contributions can be summarized as follows:

• We introduce a novel distance measure based on the Fréchet distance [2], which is highly effective in identifying spatio-

temporal similarity of trajectories.

• We propose lower and upper bounding approximations of the exact distance measure, which are straightforwardly

applicable to the spatio-temporal indices and can prune a significant portion of the search space.

• We present an efficient trajectory similarity join in the native space, which combines the distance calculations with

incremental accesses to the spatio-temporal indices.

• We conduct extensive experimental evaluations to show the efficiency and effectiveness of our proposed techniques.

The rest of this paper is organized as follows. Section 2 provides the necessary background. Section 3 formally defines our

distance metric and the approximation bounds. Section 4 elaborates on our index-based trajectory join framework. Section 6

presents our experimental results. Section 7 reviews related work and concludes the paper.

2 Preliminary

In this section, we introduce the concept of spatio-temporal trajectories, and discuss the existing similarity measures and

the indexing of trajectories using R-tree.


2.1 Trajectories and Similarity Measures

We assume that objects move in a two-dimensional space, and that a trajectory Tr is a sequence of points

p1 , p2 , ..., pi , ..., pn , where each point pi represents the location of the moving object at time ti , and is of the form (xi , yi ,ti ),

for t1 < t2 < ... < ti < ... < tn . For a given trajectory, its number of points is called the point length (p-length) of the trajectory.

The time interval between t1 and tn is called the duration of the trajectory, denoted by ∆Tr. The portion of the trajectory

between two points pi and p j (inclusive) is called a segment and is denoted as si j . A segment between two consecutive points

is called a line segment.

Several distance measures for trajectories have been proposed in the literature. The L p -norms [14] are the most common

similarity measures. For example, given two trajectories Tri and Tr j of the same p-length, one can define the similarity
q
measure based on the Euclidean distances between the corresponding points as: L2 (Tri , Tr j ) = ∑k∈[1,n] dist(pik , pkj ),
j j j
where dist(pik , pk ) = (pik .x − pk .x)2 + (pik .y − pk )2 . While L2 can be calculated in time linear to the length of the

trajectories, it is sensitive to noise and is lack of support for local time shifting, i.e., trajectories with similar motion

patterns that are out of phase. The Dynamic Time Warping (DTW) distance [21] overcomes the above problem by allowing
j
trajectories to be stretched along the temporal dimension, and is recursively defined as: DTW (Tri , Tr j ) = dist(pi1 , p1 )

+min(DTW (Rest(Tri ), Rest(Tr j )), DTW (Rest(Tri ), Tr j ), DTW (Tri , Rest(Tr j ))), where Rest(Tri ) = pi2 , ..., pin . To reduce

the impact of the quadratic complexity of DTW on large data sets, a lower-bounding function together with a dedicated

indexing structure was used for efficient pruning [21]. Similar to DTW, other distance measures have also been proposed,

e.g., the Edit Distance on Real Sequence (EDR) [10] and the distance based on Longest Common Subsequence (LCSS) [27].

The commonality is that they all follow the transformed space approach, and are not designed to utilize the spatio-temporal

indices available in the native space. Recently,Pelekis et al. [23] identified several different similarity distance for trajectories,

and argued that each of them is more appropriate than the others in different settings.

2.2 R-tree Based Indexing of Trajectories

The R-tree and its variants have been widely used for indexing arbitrary dimensional data [22]. An R-tree is a B+-tree

like access method, where each R-tree node contains an array of (key, pointer) entries where key is a hyper-rectangle that

minimally bounds the data objects in the subtree pointed at by pointer. In a leaf node, the pointer is an object identifier,

while in a non-leaf node it is a pointer to a child node on the next lower level.

When indexing spatio-temporal trajectories with the transformed space approach, each trajectory is first transformed into
a single point in a high-dimensional (metric) space and a high-dimensional indexing structure is used to index these points.

Under this GEMINI framework [14], a high-dimensional R-tree is but one optional index structure.

However, spatio-temporal trajectories can also be indexed in their native space. Several such implementations have been

developed in the moving object database literature [24, 8, 22] for processing various spatio-temporal queries. Directly

indexing the entire trajectories may introduce large dead space and decrease the discriminating power of the index, hence

the general idea is to split a long trajectory into a number of segments, and index the segments [22]. Each leaf node in the

R-tree contains a number of 3-dimensional minimum bounding hyper-rectangles (MBR) that enclose the segments generated

from splitting, together with unique identifiers that match each segment to its corresponding trajectory. The segments of the

trajectories do not have to be of the same length, and a particular leaf node may contain segments from different trajectories.

The problem of optimally splitting the trajectories to support efficient data mining has recently been investigated in [3] and

is beyond the scope of this paper.

3 Spatio-Temporal Distance of Trajectories

Existing similarity measures are not directly applicable to spatio-temporal indices in the native space. Hence, in this

section we introduce our new distance measure based on the classical Fréchet distance [2], a popular illustration of which is

given by the man walking dog example. The distance between the motion of the man and the dog is the minimal length of

leash needed. Spatio-temporal trajectories in real settings consist of series of coordinate points at discrete time stamps, and

the location of a moving object between these points is obtained via interpolation when needed. Hence it suffices to define a

discrete version of the Fréchet distance as follows [12]:

Let Tr1 = (p11 , . . . , p1n ) and Tr2 = (p21 , . . . , p2n ) be two trajectories. A coupling C between Tr1 and Tr2 is a sequence

{(p1a1 , p2b1 ), (p1a2 , p2b2 ), . . ., (p1ak , p2bk )} of distinct pairs such that a1 = 1, b1 = 1, ak = n, bk = n and for all ai and bi we have

ai+1 = ai or ai+1 = ai + 1, bi+1 = bi or bi+1 = bi + 1, i.e., the matching is monotonically non-decreasing. The length kCk

of the coupling C is the maximum link of all the pairs in the coupling C, where a link is defined as the Euclidean distance

between the two points in the pair. That is, kCk = maxi=1,...,k dist(p1ai , p2bi ). Finally, the discrete Fréchet distance between two

trajectories Tr1 and Tr2 is defined as δdF (Tr1 , Tr2 ) := min{kCk : C is a coupling of Tr1 and Tr2 }. An important observation

is that exploring all the possible couplings is exhaustive plus by considering all pairs of (p1ai , p2bi ) without paying attention to

their temporal distances distorts the real spatio-temporal similarity of moving objects. Motivated by this, we use a temporal

matching window to constrain the possible point matching and define the w-constrained discrete Fréchet distance (wDF) as

follows:
D EFINITION 3.1. Given two trajectories Tr1 and Tr2 , their w-constrained discrete Fréchet distance δwDF (Tr1 , Tr2 ) :=

min{kCw k : Cw is a w-constrained coupling of Tr1 and Tr2 , s.t. ∀(p1ai , p2bi ) ∈ Cw ⇒ kp1ai .t − p2bi .tk ≤ w}, where w is a

parameter that determines the limit of the temporal matching window.

The importance of the temporal dimension is emphasized by the matching window. An idea similar to ours (temporal

matching window constraint) has also been used for other similarity measures [28], where a window size of 5% − 20% of

the entire trajectory duration ∆Tr is reported sufficient for most application in terms of finding similar trajectories. Further

stretching the temporal matching window not only result in longer execution time of the distance function, but may deteriorate

the accuracy of the distance measure due to over matching. δwDF has the following properties:

(1) δwDF (Tr1 , Tr1 ) = 0, (2) δwDF (Tr1 , Tr2 ) = δwDF (Tr2 , Tr1 ) and (3) δwDF (Tr1 , Tr2 ) ≤ δwDF (Tr1 , Tr3 )+δwDF (Tr3 , Tr2 ).

Hence, we have:

P ROPOSITION 3.1. δwDF defines a pseudo-metric on the set of spatio-temporal trajectories.

Due to space limit, the proofs of the claims are omitted from this paper and are presented in [11].
w 2
The wDF distance can be computed by Algorithm 1 using dynamic programming and has a complexity of O( ∆Tr n ).

However, unlike DTW and EDR, wDF is a pseudo-metric and can utilize the triangular inequality for pruning during similarity

search [1]. More importantly, wDF has led us to the derivation of two approximation distances that provide even greater

pruning power, which we discuss next.

3.1 Efficiency and Approximation of wDF

For long trajectories, the brute force computation of wDF can be costly. We propose two efficient approximations that can

bound the exact wDF distance and are much faster to compute: one that guarantees a lower-bound and one that guarantees

an upper-bound for the exact wDF distance, respectively. The proposed approximations make use of a coarser representation

of the spatio-temporal trajectories, obtained through splitting a given trajectory into segments and representing them by the

sequence of MBRs that enclose the corresponding segments.

Consider two trajectories Tr1 and Tr2 , each approximated by a sequence of MBRs, e.g., M1 = {MBR11 , . . . , MBRt1 },

M2 = {MBR21 , . . . , MBR2s }, the lower-bound coupling CwL between M1 and M2 is defined as a monotonically non-decreasing

matching between the pairs of MBRs from each sequence. In particular, the link of a pair in the lower-bound coupling CwL is

defined as the MinDist between the two composing MBRs, i.e., the minimum spatial distance between any two points from

the respective MBR (c.f. Figure 1 (a)). The length kCwL k of the lower-bound coupling CwL is max{MinDist(MBR1ui , MBR2vi )}
Algorithm 1 Computing the δwDF Distance

Input: Trajectory Tr1 = (p11 ,... , p1n ), Tr2 = (p21 ,... , p2n ), matching window w

Output: δwDF (Tr1 ,Tr2 )

1: dF(1 : n,1 : n) ⇐ −1.0 // initialize n by n array dF

2: return ComputeW DF(n,n)

3: ComputeWDF(i, j)

4: if dF(i, j) > −1.0 then

5: return dF(i, j)

6: else if kp1i .t − p1j .tk > w then

7: dF(i, j) := ∞

8: else if i == 1 and j == 1 then

9: dF(i, j) := dist(p11 , p21 )

10: else if i > 1 and j = 1 then

11: dF(i, j) := max(ComputeW DF(i − 1,1),dist(p1i , p21 ))

12: else if i = 1 and j > 1 then

13: dF(i, j) := max(ComputeW DF(1, j − 1),dist(p11 , p2j ))

14: else if i > 1 and j > 1 then

15: dF(i, j) := max(min(ComputeW DF(i − 1, j), ComputeW DF(i − 1, j − 1), ComputeW DF(i, j − 1)), dist(p1i , p2j ))

16: return dF(i, j)

X W

MBR3 Tr1

MinDist
ist

MBR1
xD
Ma

Tr2
MBR2
Time

D 0LQ'LVWDQG0D['LVWIRU0%5V E &RQVWUXFWLRQRI/%'ˡ XVLQJ0LQ'LVW

Figure 1: Bounding the exact wDF distance with MBRs


where u1 = v1 = 1, uk = t and vk = s. The w-constraining condition is specified over the time intervals of MBRs. Assuming

that MBR1i and MBR2j enclose segment (p1i1 , ..., p1ik ) and (p2j1 , ..., p2jm ) respectively, they will be considered as a possible pair

in a w-constrained coupling only if ∃p1ia , p2jb s.t. kp1ai .t − p2bi .tk ≤ w}. Formally:

D EFINITION 3.2. Given two sequences of MBRs M1 and M2 for trajectories Tr1 and Tr2 respectively, the lower-bound

distance of Tr1 and Tr2 is: LBDδwDF (M1 , M2 ) := min{kCwL k: CwL is a w−constrained lower-bound coupling of M1 and M2 }.

Similarly, we define an upper-bound coupling CwU on the two sequences of MBRs M1 and M2 , where the link of a pair is

defined as the MaxDist between the two composing MBRs, provided that the temporal w-constraint holds:

D EFINITION 3.3. Given two sequences of MBRs M1 and M2 for trajectories Tr1 and Tr2 respectively, the upper-bound

distance of Tr1 and Tr2 is: UBDδwDF (M1 , M2 ) := min{kCwU k: where CwU is a w−constrained upper-bound coupling of M1

and M2 }.

The construction of LBDδwDF between two trajectories from their MBRs is illustrated in Figure 1 (b), and the relationship

of these two distance bounds and the exact distance is given by the following:

T HEOREM 3.1. Given two trajectories Tr1 and Tr2 , and the corresponding sequences of MBRs, M1 and M2 , that approxi-

mate them, for any matching window w the following holds: LBDδwDF (M1 , M2 ) ≤ δwDF (Tr1 , Tr2 ) ≤ UBDδwDF (M1 , M2 ).

We note that Theorem 3.1 applies to arbitrary trajectory splitting strategies, and the problem of optimally splitting the

trajectories is beyond the scope of this paper [3]. For simplicity, we assume in the rest of this paper that the trajectories

are uniformly split into segments of equal length l. From their definitions, LBDδwDF and UBDδwDF can be computed using

the same dynamic programming algorithm for computing wDF, except that the MinDists/MaxDists between MBRs are used

instead of Euclidean distance between points. However, the amount of distance computation involved can be greatly reduced

because of the coarser representation. This can be illustrated by using the warping matrix concept [25] to describe the

relevant coupling between two trajectories. The values in the cells of the warping matrix denote the distances between the

corresponding matching MBRs/points. Figure 2 (b) shows the warping matrix between the MBRs of the two trajectories Tr1

and Tr2 , and Figure 2 (c) shows the warping matrix between the actual points of the two trajectories. Intuitively, calculating

LBDδwDF (UBDδwDF ) or wDF is the process of finding a path [25] from the lower-left corner to the upper-right corner that

minimizes the maximum value over all cells on the path. The amount of computation for LBDδwDF (UBDδwDF ) is significantly

less because of the reduced matrix size. Instead of computing the exact δwDF distance each time, we compute the LBDδwDF
w n 2
and UBDδwDF with only O( ∆Tr l2
) time, and exclude unqualified candidates, substantially reducing the number of the wDF

distance calculations. Theorem 3.1 ensures that the MBR-based pruning will not introduce any false negtives.

MApprox MExact
10 X X X X X X 1.6 30 X X X X X X
Tr1 9 X X X X X 1.2 X X X X X
8 X X X X 1.8 0.81.5 X X X X
7 X X X Tr1 X X X
5.4 2.1 2.6
6 X X 3.5 X X X X
5 X 1.8 2.1 X X X X X
4 1.3 2.6 X X X X X X
3 1.2 0.9 X X X X X X X X
Tr2 2 2.11.8 X X X X X X X X X X
1 1.5 0.7 X X X X X X 1 X X X X X X
1 2 3 4 5 6 7 8 9 10 1 Tr2 30
(a) (b) (c)

Figure 2: Warping matrices for calculating LBDδwDF /UBDδwDF and wDF: X cells are automatically excluded by the temporal

matching window, grey cells are potentially useful and black cells are on final path

Moreover, we can do even better with the approximation distances by further limiting the search space, using an idea

similar to early abandoning [30]. Consider the warping matrix MApprox in Figure 2 (b) for calculating LBDδwDF between

Tr1 and Tr2 . Initially, it only consists of “x” cells and white cells and all the white cells are assigned a value of ∞. We

access MinDists between the MBRs in ascending order, and update the values of the corresponding cells, e.g., cell(1, 2)=0.7,

cell(8, 9) = 0.8, cell(3, 4) = 0.9, ...(grey cells). After each update, we invoke the dynamic programming algorithm to compute

the LBDδwDF . At first the algorithm will quickly return ∞. After updating a cell (i, j), if we obtain the first path connecting

the two corners in the matrix, then this is the optimal path (since any path formed later will use a cell updated after cell (i, j)

and will have a larger MinDist value than cell (i, j)). Consequently, the LBDδwDF distance is equal to the MinDist value of

cell (i, j). At this point the distance calculation has been completed and the rest of the cells can be pruned. In the example

of Figure 2 (b), the critical cell after which we could find the path is cell(6, 6) and as a result, LBDδwDF (Tr1 , Tr2 ) equals 3.5.

Note that cells such as (7, 4) are never considered at any time since its value is greater than 3.5. This important observation

is formalized as follows:

T HEOREM 3.2. When calculating LBDδwDF (UBDδwDF ) between two trajectories Tr1 and Tr2 , if the pairwise distances

between the MBRs are incrementally accessed in ascending order of their MinDists (resp. MaxDists), the number of MinDists

(resp. MaxDists) calculated is minimum.


Theorem 3.2 requires that the MinDists/MaxDists are sorted in ascending order, which may incur an extra overhead.

However, the key observation is that such an ordering can be naturally obtained by maintaining a priority queue while

accessing the MBRs in the R-tree [17]. The worst case complexity is still bounded by O(m2 ), where m is the number of

MBRs in each trajectory. However, we observed in our experiments that in practice significant speed up can be achieved,

since not all m2 cells of the warping matrix need to be evaluated. In addition, although both LBDδwDF and UBDδwDF can be

calculated using Theorem 3.2, in practice we only invoke dynamic programming once to calculate LBDδwDF . The path for

LBDδwDF can then be used to calculate an upper-bound on UBDδwDF , which in practice approximates the actual UBDδwDF

distance very well.

3.2 Improving wDF Against Noise

While wDF distance can be speeded up by applying LBDδwDF and UBDδwDF to existing spatio-temporal indices, one of

its weakness is that it is sensitive to noises/outliers in the trajectories. To see this, consider the definition of wDF distance

(c.f. Definition 3.1) where the similarity of trajectory Tr1 and Tr2 is essentially determined by the minimum of the maximum

distance between the two sets of points from the respective trajectory [18], i.e., the link with the maximum value for a given

coupling. Hence, when there is noise at a single point, two otherwise close trajectories may be made arbitrarily distant away.

DTW also suffers from the same problem, however, the impact is mitigated since DTW uses the sum of distances from all

pairs of points between two trajectories which can in effect smoothen the data from single noise/outlier. EDR and LCSS

distance measures are generally recognized as being stronger against noises/outliers, because they compute a similarity score

based on a matching threshold, and can leave noises/outliers unmatched to avoid their negative impact on the similarity

computation.

In this section we propose a novel technique, called time-aware adaptive median filtering (TAMF), to improve the

robustness of wDF. The purpose of TAMF is to effectively identify outliers in the data and exclude them from the distance

computation. Furthermore, TAMF is very lightweight and only poses a slight overhead compared to the O(n2 ) complexity

of computing wDF. Existing adaptive median filtering algorithms usually make certain global assumptions about the data in

order to detect noise candidate [16, 9]. However, if we consider a moving object trajectory as a stochastic process over time,

its statistical characteristics is usually time-varying. Hence, using global assumptions about the data may lead to inaccurate

noise detection and removal. TAMF exploits the statistics about the most recent past of the moving object, in order to improve

the precision when picking noise candidates.

The basic idea of TAMF is as follows: when calculating the wDF, instead of using only the Euclidean distance
between any pair of points in a coupling, we also take into account the Euclidean distances among their neighboring

points. Recall that dist(p1i , p2j ) denote the Euclidean distance between the ith and the jth points of trajectory Tr1 and Tr2

respectively, let Si,h j be a window of size h × h centered at (i, j), i.e., Si,h j = {(k, l) : kk − ik ≤ h, kl − jk ≤ h}. We compute

dist(p1k , p2l ) for all the pairs of points in Si,h j , and sort these distance values and. We use the median of this sorted list to

determine the “distance” between the two points p1i and p2j . As an example, suppose that we have two one dimensional

trajectories: P = [(t1 , 1), (t2 , 2), (t3 , 100), (t4 , 2), (t5 , 1)] and Q = [(t1 , 1), (t2 , 2), (t3 , 3), (t4 , 2), (t5 , 1)]. When calculating the

“distance” between the third points of P and Q with a median filter window size 1, we obtain a list of distance values

(0, 0, 0, 0, 1, 1, 97, 97, 98). Using the median of this list, we determine that the “distance” between the second points of P and

Q is 1 instead of 97 and successfully remove the impact of the outlier. We call the wDF distance defined using TAMF the

median wDF (m-wDF).

Noise in the data

(a) (b) (c)

Figure 3: Noise and outlier patterns that may exist in the moving object trajectories

While the outlier in the example above can be easily filtered out with a simple median filter, the situation is much more

complicated in trajectories from real applications. TAMF considers three aspects of the noise that may exist in the data:

• A series of bursts of impulse noises (c.f. Figure 3.a): in order to remove such noises we need to adaptively adjust the

median filter window size: while large filter window size may lead to unnecessary overhead, small window size may result

in treating noisy values as correct output of the filter.

• Noisy patches that span several data samples (c.f. Figure 3.b): in order to detect such outliers we need to correctly detect

the size of the noise as well as its rising and falling edges, and adjust the median filter window size accordingly.

• Time-varying amplitude of the impulse noise (c.f. Figure 3.c): this has been overlooked by previous works on adaptive
median filtering and may affect the precise detection and removal of the previous two types of noises. As we have

explained, trajectories are usually time-varying stochastic processes and their characteristics develops over time, e.g.,

the amplitude of the impulse noise are not uniformly the same for the entire trajectory. For example, suppose we are

tracking vehicles on road network, a data sample indicating a speed of 80mph when the object moves in urban area is a

likely candidate for outliers, whereas when the object moves on an interstate highway, a data sample indicating a speed

of 30mph may suggest a (negative) impulse noise. This illustrates that in order to correctly identify noise candidates from

trajectories, we need to incorporate the dynamic local statistics of the trajectory data.

TAMF achieves time-awareness using a sliding window based approach. When computing median wDF, we maintain a

sliding window over the recent past outputs produced from the median filter. Suppose the size of the sliding window is N,

these values are simply stored using a FIFO queue in the main memory and are dynamically updated with every new output

from the median filter. We maintain three statistics about the queue, the maximum value in the queue smax , the minimum

value in the queue smin and the average of the difference between any two neighboring values vavg . These statistics are used

to detect the presence of noise in the data.

The algorithm first detects whether there is a impulse in the median filter output in the while loop. If there is no impulse in

the median filter output, the algorithm then detect whether the center distance value itself is corrupted by noise/outlier, using

the minimum and maximum distance values as well as the average speed values maintained by the sliding window. If there is

impulse in the median filter output, we increase the size of the median filter until hMAX . The value of h MAX is determined

by the noise occurrence probability [16, 9] and is usually set to 7. If there is impulse output from the median filter after the

window size has reached the maximum, the algorithm then examine whether there is a large noise patch with more than one

data sample. The maximum size of a noise patch that can be removed is specified by kMAX , and is usually set to 3. Impulses

that span more than 3 data samples are treated as intrinsic patterns of the data and are not removed. The complexity of TAMF

is dependent on the maximum size of the median filter window. As for the sliding window Our empirical study shows that a

sliding window size of ?? is most effective. The maintenance of the sliding window takes O(1) time and O(N) space.

M-wDF can still be bounded using the MBRs in the native space, however, it is no longer a pseudo-metric, i.e., it does not

satisfy the triangular inequality which can be used for pruning. Hence, it is slower to perform similarity search or similarity

join using m-wDF, compared to using wDF. This actually presents a trade-off between the effectiveness and efficiency of the

similarity measure. We provide a more detailed experimental evaluation in Section 6.


Algorithm 2 Computing m-wDF Distance

Input: Trajectory Tr1 = (p11 ,... , p1n ), Tr2 = (p21 ,... , p2n ), warping matrix d f [n][n] where d f [i][ j] = dist(p1i , p2j )

Output: δm−wDF (Tr1 ,Tr2 )

1: for each pair of points (p1i , p2j ) do

2: let smin , smax and vavg denote the current sliding window statistics

3: initialize filter window Si,h j size h to 1

4: while h ≤ hMAX do

5: sort the d f elements within the window, let xmin ← minimum distance value within W , xmax ← maximum distance value within W , xmed ←

median value within W

6: if xmed > xmin and xmed < xmax then

7: no impulse ← true

8: break

9: else

10: increase h

11: if no impulse is true then

12: if (d f [i][ j] > smin and d f [i][ j] < smax ) and (d f [i0 ][ j0 ] − d f [i][ j] < λ × vavg or d f [i][ j] − xmed < λ × vavg ) then

13: keep the original d f [i][ j] value

14: else

15: d f [i][ j] ← xmed

16: else

17: for k from 1 to kMAX do

18: if d f [i − k][ j − k] − d f [i][ j] < λ × vavg or d f [i][ j] − x−k


med < λ × vavg then

19: there is a noise patch of size k, increase the filter size

20: else

21: increase k

22: dF(1 : n,1 : n) ⇐ −1.0 // initialize n by n array dF

23: return ComputeW DF(n,n)

4 Index-Based Trajectory Join Under wDF

In this section, we present our framework for spatio-temporal similarity join of trajectories under the wDF distance

measure. Assuming that each trajectory is uniformly split into segments that are indexed by a 3-dimensional R-tree, we

describe the nearest neighbor join algorithm, and present several important variants.
4.1 Nearest Neighbor Join

The nearest neighbor join retrieves pair of trajectories, where the second trajectory in a result pair is closer to the first one

than any other trajectory from the second data set, using wDF distance as the similarity measure.

The main inputs to the algorithm are the two trajectory sets S1 and S2 , indexed by disk-based R-trees R1 and R2 ,

respectively. The algorithm accesses both trees in a manner similar to the incremental distance join [17]: descending from

their roots simultaneously, and concurrently assigning the segments from the second set to the closest trajectory from the

first set. The algorithm maintains a set of pairs, where the first item in each pair is from R1 and the second one from R2 .

Each item can be either a node of the R-tree, or a leaf node entry, i.e., the MBR of a particular segment. The set of pairs

is managed by a priority queue, such that the dequeue operation will always remove the pair with the smallest MinDist. In

addition to the priority queue, the algorithm utilizes two data structures. The first is the warping matrix directory (WMD) that

maintains an entry for each trajectory from S1 , storing a list of incomplete LBDδwDF and UBDδwDF warping matrices between

that trajectory and a trajectory from S2 . Each entry in WMD also maintains an upper bound distance, which is the maximum

possible distance allowed to become an answer candidate. In addition, each entry has a flag that indicates whether the nearest

neighbor for this particular trajectory has been found. The second structure is the candidates table (CT) that stores for each

trajectory from S1 its candidate answers in a sorted list, in ascending order of the LBDδwDF .

The join process is illustrated in Algorithm 3. After initializing the relevant data structures, the main body of the algorithm

is a while loop that continuously processes the next pair dequeued:

• When both elements in the pair are MBRs of trajectory segments (line 4-17), it first checks whether the corresponding

entry from WMD is complete and if so, simply discards the pair from further consideration. Otherwise, it performs early

abandoning by checking whether the MinDist between the two MBRs is less than the upper bound distance (line 7). Then

the relevant warping matrices in the corresponding WMD entry are updated and the algorithm examines whether the update

generates a complete path in the LBDδwDF warping matrix. If so, the LBDδwDF and UBDδwDF distances are calculated. LBDδwDF

is used to insert a new entry into the candidates table, and UBDδwDF is used to update the upper bound distance of the entry.

Finally, if the pair’s MinDist is greater than the entry’s upper bound distance, this WMD entry is flagged complete and the

relevant warping matrices are discarded.

• When only the first element in the pair is the MBR of a segment (line 18-22), the algorithm checks whether the

corresponding entry in WMD is flagged complete, and if so the pair is discarded since it (and any new pair formed by

further descending the R-tree) may not produce a better answer than the existing candidate. Otherwise the second node is
Algorithm 3 Index-Based Trajectory Join

Input: R-tree R1 , R2 ; Trajectory set S1 , S2 ; temporal matching window w

/* filtering stage */

1: priority queue Q.ENQUEUE(R1 .root,R2 .root,0)

2: while ! Q.ISEMPTY do

3: (e1,e2,mindist) ⇐ Q.DEQUEUE

4: if both e1 and e2 are segment MBRs then

5: Tr1 ⇐ trajectory of e1, Tr2 ⇐ trajectory of e2

6: if W MD[Tr1 ] flagged incomplete then

7: if MinDist(e1,e2) ≤ E.upper bound dist then

8: insert MinDist, MaxDist of e1, e2 into W MD[Tr1 ]

9: if a path exists for the MinDist warping matrix between Tr1 and Tr2 then

10: compute LBDδwDF and UBDδwDF between Tr1 , Tr2

11: if UBDδwDF < E.upper bound dist then

12: E.upper bound dist ⇐ UBDδwDF

13: insert Tr2 and LBDδwDF into CT [Tr1 ]

14: else

15: set flag of W MD[Tr1 ] as complete

16: else if W MD[Tr1 ] flagged complete then

17: discard the pair (e1,e2)

18: else if e1 is segment MBR then

19: if W MD[Tr1 ] flagged complete then

20: discard the pair (e1,e2)

21: else

22: expandElement(e1,e2,Q)

23: else if e2 is segment MBR then

24: expandElement(e2,e1,Q)

25: else if both e1 and e2 are node then

26: expandBalancedElement(e1, e2, Q)

/* refinement stage */

27: for every entry Tri in CT do

28: compute δwDF (Tri ,Tr j ) for each candidate Tr j until the nearest neighbor is found
expanded by calling the function expandElement. ExpandElement expands one of the input nodes by pairing each one of

its children with the other input element if they are temporally within w, and inserts the resulting pair into the priority queue

Q.

• When only the second element in the pair is the MBR of a segment (line 23-24), the first node is expanded by calling

expandElement, with elem1 and elem2 exchanged.

• When a pair of nodes is processed (line 25-26), the algorithm chooses to expand one of the nodes by calling

expandBalancedElement which tries to keep the balance of the depth when descending the two trees. The node to expand is

the one with a shallower depth or with a larger area if both nodes are at the same depth [17].

After the while loop terminates, the refinement step is performed on the CT using the triangular inequality of wDF for

pruning (line 27-28). For every entry of the candidates table, we examine the candidate trajectories in ascending order of

their LBDδwDF and calculate the exact wDF distance, until either all the candidate trajectories have been examined, or the next

LBDδwDF is greater than the largest computed wDF distance value.

4.2 Variants of the Similarity Join

Algorithm 3 for nearest neighbor join requires minor modifications to calculate other similarity joins among trajectories

in our framework.

• k-nearest neighbor join (kNN join) [6]: A kNN join finds for each trajectory from S1 its k nearest neighbors from S2

in terms of the wDF distance. For each trajectory from S1 , after the first k candidates are added to the candidate table,

the minimum of their UBDδwDF is used as the upper bound. We continue to add new candidates as long as their LBDδwDF

distances are smaller than the current upper bound, and update the upper bound with the new tighter UBDδwDF if necessary.

In the refinement stage, we calculate the exact wDF distance for every candidate and select the k trajectories with the smallest

distance values.

• Range Join [6]: A range join finds for each trajectory from S1 all the trajectories from S2 that are within a given wDF

distance of it. For this extension, we simply retrieve for each trajectory in S1 all the candidates whose LBDδwDF is less than

the given distance threshold during the filtering stage, and refine the answers using the exact wDF distance.

We also note that our framework can straightforwardly support the time interval join [5], where the kNN or range predicate

is defined using only some portions (segments) of trajectories within a specified time interval of interest. In this case we

retrieve only the index nodes and leaf node entries that intersect with the given time interval from the same index structure.
5 Applying Similarity Join for Efficient Clustering of Trajectories

In this section we demonstrate how our proposed techniques can be used to support efficient clustering of spatio-temporal

trajectories. Various clustering algorithms have been proposed in the literature [18, 19, 13]. These algorithms unanimously

require frequent distance computation between objects and scaling to large trajectory sets may be costly. However, their

efficiency can be significantly improved when equipped with our trajectory join framework. In the following, we first discuss

the application of our methodology to the partition-based algorithm k-medoids [18], and then we outline the similar benefits

in the context of Chameleon [19] and DBSCAN [13].

5.1 K-medoids Clustering for Trajectories

The k-medoids algorithm is a partition-based method that divides the trajectories into k groups and iteratively exchanges

objects between them until a predefined function, which evaluates the quality of the clusters, reaches a local optimum. The

most expensive operations in k-medoids are the initial assignment of the trajectories to their nearest medoids, and the iterative

reassignment of the trajectories after randomly replacing an existing medoid with a new one. However, these operations

actually correspond to a nearest neighbor join between the set of trajectories and the set of medoids [6], hence we can make

use of the nearest neighbor join algorithm presented in the previous section to expedite the process. The modification of the

standard k-medoids algorithm [18], which utilizes spatio-temporal inexing and the wDF distance, is described as follows:

1. Randomly pick k trajectories as initial medoids and create an R-tree on the medoid set;

2. Call Algorithm 3 with the R-trees of trajectories & medoids as input;

3. Compute the quality measure QC = ∑1<i<k ∑Tr j ∈Ci δwDF (mi , Tr j );

4. Randomly pick trajectory mreplace to replace a randomly selected medoid mi in the medoids R-tree;

5. Call Algorithm 3 with the new medoids R-tree;

6. Update QC after the replacement and commit the change if it decreases;

7. Repeat 4-6 until no further improvement in QC .

The crucial benefit of our approach is received in step 5 of the above algorithm, where many trajectories remain closest to

the same medoid from the previous assignment. Hence, we can incrementally update the nearest medoid only when necessary,

instead of running Algorithm 3 from the scratch every time. The trajectories affected by the medoid replacement were either

assigned to the cluster whose medoid is replaced, or assigned to other clusters but are now closer to the new medoid. All other

trajectories remain in the same clusters. By identifying such trajectories at an early stage during the replacement process, we

can prune a large number of distance calculations. This is achieved by modifying Algorithm 3 to cater for step 5, as described
in Algorithm .

Algorithm 4 Incremental Reassignment

... ...

/* refinement stage */

mi ⇐ the medoid to be replaced by mreplace

Hash table H ⇐ trajectories that were in the cluster of mi

for every entry Tri in CT do

if Tri ∈ H or candidates contain mreplace then

compute δwDF (Tri ,Tr j ) for each candidate Tr j

We utilize an additional hash table H to store the trajectories that were in the replaced cluster. In the refinement stage, we

only calculate the distance for an entry in CT if: either the trajectory is in H, or the replaced medoid is one of the candidates.

These changes ensure that the distance calculation for the trajectories that remain in their old cluster will not be repeated,

and can significantly speed up the reassignment process which is dominated by the expensive distance calculation in the

refinement stage.

5.2 Hierarchical and Density-Based Clustering

• Index-based Chameleon: The Chameleon algorithm [19] explores graph partitioning and dynamic modelling to achieve

high quality hierarchical clustering. The most expensive step, when applying Chameleon to spatio-temporal trajectories, is

to build the hyper-graph where each vertex of the graph represents a trajectory, and there exists an edge between two vertices

(trajectories) if one trajectory is among the k nearest neighbors of the other. However, the step of building this k-nearest

neighbor graph can be transformed to that of performing a k-nearest neighbor self-join, and we can utilize the techniques

presented in Section 4.2 to expedite the process.

• Index-based DBSCAN: To cluster trajectories using density-based algorithms such as DBSCAN [13], the critical task is

to identify dense regions using the input parameters ε and MinPts, by performing a range query for each trajectory. However,

instead of executing the individual range queries separately, we can apply the range join algorithm proposed in Section 4.2 to

concurrently perform the range queries for all the trajectories in the data set, where the join distance equals ε. The result of

this join can be stored in a matrix for look-up during the execution of the classic DBSCAN algorithm.

Note that for trajectories whose UBDδwDF distance is smaller than ε, we know that these trajectories are in the ε neighborhood

for sure and the exact distance computation can be completely avoided.
6 Experimental Results

In this section, we empirically evaluate the efficiency and effectiveness of our proposed techniques.

We have implemented our similarity join framework in Java. All our experiments are executed on a PC with a Pentium IV

3.0 GHz CPU and 1 GB of memory. To evaluate the efficiency of the proposed algorithms, we use the network-based traffic

generator [7] and produce moving object trajectories based on the road networks of Oldenburg (OB) and San Francisco (SF).

To obtain some quantitative observations about the potential use of our framework for data mining applications, we use the

wDF distance for classification in data sets provided by the UCR Time Series Collection [20]. We index the trajectories with

a 3-dimensional R-tree and uniformly split each trajectory into segments. The resulting segments are then inserted into the

R-tree, where each data entry in the leaf node contains one segment. The page size is set to 4KB, and an LRU buffer with

1000 pages is used. Unless stated otherwise, the w window size is set to 15% of the entire trajectory duration.

6.1 Efficiency of Similarity Join

Although our results are independent of the trajectory splitting strategy adopted, before evaluating the performance of

our similarity join framework, we need to determine the trajectory splitting size for our data sets in order to remove its

impact from further experiments. Increasing the number of splits implies tighter bounds but may also increase the costs for

calculating them, whereas decreasing the number of splits deteriorates filtering effectiveness. We perform a nearest neighbor

join for 200 trajectories generated from the road networks of OB and SF, with 400 and 1000 points respectively. We vary the

number of points contained in each segment from 5 to 200 and the results are shown in Figure 4. Based on the results, we fix

the number of points in each segment to 20 in the following experiments.

With the uniform split model, we then evaluate the tightness of the two distance bounds. We use the road networks of OB

and SF to generate varying number of trajectories, and randomly pick one trajectory to perform a nearest neighbor query,

using the two distance bounds for pruning. We record the total number of times the exact wDF distance is calculated, and

divide this number by the total number of trajectories in the data set. The result ratio is shown in Figure 4. It can be observed

that using our approximate distance bounds, we only need to perform less than 2% of the wDF distance calculation.

Next we evaluate the efficiency and the scalability of our trajectory join algorithm. Due to the limited space, we focus on

the nearest neighbor join only. The next two sets of experiments compare the efficiency of three different approaches: (1)

our framework using the wDF distance, (2) the metric-based join [29] (essentially a sequential scan over the entire data set

but uses the triangular inequality for pruning as much as possible) with wDF as the distance metric and (3) similarity join on
Percentage of $w$DF Calculation Performed
(a) Impact of Segment Size (b) Tightness of Bounds
120 0.1
p-length = 400 p-length = 400
p-length = 1000 p-length = 1000
Total Running Time (sec)

100 0.08
80
0.06
60
0.04
40

20 0.02

0 0
0 50 100 150 200 0 500 1000 1500 2000
Number of Points per Segment Total Number of Trajectories

Figure 4: Impact of Segment Size and Tightness of Bounds

DTW distance with lower-bound indexing [21], as a representative of the transformed space approach. For the DTW based

approach, we implement the join as a batch of similarity search queries where each query is a trajectory from the first data set

that is used to search for its nearest neighbor in the second data set. We use the same parameters, e.g., the number of points

in each segment/piece-wise approximation, the R-tree parameters and splitting strategy, etc. in both our framework and the

DTW implementation. We also take into account the time it takes for approach (1) and (3) to build the index structure.

Our first set of experiments reports the total running time of the nearest neighbor join as a function of the number of

trajectories. Figure 5 compares the performance of the three approaches on trajectories generated from road networks of

OB and SF, respectively. Each OB trajectory contains 400 points and each SF trajectory contains 1000 points. We observe

that our join framework clearly outperforms the metric-based join, yielding a speed-up of up to 10 times. Furthermore, our

approach scales well to large trajectory sets since the running time grows linearly with respect to the number of trajectories,

whereas the running time for metric-based join grows quadratically. While the DTW based approach also outperforms the

metric space based approach by a large factor, it is on average more than 2 times slower than our approach. This discrepancy

becomes even larger on the SF data set. The main reason is that when the number of points in each trajectory increases,

the dimensionality of the index structure used to index the trajectories in the transformed space, i.e., the index of the DTW

distance, also grows. This will reduce the selectivity of the index and admit more false positives that will need to be eliminated

with the expensive DTW distance calculation. Increasing the number of points per segment/piece-wise approximation can
(a) Oldenburg Data Set (b) San Francisco Data Set
4 12
Metric based Join Metric based Join
Total Running Time (k sec)

3.5 wDF based Join wDF based Join


DTW based Join 10 DTW based Join
3
8
2.5
2 6
1.5
4
1
2
0.5
0 0
0 500 1000 1500 2000 0 500 1000 1500 2000
Number of Trajectories Number of Trajectories

Figure 5: Scaling with Number of Trajectories

not solve the problem, as it will yield a wider bounding envelop used by DTW and loosen the lower-bounds [21]. Working

in the native space, our approach does not have this problem of dimensionality. When the number of points per trajectory

increases, it only increases the total number of segments and the size of the R-tree structure. However, the extra accesses to

the indices are paid off by the reduction of false positives because of the lower/upper-bounds.

(a) 400 Trajectories (b) 1000 Trajectories


1 3.5
Metric based Join Metric-based Join
Total Running Time (k sec)

wDF based Join wDF based Join


0.8 DTW based Join 3 DTW based Join
2.5
0.6
2

0.4 1.5

1
0.2
0.5

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Number of Points per Trajectory Number of Points per Trajectory

Figure 6: Scaling with Trajectory Length


Our next set of experiments investigates the similarity join performance with respect to the number of points per trajectory.

We fix the number of trajectories in OB and SF to 400 and 1000 respectively and increase the number of points in each

trajectory. From Figure 6, we can observe that our approach scales very well with the number of points per trajectory, and

consistently delivers a speedup of more than 2 with respect to the DTW based approach. The speedup increases as the number

of points per trajectory grows from 200 to 1000.

6.2 Effectiveness of wDF

In order to evaluate the effectiveness of our proposed similarity measure, we use a one-nearest neighbor classification

algorithm as suggested by Keogh et al. on 20 different data sets [20]. These data sets cover various application domains, e.g.,

robotics, industry, botany etc. For each group of data, a training set and a testing set are provided together with the correct

cluster labels. We compare the classification error ratio of wDF against that of L2 -norm and DTW from [20], as shown in

Table 1.

The classification error ratio of wDF is obtained by finding the optimal warping window size for the purpose of this

comparison (and so does DTW), and the percentages in parentheses indicate the ratio of matching window size w to the

trajectory duration. We observe that the classification error rates yielded by wDF are clearly superior to L2 -norm, and is

comparable with DTW (wDF wins in 7 data sets, ties in 2 data sets and loses the rest). This is because while wDF can handle

local time shifting, it is more sensitive to noise than DTW. We note that using a uniform window size of 15% yields only

slightly different results [11].

Since wDF is sensitive to noise, one can alleviate this problem by apply some filtering technique similar to EDR and

LCSS [27, 10] when determining wDF. We have considered using a median filter to protect the warping matrix from noise.

Our preliminary experiments indicate that the median filter substantially improves the effectiveness of wDF for classification

purposes, as illustrated in Table 2. However, there are two important issues that we need to address: (1) choosing the optimal

filter size, or properly adjusting it (for adaptive algorithms); (2) median filters need not yield metric distance, which may

slow down the refinement step of Algorithm 3. We will focus on these issues in the future work.

6.3 Clustering Speed-up

To demonstrate the application and benefits of our framework, we evaluate the performance of the three clustering

algorithms presented in Section 5, equipped with our index-based trajectory join. We compare our approach only with

the metric-based approach where the triangular inequality is used to prune the distance computation. The trajectories used in
Data set L2 -norm DTW wDF (w)

Synthetic Ctrl. 0.12 0.017 0.02 (13%)

Gun-Point 0.087 0.087 0.027 (0.7%)

CBF 0.148 0.004 0.027 (14.8%)

Face(all) 0.286 0.192 0.142 (2.3%)

OSU Leaf 0.483 0.384 0.421 (3%)

Swedish Leaf 0.213 0.157 0.182 (4.7%)

50 Words 0.369 0.242 0.301 (4.4%)

Trace 0.24 0.01 0 (4.4%)

Two Patterns 0.09 0.0015 0.0045 (14.5%)

Wafer 0.005 0.005 0.0045 (13.2%)

Face(Four) 0.216 0.114 0.307 (2.3%)

Lightning-2 0.246 0.131 0.229 (2.8%)

Lightning-7 0.425 0.288 0.329 (3.8%)

ECG 0.12 0.12 0.13 (1%)

Adiac 0.389 0.391 0.381 (6.25%)

Yoga 0.17 0.155 0.143 (2.1%)

Fish 0.217 0.16 0.181 (1.7%)

Beef 0.467 0.467 0.467 (2.3%)

Coffee 0.25 0.179 0.0714 (2.8%)

OliveOil 0.133 0.167 0.167 (0.5%)

Table 1: Effectiveness of wDF Distance

the experiments are generated from the SF road networks, with 600 points each.

We first compare the effects of incremental reassignments in k-medoids. We fix the number of trajectories to 6400 and

increase the number of clusters from 2 to 64. Figure 7 (a) shows that the incremental reassignment can speed up each

iteration over the naive reassignment by up to 3.5 times. The speed-up increases with the number of clusters since the

number of trajectories that need to be reassigned becomes smaller.

Next, we study the speed-up achieved by implementing the three clustering paradigms using our trajectory similarity join

framework. In all the experiments, we fix the number of clusters to be 10. For k-medoids, we force the reassignment process

to terminate after 20 iterations in each run. The results are shown in Figure 7 (b): the similarity join can dramatically speed
Data set DTW wDF wDF with median filter

CBF 0.004 0.027 0.011

OSU Leaf 0.384 0.421 0.393

Swedish Leaf 0.157 0.182 0.166

50 Words 0.242 0.301 0.273

Two Patterns 0.0015 0.0045 0.00075

Lightning-7 0.288 0.329 0.29

Beef 0.467 0.467 0.433

Table 2: Effectiveness of Median Filter on wDF

(a) Speed-up of Incremental Reassignment (b) Speed-up of k-medoids


6 12
Total Running Time (k sec)
Metric-based Approach
5 10 Join-based Approach
Speed-up Ratio

4 8
3 6
2 4
1 2
0 0
0 10 20 30 40 50 60 0 500 1000 1500 2000
Number of Clusters Number of Trajectories

Figure 7: Speed-up of k-medoids

up the k-medoids algorithm compared to the metric-based approach by more than an order of magnitude. For Chameleon, we

use a 5-NN join to build the hyper-graph for agglomotive clustering. For DBSCAN, we set ε to ten times the maximum speed

of the moving objects and MinPts to 3. The results for Chameleon and DBSCAN are shown in Figure 8. We observe that due

to the expensive cost to compute the wDF distance, the kNN join and the range join dominate the running time of Chameleon

and DBSCAN, by 99.5% and 99% respectively (not shown in figure). The performance of Chameleon can be improved by

more than 5 times, and DBSCAN by more than 8 times. Furthermore, the tendency of improvement keeps growing with the

number of trajectories.
(a) Speed-up of Chameleon (b) Speed-up of DBSCAN
6 6
Total Running Time (k sec)

Total Running Time (k sec)


Metric-based Chameleon Metric-based DBSCAN
5 Join-based Chameleon 5 Join-based DBSCAN
4 4
3 3
2 2
1 1
0 0
0 500 1000 1500 2000 0 500 1000 1500 2000
Number of Trajectories Number of Trajectories

Figure 8: Speed-up of Chameleon & DBSCAN

7 Related Work & Concluding Remarks

The problem of turbo-charging data mining process by similarity join has been investigated in [6] for low-dimensional

data. In this work, we focus on joining spatio-temporal trajectories and the main goal is to utilize the index structure to prune

a large number of expensive distance calculation which dominates the join process. A trajectory join using the L p -norms and

a specialized index structure was presented in [5]. However, the approach can not be straightforwardly extended to support

different spatio-temporal similarity join.

In [28] the indexing of LCSS and DTW using MBRs of trajectory segments is explored. However, the proposed lower-

bound distance are calculated in conjunction with a query sequence, which makes the efficient extension to similarity join

questionable. The issue of what is a semantically appropriate distance measure for trajectory similarity is addressed in [23].

[15] considers similarity search for trajectories using spatio-temporal indices and proposes a novel distance measure, however

the work does not address the similarity join of trajectories. We note that our idea of lower-bounding wDF using mindists of

MBRs is similar to that proposed in [26], and the idea of early abandoning proposed in [30] is similar to our Theorem 3.2.

The key contribution of our work is the combination of similarity joins and spatio-temporal indices in the native space of

moving object trajectories, which yields an increased efficiency.

In this paper, we introduced a new similarity measure wDF for location-related time series data, based on Fréchet distance.

In order to compute the distance efficiently, we proposed two approximations for effective upper/lower-bounding. We then

combined these approximations with spatio-temporal indices in the native space for pruning, and presented a similarity join
framework under our distance measure that supports a number of different similarity join variants. Our experimental results

have demonstrated the efficiency and scalability of our proposed technique in the context of moving object trajectories, and

verified the effectiveness of our distance measure.

One immediate extension of this paper is to improve the robustness of our distance measure against outliers in the data.

Another interesting avenue of future work is to extend our approach towards more general types of motion and richer

representations of the trajectory models.

References

[1] Lei Chen 0002 and Raymond T. Ng. On the marriage of lp-norms and edit distance. In VLDB, pages 792–803, 2004.
[2] Helmut Alt and Michael Godau. Computing the fréchet distance between two polygonal curves. Int. J. Comput. Geometry Appl.,
5:75–91, 1995.
[3] Aris Anagnostopoulos, Michail Vlachos, Marios Hadjieleftheriou, Eamonn J. Keogh, and Philip S. Yu. Global distance-based
segmentation of trajectories. In KDD, pages 34–43, 2006.
[4] Axel Mosig. Efficient Algorithms for Shape and Pattern Matching. Ph.D thesis, 2004.
[5] Petko Bakalov, Marios Hadjieleftheriou, Eamonn J. Keogh, and Vassilis J. Tsotras. Efficient trajectory joins using symbolic
representations. In Mobile Data Management, pages 86–93, 2005.
[6] Christian Böhm and Florian Krebs. The k-nearest neighbour join: Turbo charging the kdd process. Knowl. Inf. Syst., 2004.
[7] Thomas Brinkhoff. A framework for generating network-based moving objects. GeoInformatica, 6(2):153–180, 2002.
[8] V. Prasad Chakka, Adam Everspaugh, and Jignesh M. Patel. Indexing large trajectory data sets with SETI. In CIDR, 2003.
[9] R. H. Chan, Chung-Wa Ho, and M. Nikolova. Salt-and-pepper noise removal by median-type noise detectors and detail-preserving
regularization. IEEE Trans. on Image Processing, 14(10), 2005.
[10] Lei Chen, M. Tamer Özsu, and Vincent Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD
Conference, pages 491–502, 2005.
[11] Hui Ding, Goce Trajcevski, and Peter Scheuermann. Efficient similarity join of spatio-temporal trajectories. In Technical Report
NWU-EECS-08-01, Northwestern University, 2007.
[12] Thomas Eiter and Heikki Mannila. Computing discrete fréchet distance. In Technical Report CD-TR 94/64, Technische Universitat
Wien, 1994.
[13] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial
databases with noise. In KDD, pages 226–231, 1996.
[14] Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD
Conference, pages 419–429, 1994.
[15] Elias Frentzos, Kostas Gratsias, and Yannis Theodoridis. Index-based most similar trajectory search. In ICDE, 2007.
[16] H.Hwang and R. A. Haddad. Adaptive median filters: new algorithms and results. IEEE Trans. on Image Processing, 4(4), 1995.
[17] Gı́sli R. Hjaltason and Hanan Samet. Incremental distance join algorithms for spatial databases. In SIGMOD Conference, pages
237–248, 1998.
[18] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, CA, 2005.
[19] George Karypis, Eui-Hong Han, and Vipin Kumar. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer,
32(8):68–75, 1999.
[20] E. Keogh, X. Xi, L. Wei, and C.A. Ratanamahatana. The UCR Time Series dataset. In http://www.cs.ucr.edu/ ea-
monn/time series data/, 2006.
[21] Eamonn J. Keogh. Exact indexing of dynamic time warping. In VLDB, pages 406–417, 2002.
[22] Y. Manolopoulos, A. Nanopoulos, A.N. Papadopoulos, and Y. Theodoridis, editors. R-trees: Theory and Applications. Springer-
Verlag, 2006.
[23] Nikos Pelekis, Ioannis Kopanakis, Gerasimos Marketos, Irene Ntoutsi, Gennady L. Andrienko, and Yannis Theodoridis. Similarity
search in trajectory databases. In TIME, pages 129–140, 2007.
[24] Dieter Pfoser, Christian S. Jensen, and Yannis Theodoridis. Novel approaches in query processing for moving object trajectories. In
VLDB, pages 395–406, 2000.
[25] Yasushi Sakurai, Masatoshi Yoshikawa, and Christos Faloutsos. Ftw: fast similarity search under the time warping distance. In
PODS, pages 326–337, 2005.
[26] Michail Vlachos, Dimitrios Gunopulos, and Gautam Das. Rotation invariant distance measures for trajectories. In KDD, 2004.
[27] Michail Vlachos, Dimitrios Gunopulos, and George Kollios. Discovering similar multidimensional trajectories. In ICDE, 2002.
[28] Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn J. Keogh. Indexing multidimensional time-series.
VLDB J., 15(1):1–20, 2006.
[29] Jason Tsong-Li Wang and Dennis Shasha. Query processing for distance metrics. In VLDB, pages 602–613. Morgan Kaufmann,
1990.
[30] Li Wei, Eamonn J. Keogh, Helga Van Herle, and Agenor Mafra-Neto. Atomic wedgie: Efficient query filtering for streaming times
series. In ICDM, pages 490–497, 2005.

APPENDIX:

Proposition 3.1 proof (sketch, c.f.[4]): Obviously, we have δwDF (Tr1 , Tr1 ) = 0 and δwDF (Tr1 , Tr2 ) = δwDF (Tr2 , Tr1 ), it
remains to prove the triangular inequality.
Let Tr1 = (p11 , . . . , p1n ), Tr2 = (p21 , . . . , p2n ) and Tr3 = (p31 , . . . , p3n ), we show that δwDF (Tr1 , Tr3 ) ≤ δwDF (Tr1 , Tr2 ) +
δwDF (Tr2 , Tr3 ). By definition of δwDF , it suffices to construct a w-constrained coupling (u, v) such that dist(p1i , p3u(i) ) ≤
δwDF (Tr1 , Tr2 ) + δwDF (Tr2 , Tr3 ) and dist(p1v( j) , p3j ) ≤ δwDF (Tr1 , Tr2 ) + δwDF (Tr2 , Tr3 ) for all i, j ∈ [1 : n].
By definition of δwDF , there exists a w-constrained coupling (µ, ν) between Tr1 and Tr2 such that dist(p1µ(s) , p2ν(s) ) ≤ δwDF
for any s. Similarly, there exists a w-constrained coupling (κ, λ) between Tr2 and Tr3 such that dist(p2κ(t) , p3λ(t) ) ≤ δwDF for
any t.
Next, we construct u(i) as follows: given i ∈ [1 : n], let α be the first point in (µ, ν) that is matched with p1i , and let u(i)
be the first point in (κ, λ) that is matched with p2α . Similarly, we construct v( j) as follows: given j ∈ [1 : n], let β be the first
point in κ, λ that is matched with p3j , and let v( j) be the last point in µ, ν that is matched with p2beta . We have now defined
u(i) and v( j) for all i, j ∈ [1 : n]. From the construction process, we know (i, u(i)) and (v( j), j) form two monotonically
non-decreasing couplings between Tr1 and Tr3 . Furthermore, we know that for any i, p1i , p3u(i) and α forms a triangle and
hence we have: dist(p1i , p3u(i) ) ≤ dist(p1i , α) + dist(α, p3u(i) ) ≤ δwDF (Tr1 , Tr2 ) + δwDF (Tr2 , Tr3 ), from the definition of (µ, ν).
Similarly, we have dist(p2v( j) , p3j ) ≤ dist(p2v( j) , β) + dist(β, p3j ) ≤ δwDF (Tr1 , Tr2 ) + δwDF (Tr2 , Tr3 ). 

Theorem 3.1 proof (sketch): Let C be the coupling between Tr1 and Tr2 which has the minimum length. By definition,
there exists a pair of points in C, say (p1i , p2j ), whose distance is larger than any other pair in C, i.e., δwDF (Tr1 , Tr2 ) =
dist(p1i , p2j ). Further assume that for a given M1 and M2 , p1i is contained in MBR1u and p2j is contained in MBB2v . We have
MinDist(MBB1u , MBB2v ) ≤ dist(p1i , p2j ). By definition of δwDF and by transitivity, for any other pair of MBRs we can show
that their MinDist is bounded by dist(p1i , p2j ). Hence, dist(p1i , p2j ) is greater than the length of any lower-bound coupling
between the two sequences of MBRs. The proof of δwDF (Tr1 , Tr2 ) ≤ UBDδwDF (M1 , M2 ) can be carried out similarly. 

You might also like