We propose an influential set based moving k keyword query processing model, which avoids the shortcoming of safe region-based approaches that the update cost and update frequency cannot be optimized simultaneously. Based on the model, we design a parallel query processing method and a parallel validation method for multicore processing platforms. The time complexity of the algorithms is O((log| D|+ p. k)/ p. k) and O(log p. k), respectively, which are all O(1/ k) times the time complexity of the state-of-the-art method. The experiment result confirms the superiority of our algorithms over the state-of-the-art method.
In recent years, smart mobile devices represented by smartphones have not only been explosively developed in terms of quantity but also have greatly improved their processing capabilities and available network bandwidth. Smart mobile devices give users the ability to access information and services related to their current location anytime and anywhere. The rapid increase in the number of smart mobile devices enables governments and enterprises to provide users more and better location-based services (LBS) with high willingness. The increase in its processing power and communication bandwidth has made many previously unfeasible LBS applications possible [
As an emerging service content in the current LBS field, Moving top-k Spatial Keywords (MkSK) query has been paid more and more attention. MkSK query provides mobile Internet users with search services of the spatial keywords results [
The existing spatial moving k nearest keyword query algorithm usually considers only one factor of the position change of the queries, and its focus is on index optimization of the spatial relationship of the query object. The MkSK query not only needs to consider the relative position relationship between spatial objects but also consider the correlation between objects and query keywords. Keywords do not have the characteristics of continuous distribution similar to spatial positions, and they cannot be directly indexed by traditional spatial data structures (such as R*-tree). Therefore, the existing moving k nearest keyword query algorithm cannot be directly applied to MkSK query.
The current research on MkSK is mostly based on the safe area method. Literature [
To solve these problems, this paper proposes an MkSK query processing model based on impact set [
As far as we know, the method proposed in this paper is the first time that uses the impact set, and it is also the first time that concurrent mode is used to perform spatial moving k nearest keyword query processing.
In the following, we describe the keyword neighbors and their influence sets in Section 3, the algorithm in Section 4. We discuss the experimental results in Section 5 and conclude our work in Section 6.
Given a set of objects D, each object q ∈ D contains a pair of data 〈 λ , φ 〉 , p . λ represents the location of the object p . φ represents the keyword from the object. The spatial-keyword neighbor query q = 〈 λ , φ , k 〉 includes three parameters: q . λ is the position of the query point, q . φ represents query keyword, q . k is the number of query results, and the spatial-keyword neighbor query can be defined as follows:
Definition 1. (spatial keywords k nearest query) Given object set D and query q = 〈 λ , φ , k 〉 , spatial keyword neighbor query result set satisfies:
{ | N | = k , ∀ p ′ ∈ N , ∀ p ″ ∈ D \ N , f ( q , p ′ ) ≤ f ( q , p ″ ) , (1)
while
f ( q , p ′ ) = g ( ‖ q . λ , p ′ . λ ‖ , t r q . φ ( p ′ . φ ) ) , (2)
represents the weighted distances of q and p ′ which takes into account the distance ‖ q . λ , p ′ . λ ‖ between q and p ′ , also considering the relevance of the keyword t r q . φ ( p ′ . φ ) .
The weight distance function f ( q , p ′ ) can be defined according to the needs, for example in the literature [
f ( q , p ′ ) = ‖ q . λ , p ′ . λ ‖ t r q . φ ( p ′ . φ ) , (3)
In the literature [
f ( q , p ′ ) = α ⋅ ‖ q . λ , p ′ . λ ‖ + ( 1 − α ) ⋅ t r q . φ ( p ′ . φ ) , (4)
The coefficient α is used to adjust the importance of the two parameters. This paper uses the definition of weight-distance in (5).
The TFIDF model [
t r q . φ ( p . φ ) = ∑ φ ∈ q . φ ( t f ( φ , p . φ ) ⋅ i d f ( φ ) ) , (6)
The function t f ( φ , p . φ ) represents the frequency of occurrences φ in p . φ , and the function i d f ( φ ) represents the reciprocal of the total number of objects φ contained in D (Inverse Documlcwcent Frequency, IDF).
This paper uses TFIDF model to calculate the correlation between objects. In practical applications, we will modify (5) to
t r q . φ ( p . φ ) = ∑ φ ∈ q . φ ( t f ( φ , p . φ ) ⋅ i d f ( φ ) ) + c , (7)
where c is a sufficiently small positive number, its presence makes t r q . φ ( p . φ ) ≠ 0 thus avoiding the divide-by-zero error in (3).
The spatial-keyword k neighbor query is a single query, while the moving spatial-keyword k neighbor query is a continuous query:
Definition 2. (Moving top-k Spatial Keywords Nearest Neighbor Query) Moving top-k Spatial Keywords (MkSK) query is a process that continuously updates the query result N as q . λ changes after given a data set D and initial query q = 〈 λ , φ , k 〉 . In this process, the elements in N always satisfy the constraint of
definition 1.
Moving top-k Spatial Keywords Nearest Neighbor Query is a kind of moving spatial neighbor query. In practical applications, the client-server architecture is often used to handle moving spatial neighbor queries. Clients are generally mobile devices with weak computing and storage capabilities, such as mobile phones, onboard computers, etc. The main computational operations and data storage rely on a powerful central server in the query process.
The simplest idea of dealing with moving spatial k neighbor queries is to recalculate N while each update of q . λ . Due to the high computational cost and communication cost, this idea is obviously not feasible. At present, the main two categories of moving k neighbor query processing methods are based on the Safe Region (SR) and the Influential Set (IS) [
1) Based on the Safe Region (SR)
This method calculates a security area for the current query result N. When the queryer q is located in the area, it can ensure that the result set N is correct. When the queryer q leaves the area, it needs to recalculate the N and the new security area.
The computational cost of this method includes a) the cost of determining the validity of the SR when the q is updated (the client) and b) the update cost of the SR (server).The SR update cost of the server is determined by the update frequency (recorded as sf) of the SR and the average calculation amount (recorded as sc) of each SR calculation recorded as O ( s f ⋅ s c ) . Lowering the SR update frequency requires calculating a more accurate SR boundary, that is, lowering sf will increase sc. At the same time, lowering sc will make the security area inaccurate, and the area of the security area will inevitably become smaller in order to ensure correctness, thus increasing sf. Therefore, it is often difficult to optimize both sf and sc.
The literature [
2) Based on the Influential Set (IS)
This method finds the object point pn with the largest distance from q in the current k-nearest neighbor query result set N, and the object point pi with the smallest distance from q in the effect set f ( q , p n ) ≤ f ( q , p i ) of N. If and only if f ( q , p n ) ≤ f ( q , p i ) , N is effective. The frequency of updating result sets in the method based on the impact set is always lower than the method based on the safe area. But the average calculation amount of each new calculation result set and its influence set can also be optimized through calculate the Voronoi diagram Without pre-calculated the keywords, so the overall efficiency is better than the method that based on the safe area [
When performing keywords search, because of a large number of keywords and different Voronoi diagrams corresponding to different keywords, the keyword nearest neighbor query cannot be optimized using the pre-calculated Voronoi diagram.
As far as we know, there is no moving top-k spatial keyword nearest neighbor query based on impact set.
First, we extend the definition of impact set [
Definition 3. (Keyword Impact Sets) Given result set N of the keyword k nearest neighbor query when querying q and its initial position, the keyword impact set IS ( N ) ∈ D of N is an object set, satisfying
N N k = N ⇔ ∀ p ′ ∈ N , ∀ p ″ ∈ IS ( N ) , ‖ q . λ , p ′ . λ ‖ t r q . φ ( p ′ . φ ) ≤ ‖ q . λ , p ″ . λ ‖ t r q . φ ( p ″ . φ ) , (8)
where N N k ( q ) is that q’s keyword k nearest neighbor at the current position, and
t r q . φ ( p . φ ) = ∑ φ ∈ q . φ ( t f ( φ , p . φ ) ⋅ i d f ( φ ) ) . (9)
Without considering the keywords, querying the spatial k-nearest neighbors of the q can be found sequentially from the nearest to the far in the R*-tree [
Determining N’s search scope is the first problem that must be solved for the keyword k nearest neighbor query. Theorem 1 points out that there exists a circular region with q . λ as the center and the distance from q . λ to the most distant object as the radius that may become the key K nearest neighbor. It is clear that the first k nearest neighbors of q are in this region.
Theorem 1. Given a set of objects D and a query q, consider any circle with a number of objects greater than k with q . λ as the center, C is the set of all objects within the circle, and NC is the keyword k nearest neighbor set of q in C,
p k = arg max p ‖ q . λ , p . λ ‖ t r q . φ ( p . φ ) s .t . p ∈ N C , (10)
t r q . φ max = max p ∈ D ( t r q . φ ( p . φ ) ) , (11)
If
∃ p f ∈ C , t r q . φ max ≤ t r q . φ ( p . φ ) ⋅ ‖ q . λ , p f . λ ‖ ‖ q . λ , p k . λ ‖ , (12)
Then NC is equivalent to set N which q in the D’s keyword neighbor result.
Prove: To prove by contradiction. Suppose C is sorted by the weighted distance to q, the first k objects are not the keyword k nearest neighbors of q in D. That is if there is an object p ′ is the keyword k neighboring of q in D but p ′ does not belong to NC, then
∃ p ′ ∈ D \ N C , ‖ q . λ , p ″ . λ ‖ t r q . φ ( p ″ . φ ) > ‖ q . λ , p ′ . λ ‖ t r q . φ ( p ′ . φ ) . (13)
The following discussion of the two conditions p ′ ∈ C and p ′ ∈ D \ C , respectively.
1) p ′ ∈ C
According to the definition of NC, there is,
∀ p ∈ C \ N C , ∀ p ″ ∈ N C , ‖ q . λ , p ″ . λ ‖ t r q . φ ( p ″ . φ ) > ‖ q . λ , p . λ ‖ t r q . φ ( p . φ ) . (14)
Contradictions between Formula (15) and Formula (16).
2) p ′ ∈ D \ C
It is known from (13)
∀ p ″ ∈ N C , ‖ q . λ , p ″ . λ ‖ t r q . φ ( p ″ . φ ) > ‖ q . λ , p k . λ ‖ t r q . φ ( p k . φ ) (17)
But it is known from (12)
‖ q . λ , p k . λ ‖ t r q . φ ( p k . φ ) > ‖ q . λ , p f λ ‖ t r q . φ max (18)
It can be known from p ′ ∉ C and p f ∈ C , ‖ q . λ , p f λ ‖ < ‖ q . λ , p ′ . λ ‖ , So
‖ q . λ , p f . λ ‖ t r q . φ max > ‖ q . λ , p ′ λ ‖ t r q . φ max (19)
Known from (11)
‖ q . λ , p ′ . λ ‖ t r q . φ max > ‖ q . λ , p ′ λ ‖ t r q . φ ( p ′ . φ ) (20)
Comprehensive (17)-(20) available
∀ p ″ ∈ N C , ‖ q . λ , p ″ . λ ‖ t r q . φ ( p ″ . φ ) < ‖ q . λ , p ′ . λ ‖ t r q . φ ( p ′ . φ ) (21)
Contradictions between Formula (22) and Formula (23).
Synthesize the above 1 and 2 cases and get the proposition.
Theorem 1 indicates that there is a circular region in which the keyword k neighbor query result set of q is equivalent to the query result set in the entire space. Theorem 2 points out that we can find a minimal circular region that satisfies the condition of Theorem 1 by continuously expanding the radius.
Theorem 2. Given result set D and query q, investigate the circular area ∃ p ∈ D which makes q . λ as the center and r as the radius, when r < ‖ q . λ , p . λ ‖ , there is no object in this circle that satisfies the inequality in Theorem 1; when there is always an object in this circle that satisfies the inequality in Theorem 1.
Prove: The object set in the circle is C, and q’s keyword k neighbor result set of D is N,
p k = arg max p ‖ q . λ , p . λ ‖ t r q . φ ( p . φ ) s .t . p ∈ N , (24)
Let
r ′ = t r q . φ max ⋅ ‖ q . λ , p k . λ ‖ t r q . φ ( p k . φ ) , (25)
p f = arg min p ‖ q . λ , p . λ ‖ s .t . p ∈ D , ‖ q . λ , p . λ ‖ ≥ r ′ . (26)
When r < ‖ q . λ , p f . λ ‖ , it can be known from (27),
∀ p ∈ C , ‖ q . λ , p . λ ‖ ≤ r ′ (28)
Combining (14) and (16) shows that there is no condition in C that the object satisfies the inequality in Theorem 1.
In summary, the proposition is established.
After generating the keyword k neighbor result set N of q, Theorem 3 limits the existence area of its keyword impact set IS(N). The object set of this region constitutes IS(N).
Theorem 3. Given the set of objects D and the query q, the q’s keyword k neighbors’ result sets that are recorded as N, using the definitions of (10), (11) and (26) for p k , t r q . φ max and p f , let
r = max { ‖ q . λ , p f . λ ‖ , 2 ⋅ t r q . φ max ⋅ ‖ q . λ , p k . λ ‖ t r q . φ ( p k . φ ) , 2 ⋅ max p ∈ N ‖ q . λ , p . λ ‖ } , (29)
C is a set of all objects inside a circle that has a center around q . λ with radius r, then
N N k ( q ) = N ⇔ ∀ p ′ ∈ N , ∀ p ″ ∈ C \ N , ‖ p ′ . λ , q . λ ‖ t r q . φ ( p ′ . φ ) ≤ ‖ p ″ . λ , q . λ ‖ t r q . φ ( p ″ . φ ) , (30)
Prove: Combining the definition of the impact set in [
∪ p ∈ N D N q . φ ( p ) ⊆ C , (31)
where N q . φ ( p ) represents the set of neighbors of the object p in the key Voronoi diagram corresponding to q . φ .
To prove by contradiction. Suppose there is an object p ′ ∈ D \ C and
∃ p ∈ N , p ′ ∈ N q . φ ( p ) (32)
According to the definition of Voronoi’s neighbors [
‖ q . λ , p ′ . λ ‖ t r q . φ ( p ′ . φ ) ≤ ‖ q . λ , p ′ . λ ‖ min { t r q . φ ( p ′ . φ ) , t r q . φ ( p . φ ) } + ‖ q . λ , p . λ ‖ t r q . φ ( p . φ ) (33)
According to (17) there is ‖ p ′ . λ , p . λ ‖ ≥ ‖ q . λ , p . λ ‖ , at the same time t r q . φ max ≥ t r q . φ ( p ′ . φ ) , so
‖ q . λ , p ′ . λ ‖ t r q . φ max ≤ ‖ q . λ , p ′ . λ ‖ t r q . φ ( p ′ . φ ) ≤ 2 ‖ q . λ , p . λ ‖ t r q . φ ( p . φ ) ≤ 2 ‖ q . λ , p k . λ ‖ t r q . φ ( p k . φ ) (34)
which is
‖ q . λ , p ′ . λ ‖ ≤ 2 ⋅ t r q . φ max ⋅ ‖ q . λ , p k . λ ‖ t r q . φ ( p k . φ ) (35)
Combining (23) and (17) knows q ′ ∈ C and q ′ ∈ D \ C contradicting.
Theorem evidence.
According to the above theorem, this chapter proposes an algorithm based on the influence set of Parallel Moving top-k Spatial Keyword (PMkSK). Since the Voronoi diagram of the keyword cannot be pre-calculated, the efficiency of directly using the impact set method to process moving k spatial keyword will be very low. However, we observed there are no dependencies between the execution processes with large calculation amount in the process of calculating the impact set. We decompose these computational processes and design a parallel k-nearest neighbor and its influence set generation algorithm.
In the client-server architecture, the client first initiates a query to the server, the server generates a query result set N and it’s effect set IS(N) then return it to the client. The client verifies N's validity by constantly combining IS(N) with its current location when the location changes. If it is determined that N has expired, the request to the server is re-initiated and repeated until the user stops the query. For this kind of architecture, we separately designed the generating algorithm running on the server side for the computing keyword k nearest neighbor result set N and its influence set IS(N) and the verification algorithm running on the client side using IS(N)-to-N Verification.
With q . λ as the center of the circle, the scope of the circular search continues to expand until it finds the object p f satisfying (15). According to Theorem 2, the keyword k-nearest neighbor of q in the circular region is the final query result N. From N and p f , we can calculate the r-value in Equation (17), and then expand the search range to a circle with r as the radius,according to Theorem 3, the objects in this circle remove N and the rest of the object set is IS(N).
In the above process, when the circular search scope is expanded, multiple objects newly added in the circle can be judged and sorted in parallel. The algorithm uses the asynchronous parallel random access machine (APRAM) [
In the algorithm, the linked list L can store spatial objects or MBR [
1) If the retrieved element is an MBR, then the internal elements of the MBR are first sorted using a parity-ordering network [
2) If the removed element is an object, remove it and put into the list C Objects in C are sorted in descending order of q weight-distance.
The CanTerminate(C) function calculates r ′ from the first k-nearest neighbors that C already contains, and compares the distance between q . λ and the first element in r ′ and T. According to Theorem 2, it illustrates C already contains the first k nearest neighbors if r ′ is small, so the loop can be ended. At this point, the first k elements in C are the result set N.
Given a result set N, its impact set IS(N), and the moved query object q ′ = 〈 λ ′ , φ , k 〉 , according to definition 3, algorithm 2 can verify the validity of N to q ′ according to IS(N).
Assuming the spatial keywords are evenly distributed, the regular Best-First algorithm requires a time complexity of O ( log | D | + p . k ) [
O ( 1 q . k ( log | D | + q . k ) ) .
The number of objects in the first and second rows is O ( q . k ) , the time complexity required for taking the minimum and maximum values using the parallel balanced tree method is O ( log m ) (m is the number of objects) [
The experiment uses real position data sets HOTEL and GN. The data set details are shown in
The experiment uses the most advanced MSk-uvr moving keyword k nearest neighbor query processing method [
Observing
The above experimental results show that the PMkSK algorithm in this paper can take full advantage of parallel processing on a multi-core processor platform.
Data set | Number of spatial objects | Number of keywords | Average number of keywords |
---|---|---|---|
HOTEL | 21,021 | 602 | 3 |
GN | 1,868,821 | 222,407 | 4 |
At the same time, because the PMkSK algorithm adopts an update verification method based on the impact set rather than the security area, its update cost and update frequency are better than the MSK-uvr algorithm.
A moving keyword k-nearest neighbor query processing method based on impact set is proposed, which avoids the inherent disadvantage of the update cost and update frequency cannot get excellent at the same time of the moving k nearest neighbor query processing method based on the security region. The parallel query algorithm is designed to calculate the k-nearest neighbor result set and obtain the influence set of the result set. The time complexity of the proposed server-side parallel query algorithm and client-side parallel verification algorithm is O ( 1 k ) existing method. The experimental results show that the parallel method proposed in this paper is more suitable for the multi-core servers and widely used than the existing methods for single-core systems.
This work is supported by the National Key R & D Program of China (2016YFC0801607), the National Nature Science Foundation of China (61872071, 61872070), the Fundamental Research Funds for the Central Universities (N171604008, N171605001), and the Ministry of Education Joint Foundation for Equipment Pre-Research (6141A020333).
The authors declare no conflicts of interest regarding the publication of this paper.
Chen, K.L., Liu, Y.R. and Deng, Q.X. (2019) A Parallel Processing Method for Moving Top-K Spatial Keyword Query. Journal of Software Engineering and Applications, 12, 72-84. https://doi.org/10.4236/jsea.2019.124006