You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: InProgress/tensors_prog/eng/001intro.tex
+1-3Lines changed: 1 addition & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -51,9 +51,7 @@
51
51
\begin{enumerate}
52
52
\item We rethink and improve the CFPQ algorithm based on tensor-product proposed by~\cite{10.1007/978-3-030-54832-2_6}.
53
53
We reduce this algorithm to operations over Boolean matrices.
54
-
As a result, all-path query semantics is handled, as opposed to the previous matrix-based solution capable of handling only the single-path semantics.
55
-
Best to our knowledge, our algorithm is the first CFPQ algorithm based on linear algebra which is capable to handle all-path query semantics.
56
-
Also, both regular and context-free grammars can be used as queries.
54
+
As a result, all-path query semantics is handled. Also, both regular and context-free grammars can be used as queries.
57
55
\item
58
56
We prove the correctness and time complexity for the proposed algorithm thus providing an upper bound on the complexity of the CFPQ problem in dependence on the size of the query (its context-free grammar) and the number of vertices in the input graph.
59
57
The proposed algorithm has subcubic complexity in terms of the grammar and the input graph sizes, which is comparable with the state-of-the-art solutions.
Copy file name to clipboardExpand all lines: InProgress/tensors_prog/eng/002prelim.tex
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
In this section, we introduce some basic notation and definitions from graph theory and formal language theory which will be used in the rest of the paper.
We use a directed edge-labeled graph as a data model.
10
10
To introduce the \term{Language-Constraint Path Querying Problem}~\cite{barrett2000formal} over directed edge-labeled graphs we first give definitions for both languages and grammars.
Note that $\Pi$ can be infinite, thus in practice, we should provide a way to build a finite representation of such paths with reasonable complexity, instead of explicit construction of the $\Pi$.
108
108
109
-
\subsection{Regular Path Queries and Finite State Machine}
109
+
\he{Regular Path Queries and Finite State Machine}
110
110
111
111
In \term{Regular Path Querying} (RPQ) the language $\mathcal{L}$ is regular.
112
112
This case is widespread and well-studied.
@@ -171,7 +171,7 @@ \subsection{Regular Path Queries and Finite State Machine}
171
171
Thus RPQ evaluation is an intersection of two FSMs.
172
172
The query result can also be represented as FSM because regular languages are closed under intersection~\cite{automata:theory:10.5555/1177300}.
173
173
174
-
\subsection{Context-Free Path Querying and Recursive State Machines}
174
+
\he{Context-Free Path Querying and Recursive State Machines}
175
175
176
176
An even more general case than RPQ is a \term{Context-Free Path Querying Problem (CFPQ)}, where one can use context-free languages as constraints.
177
177
These constraints are more expressive than regular ones.
@@ -310,7 +310,7 @@ \subsection{Context-Free Path Querying and Recursive State Machines}
310
310
$$
311
311
312
312
Matrix $M_1$ can be represented as a set of Boolean matrices as follows:
313
-
{\small
313
+
{\scriptsize
314
314
\begin{align*}
315
315
M_1^S =
316
316
\begin{pmatrix}
@@ -338,7 +338,7 @@ \subsection{Context-Free Path Querying and Recursive State Machines}
338
338
Also, an RSM can be viewed as an FSM over $\Sigma\cup N$.
339
339
In this work, we use this point of view to propose a unified algorithm to evaluate both regular and context-free path queries with zero overhead for regular queries.
340
340
341
-
\subsection{Graph Kronecker Product and Machines Intersection}
341
+
\he{Graph Kronecker Product and Machines Intersection}
342
342
343
343
In this section, we introduce the classic Kronecker product definition,
344
344
describe graph Kronecker product and its relation to Boolean matrices algebra,
Copy file name to clipboardExpand all lines: InProgress/tensors_prog/eng/003_algorithm_example.tex
+3-1Lines changed: 3 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
\subsection{An example}
1
+
\he{An example}
2
2
\label{example:section}
3
3
In this section, we introduce a detailed example to demonstrate the steps taken by the proposed algorithms.
4
4
Namely, consider the graph $\mathcal{G}$ presented in Figure~\ref{fig:example_input_graph} and the RSM $R$ presented in Figure~\ref{example:automata}.
@@ -16,6 +16,7 @@ \subsection{An example}
16
16
$\mathcal{M}_1$ and $\mathcal{M}_{2,(0)}$ matrices and collapse the result to the single Boolean matrix
17
17
$M_{3,(1)}$. For the sake of simplicity, we provide only
18
18
$M_{3,(1)}$, which is evaluated as follows.
19
+
\small
19
20
{
20
21
\renewcommand{\arraystretch}{0.5}
21
22
\setlength\arraycolsep{0.1pt}
@@ -76,6 +77,7 @@ \subsection{An example}
76
77
corresponding matrix block in the evaluated matrix $M_{3,{2}}$. The transitive closure
77
78
evaluation introduces three new paths $(0, 1) \rightarrow (2,1), (1, 0) \rightarrow (3,1)$ and $(0, 1) \rightarrow (3,1)$ (see Figure~\ref{fig:example_2_product}). Since only the path between vertices $(0,1)$ and
78
79
$(3,1)$ connects the start and final states in the automaton, the edge $(1,S,1)$ is added to the resulting graph.
Copy file name to clipboardExpand all lines: InProgress/tensors_prog/eng/003_index_creation_algorithm.tex
+4-3Lines changed: 4 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
\subsection{Index Creation Algorithm}
1
+
\he{Index Creation Algorithm}
2
2
3
3
The \textit{index creation} algorithm outputs the final adjacency matrix for the input graph with all pairs of vertices which are reachable through some nonterminal in the input grammar $G$, as well as the index matrix which is to be used to extract paths in the \textit{path extraction} algorithm.
\subsubsection{Application of Dynamic Transitive Closure}
82
+
83
+
\textbf{Application of Dynamic Transitive Closure.}
83
84
The most time-consuming steps of the algorithm are the computations of the Kronecker product and transitive closure.
84
85
Note that the adjacency matrix $\mathcal{M}_2$ is changed incrementally i.e. elements (edges) are added to $\mathcal{M}_2$ at each iteration of the algorithm and are never deleted from it.
85
86
So it is not necessary to recompute the whole product or transitive closure if some appropriate data structure is maintained.
@@ -159,7 +160,7 @@ \subsubsection{Application of Dynamic Transitive Closure}
159
160
%one to express it in terms of basic matrix operations.
160
161
% TODO: more accurate upper bound for the algorithm complexity
161
162
162
-
\subsubsection{Index creation for RPQ}
163
+
\textbf{Index creation for RPQ.}
163
164
In the case of the RPQ, the main \textbf{while} loop takes only one iteration to actually append data.
164
165
Since the input query is provided in the form of the regular expression, one can construct the corresponding RSM which consists of the single \textit{component state machine}.
165
166
This CSM is built from the regular expression and is labeled as $S$, for example, and has no \textit{recursive calls}.
Copy file name to clipboardExpand all lines: InProgress/tensors_prog/eng/004evaluation.tex
+18-18Lines changed: 18 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -12,21 +12,21 @@
12
12
We only measure the execution time of the algorithms themselves, thus we assume an input graph is loaded into RAM in the form of its adjacency matrix in the sparse format.
13
13
Note, that the time needed to load an input graph into the RAM is excluded from the time measurements.
14
14
15
-
\subsection{RPQ Evaluation}
15
+
\he{RPQ Evaluation}
16
16
17
17
To investigate the applicability of the proposed algorithm for regular path querying we gathered a dataset that consists of both real-world and synthetically generated graphs.
18
18
We generated the queries from the most popular RPQ templates.
19
19
20
-
\subsubsection{Dataset}
20
+
\he{Dataset}
21
21
22
22
We gathered several graphs that represent real-world data from different areas and are frequently used for the evaluation of the graph querying algorithms.
23
23
Namely, the dataset consists of three parts.
24
24
The first part is the set of LUBM graphs\footnote{Lehigh University Benchmark (LUBM) web page: \url{http://swat.cse.lehigh.edu/projects/lubm/}. Access date: 07.07.2020.}~\cite{10.1016/j.websem.2005.06.005} which have different numbers of vertices.
25
25
The second one is the set of graphs from Uniprot database\footnote{Universal Protein Resource (UniProt) web page: \url{https://www.uniprot.org/}. All files used can be downloaded via the link: \url{ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/}. Access date: 07.07.2020.}: \textit{proteomes}, \textit{taxonomy} and \textit{uniprotkb}.
26
-
The~last part consists of the RDF files \textit{mappingbased\_properties} from DBpedia\footnote{DBpedia project web site: \url{https://wiki.dbpedia.org/}. Access date: 07.07.2020.} and \textit{geospecies}\footnote{The Geospecies RDF: \url{https://old.datahub.io/dataset/geospecies}. Access date: 07.07.2020.}.
26
+
The~last part consists of the RDF files \textit{mappingbased\_properties} (\textit{mapping\_prop}) from DBpedia\footnote{DBpedia project web site: \url{https://wiki.dbpedia.org/}. Access date: 07.07.2020.} and \textit{geospecies}\footnote{The Geospecies RDF: \url{https://old.datahub.io/dataset/geospecies}. Access date: 07.07.2020.}.
27
27
A brief description of the graphs in the dataset is presented in Table~\ref{tbl:graphs_for_rpq}.
The most frequent relations from the given graph were used as symbols in the query template\footnote{Used generator is available as part of CFPQ\_data project: \url{https://github.com/JetBrains-Research/CFPQ_Data/blob/master/tools/gen_RPQ/gen.py}. Access data: 07.07.2020.}.
64
64
We used the same set of queries for all LUBM graphs to investigate the scalability of the proposed algorithm.
65
65
66
-
\begin{table}
66
+
\begin{table}[h]
67
67
\centering
68
68
\caption{Queries templates for RPQ evaluation}
69
69
\label{tbl:queries_templates}
70
-
{\small
70
+
{\scriptsize
71
71
\renewcommand{\arraystretch}{1.2}
72
72
%\rowcolors{2}{black!2}{black!10}
73
73
\begin{tabular}{|c|c||c|c|}
@@ -96,7 +96,7 @@ \subsubsection{Dataset}
96
96
\end{table}
97
97
98
98
99
-
\subsubsection{Results}
99
+
\he{Results}
100
100
101
101
We averaged the execution time of index creation over 5 runs for each query.
102
102
Index creation time for LUBM graphs set is presented in Figure~\ref{fig:lubm_all_qs}.
@@ -105,7 +105,7 @@ \subsubsection{Results}
105
105
We conclude that our algorithm demonstrates reasonable performance to be applied to the real-world data analysis.
106
106
%\cho{Note that the accurate comparison of different approaches may be a promising direction of future research.}
On the other hand, \textit{taxonomy} querying in many cases requires significantly more time than for other graphs, while \textit{taxonomy} is not the biggest graph.
120
120
Finally, in most cases, query execution lasts less than 10 seconds, even for bigger graphs, and no query requires more than 52.17 seconds.
% \caption{Single path extraction for specific graph and query for our solution (\subref{fig:geo_tensors_rpq}, \%subref{fig:dbpedia_tensors_rpq}, \subref{fig:geo_tensors_cfpq}), and Azimov's (\subref{fig:geo_matrix_cfpq})}
164
164
%\end{figure}
165
165
166
-
\subsection{CFPQ Evaluation}
166
+
\he{CFPQ Evaluation}
167
167
168
168
We evaluate the applicability of the proposed algorithm to CFPQ processing over real-world graphs on a number of classic cases and compare them with Azimov's algorithm.
169
169
Currently, only a single path version of Azimov's algorithm exists, and we use its implementation using PyGraphBLAS. Note that it is not trivial to compare our results with the state-of-the-art results provided by~\cite{10.1145/3398682.3399163} (Azimov's algorithm) because our algorithm computes significantly more information. While the state-of-the-art solution computes only reachability facts or a single-path semantics, our algorithm computes data necessary to restore all possible paths.
170
170
171
-
\subsubsection{Dataset}
171
+
\he{Dataset}
172
172
173
173
We use CFPQ\_Data\footnote{CFPQ\_Data is a dataset for CFPQ evaluation which contains both synthetic and real-world data and queries \url{https://github.com/JetBrains-Research/CFPQ\_Data}. Access date: 07.07.2020.} dataset for evaluation.
174
174
Namely, we use relatively big RDF files and respective same-generation queries $G_1$~(Eq.~\ref{eqn:g_1}) and $G_2$~(Eq.~\ref{eqn:g_2}) which are used in other works for CFPQ evaluation.
@@ -206,7 +206,7 @@ \subsubsection{Dataset}
206
206
The detailed data about all the graphs used is presented in Table~\ref{tbl:graphs_for_cfpq}.
207
207
208
208
{\setlength{\tabcolsep}{0.2em}
209
-
\begin{table}
209
+
\begin{table*}[h]
210
210
\centering
211
211
{
212
212
\caption{Graphs for CFPQ evaluation: \textit{bt} is broaderTransitive, \textit{sco} is subCalssOf}
@@ -232,14 +232,14 @@ \subsubsection{Dataset}
232
232
\hline
233
233
\end{tabular}
234
234
}
235
-
\end{table}
235
+
\end{table*}
236
236
}
237
-
\subsubsection{Results}
237
+
\he{Results}
238
238
239
239
We averaged the index creation time over 5 runs for both single-path Azimov's algorithm (\textbf{Mtx}) and the proposed algorithm (\textbf{Tns}) (see Table~\ref{tbl:CFPQ_index}).
240
240
241
241
{\setlength{\tabcolsep}{0.2em}
242
-
\begin{table}
242
+
\begin{table*}[h]
243
243
\centering
244
244
\caption{CFPQ evaluation results, time is measured in seconds}
We can see that while in some cases our solution is comparable or just slightly better than Azimov's algorithm (\textit{enzyme, eclass\_514en, go}), there are cases when our solution is significantly faster (\textit{go-hierarchy}, up to 9 times faster), and when Azimov's algorithm about 1.3 times faster (all memory aliases and \textit{geospecies} with \textit{Geo} query).
@@ -285,7 +285,7 @@ \subsubsection{Results}
285
285
%While both methods demonstrate linear time on the length of the extracted path, our generic solution is more than 1000 times slower than Azimov's single path extraction procedure.
286
286
%We conclude that current generic all-path extraction procedure is not optimal for single path extraction.
287
287
288
-
\subsection{Conclusion}
288
+
\he{Conclusion}
289
289
290
290
We conclude that the proposed algorithm is applicable to real-world data processing: the algorithm allows one to solve both the reachability problem and to extract paths of interest in a reasonable time.
291
291
While index creation time (reachability query evaluation) is comparable with other existing solutions, the paths extraction procedure should be improved in the future. However, the state-of-the-art solution computes only reachability facts or a single-path semantics, whereas our algorithm computes data necessary to restore all possible paths (all-paths semantics).
0 commit comments