Skip to content

Commit bf90365

Browse files
committed
Text layout, update of contribution in intro.
1 parent ff74b45 commit bf90365

File tree

8 files changed

+32
-33
lines changed

8 files changed

+32
-33
lines changed

InProgress/tensors_prog/eng/001intro.tex

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,7 @@
5151
\begin{enumerate}
5252
\item We rethink and improve the CFPQ algorithm based on tensor-product proposed by~\cite{10.1007/978-3-030-54832-2_6}.
5353
We reduce this algorithm to operations over Boolean matrices.
54-
As a result, all-path query semantics is handled, as opposed to the previous matrix-based solution capable of handling only the single-path semantics.
55-
Best to our knowledge, our algorithm is the first CFPQ algorithm based on linear algebra which is capable to handle all-path query semantics.
56-
Also, both regular and context-free grammars can be used as queries.
54+
As a result, all-path query semantics is handled. Also, both regular and context-free grammars can be used as queries.
5755
\item
5856
We prove the correctness and time complexity for the proposed algorithm thus providing an upper bound on the complexity of the CFPQ problem in dependence on the size of the query (its context-free grammar) and the number of vertices in the input graph.
5957
The proposed algorithm has subcubic complexity in terms of the grammar and the input graph sizes, which is comparable with the state-of-the-art solutions.

InProgress/tensors_prog/eng/002prelim.tex

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
In this section, we introduce some basic notation and definitions from graph theory and formal language theory which will be used in the rest of the paper.
66

7-
\subsection{Language-Constrained Path Querying Problem}
7+
\he{Language-Constrained Path Querying Problem}
88

99
We use a directed edge-labeled graph as a data model.
1010
To introduce the \term{Language-Constraint Path Querying Problem}~\cite{barrett2000formal} over directed edge-labeled graphs we first give definitions for both languages and grammars.
@@ -106,7 +106,7 @@ \subsection{Language-Constrained Path Querying Problem}
106106

107107
Note that $\Pi$ can be infinite, thus in practice, we should provide a way to build a finite representation of such paths with reasonable complexity, instead of explicit construction of the $\Pi$.
108108

109-
\subsection{Regular Path Queries and Finite State Machine}
109+
\he{Regular Path Queries and Finite State Machine}
110110

111111
In \term{Regular Path Querying} (RPQ) the language $\mathcal{L}$ is regular.
112112
This case is widespread and well-studied.
@@ -171,7 +171,7 @@ \subsection{Regular Path Queries and Finite State Machine}
171171
Thus RPQ evaluation is an intersection of two FSMs.
172172
The query result can also be represented as FSM because regular languages are closed under intersection~\cite{automata:theory:10.5555/1177300}.
173173

174-
\subsection{Context-Free Path Querying and Recursive State Machines}
174+
\he{Context-Free Path Querying and Recursive State Machines}
175175

176176
An even more general case than RPQ is a \term{Context-Free Path Querying Problem (CFPQ)}, where one can use context-free languages as constraints.
177177
These constraints are more expressive than regular ones.
@@ -310,7 +310,7 @@ \subsection{Context-Free Path Querying and Recursive State Machines}
310310
$$
311311

312312
Matrix $M_1$ can be represented as a set of Boolean matrices as follows:
313-
{\small
313+
{\scriptsize
314314
\begin{align*}
315315
M_1^S =
316316
\begin{pmatrix}
@@ -338,7 +338,7 @@ \subsection{Context-Free Path Querying and Recursive State Machines}
338338
Also, an RSM can be viewed as an FSM over $\Sigma \cup N$.
339339
In this work, we use this point of view to propose a unified algorithm to evaluate both regular and context-free path queries with zero overhead for regular queries.
340340

341-
\subsection{Graph Kronecker Product and Machines Intersection}
341+
\he{Graph Kronecker Product and Machines Intersection}
342342

343343
In this section, we introduce the classic Kronecker product definition,
344344
describe graph Kronecker product and its relation to Boolean matrices algebra,

InProgress/tensors_prog/eng/003_algorithm_example.tex

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
\subsection{An example}
1+
\he{An example}
22
\label{example:section}
33
In this section, we introduce a detailed example to demonstrate the steps taken by the proposed algorithms.
44
Namely, consider the graph $\mathcal{G}$ presented in Figure~\ref{fig:example_input_graph} and the RSM $R$ presented in Figure~\ref{example:automata}.
@@ -16,6 +16,7 @@ \subsection{An example}
1616
$\mathcal{M}_1$ and $\mathcal{M}_{2,(0)}$ matrices and collapse the result to the single Boolean matrix
1717
$M_{3,(1)}$. For the sake of simplicity, we provide only
1818
$M_{3,(1)}$, which is evaluated as follows.
19+
\small
1920
{
2021
\renewcommand{\arraystretch}{0.5}
2122
\setlength\arraycolsep{0.1pt}
@@ -76,6 +77,7 @@ \subsection{An example}
7677
corresponding matrix block in the evaluated matrix $M_{3,{2}}$. The transitive closure
7778
evaluation introduces three new paths $(0, 1) \rightarrow (2,1), (1, 0) \rightarrow (3,1)$ and $(0, 1) \rightarrow (3,1)$ (see Figure~\ref{fig:example_2_product}). Since only the path between vertices $(0,1)$ and
7879
$(3,1)$ connects the start and final states in the automaton, the edge $(1,S,1)$ is added to the resulting graph.
80+
\small
7981
{
8082
\renewcommand{\arraystretch}{0.5}
8183
\setlength\arraycolsep{0.1pt}

InProgress/tensors_prog/eng/003_index_creation_algorithm.tex

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
\subsection{Index Creation Algorithm}
1+
\he{Index Creation Algorithm}
22

33
The \textit{index creation} algorithm outputs the final adjacency matrix for the input graph with all pairs of vertices which are reachable through some nonterminal in the input grammar $G$, as well as the index matrix which is to be used to extract paths in the \textit{path extraction} algorithm.
44

@@ -79,7 +79,8 @@ \subsection{Index Creation Algorithm}
7979
\EndFunction
8080
\end{algorithmic}
8181
\end{algorithm}
82-
\subsubsection{Application of Dynamic Transitive Closure}
82+
83+
\textbf{Application of Dynamic Transitive Closure.}
8384
The most time-consuming steps of the algorithm are the computations of the Kronecker product and transitive closure.
8485
Note that the adjacency matrix $\mathcal{M}_2$ is changed incrementally i.e. elements (edges) are added to $\mathcal{M}_2$ at each iteration of the algorithm and are never deleted from it.
8586
So it is not necessary to recompute the whole product or transitive closure if some appropriate data structure is maintained.
@@ -159,7 +160,7 @@ \subsubsection{Application of Dynamic Transitive Closure}
159160
%one to express it in terms of basic matrix operations.
160161
% TODO: more accurate upper bound for the algorithm complexity
161162

162-
\subsubsection{Index creation for RPQ}
163+
\textbf{Index creation for RPQ.}
163164
In the case of the RPQ, the main \textbf{while} loop takes only one iteration to actually append data.
164165
Since the input query is provided in the form of the regular expression, one can construct the corresponding RSM which consists of the single \textit{component state machine}.
165166
This CSM is built from the regular expression and is labeled as $S$, for example, and has no \textit{recursive calls}.

InProgress/tensors_prog/eng/003_paths_extraction_algorithm.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
\subsection{Paths Extraction Algorithm}
1+
\he{Paths Extraction Algorithm}
22
After the index has been created, one can enumerate all paths between specified vertices.
33
The index $M_3$ already stores data about all paths derivable from nonterminals.
44
This data can be used to construct these paths. However, the set of such paths can be infinite.

InProgress/tensors_prog/eng/004evaluation.tex

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,21 @@
1212
We only measure the execution time of the algorithms themselves, thus we assume an input graph is loaded into RAM in the form of its adjacency matrix in the sparse format.
1313
Note, that the time needed to load an input graph into the RAM is excluded from the time measurements.
1414

15-
\subsection{RPQ Evaluation}
15+
\he{RPQ Evaluation}
1616

1717
To investigate the applicability of the proposed algorithm for regular path querying we gathered a dataset that consists of both real-world and synthetically generated graphs.
1818
We generated the queries from the most popular RPQ templates.
1919

20-
\subsubsection{Dataset}
20+
\he{Dataset}
2121

2222
We gathered several graphs that represent real-world data from different areas and are frequently used for the evaluation of the graph querying algorithms.
2323
Namely, the dataset consists of three parts.
2424
The first part is the set of LUBM graphs\footnote{Lehigh University Benchmark (LUBM) web page: \url{http://swat.cse.lehigh.edu/projects/lubm/}. Access date: 07.07.2020.}~\cite{10.1016/j.websem.2005.06.005} which have different numbers of vertices.
2525
The second one is the set of graphs from Uniprot database\footnote{Universal Protein Resource (UniProt) web page: \url{https://www.uniprot.org/}. All files used can be downloaded via the link: \url{ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/}. Access date: 07.07.2020.}: \textit{proteomes}, \textit{taxonomy} and \textit{uniprotkb}.
26-
The~last part consists of the RDF files \textit{mappingbased\_properties} from DBpedia\footnote{DBpedia project web site: \url{https://wiki.dbpedia.org/}. Access date: 07.07.2020.} and \textit{geospecies}\footnote{The Geospecies RDF: \url{https://old.datahub.io/dataset/geospecies}. Access date: 07.07.2020.}.
26+
The~last part consists of the RDF files \textit{mappingbased\_properties} (\textit{mapping\_prop}) from DBpedia\footnote{DBpedia project web site: \url{https://wiki.dbpedia.org/}. Access date: 07.07.2020.} and \textit{geospecies}\footnote{The Geospecies RDF: \url{https://old.datahub.io/dataset/geospecies}. Access date: 07.07.2020.}.
2727
A brief description of the graphs in the dataset is presented in Table~\ref{tbl:graphs_for_rpq}.
2828

29-
\begin{table}
29+
\begin{table}[h]
3030
\centering
3131
\caption{Graphs for RPQ evaluation}
3232
\label{tbl:graphs_for_rpq}
@@ -50,7 +50,7 @@ \subsubsection{Dataset}
5050
Taxonomy & 5 728 398 & 14 922 125 \\
5151
\hline
5252
Geospecies & 450 609 & 2 201 532 \\
53-
Mappingbased\_properties & 8 332 233 & 25 346 359 \\
53+
Mapping\_prop & 8 332 233 & 25 346 359 \\
5454
\hline
5555
\end{tabular}
5656
}
@@ -63,11 +63,11 @@ \subsubsection{Dataset}
6363
The most frequent relations from the given graph were used as symbols in the query template\footnote{Used generator is available as part of CFPQ\_data project: \url{https://github.com/JetBrains-Research/CFPQ_Data/blob/master/tools/gen_RPQ/gen.py}. Access data: 07.07.2020.}.
6464
We used the same set of queries for all LUBM graphs to investigate the scalability of the proposed algorithm.
6565

66-
\begin{table}
66+
\begin{table}[h]
6767
\centering
6868
\caption{Queries templates for RPQ evaluation}
6969
\label{tbl:queries_templates}
70-
{\small
70+
{\scriptsize
7171
\renewcommand{\arraystretch}{1.2}
7272
%\rowcolors{2}{black!2}{black!10}
7373
\begin{tabular}{|c|c||c|c|}
@@ -96,7 +96,7 @@ \subsubsection{Dataset}
9696
\end{table}
9797

9898

99-
\subsubsection{Results}
99+
\he{Results}
100100

101101
We averaged the execution time of index creation over 5 runs for each query.
102102
Index creation time for LUBM graphs set is presented in Figure~\ref{fig:lubm_all_qs}.
@@ -105,7 +105,7 @@ \subsubsection{Results}
105105
We conclude that our algorithm demonstrates reasonable performance to be applied to the real-world data analysis.
106106
%\cho{Note that the accurate comparison of different approaches may be a promising direction of future research.}
107107

108-
\begin{figure}
108+
\begin{figure}[h]
109109
\centering
110110
\includegraphics[width=0.5\textwidth]{LUBM_all.pdf}
111111
\caption{Index creation time for LUBM graphs}
@@ -119,7 +119,7 @@ \subsubsection{Results}
119119
On the other hand, \textit{taxonomy} querying in many cases requires significantly more time than for other graphs, while \textit{taxonomy} is not the biggest graph.
120120
Finally, in most cases, query execution lasts less than 10 seconds, even for bigger graphs, and no query requires more than 52.17 seconds.
121121

122-
\begin{figure}
122+
\begin{figure}[h]
123123
\centering
124124
\includegraphics[width=0.5\textwidth]{other_all.pdf}
125125
\caption{Index creation time for real-world RDFs}
@@ -163,12 +163,12 @@ \subsubsection{Results}
163163
% \caption{Single path extraction for specific graph and query for our solution (\subref{fig:geo_tensors_rpq}, \%subref{fig:dbpedia_tensors_rpq}, \subref{fig:geo_tensors_cfpq}), and Azimov's (\subref{fig:geo_matrix_cfpq})}
164164
%\end{figure}
165165

166-
\subsection{CFPQ Evaluation}
166+
\he{CFPQ Evaluation}
167167

168168
We evaluate the applicability of the proposed algorithm to CFPQ processing over real-world graphs on a number of classic cases and compare them with Azimov's algorithm.
169169
Currently, only a single path version of Azimov's algorithm exists, and we use its implementation using PyGraphBLAS. Note that it is not trivial to compare our results with the state-of-the-art results provided by~\cite{10.1145/3398682.3399163} (Azimov's algorithm) because our algorithm computes significantly more information. While the state-of-the-art solution computes only reachability facts or a single-path semantics, our algorithm computes data necessary to restore all possible paths.
170170

171-
\subsubsection{Dataset}
171+
\he{Dataset}
172172

173173
We use CFPQ\_Data\footnote{CFPQ\_Data is a dataset for CFPQ evaluation which contains both synthetic and real-world data and queries \url{https://github.com/JetBrains-Research/CFPQ\_Data}. Access date: 07.07.2020.} dataset for evaluation.
174174
Namely, we use relatively big RDF files and respective same-generation queries $G_1$~(Eq.~\ref{eqn:g_1}) and $G_2$~(Eq.~\ref{eqn:g_2}) which are used in other works for CFPQ evaluation.
@@ -206,7 +206,7 @@ \subsubsection{Dataset}
206206
The detailed data about all the graphs used is presented in Table~\ref{tbl:graphs_for_cfpq}.
207207

208208
{\setlength{\tabcolsep}{0.2em}
209-
\begin{table}
209+
\begin{table*}[h]
210210
\centering
211211
{
212212
\caption{Graphs for CFPQ evaluation: \textit{bt} is broaderTransitive, \textit{sco} is subCalssOf}
@@ -232,14 +232,14 @@ \subsubsection{Dataset}
232232
\hline
233233
\end{tabular}
234234
}
235-
\end{table}
235+
\end{table*}
236236
}
237-
\subsubsection{Results}
237+
\he{Results}
238238

239239
We averaged the index creation time over 5 runs for both single-path Azimov's algorithm (\textbf{Mtx}) and the proposed algorithm (\textbf{Tns}) (see Table~\ref{tbl:CFPQ_index}).
240240

241241
{\setlength{\tabcolsep}{0.2em}
242-
\begin{table}
242+
\begin{table*}[h]
243243
\centering
244244
\caption{CFPQ evaluation results, time is measured in seconds}
245245
\label{tbl:CFPQ_index}
@@ -267,7 +267,7 @@ \subsubsection{Results}
267267
fs & --- & --- & --- & --- & --- & --- & 470.49 & 370.73 \\
268268
\hline
269269
\end{tabular}
270-
\end{table}
270+
\end{table*}
271271
}
272272

273273
We can see that while in some cases our solution is comparable or just slightly better than Azimov's algorithm (\textit{enzyme, eclass\_514en, go}), there are cases when our solution is significantly faster (\textit{go-hierarchy}, up to 9 times faster), and when Azimov's algorithm about 1.3 times faster (all memory aliases and \textit{geospecies} with \textit{Geo} query).
@@ -285,7 +285,7 @@ \subsubsection{Results}
285285
%While both methods demonstrate linear time on the length of the extracted path, our generic solution is more than 1000 times slower than Azimov's single path extraction procedure.
286286
%We conclude that current generic all-path extraction procedure is not optimal for single path extraction.
287287

288-
\subsection{Conclusion}
288+
\he{Conclusion}
289289

290290
We conclude that the proposed algorithm is applicable to real-world data processing: the algorithm allows one to solve both the reachability problem and to extract paths of interest in a reasonable time.
291291
While index creation time (reachability query evaluation) is comparable with other existing solutions, the paths extraction procedure should be improved in the future. However, the state-of-the-art solution computes only reachability facts or a single-path semantics, whereas our algorithm computes data necessary to restore all possible paths (all-paths semantics).
28 KB
Binary file not shown.

InProgress/tensors_prog/eng/main.tex

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,6 @@
5555
\label{sec:prelim}
5656
\input{002prelim}
5757

58-
\clearpage
5958
\He{Context-free path querying by Kronecker product}
6059
\label{sec:algo}
6160
\input{003algo.tex}
@@ -68,7 +67,6 @@
6867
\label{sec:related}
6968
\input{005related.tex}
7069

71-
\newpage
7270
\He{Conclusion and Future Work}
7371
\label{sec:conclusion}
7472
\input{006conclusion.tex}

0 commit comments

Comments
 (0)