|
| 1 | +% !TEX TS-program = pdflatex |
| 2 | +% !TeX spellcheck = en_US |
| 3 | +% !TEX root = main.tex |
| 4 | + |
| 5 | +Language-constrained path querying~\cite{barrett2000formal} is a technique for graph navigation querying. |
| 6 | +This technique allows one to use formal languages as constraints on paths in edge-labeled graphs: a path satisfies constraints if the labels along it form a word from the specified language. |
| 7 | + |
| 8 | +The utilization of regular languages as constraints, or \textit{Regular Path Querying} (RPQ), is the most well-studied and widespread. |
| 9 | +Different aspects of RPQs are actively studied in graph databases~\cite{10.1145/2463664.2465216, 10.1145/3104031,10.1145/2850413}, while regular constraints are supported in such popular query languages as PGQL~\cite{10.1145/2960414.2960421} and SPARQL\footnote{Specification of regular constraints in SPARQL property paths: \url{https://www.w3.org/TR/sparql11-property-paths/}. Access date: 07.07.2020.}~\cite{10.1007/978-3-319-25007-6_1} (known as property paths). |
| 10 | +Nevertheless, there is certainly room for improvement of RPQ efficiency, and new solutions are being created~\cite{Wang2019,10.1145/2949689.2949711}. |
| 11 | + |
| 12 | +At the same time, using more powerful languages as constraints, namely context-free languages, has gained popularity in recent years. |
| 13 | +\textit{Context-Free Path Querying} problem (CFPQ) was introduced by~\cite{Yannakakis}, and nowadays is used in many areas. |
| 14 | +For example, CFPQ is used for interprocedural static code analysis~\cite{10.1145/3158118,10.5555/271338.271343, YanSCA, Zheng:2008:DAA:1328897.1328464} |
| 15 | +In this area CFPQ is known as the context-free language reachability (\textit{the CFL-reachability}) problem. |
| 16 | +Also, CFPQ can be used for biological data analysis~\cite{GraphQueryWithEarley}, graph segmentation in data provenance analysis~\cite{8731467}, and for data flow information preserving in machine learning based solutions for code analysis problems~\cite{10.1145/3428301}. |
| 17 | + |
| 18 | +Many algorithms for CPFQ were proposed, but recently~\cite{Kuijpers:2019:ESC:3335783.3335791} showed that the state-of-the-art CFPQ algorithms are still not performant enough for practical use. |
| 19 | +This motivates further research of the new algorithms for CFPQ. |
| 20 | + |
| 21 | +One promising way to achieve high-performance solutions for graph analysis problems is to reduce them to linear algebra operations. |
| 22 | +To facilitate this approach, the description of basic linear algebra primitives GraphBLAS~API~\cite{7761646} was proposed. |
| 23 | +Evaluation of the libraries that implement this API, such as SuiteSparce~\cite{10.1145/3322125} and CombBLAS~\cite{10.1177/1094342011403516}, show that reduction to linear algebra is a good way to utilize high-performance parallel and distributed computations for graph analysis. |
| 24 | + |
| 25 | +\cite{Azimov:2018:CPQ:3210259.3210264} showed how to reduce CFPQ to matrix multiplication. |
| 26 | +Later, it was shown by~\cite{Mishin:2019:ECP:3327964.3328503} and~\cite{10.1145/3398682.3399163} that by using the appropriate libraries for linear algebra for Azimov's algorithm implementation one can create a practical solution for CFPQ. |
| 27 | +However, Azimov's algorithm requires transforming of the input grammar to Chomsky Normal Form. |
| 28 | +This leads to the grammar size increase and hence worsens performance, especially for regular queries and complex context-free queries. |
| 29 | + |
| 30 | +To solve these problems, an algorithm based on automata intersection was proposed~\cite{10.1007/978-3-030-54832-2_6}. |
| 31 | +This algorithm is based on linear algebra and does not require the transformation of the input grammar. |
| 32 | +In this work, we improve this algorithm by reducing it to operations over Boolean matrices, thus simplifying its description and implementation. |
| 33 | +Additionally, we added the support of all-paths query semantics. |
| 34 | +Under the \textit{all-path query semantics}, a query is evaluated to all paths satisfying the conditions of the query. |
| 35 | +All-paths semantics is necessary, for example, in biological data analysis~\cite{GraphQueryWithEarley}, where paths prove why the specified vertices are of interest (similar). |
| 36 | +In static code analysis (e.g. alias analysis) paths indicate the reason why two names are aliases which can be used to generate a good error message to the user of the static analysis tool. Reporting all such reasons (all paths) makes for a shorter feedback loop as well as provides a more detailed analysis. |
| 37 | + |
| 38 | +We also show that this algorithm is performant enough for regular queries, so it is a good candidate for integration with the real-world query languages: one algorithm can be used to evaluate both regular and context-free queries. |
| 39 | +Having a unified environment simplifies the development of the querying tools by allowing for reuse of common optimizations for the querying algorithm. |
| 40 | +Note that a real-world context-free query is likely to have a regular subquery which can be significant in size. |
| 41 | +Our algorithm is capable to treat such regular subparts as a regular query thus imposing little overhead as compared to treating them as a generic context-free query. |
| 42 | +This makes a unified solution more promising in terms of performance. |
| 43 | + |
| 44 | +Moreover, we show that this algorithm opens the way to tackle a long-standing problem about the existence of truly-subcubic $O(n^{3-\epsilon})$ CFPQ algorithm ~\cite{10.1145/1328438.1328460, Yannakakis}. |
| 45 | +Currently, the best result is an $O(n^3/\log{n})$ algorithm of~\cite{10.1145/1328438.1328460}. |
| 46 | +Also, there exist truly subcubic solutions that use fast matrix multiplication for some fixed subclasses of context-free languages~\cite{8249039}. |
| 47 | +Unfortunately, these solutions cannot be generalized to arbitrary CFPQs. |
| 48 | +In this work, we identify incremental transitive closure as a bottleneck on the way to achieve subcubic time complexity for CFPQ. |
| 49 | + |
| 50 | +To sum up, we make the following contributions. |
| 51 | +\begin{enumerate} |
| 52 | + \item We rethink and improve the CFPQ algorithm based on tensor-product proposed by~\cite{10.1007/978-3-030-54832-2_6}. |
| 53 | + We reduce this algorithm to operations over Boolean matrices. |
| 54 | + As a result, all-path query semantics is handled. Also, both regular and context-free grammars can be used as queries. |
| 55 | + \item |
| 56 | + We prove the correctness and time complexity for the proposed algorithm thus providing an upper bound on the complexity of the CFPQ problem in dependence on the size of the query (its context-free grammar) and the number of vertices in the input graph. |
| 57 | + The proposed algorithm has subcubic complexity in terms of the grammar and the input graph sizes, which is comparable with the state-of-the-art solutions. |
| 58 | + On the other hand, the algorithm does not require transforming the input grammar to Chomsky Normal Form. |
| 59 | + The transformation leads to at least a quadratic blow-up in grammar size, thus by avoiding the transformation, our algorithm achieves better time complexity than other solutions in terms of the grammar size. |
| 60 | + \item We demonstrate the interconnection between CFPQ and incremental transitive closure. |
| 61 | + We show that incremental transitive closure is a bottleneck on the way to achieve a faster CFPQ algorithm for the general case of arbitrary graphs as well as for special families of graphs, such as planar graphs. |
| 62 | + \item We implement the described algorithm and evaluate it on real-world data for both RPQ and CFPQ. |
| 63 | + The evaluation shows that the proposed algorithm is comparable with the existing solutions for CFPQ and RPQ, thus the algorithm provides a promising way to handle both CFPQ and RPQ. |
| 64 | +\end{enumerate} |
0 commit comments