Can you guess where this shape goes? That's right! It goes in the square hole!
Simple Linked Data validation.
Prototype; work in progress. Only returns Boolean pass/fail; debugging happens through logging to stdout/stderr. Uses shex.js to parse ShExC and ShapeMaps, and n3.js to parse RDF files.
The existing implementations of ShEx that I have tried (shex.js, rudof, jena-shex) do not perform well for my usecase: a CLOSED
shape with a number of optional triples (i.e. ?
or *
). shex.js takes 7 seconds for record B1 and 9 minutes for record B2, when using an simplified schema that does not recurse into related entities like authors, publishers, and taxonomic checklists.
This prototypes evaluates all 2000+ records (1.4 million triples) against the full schema in 3 seconds. It leverages the fact that, at least in my data, most if not all triples could only fit a single constraint. This can greatly reduce the number of choices that need to be made to find potential partitions of triples that satisfy the Boolean expression. I believe this is equivalent to the behavior prescribed by the specification.
Notably though, non-CLOSED
shapes or shapes allowing specific EXTRA
triples make it so some or all triples have an additional option, that of being unused, which greatly increases the number of choices to be made. Proper error reporting (i.e. finding almost-solutions) leads to similar additional choices for the algorithm: any triple that cannot be used as part of a solution could be discarded (or their object adjusted so that they do fit existing constraints), and missing triples could be "added" instead of leading to early exits.
Major:
- Public API
- CLI
- Tests
- Functional solution/error reporting
Features:
Schemas:
- Schemas that import other schemas
- Semantic actions
- (Annotations)
Shapes:
- Non-
CLOSED
shapes - Shapes allowing specific
EXTRA
triples - Shapes that
EXTENDS
other shapes ABSTRACT
shapesEXTERNAL
shapes- Shapes with
restricts
(found in shex.js; not in spec?) EachOf
orOneOf
triple sets withmin
and/ormax
(a bit of a problem, to be honest)
Nodes:
- Validation of literals with datatype annotation (i.e. the string content)
- Configuration of RegExp engine
Optimization:
- Implement alternative methods of solving the Boolean expression of a
Shape
; switch between implementations depending on heuristics - Possibly: try evaluating Boolean expression for early exits (comes with significant overhead)
- Possibly:
- Evaluate implementations for simple optimization gains