Skip to content

larsgw/the-square-hole

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Square Hole

Can you guess where this shape goes? That's right! It goes in the square hole!

Simple Linked Data validation.

Prototype; work in progress. Only returns Boolean pass/fail; debugging happens through logging to stdout/stderr. Uses shex.js to parse ShExC and ShapeMaps, and n3.js to parse RDF files.

The existing implementations of ShEx that I have tried (shex.js, rudof, jena-shex) do not perform well for my usecase: a CLOSED shape with a number of optional triples (i.e. ? or *). shex.js takes 7 seconds for record B1 and 9 minutes for record B2, when using an simplified schema that does not recurse into related entities like authors, publishers, and taxonomic checklists.

This prototypes evaluates all 2000+ records (1.4 million triples) against the full schema in 3 seconds. It leverages the fact that, at least in my data, most if not all triples could only fit a single constraint. This can greatly reduce the number of choices that need to be made to find potential partitions of triples that satisfy the Boolean expression. I believe this is equivalent to the behavior prescribed by the specification.

Notably though, non-CLOSED shapes or shapes allowing specific EXTRA triples make it so some or all triples have an additional option, that of being unused, which greatly increases the number of choices to be made. Proper error reporting (i.e. finding almost-solutions) leads to similar additional choices for the algorithm: any triple that cannot be used as part of a solution could be discarded (or their object adjusted so that they do fit existing constraints), and missing triples could be "added" instead of leading to early exits.

Not yet implemented

Major:

  • Public API
  • CLI
  • Tests
  • Functional solution/error reporting

Features:

Schemas:

  • Schemas that import other schemas
  • Semantic actions
  • (Annotations)

Shapes:

  • Non-CLOSED shapes
  • Shapes allowing specific EXTRA triples
  • Shapes that EXTENDS other shapes
  • ABSTRACT shapes
  • EXTERNAL shapes
  • Shapes with restricts (found in shex.js; not in spec?)
  • EachOf or OneOf triple sets with min and/or max (a bit of a problem, to be honest)

Nodes:

  • Validation of literals with datatype annotation (i.e. the string content)
  • Configuration of RegExp engine

Optimization:

  • Implement alternative methods of solving the Boolean expression of a Shape; switch between implementations depending on heuristics
  • Possibly: try evaluating Boolean expression for early exits (comes with significant overhead)
  • Possibly:
  • Evaluate implementations for simple optimization gains

About

Simple Linked Data validation

Resources

License

Stars

Watchers

Forks

Releases

No releases published