Skip to content

feat(kg): make the way to map data to KG more clear #409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 98 additions & 64 deletions docs/docs/ops/storages.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,24 +87,34 @@ You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoin

## Property Graph Targets

Property graph is a graph data model where both nodes and relationships can have properties.
Property graph is a widely-adopted model for knowledge graphs, where both nodes and relationships can have properties.
[Graph database concepts](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/) has a good introduction to basic concepts of property graphs.

The following concepts will be used in the following sections:
* [Node](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-node)
* [Node label](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-labels), which represents a type of nodes.
* [Relationship](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-relationship), which describes a connection between two nodes.
* [Relationship type](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-relationship-type)
* [Properties](https://neo4j.com/docs/getting-started/appendix/graphdb-concepts/#graphdb-properties), which are key-value pairs associated with nodes and relationships.

### Data Mapping

In CocoIndex, you can export data to property graph databases.
This usually involves more than one collectors, and you export them to different types of graph elements (nodes and relationships).
In particular,
Data from collectors are mapped to graph elements in various types:

1. Rows from collectors → Nodes in the graph
2. Rows from collectors → Relationships in the graph (including source and target nodes of the relationship)

1. You can export rows from some collectors to nodes in the graph.
2. You can export rows from some other collectors to relationships in the graph.
3. Some nodes referenced by relationships exported in 2 may not exist as nodes exported in 1.
CocoIndex will automatically create and keep these nodes, as long as they're still referenced by at least one relationship.
This guarantees that all relationships exported in 2 are valid.
This is what you need to provide to define these mappings:

We provide common types `NodeMapping`, `RelationshipMapping`, and `ReferencedNode`, to configure for each situation.
They're agnostic to specific graph databases.
* Specify [nodes to export](#nodes-to-export).
* [Declare extra node labels](#declare-extra-node-labels), for labels to appear as source/target nodes of relationships but not exported as nodes.
* Specify [relationships to export](#relationships-to-export).

#### Nodes
In addition, the same node may appear multiple times, from exported nodes and various relationships.
They should appear as the same node in the target graph database.
CocoIndex automatically [matches and deduplicates nodes](#nodes-matching-and-deduplicating) based on their primary key values.

#### Nodes to Export

Here's how CocoIndex data elements map to nodes in the graph:

Expand All @@ -114,9 +124,9 @@ Here's how CocoIndex data elements map to nodes in the graph:
| a collected row | a node |
| a field | a property of node |

Note that the label used in different `NodeMapping`s should be unique.
Note that the label used in different `Nodes`s should be unique.

`cocoindex.storages.NodeMapping` is to describe mapping to nodes. It has the following fields:
`cocoindex.storages.Nodes` is to describe mapping to nodes. It has the following fields:

* `label` (type: `str`): The label of the node.

Expand All @@ -138,7 +148,7 @@ document_collector.export(
...
cocoindex.storages.Neo4j(
...
mapping=cocoindex.storages.NodeMapping(label="Document"),
mapping=cocoindex.storages.Nodes(label="Document"),
),
primary_key_fields=["filename"],
)
Expand Down Expand Up @@ -167,7 +177,32 @@ graph TD
classDef node font-size:8pt,text-align:left,stroke-width:2;
```

#### Relationships
#### Declare Extra Node Labels

If a node label needs to appear as source or target of a relationship, but not exported as a node, you need to [declare](../core/flow_def#target-declarations) the label with necessary configuration.

The dataclass to describe the declaration is specific to each target storage (e.g. `cocoindex.storages.Neo4jDeclarations`),
while they share the following common fields:

* `nodes_label` (required): The label of the node.
* Options for [storage indexes](../core/flow_def#storage-indexes).
* `primary_key_fields` (required)
* `vector_indexes` (optional)

Continuing the same example above.
Considering we want to extract relationships from `Document` to `Place` later (i.e. a document mentions a place), but the `Place` label isn't exported as a node, we need to declare it:

```python
flow_builder.declare(
cocoindex.storages.Neo4jDeclarations(
connection = ...,
nodes_label="Place",
primary_key_fields=["name"],
),
)
```

#### Relationships to Export

Here's how CocoIndex data elements map to relationships in the graph:

Expand All @@ -177,12 +212,12 @@ Here's how CocoIndex data elements map to relationships in the graph:
| a collected row | a relationship |
| a field | a property of relationship, or a property of source/target node, based on configuration |

Note that the type used in different `RelationshipMapping`s should be unique.
Note that the type used in different `Relationships`s should be unique.

`cocoindex.storages.RelationshipMapping` is to describe mapping to relationships. It has the following fields:
`cocoindex.storages.Relationships` is to describe mapping to relationships. It has the following fields:

* `rel_type` (type: `str`): The type of the relationship.
* `source`/`target` (type: `cocoindex.storages.NodeReferenceMapping`): Specify how to extract source/target node information from the collected row. It has the following fields:
* `source`/`target` (type: `cocoindex.storages.NodeFromFields`): Specify how to extract source/target node information from specific fields in the collected row. It has the following fields:
* `label` (type: `str`): The label of the node.
* `fields` (type: `Sequence[cocoindex.storages.TargetFieldMapping]`): Specify field mappings from the collected rows to node properties, with the following fields:
* `source` (type: `str`): The name of the field in the collected row.
Expand Down Expand Up @@ -218,13 +253,13 @@ doc_place_collector.export(
...
cocoindex.storages.Neo4j(
...
mapping=cocoindex.storages.RelationshipMapping(
mapping=cocoindex.storages.Relationships(
rel_type="MENTION",
source=cocoindex.storages.NodeReferenceMapping(
source=cocoindex.storages.NodeFromFields(
label="Document",
fields=[cocoindex.storages.TargetFieldMapping(source="doc_filename", target="filename")],
),
target=cocoindex.storages.NodeReferenceMapping(
target=cocoindex.storages.NodeFromFields(
label="Place",
fields=[
cocoindex.storages.TargetFieldMapping(source="place_name", target="name"),
Expand All @@ -250,58 +285,70 @@ graph TD
classDef: nodeRef
}

Doc_Chapter2@{
Doc_Chapter2_a@{
shape: rounded
label: "**[Document]**
**filename\\*: chapter2.md**"
classDef: nodeRef
}

Place_CrystalPalace@{
Doc_Chapter2_b@{
shape: rounded
label: "**[Document]**
**filename\\*: chapter2.md**"
classDef: nodeRef
}

Place_CrystalPalace_a@{
shape: rounded
label: "**[Place]**
**name\\*: Crystal Palace**
embedding: [0.1, 0.5, ...]"
classDef: nodeRef
classDef: node
}

Place_MagicForest@{
shape: rounded
label: "**[Place]**
**name\\*: Magic Forest**
embedding: [0.4, 0.2, ...]"
classDef: nodeRef
classDef: node
}

Place_CrystalPalace_b@{
shape: rounded
label: "**[Place]**
**name\\*: Crystal Palace**
embedding: [0.1, 0.5, ...]"
classDef: node
}

Doc_Chapter1:::nodeRef -- **[MENTION]**{location:12} --> Place_CrystalPalace:::nodeRef
Doc_Chapter2:::nodeRef -- **[MENTION]**{location:23} --> Place_MagicForest:::nodeRef
Doc_Chapter2:::nodeRef -- **[MENTION]**{location:56} --> Place_CrystalPalace:::nodeRef

Doc_Chapter1:::nodeRef -- **:MENTION** (location:12) --> Place_CrystalPalace_a:::node
Doc_Chapter2_a:::nodeRef -- **:MENTION** (location:23) --> Place_MagicForest:::node
Doc_Chapter2_b:::nodeRef -- **:MENTION** (location:56) --> Place_CrystalPalace_b:::node

classDef nodeRef font-size:8pt,text-align:left,fill:transparent,stroke-width:1,stroke-dasharray:5 5;
classDef node font-size:8pt,text-align:left,stroke-width:2;

```

#### Nodes Matching and Deduplicating

#### Nodes only referenced by relationships
The nodes and relationships we got above are discrete elements.
To fit them into a connected property graph, CocoIndex will match and deduplicate nodes automatically:

If a node appears as source or target of a relationship, but not exported using `NodeMapping`, CocoIndex will automatically create and keep these nodes until they're no longer referenced by any relationships.
* Match nodes based on their primary key values. Nodes with the same primary key values are considered as the same node.
* For non-primary key fields (a.k.a. value fields), CocoIndex will pick the values from an arbitrary one.
If multiple nodes (before deduplication) with the same primary key provide value fields, an arbitrary one will be picked.

:::note Merge of node values
:::note

If the same node (as identified by primary key values) appears multiple times (e.g. they're referenced by different relationships),
CocoIndex uses value fields provided by an arbitrary one of them.
The best practice is to make the value fields consistent across different appearances of the same node, to avoid non-determinism in the exported graph.

:::

If a node's label specified in `NodeReferenceMapping` doesn't exist in any `NodeMapping`, you need to [declare](../core/flow_def#target-declarations) a `ReferencedNode` to configure [storage indexes](../core/flow_def#storage-indexes) for nodes with this label.
The following options are supported:

* `primary_key_fields` (required)
* `vector_indexes` (optional)

Using the same example above.
After combining exported nodes and relationships, we get the knowledge graph with all information:
After matching and deduplication, we get the final graph:

```mermaid
graph TD
Expand All @@ -326,38 +373,22 @@ graph TD
label: "**[Place]**
**name\\*: Crystal Palace**
embedding: [0.1, 0.5, ...]"
classDef: nodeRef
classDef: node
}

Place_MagicForest@{
shape: rounded
label: "**[Place]**
**name\\*: Magic Forest**
embedding: [0.4, 0.2, ...]"
classDef: nodeRef
classDef: node
}

Doc_Chapter1:::node -- **[MENTION]**{location:12} --> Place_CrystalPalace:::nodeRef
Doc_Chapter2:::node -- **[MENTION]**{location:23} --> Place_MagicForest:::nodeRef
Doc_Chapter2:::node -- **[MENTION]**{location:56} --> Place_CrystalPalace:::nodeRef
Doc_Chapter1:::node -- **:MENTION** (location:12) --> Place_CrystalPalace:::node
Doc_Chapter2:::node -- **:MENTION** (location:23) --> Place_MagicForest:::node
Doc_Chapter2:::node -- **:MENTION** (location:56) --> Place_CrystalPalace:::node

classDef node font-size:8pt,text-align:left,stroke-width:2;
classDef nodeRef font-size:8pt,text-align:left,fill:transparent,stroke-width:1,stroke-dasharray:5 5;

```

Nodes with `Place` label in the example aren't exported explicitly using `NodeMapping`, so CocoIndex will automatically create them as long as they're still referenced by any relationship.
You need to declare a `ReferencedNode`:

```python
flow_builder.declare(
cocoindex.storages.Neo4jDeclarations(
...
referenced_nodes=[
cocoindex.storages.ReferencedNode(label="Place", primary_key_fields=["name"]),
],
),
)
```

### Neo4j
Expand Down Expand Up @@ -388,6 +419,9 @@ The `Neo4j` storage exports each row as a relationship to Neo4j Knowledge Graph.
Neo4j also provides a declaration spec `Neo4jDeclaration`, to configure indexing options for nodes only referenced by relationships. It has the following fields:

* `connection` (type: auth reference to `Neo4jConnectionSpec`)
* `relationships` (type: `Sequence[ReferencedNode]`)
* Fields for [nodes to declare](#nodes-to-declare), including
* `nodes_label` (required)
* `primary_key_fields` (required)
* `vector_indexes` (optional)

You can find an end-to-end example [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/docs_to_knowledge_graph).
24 changes: 10 additions & 14 deletions examples/docs_to_knowledge_graph/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,35 +93,31 @@ def docs_to_kg_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.D
"document_node",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.NodeMapping(label="Document")),
mapping=cocoindex.storages.Nodes(label="Document")),
primary_key_fields=["filename"],
)
# Declare reference Node to reference entity node in a relationship
flow_builder.declare(
cocoindex.storages.Neo4jDeclarations(
cocoindex.storages.Neo4jDeclaration(
connection=conn_spec,
referenced_nodes=[
cocoindex.storages.ReferencedNode(
label="Entity",
primary_key_fields=["value"],
)
]
nodes_label="Entity",
primary_key_fields=["value"],
)
)
entity_relationship.export(
"entity_relationship",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.RelationshipMapping(
mapping=cocoindex.storages.Relationships(
rel_type="RELATIONSHIP",
source=cocoindex.storages.NodeReferenceMapping(
source=cocoindex.storages.NodeFromFields(
label="Entity",
fields=[
cocoindex.storages.TargetFieldMapping(
source="subject", target="value"),
]
),
target=cocoindex.storages.NodeReferenceMapping(
target=cocoindex.storages.NodeFromFields(
label="Entity",
fields=[
cocoindex.storages.TargetFieldMapping(
Expand All @@ -136,13 +132,13 @@ def docs_to_kg_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.D
"entity_mention",
cocoindex.storages.Neo4j(
connection=conn_spec,
mapping=cocoindex.storages.RelationshipMapping(
mapping=cocoindex.storages.Relationships(
rel_type="MENTION",
source=cocoindex.storages.NodeReferenceMapping(
source=cocoindex.storages.NodeFromFields(
label="Document",
fields=[cocoindex.storages.TargetFieldMapping("filename")],
),
target=cocoindex.storages.NodeReferenceMapping(
target=cocoindex.storages.NodeFromFields(
label="Entity",
fields=[cocoindex.storages.TargetFieldMapping(
source="entity", target="value")],
Expand Down
23 changes: 15 additions & 8 deletions python/cocoindex/storages.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ class TargetFieldMapping:
target: str | None = None

@dataclass
class NodeReferenceMapping:
class NodeFromFields:
"""Spec for a referenced graph node, usually as part of a relationship."""
label: str
fields: list[TargetFieldMapping]
Expand All @@ -50,30 +50,37 @@ class ReferencedNode:
vector_indexes: Sequence[index.VectorIndexDef] = ()

@dataclass
class NodeMapping:
class Nodes:
"""Spec to map a row to a graph node."""
kind = "Node"

label: str

@dataclass
class RelationshipMapping:
class Relationships:
"""Spec to map a row to a graph relationship."""
kind = "Relationship"

rel_type: str
source: NodeReferenceMapping
target: NodeReferenceMapping
source: NodeFromFields
target: NodeFromFields

# For backwards compatibility only
NodeMapping = Nodes
RelationshipMapping = Relationships
NodeReferenceMapping = NodeFromFields

class Neo4j(op.StorageSpec):
"""Graph storage powered by Neo4j."""

connection: AuthEntryReference
mapping: NodeMapping | RelationshipMapping
mapping: Nodes | Relationships

class Neo4jDeclarations(op.DeclarationSpec):
class Neo4jDeclaration(op.DeclarationSpec):
"""Declarations for Neo4j."""

kind = "Neo4j"
connection: AuthEntryReference
referenced_nodes: Sequence[ReferencedNode] = ()
nodes_label: str
primary_key_fields: Sequence[str]
vector_indexes: Sequence[index.VectorIndexDef] = ()
Loading