Formatting training data for spancat training #13880

sam8beard · 2025-10-12T00:59:37Z

sam8beard
Oct 12, 2025

Hi there,

I'm taking a stab at building my own claim extraction pipeline (first time spaCy user).

Upstream in my pipeline, I feed n amount of docs to NER in the en_core_web_sm pretrained model in order to identify target spans using my own dependency parsing logic. I then construct a list of training data formatted for span cat:

training_data = [
    [text] (
        "The report states that AI risks are increasing significantly.",
       [annots] {
            "spans": {
                "sc": [
                    (1, 2, "SOURCE"),
                    (2, 3, "CLAIM_VERB"),
                    (4, 8, "CLAIM_CONTENTS"),
                    (8, 9, "CLAIM_MOD"),
                ]
            }
        }
    )
]

Each start and end is the starting token index and end token index in the sentence. This list is then passed to my function where I have the training loop and create examples from all of the tuples in training_data.

examples = []
for text, annots in training_data: 
    doc = nlp.make_doc(text) 
    spans_to_add = []
    for span in annots["spans"]["sc"]: 
        (start, end, label) = span
        new_span = Span(doc, start, end, label) 
        spans_to_add.append(new_span)
    doc.spans["sc"] = spans_to_add 
    examples.append(Example.from_dict(doc, {}))

I'm a bit confused on how I should be creating examples for my training loop. How should my training data be formatted for training my spancat component?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Formatting training data for spancat training #13880

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Formatting training data for spancat training #13880

Uh oh!

sam8beard Oct 12, 2025

Replies: 0 comments

sam8beard
Oct 12, 2025