Trying to understand predicted similarity scores during findAndLabel #1168

rkennedy-argus · 2025-06-18T18:23:16Z

rkennedy-argus
Jun 18, 2025

I'm seeing some pairs during findAndLabel with similarity scores that are FAR higher than I would expect. For example:

+---------+----------------------------+----+--------+----+----+----+----+----------------------+----+-----+--------------------+-----------+------------+---------+-----+-----+---------+
|qid      |name                        |P17 |P159    |P169|P355|P414|P740|P856                  |P946|P1056|P1278               |P1320      |P1448       |P1454    |P1830|P3320|z_zsource|
+---------+----------------------------+----+--------+----+----+----+----+----------------------+----+-----+--------------------+-----------+------------+---------+-----+-----+---------+
|Q22976646|Janów Podlaski National Stud|Q36 |Q3383557|NULL|NULL|NULL|NULL|http://www.skjanow.pl/|NULL|NULL |NULL                |NULL       |NULL        |NULL     |NULL |NULL |wikidata |
|Q67809505|JANDA                       |Q213|Q36989  |NULL|NULL|NULL|NULL|NULL                  |NULL|NULL |315700TM558DWYUL8O10|cz/48154440|JANDA s.r.o.|Q15646299|NULL |NULL |wikidata |
+---------+----------------------------+----+--------+----+----+----+----+----------------------+----+-----+--------------------+-----------+------------+---------+-----+-----+---------+

        Zingg predicts the above records MATCH with a similarity score of 0.80

These rows are not even remotely similar, aside from having the same first letter in the name field. At first, I thought maybe it was because of the empty fields. But I judiciously applied the null_or_blank match type to rule that out and I'm still seeing this similarity score.

Here are the field definitions I'm working with:

[
  {
    "fieldName": "qid",
    "matchType": "exact,null_or_blank",
    "fields": "qid",
    "dataType": "string"
  },
  {
    "fieldName": "name",
    "matchType": "text,null_or_blank",
    "fields": "name",
    "dataType": "string"
  },
  {
    "fieldName": "P17",
    "matchType": "exact,null_or_blank",
    "fields": "P17",
    "dataType": "string"
  },
  {
    "fieldName": "P159",
    "matchType": "exact,null_or_blank",
    "fields": "P159",
    "dataType": "string"
  },
  {
    "fieldName": "P169",
    "matchType": "exact,null_or_blank",
    "fields": "P169",
    "dataType": "string"
  },
  {
    "fieldName": "P355",
    "matchType": "text,null_or_blank",
    "fields": "P355",
    "dataType": "string"
  },
  {
    "fieldName": "P414",
    "matchType": "exact,null_or_blank",
    "fields": "P414",
    "dataType": "string"
  },
  {
    "fieldName": "P740",
    "matchType": "exact,null_or_blank",
    "fields": "P740",
    "dataType": "string"
  },
  {
    "fieldName": "P856",
    "matchType": "exact,null_or_blank",
    "fields": "P856",
    "dataType": "string"
  },
  {
    "fieldName": "P946",
    "matchType": "exact,null_or_blank",
    "fields": "P946",
    "dataType": "string"
  },
  {
    "fieldName": "P1056",
    "matchType": "text,null_or_blank",
    "fields": "P1056",
    "dataType": "string"
  },
  {
    "fieldName": "P1278",
    "matchType": "exact,null_or_blank",
    "fields": "P1278",
    "dataType": "string"
  },
  {
    "fieldName": "P1320",
    "matchType": "exact,null_or_blank",
    "fields": "P1320",
    "dataType": "string"
  },
  {
    "fieldName": "P1448",
    "matchType": "exact,null_or_blank",
    "fields": "P1448",
    "dataType": "string"
  },
  {
    "fieldName": "P1454",
    "matchType": "text,null_or_blank",
    "fields": "P1454",
    "dataType": "string"
  },
  {
    "fieldName": "P1830",
    "matchType": "text,null_or_blank",
    "fields": "P1830",
    "dataType": "string"
  },
  {
    "fieldName": "P3320",
    "matchType": "text,null_or_blank",
    "fields": "P3320",
    "dataType": "string"
  }
]

sonalgoyal · 2025-06-18T19:17:10Z

sonalgoyal
Jun 18, 2025
Maintainer

How much labelling have you done so far? How many of those are matches?
How populated are the fields you have?
For the ones that are text, are they 7-8+ words or more?

1 reply

rkennedy-argus Jun 18, 2025
Author

How much labelling have you done so far? How many of those are matches?

I started with some pre-existing training data, following the instructions in https://docs.zingg.ai/latest/stepbystep/createtrainingdata/addowntrainingdata. This set includes 35 rows spanning 17 z_clusters, 9 of which are matching clusters and 8 which are non-matching clusters.

After that, I did a round of findAndLabel, all of which were non-matching. Nothing was indicating whether I had done a sufficient amount of labeling, but when I'd previously moved on to train it would fail, indicating I hadn't done enough. Is there a good heuristic for knowing how much labeling is enough labeling?

How populated are the fields you have?

Here's the summary for my data:

D select column_name, approx_unique, count, null_percentage from (summarize wikidata);
+-------------+---------------+--------+-----------------+
| column_name | approx_unique | count  | null_percentage |
+-------------+---------------+--------+-----------------+
| qid         | 582690        | 596510 | 0.00            |
| P1056       | 5972          | 596510 | 94.65           |
| P127        | 19059         | 596510 | 95.35           |
| P1278       | 31378         | 596510 | 92.82           |
| P1320       | 61161         | 596510 | 91.67           |
| P1448       | 83816         | 596510 | 86.38           |
| P1454       | 1227          | 596510 | 81.18           |
| P159        | 40373         | 596510 | 62.68           |
| P169        | 3596          | 596510 | 99.42           |
| P17         | 480           | 596510 | 14.13           |
| P1830       | 4837          | 596510 | 99.05           |
| P3320       | 1036          | 596510 | 99.83           |
| P355        | 8456          | 596510 | 98.48           |
| P414        | 154           | 596510 | 97.58           |
| P740        | 5009          | 596510 | 98.01           |
| P856        | 201644        | 596510 | 66.89           |
| P946        | 5797          | 596510 | 99.03           |
| name        | 472560        | 596510 | 0.00            |
+-------------+---------------+--------+-----------------+

As you can see, it's quite sparse. name is the one field we're guaranteed to have.

For the ones that are text, are they 7-8+ words or more?

Not always. A number of them are single word IDs that if we were on enterprise we'd use deterministic matching for (e.g. P1278, P1320, and P946). Some of them are multiple IDs concatenated with commas where we're hoping the matching can detect subset overlaps (e.g. P1056, P1830, P3320, and P355).

sonalgoyal · 2025-06-30T14:57:18Z

sonalgoyal
Jun 30, 2025
Maintainer

Sorry for the late reply! For single value columns, you can try switching to exact instead of text. For multiple IDs concatenated with commas, text may be ok. If any of the columns are less than 70-80% populated, you probably dont want them to be part of the model - signal may be too weak and confusing.

You typically want to label at least 40-50 matching pairs so that Zingg can build a good first model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to understand predicted similarity scores during findAndLabel #1168

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trying to understand predicted similarity scores during findAndLabel #1168

Uh oh!

rkennedy-argus Jun 18, 2025

Replies: 2 comments · 1 reply

Uh oh!

sonalgoyal Jun 18, 2025 Maintainer

Uh oh!

Uh oh!

rkennedy-argus Jun 18, 2025 Author

Uh oh!

sonalgoyal Jun 30, 2025 Maintainer

rkennedy-argus
Jun 18, 2025

Replies: 2 comments 1 reply

sonalgoyal
Jun 18, 2025
Maintainer

rkennedy-argus Jun 18, 2025
Author

sonalgoyal
Jun 30, 2025
Maintainer