Skip to content

BUG: Duplicate packages with same PURL break SBOM import and DejaCode component catalog #295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghsa-retrieval opened this issue Apr 23, 2025 · 21 comments
Assignees
Labels
bug Something isn't working design needed Design details needed to complete the issue enhancement New feature or request HighPriority High Priority integration Integration with other applications

Comments

@ghsa-retrieval
Copy link

ghsa-retrieval commented Apr 23, 2025

Describe the bug
Importing an SBOM results in errors for several packages:

{'__all__': ['Package with this Dataspace, Type, Namespace, Name, Version, Qualifiers, Subpath, Download URL and Filename already exists.']}

Comparing the content of the SBOM with the inventory as well as existing package revealed, that the issue is caused by duplicate packages in the component catalog. Apparently there are packages with the same PURL, hash, type, name, and version. One is properly populated with scan data, while the other is not. (Edit: It seems that the above error is caused when there is an existing package without a download url, see #295 (comment) for the two distinct cases). I suspect the duplicate/broken one was formerly associated with a project that has since been deleted.

The even bigger issue is that while we can see those packages in the regular UI, they do not get shown in the admin's dashboard when searching for its name. Thus, deleting over 400 affected ones is a bit of a challenge, given that the error message does not indicate which specific packages cause the issue.

It seems that there is some uniqueness constraint not properly checked when importing SBOMs, as all packages have been imported through SBOMs.

To Reproduce
Unclear

Expected behavior
DejaCode should not allow to create duplicate packages through imported SBOMs.

Screenshots

Image

Image

Image

Context (OS, Browser, Device, etc.):
n.a.

@ghsa-retrieval ghsa-retrieval added bug Something isn't working design needed Design details needed to complete the issue enhancement New feature or request labels Apr 23, 2025
@ghsa-retrieval ghsa-retrieval changed the title BUG: Duplicate package with same PURL BUG: Duplicate packages with same PURL break SBOM import and DejaCode component catalog Apr 23, 2025
@ghsa-retrieval
Copy link
Author

Manually going through them sorted by date in the admin dashboard I think I managed to delete the affected ones manually.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Apr 23, 2025

I also noticed that reimporting SBOMs on a product that already had an SBOM import can result in dependency count increasing, as it is seemingly not checking for duplicates. Perhaps this issue is related to that?

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Apr 23, 2025

The behavior is quite strange. For some packages the SBOM import will create new package entries, despite there already being one with the same PURL, albeit different / non-empty download URL. However, this does not happen for every package, so I'm still missing some factor that plays into this decision.

@ghsa-retrieval
Copy link
Author

Running "Improve Package from PurlDB" fails with duplicate key value violates unique constraint"component_catalog_packag_dataspace_id_type_namesp_c6620419_uniq"DETAIL:Key(dataspace_id,type,namespace,name,version,qualifiers,subpath,download_url,filename)=(3,npm,,parse-json,4.0.0,,,https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz,parse-json-4.0.0.tgz)alreadyexists. since assigning the download_url would make it a fully duplicate package.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Apr 23, 2025

It seems there are two issues here:

  • It seems that there are packages that already exist that would result in a uniqueness constraint violation, if the SBOM would create them again, resulting in the error: {'__all__': ['Package with this Dataspace, Type, Namespace, Name, Version, Qualifiers, Subpath, Download URL and Filename already exists.']}
    • It is unclear to me why it is not simply referencing the original package in a product package relationship as it would usually do
  • DejaCode creates a new package and adds a product package relationship when the package import from the SBOM deviates only in the download URL. It seems that this happens when we already have a package in the catalog that was enhanced from PurlDB and even scanned and the one coming from the SBOM has no download url at all. However, this is not sufficient as criteria, there is something else at play or otherwise there would be far more duplicate packages
    • Attempting to enhance the package in the product with data from PurlDB fails, because assigning the download URL would violate the uniqueness constraint that covers dataspace_id, type, namespace, name, version, qualifiers, subpath, download_url, and filename

@DennisClark DennisClark added integration Integration with other applications HighPriority High Priority labels Apr 23, 2025
@ghsa-retrieval
Copy link
Author

I'll investigate the issue further today and see if I can narrow down condition under which this happens.

@tdruez
Copy link
Contributor

tdruez commented Apr 24, 2025

Some clarification about a Package uniqueness from the source code:

# If one value of the filename, download_url, or any purl fields, changed,
# the Package is not a duplicate and can be created.
# Note that an empty string '' counts as a unique value.
#
# A `package_url` can be identical for multiple files.
# For example, ".zip" and ".whl" release of a Package may share the same `package_url`.
# Therefore, we only apply this unique constraint on `package_url` in the context of a
# `download_url` and `filename`.
# Also, a duplicated `download_url`+`filename` combination is allowed if any of the
# `package_url` fields is different.

Note that the filename + download_url combo predates the existence of PURL to define packages.

For example:

In DejaCode, those 2 packages may share the same pkg:pypi/[email protected] PURL but will exist are 2 separate packages.

@tdruez
Copy link
Contributor

tdruez commented Apr 24, 2025

that the issue is caused by duplicate packages in the component catalog. Apparently there are packages with the same PURL, hash, type, name, and version.

I'm able to reproduce the error locally. This is a bug in the importer logic that occurs when multiple packages with the same PURL, but different download_url or filename, are present in the Dataspace.
The logic to "get" the proper Package instance is missing the "empty" fields, raising the MultipleObjectsReturned exception.
Since no package instance is found, the import process tries to create a new package entry raising the "unique" constraint exception.

I'll provide a fix to ensure the proper existing package (from the multiple records) is fetched and assigned.

For some packages the SBOM import will create new package entries, despite there already being one with the same PURL, albeit different / non-empty download URL. However, this does not happen for every package, so I'm still missing some factor that plays into this decision.

As explained above, the issue occurs when there's more than 1 package with the same PURL but different filename or download_url.


The even bigger issue is that while we can see those packages in the regular UI, they do not get shown in the admin's dashboard when searching for its name.

That would be a search bug. Have you tried the advanced syntax such as name=value?

I also noticed that reimporting SBOMs on a product that already had an SBOM import can result in dependency count increasing, as it is seemingly not checking for duplicates. Perhaps this issue is related to that?

That's a separate issue that should also be fixed.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Apr 24, 2025

@tdruez Thank you for looking into this! I can confirm that your description is consistent with what I am seeing, having checked and cross-referenced data from the SBOM against product packages and packages in the catalog.

Besides the case where there are already two or more packages with the same PURL but different download URLs, I have also had new package creation happen when there was only one existing package with the same PURL. It seems that this has happened when the checksums from the package where different, not the value itself but rather one has SHA1 the other SHA512 or one has both and the other just one of them. This can still cause trouble down the line when applying data from PURL DB would essentially create the exact same package by filling in the missing data, at which point the uniqueness constraint is triggered.

If I'm not wrong with my assessment, this would fully describe the behavior I'm seeing.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Apr 24, 2025

As far as I can tell the search in package does not allow to use the advanced syntax to search in the identifier column. Although this might also be a formatting issue since the purl contains colon and slashes, but quoting also didn't help. Just searching for the package name seems to exclude the ones with duplicate PURL. So for instance searching the UI for "parse-json" results in both pkg:npm/[email protected] being shown, while the admin dashboard only returns one.

I'll check on the dependency count issue when importing SBOMs multiple times and if that issues is still reproducible, file an additional issues. Edit: Issues has been filed #297

Thank you very much for your efforts

@tdruez
Copy link
Contributor

tdruez commented Apr 24, 2025

As far as I can tell the search in package does not allow to use the advanced syntax to search in the identifier column.

Right, that's because the identifier value is not a column in the database but a dynamic property.

@property
def identifier(self):
    """
    Provide a unique value to identify each Package.
    It is the Package URL if one exists; otherwise it is the Package Filename.
    """
    return self.package_url or self.filename

This exists for legacy support of packages created without a PURL.

So for instance searching the UI for "parse-json" results in both pkg:npm/[email protected] being shown, while the admin dashboard only returns one.

This would be a bug with the admin search then but I cannot manage to reproduce locally so far.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Apr 24, 2025

@tdruez I can file a separate issue, as the underlying cause is unrelated to this ticket. It's not an important issue though, as it is now clear what the issue is, how to workaround it, and once the patch is done, should probably not be all that relevant.

The reason why the duplicate packages happen a bit more on my system is that there are projects have pulled the package from an internal package registry and some from the official package registry so both of them coexist.

For others stumbling upon the same issue:

  • If you get the error shown in the initial post, compare the packages from the SBOM with the packages in the inventory. Any missing packages are the once that have been rejected
    • Check if these packages are present multiple times with the same PURL
    • If any of the packages are not needed e.g. not associated with any products, then consider removing it
  • If you are using Improve Package from PurlDB and get error about uniqueness constraints, then this means the attempt to add data to a package would turn it to be equivalent to another one.
    • Remove the incomplete package from the projects inventory and instead link the existing complete one from the catalog

tdruez added a commit that referenced this issue Apr 24, 2025
tdruez added a commit that referenced this issue Apr 24, 2025
Signed-off-by: tdruez <[email protected]>
tdruez added a commit that referenced this issue Apr 24, 2025
@tdruez
Copy link
Contributor

tdruez commented Apr 24, 2025

@ghsa-retrieval The initial import issue should be fixed by #298

We'll handle the PurlDB issue in a separate PR ;)

@ghsa-retrieval
Copy link
Author

@tdruez Thank you so much. I'll create a build tomorrow and give it a try.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Apr 25, 2025

@tdruez Unfortunately, the patch is not working as I would expect. I'm now seeing duplicate packages being created for all of them, even the ones that could previously be cleanly be mapped to exactly one existing package (for which there was no other package with the same purl in the catalog prior to the SBOM import).

@tdruez
Copy link
Contributor

tdruez commented Apr 25, 2025

@ghsa-retrieval I see, the new package matching is a bit too restrictive. We probably want to be more flexible when only 1 package exists for a given PURL and use this package instead of creating a new record (even if the download_url value differs).

@ghsa-retrieval
Copy link
Author

@tdruez At least if the download_url for the potential new package is empty. If there truly is a different one, I'd say creating a new one would be the correct way to go.

tdruez added a commit that referenced this issue Apr 28, 2025
tdruez added a commit that referenced this issue May 2, 2025
tdruez added a commit that referenced this issue May 2, 2025
Signed-off-by: tdruez <[email protected]>
tdruez added a commit that referenced this issue May 2, 2025
Signed-off-by: tdruez <[email protected]>
tdruez added a commit that referenced this issue May 2, 2025
@tdruez
Copy link
Contributor

tdruez commented May 2, 2025

@ghsa-retrieval I've implemented and merged the refinements for the package matching logic in #300
You can give it a try and let me know ;)

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented May 2, 2025

@tdruez Thank you very much! It seem to be working well, no errors and everything imported. All packages that already existed got properly mapped. The only exception was a package that already had two packages in the catalog with the same PURL but different download URL. I think that is expected behavior because it would not be clear which one DejaCode is supposed to pick if the imported SBOM does not provide a download URL (empty) that matches either of the existing ones.

@tdruez
Copy link
Contributor

tdruez commented May 5, 2025

It seem to be working well,

Thanks for the confimation!

The only exception was a package that already had two packages in the catalog with the same PURL but different download URL. I think that is expected behavior because it would not be clear which one DejaCode is supposed to pick if the imported SBOM does not provide a download URL (empty) that matches either of the existing ones.

Yes, this the expected behavior. DejaCode tries to re-use existing packages as much as possible, unless there is no clear way to make a choice.


Running "Improve Package from PurlDB" fails with duplicate key value violates unique constraint"component_catalog_packag_dataspace_id_type_namesp_c6620419_uniq"DETAIL:Key(dataspace_id,type,namespace,name,version,qualifiers,subpath,download_url,filename)=(3,npm,,parse-json,4.0.0,,,https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz,parse-json-4.0.0.tgz)alreadyexists. since assigning the download_url would make it a fully duplicate package.

I'll look into this one next before closing on this issue.

@tdruez
Copy link
Contributor

tdruez commented May 5, 2025

Entered as #303
Closing this one.

@tdruez tdruez closed this as completed May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working design needed Design details needed to complete the issue enhancement New feature or request HighPriority High Priority integration Integration with other applications
Projects
None yet
Development

No branches or pull requests

3 participants