Skip to content

sunlight: specifying synchronous merging #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jdeblasio opened this issue Jun 7, 2024 · 5 comments
Closed

sunlight: specifying synchronous merging #79

jdeblasio opened this issue Jun 7, 2024 · 5 comments

Comments

@jdeblasio
Copy link

Early drafts of the Sunlight spec noted that the inclusion of the leaf index in the SCT "limit[ed] Sunlight logs to a null Merge Delay" but that language was softened after it was observed that it was possible to identify a future leaf index without actually yet including the certificate in the tree.

Synchronous merging (i.e. a null merge delay) is a highly-desirable property from Chrome's perspective, and we would like to see this property added to the spec more explicitly.

Experience with RFC6962 logs in the existing CT ecosystem have shown that one of the greatest risks to individual logs is the issuance of SCTs whose corresponding certificates are never included in the log's merkle tree. Dropping certificates for which an SCT has been issued results in an unrecoverable loss of integrity, leading to the log's removal from the list of usable logs by CT-enforcing user agents.

Avoiding this risk is worth a lot to us. Logs commonly experience downtime, but as long as logs have durably included all certificates for which SCTs were issued, and resume correctly serving the required submission and read endpoints, these failures are typically fully recoverable. Downtime when RFC 6962 logs have not yet fully incorporated all pending certificates has led to several log failures due to either omitting entries entirely or rebuilding the tree in a way that resulted in a split view.

Logs that break their integrity guarantees not only pose risks to the directly-involved certificates, but also cause extended periods of reduced availability of CT logging for the entire web ecosystem. Replacement of a log is far from instantaneous -- it takes months to ensure that a new log is usable in all enforcing user agents. During that time, the WebPKI must rely on fewer remaining CT logs.

One wrinkle is that the current specification identifies an API, but largely does not dictate other log behavior. I'll provide a PR soon, but broadly, we'd like to propose the introduction of a "Log Behavior" section (mirroring a similar section in RFC6962) that specifies that:

  • A log MUST incorporate the certificate into the Merkle Tree before returning the SCT.
  • To facilitate the usability of this log, add-chain and add-pre-chain APIs SHOULD return SCTs within a specified SLO.
@FiloSottile
Copy link
Member

I agree with the value of synchronous merging, and it's one of the major motivations of the Sunlight log implementation.

I am not sure we can effectively require it in this specification, though, or even at all in policy. Unless resorting to audits or code review, policies are limited to specifying externally observable behavior. "Any SCT must be eventually incorporated at the index in the extension" is already the requirement of the current document. "Any SCT must be observably incorporated before being returned" is a way stronger requirement that doesn't leave space for designs that offer immediate durability but eventual consistency of reads (e.g. putting a caching CDN in front of a Sunlight log). "A log MUST incorporate the certificate into the Merkle Tree before returning the SCT" is not specifying an externally observable behavior, and kind of gets into the durability properties of the log's internals (e.g. "incorporate" probably means running fsync? What about the RAID cache?).

IMHO, applying pressure towards safer designs (such as towards synchronous merging, and away from SCTs) makes perfect sense, and that's what the SCT extension does. Mandating designs, on the other hand, feels complex and brittle. We can keep going further (in future iterations of the specification): for example we could require returning the STH along with the SCT next.

@AlCutter
Copy link
Contributor

AlCutter commented Jun 8, 2024

+1 to Filippo's response above, this is where I landed too and for largely the same reasons.

@phbnf
Copy link
Contributor

phbnf commented Jun 8, 2024

+1

A log MUST incorporate the certificate into the Merkle Tree before returning the SCT.: while I understand the desire, it is quite a strong requirement that does not bode well with distributed systems. Today CT's MDD is 24hours. I'd be keen to reduce this by a few orders of magnitude, but not necessarily to drop it to zero - this would narrow down design options quite a lot. So much that I think it would hinder sunlight spec's momentum towards a better ecosystem.
I also have a hard time to tell how this requirement would be reliably be measured, and how failure to meet it would be reported.

@AGWA
Copy link

AGWA commented Jun 11, 2024

I agree with @FiloSottile.

Also, it's unclear to me why a log which has trouble storing unmerged entries durably wouldn't also have trouble storing the sequenced tree durably. Indeed, analyzing past log failures involving data loss or corruption suggests that mandating synchronous merging wouldn't have helped in most cases:

  • Venafi, TrustAsia 2020, TrustAsia 2021: log incorporated entries but later signed an inconsistent view of the tree
  • Yeti 2022 # 2, Yeti 2023: log operator deleted the database by accident, which also deleted incorporated entries
  • Yeti 2022, Nessie 2024: bit flip in incorporated entries
  • Mammoth 2024h1: corruption of unincorporated entries, however the root cause was using a database which doesn't handle out of space errors robustly; if this log had synchronous merging, the out of space condition would likely have been triggered by writing the table for sequenced entries rather than unsequenced entries, so the log would have died anyways
  • StartCom, WoSign, Nessie 2023: loss of unincorporated entries, however the log operators never posted detailed post mortems, so we don't know if synchronous merging would have helped

I believe that Sunlight logs will be more robust than past CT logs not because of the protocol but because of the following aspects of the Sunlight implementation:

  • Single write node instead of attempting to achieve high write availability
  • Using object storage instead of a database
  • Compare-and-swap protections

It would be beneficial to encourage all log implementations to make the above choices, but the Sunlight protocol spec doesn't seem like the right place for it.

@jdeblasio
Copy link
Author

Thanks for the insights. Many of the benefits of synchronous inclusion are indeed achieved by durably storing unmerged entries, and by having much shorter MMDs, so I'll close this issue out.

Chrome will very likely try to address the second point by significantly reducing the allowable MMD of tiled logs accepted into Chrome's list. If anyone has opinions on what those restrictions should/should not be, feel free to reach out (directly to me, on ct-policy@, or wherever else is convenient.)

@jdeblasio jdeblasio closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants