Skip to content

Fix read_volatile intrinsic #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

LegNeato
Copy link
Contributor

@LegNeato LegNeato commented May 24, 2025

It would always generate PTX to load only a 1-bit integer (i1) from the provided pointer, regardless of the actual size of T. The alignment specified for the load was also based on the pointer's alignment, not the pointee's alignment.

This commit changes it to:

  1. Load the full llvm_type_of(T) instead of i1.
  2. Use the correct alignment of T for the load instruction.
  3. Removes an incorrect pointer cast that was based on the return ABI of T rather than the type of the memory being read.

Possibly fixes #208.

@LegNeato
Copy link
Contributor Author

LegNeato commented May 24, 2025

Note I have never touched this code before and we don't have tests, so I am very unsure of this change.

@LegNeato LegNeato force-pushed the fix208 branch 2 times, most recently from ce65c5e to e72a113 Compare May 24, 2025 15:35
@LegNeato LegNeato marked this pull request as draft May 24, 2025 15:35
@LegNeato LegNeato force-pushed the fix208 branch 3 times, most recently from 016c8b4 to fab7455 Compare May 24, 2025 15:59
@LegNeato

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@LegNeato

This comment was marked as outdated.

@brandonros
Copy link

jedisct1/rust-ed25519-compact@master...brandonros:rust-ed25519-compact:master

Can you give this a look? I did this as a wild guess I'm afraid.

image

Can I do this atomic/fence part better/differently?

I see you are fixing read_volatile. Does write_volatile need the same fix by chance?

@brandonros
Copy link

brandonros commented May 25, 2025

Ignore that one, that's the "working" library (rust-ed25519-compact) actually. The non-working one is the more popular curve25519-dalek.

Here is the IllegalAddress one:

 let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input); // good
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar; // good
        let compressed_point = point.compress(); // problem in here
        /*let public_key_bytes = compressed_point.to_bytes();
        public_key_bytes*/

Let me figure out exactly which line in curve25519_dalek is giving us the problem, might give us a clue

@brandonros
Copy link

Update:

drilling it into 1 specific line from curve25519-dalek library

let mut input = [0u8; 32];
        input.copy_from_slice(&hashed_private_key_bytes[0..32]);
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        let recip = point.Z.invert();
        let x = &point.X * &recip;
        let y = &point.Y * &recip;
        let mut s: [u8; 32];
        s = y.as_bytes();
        s[31] ^= x.is_negative().unwrap_u8() << 7; // <<< it is this line that causes IllegalAddress

will keep digging in

@brandonros
Copy link

If I convert this

impl FieldElement {
    /// Determine if this `FieldElement` is negative, in the sense
    /// used in the ed25519 paper: `x` is negative if the low bit is
    /// set.
    ///
    /// # Return
    ///
    /// If negative, return `Choice(1)`.  Otherwise, return `Choice(0)`.
    pub fn is_negative(&self) -> Choice {
        let bytes = self.as_bytes();
        (bytes[0] & 1).into()
    }

into this:

        let x_bytes = x.as_bytes();
        let x_is_negative = x_bytes[0] & 1;
        s[31] ^= x_is_negative << 7;

one IllegalAddress problem goes away and I get further

I ended up with this... I might be chasing a ghost/it might be a different issue (like some alignment issue or something that throws things off down the stack or something)

else {
        let mut input = [0u8; 32];
        input.copy_from_slice(&hashed_private_key_bytes[0..32]);
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        let recip = point.Z.invert();
        let x = &point.X * &recip;
        let y = &point.Y * &recip;
        let mut s: [u8; 32];
        s = y.as_bytes();
        let x_bytes = x.as_bytes();
        let x_is_negative = x_bytes[0] & 1;
        s[31] ^= x_is_negative << 7;
        let compressed_point = curve25519_dalek::edwards::CompressedEdwardsY(s);
        let public_key_bytes = compressed_point.to_bytes();
        //public_key_bytes // if you try to return this, IllegalAddress
        [0u8; 32] // if you return this, everything else above it passes?
    }

I'll leave it on the branch for you to repro if you want

@brandonros
Copy link

sorry for the spam, journey continues

the function can return the public_key_bytes no problem, it's when you turn around and try to read it, which would make sense given your read_volatile but I'm not quite seeing where something is marked volatile in this mix, did you see something that tipped you off onto that? Is there some other stale read potentially?

Otherwise it's like a Drop or something that I've seen when playing with Rust and FFI before. Similar behavior.

impl CompressedEdwardsY {
    /// View this `CompressedEdwardsY` as an array of bytes.
    pub const fn as_bytes(&self) -> &[u8; 32] {
        &self.0
    }

    /// Copy this `CompressedEdwardsY` to an array of bytes.
    pub const fn to_bytes(&self) -> [u8; 32] {
        self.0
    }

I used to have do things like this std::mem::ManuallyDrop::new(buf).as_slice().as_ptr(), let's see if I can play with it.

@brandonros
Copy link

Even trying this

let public_key_bytes = compressed_point.to_bytes();
        output.copy_from_slice(&public_key_bytes[0..32]);
    let mut public_key_bytes = [0u8; 32];
    derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes);

or this

        let mut output = [0u8; 32];
        output.copy_from_slice(&public_key_bytes[0..32]);
        output

does not seem to make it much better

@brandonros
Copy link

brandonros commented May 25, 2025

    // ed25519 private key -> public key (first 32 bytes only)
    cuda_std::println!("hashed_private_key_bytes: {:02x?}", hashed_private_key_bytes);
    let mut public_key_bytes = [0u8; 32];
    derrive_public_key(hashed_private_key_bytes, &mut public_key_bytes);
    //cuda_std::println!("public_key_bytes: {}", public_key_bytes[0]); // if you uncomment this, IllegalAddress

This is about as simple as I can boil it down to

@brandonros
Copy link

this was a pretty bad miss on my end...

if the final result is not needed, it's most likely (obviously) dropped by the compiler

let me add a println! in between each line/result and see if that helps

@brandonros
Copy link

        // 1
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        cuda_std::println!("scalar: {:?}", scalar);

just this, fine

        // 1
        let scalar = curve25519_dalek::Scalar::from_bytes_mod_order(input);
        cuda_std::println!("scalar: {:?}", scalar);

        // 2
        let point = curve25519_dalek::constants::ED25519_BASEPOINT_TABLE * &scalar;
        cuda_std::println!("point: {:?}", point);
Found 1 CUDA devices
Starting device 0
[0] Loading module...
[0] Starting search loop...
hashed_private_key_bytes: [60, 95, b5, 5d, 3b, 22, 10, e2, 4e, ac, 07, 85, 73, f5, 53, 11, eb, 1e, c7, ea, 6b, f8, 03, cf, d3, 84, 22, 63, 60, 00, e7, 6c, 6d, 82, b7, 6f, 03, 46, d8, 9c, 5e, f1, 93, 46, 65, dc, 63, 73, 53, 0e, 7f, f0, 64, a8, 91, f8, 11, 55, cf, be, af, 27, 21, 07]

thread 'main' panicked at src/main.rs:123:32:
called `Result::unwrap()` on an `Err` value: IllegalAddress

by enabling the multiplication... it breaks even the "#1" working step...

will continue to dive in, don't expect you to read all this, hoping it helps

@LegNeato
Copy link
Contributor Author

No, the stream of consciousness is good and helps! Thanks for sticking with it. I had originally looked at ED25519_BASEPOINT_TABLE but I think I followed a red herring looking at blackbox, which lead me to read_volatile.

@brandonros
Copy link

brandonros commented May 25, 2025

So I did something a bit strange/foolish to help dive into this:

https://github.com/brandonros/nvptx-llvm-poc

Just to remove Rust-CUDA from the mix, I made a small pipeline of Rust -> LLVM IR -> NVPTX to see if it could shed insight.

llc: error: llc: output/kernel.ll:44:32: error: expected type
  %_8 = getelementptr inbounds nuw i8, ptr %point, i64 80
; Function Attrs: nounwind
define ptx_kernel void @ed25519_poc() unnamed_addr #2 {
start:
  %x_bytes = alloca [32 x i8], align 1
  %_16 = alloca [32 x i8], align 1
  %y = alloca [40 x i8], align 8
  %x = alloca [40 x i8], align 8
  %recip = alloca [40 x i8], align 8
  %point = alloca [160 x i8], align 8
  %scalar = alloca [32 x i8], align 1
  %input = alloca [32 x i8], align 1
  call void @llvm.memset.p0.i64(ptr noundef nonnull align 1 dereferenceable(32) %input, i8 0, i64 32, i1 false)
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %scalar)
; call curve25519_dalek::scalar::Scalar::from_bytes_mod_order
  call void @_ZN16curve25519_dalek6scalar6Scalar20from_bytes_mod_order17ha97de3501966aa39E(ptr noalias nocapture noundef nonnull sret([32 x i8]) align 1 dereferenceable(32) %scalar, ptr noalias nocapture noundef nonnull readonly align 1 dereferenceable(32) %input) #8
  call void @llvm.lifetime.start.p0(i64 160, ptr nonnull %point)
  %_4 = load ptr, ptr @_ZN16curve25519_dalek7backend6serial3u649constants23ED25519_BASEPOINT_TABLE17h1495e1294bfbdf04E, align 8, !nonnull !2, !align !3, !noundef !2
; call <&curve25519_dalek$$edwards$$EdwardsBasepointTable$u20$as$u20$core$$ops$$arith$$Mul$LT$$RF$curve25519_dalek$$scalar$$Scalar$GT$$GT$::mul
  call void @"_ZN138_$LT$$RF$curve25519_dalek$$edwards$$EdwardsBasepointTable$u20$as$u20$core$$ops$$arith$$Mul$LT$$RF$curve25519_dalek$$scalar$$Scalar$GT$$GT$3mul17hb25e470272c01c40E"(ptr noalias nocapture noundef nonnull sret([160 x i8]) align 8 dereferenceable(160) %point, ptr noalias noundef nonnull readonly align 8 dereferenceable(30720) %_4, ptr noalias noundef nonnull readonly align 1 dereferenceable(32) %scalar) #8
  call void @llvm.lifetime.start.p0(i64 40, ptr nonnull %recip)
  %_8 = getelementptr inbounds nuw i8, ptr %point, i64 80
; call curve25519_dalek::field::<impl curve25519_dalek$$backend$$serial$$u64$$field$$FieldElement51$GT$::invert
# Build with cargo
pushd kernel
cargo clean
cargo +nightly build --release --target nvptx64-nvidia-cuda
popd

# Find the generated .ll file (cargo puts it in a nested directory)
find kernel/target/nvptx64-nvidia-cuda/release/deps -name "kernel.ll" -exec cp {} output/kernel.ll \;

# Convert .ll to .ptx
llc -march=nvptx64 -mcpu=sm_75 output/kernel.ll -o output/kernel.ptx

Does this help us? I am on a newer version of nightly...

Shall I try this code without the precomputed table, let me see how that goes...

@brandonros

This comment was marked as off-topic.

@LegNeato

This comment was marked as outdated.

@LegNeato LegNeato force-pushed the fix208 branch 2 times, most recently from cc392bb to efd57d0 Compare May 25, 2025 19:30
@brandonros

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@LegNeato

This comment was marked as outdated.

@LegNeato

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@LegNeato

This comment was marked as outdated.

@LegNeato

This comment was marked as outdated.

@brandonros

This comment was marked as outdated.

@LegNeato
Copy link
Contributor Author

I'm going to split the read_volatile changes from the address space changes.

LegNeato added a commit to LegNeato/Rust-CUDA that referenced this pull request May 26, 2025
This isn't fuly correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill insead of only spilling when
a static is big. But, this is materially better than
what is there.

Fixes Rust-GPU#208.

See also the debugging and discussion in
Rust-GPU#216
LegNeato added a commit to LegNeato/Rust-CUDA that referenced this pull request May 26, 2025
This isn't fully correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill insead of only spilling when
a static is big. But, this is materially better than
what is there (which is a runtime error).

An argument can be made to just _always_ use global
memory and we don't have to worry about getting the
packing right.

Fixes Rust-GPU#208.

See also the debugging and discussion in
Rust-GPU#216
LegNeato added a commit to LegNeato/Rust-CUDA that referenced this pull request May 26, 2025
This isn't fully correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill instdead of only spilling when
a static is big. But, this is materially better than
what is there (which is a runtime error).

An argument can be made to just _always_ use global
memory and we don't have to worry about getting the
packing right.

Fixes Rust-GPU#208.

See also the debugging and discussion in
Rust-GPU#216
@LegNeato
Copy link
Contributor Author

@brandonros #217

LegNeato added a commit that referenced this pull request May 26, 2025
* Allow address spaces to propagate to LLVM

This looks like it was code that wasn't deleted after the refactor in
decda87

* Spill large statics from constant to global memory

This isn't fully correct, as ideally we keep track
of what we have put into constant memory and when
it is filled up spill instdead of only spilling when
a static is big. But, this is materially better than
what is there (which is a runtime error).

An argument can be made to just _always_ use global
memory and we don't have to worry about getting the
packing right.

Fixes #208.

See also the debugging and discussion in
#216

* Add `--use-constant-memory-space` flag, off by default

* Make it clear that `#[cuda_std::address_space(constant)]` still works
@brandonros
Copy link

@LegNeato I just confirmed we need this. Without it curve25519-dalek generates entirely wrong (silently) different results on GPU vs CPU. this fixes it.

It would always generate PTX to load only a 1-bit integer (`i1`)
from the provided pointer, regardless of the actual size of `T`.
The alignment specified for the load was also based on the pointer's
alignment, not the pointee's alignment.

This commit changes it to:
1. Load the full `llvm_type_of(T)` instead of `i1`.
2. Use the correct alignment of `T` for the load instruction.
3. Removes an incorrect pointer cast that was based on the return
   ABI of `T` rather than the type of the memory being read.

Possibly fixes Rust-GPU#208.
@brandonros
Copy link

are you good to merge, anything you want to wait for?

@LegNeato
Copy link
Contributor Author

LegNeato commented Jun 3, 2025

I'm not exactly sure this is correct in all cases, I hit a different issue while investigating the sha2 stuff and trying to track down if this is the cause. Not familiar with this area so fumbling around in the dark a bit.

@LegNeato LegNeato marked this pull request as ready for review June 3, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

curve25519_dalek crate = runtime error
2 participants