Skip to content

[doc] perlpacktut #19203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zhijieshi opened this issue Oct 19, 2021 · 6 comments
Open

[doc] perlpacktut #19203

zhijieshi opened this issue Oct 19, 2021 · 6 comments

Comments

@zhijieshi
Copy link

zhijieshi commented Oct 19, 2021

Where

https://perldoc.perl.org/perlpacktut

Description

Issue 1:

The example at the end of "The Basic Principle" packs "byte contents from a string of hexadecimal digits".
The code is pack( 'H2' x 10, 30..39 ). It is not really straightforward to see 30 as a "hexadecimal digits".
Why making it unnecessarily confusing?

The following would be easier for beginners, avoiding "misunderstanding", which is the purpose of this tutorial.

my $s = pack( 'H2' x 10, '30'..'39');
print "$s\n";

Issue 2:

Since there are unicode strings and byte strings, it is not clear what can be unpacked. It seems unpacking unicode strings may have unexpected result.

#!/usr/bin/perl -w
use v5.34;
use utf8;
use strict;
use warnings;
use Encode qw(encode decode);

my $s = "0123456789😀";
my $b = encode "UTF8", $s;

say "Unpack unicode string 1: ",  unpack( '(H2)*', $s);
say "Unpack unicode string 2: ",  unpack( 'H*', $s);
say "Unpack bytes:            ", unpack( 'H*', $b);

{
use bytes;
say "Unpack unicode string 3: ",  unpack( 'H*', $s);
}

The output is:

Character in 'H' format wrapped in unpack at .\t.pl line 11.
Unpack unicode string 1: 3031323334353637383900
Character in 'H' format wrapped in unpack at .\t.pl line 12.
Unpack unicode string 2: 3031323334353637383900
Unpack bytes:            30313233343536373839f09f9880
Unpack unicode string 3: 30313233343536373839f09f9880
@Grinnz
Copy link
Contributor

Grinnz commented Oct 19, 2021

Thank you. I agree with your first point, though it may be made even clearer by using strings containing hex digits A-F in the example.

For point 2, the Unicode section probably needs to be rewritten as it's overly abstraction dependent, similar to your "use bytes" example which breaks the Perl string abstraction. I'm not sure exactly what you're suggesting is the problem there otherwise.

@zhijieshi
Copy link
Author

For point 2, I would like to see some clarifications in the tutorial. I agree that some sections may "needs to be rewritten". When I read the tutorial, I had these questions.

Q1: Can a unicode string be unpacked? If it is not recommended, then the tutorial can make it clear "do not unpack unicode string".

Q2: The example in the tutorial seems to suggest that it is fine to unpack a unicode string into "strings"? If a unicode string can be unpacked in some cases, when would it work?

while (<>) {
    my ($date, $desc, $income, $expend) =
        unpack("A10xA27xA7xA*", $_);
    $tot_income += $income;
    $tot_expend += $expend;
}

@Grinnz
Copy link
Contributor

Grinnz commented Oct 19, 2021

It's a bit complex. The Perl string abstraction is simply a sequence of codepoints - not Unicode, nor bytes, until something interprets it as such. The 'a' and 'A' patterns for example will pass through a codepoint whether or not it fits in a byte, but other patterns like 'C' which are defined to operate on bytes have less obvious behavior (and unfortunately don't warn that you're doing something strange).

And your example has an additional complication. Unless you pass -CSD or add a decoding layer to STDIN or the files you are reading from, <> will return encoded bytes, not Unicode strings. So in that example unpack is likely receiving a byte string.

@zhijieshi
Copy link
Author

Thanks for the explanation. To summarize, a string may have a codepoint consists of more than one byte. The 'a' or A' pattern works with those codepoints while some other patterns works with bytes only.

@Grinnz
Copy link
Contributor

Grinnz commented Oct 20, 2021

It's more accurate to say it may have a codepoint which cannot represent a byte because it is higher than 255. What it's represented by internally is immaterial (unless using "use bytes", which is why that is problematic).

@khwilliamson
Copy link
Contributor

I've never fully understood pack and unpack, and I don't think now it's just me.

Looking @zhijieshi 's first example, I would think that if it were changed to

my $s = pack( 'H2' x 26, '41'..'5A' );

things would be clear. But instead this comes out

ABCDEFGHIPQRSTUVWXY`abcdef

And if we make the first value in the range into a number containing a hex-only digit, we get

my $s = pack( 'H2' x 6, '4A'..'4F' );
Argument "4A" isn't numeric in range (or flop)

So, the numbers 30..39 are interpreted as hex, but not all hex numbers can be used here.

And this is near the beginning of a tutorial, talking about beginner level stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants