Project

General

Profile

Actions

Bug #21682

open

The result of IO#pos is inconsistent after using IO#ungetc.

Bug #21682: The result of IO#pos is inconsistent after using IO#ungetc.

Added by YO4 (Yoshinao Muramatsu) about 18 hours ago. Updated about 4 hours ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:123776]

Description

In this issue, I propose modifying IO#ungetc to never change the file position and explain the reasons for this change.

The file location after using IO#ungetc may or may not change.
This behavior is not documented.
Since there are cases that the file position cannot be modified, I propose to unify the behavior so that the file position is not changed.

Motivation

Ensure file position behavior is independent of I/O setup.
Avoid excessive exposure of implementation details.

Current Details

The file position after using IO#ungetc depends on the following conditions:
When encoding conversion is not used: It changes.
When encoding conversion is used: It does not change.
This corresponds internally to whether the ungetc character is stored in the IO object's rbuf or cbuf.

When using cbuf, the byte count of a character in cbuf may differ from its byte count in rbuf, making conversion from cbuf byte count to rbuf byte count potentially complex.
Furthermore, even if one reverse-converts a character in cbuf to calculate its byte count in rbuf, the practical significance of this calculation is questionable.
Therefore, it appears that adjusting the buffered byte count from the actual file position to calculate the file position in Ruby when a character exists in cbuf is not implemented.

When stored in rbuf, its byte count can be used to calculate the file position, so the file position moves due to ungetc.

Problems

It seems relatively unknown that the side effects of ungetc can change depending on I/O setup.
The Ruby community doesn't seem to discuss avoiding the use of file positions after ungetc.
Behavior changing under little-known conditions reduces test coverage and leaves potential bugs.

On Windows, files opened in "r" mode perform CRLF conversion but operate without encoding conversion, delegating conversion to the C runtime library (CRT).
This is primarily for speed, but it modifies IO state in various places within io.c, reducing readability. It has also caused numerous bugs related to file position.
Unfortunately, in Windows, "r" and "rt" are not equivalent depending on whether encoding conversion is used or not.
This means the behavior of ungetc cannot be predicted within the context of 'text mode'.

I'm working personally on switching to use encoding conversion instead of CRT read(), which means encoding conversion will be used for files opened with "r".
This change will alter the behavior of ungetc. This impact occurs because implementation details are being exposed excessively, which is undesirable.

Changes and Impact

IO#ungetc will no longer change the file position.
This affects not only IO#pos, but also implicitly impacts the file position when writing after ungetc.
TestIO#test_copy_stream_dst_rbuf depends on this behavior.

Changing the behavior of IO#ungetbyte is outside the scope of this issue.
Changing it caused many tests to fail, but the details have not been investigated.

Since IO#ungetbyte changes the file position while IO#ungetc does not, mixing their use is no longer possible.
This means that performing ungetbyte after unetc before reading the entire data will cause an exception.

/test/ruby/test_io.rb fails due to the above implicit dependency, and
/spec/ruby/core/io/ungetc_spec.rb fails due to a test explicitly checking IO#pos.

I have no idea about investigating the impact of real-world usage.

Another approach

Do not modify Ruby's behavior. Instead, document that the file position after using IO#ungetc cannot be trusted.
Alternatively, it might be possible to issue a warning during specific operations when characters from IO#ungetc remain pending.

In these cases, when Windows is switched to use encoding conversion for "r",
io_unread() and rb_io_tell() will also take into account the number of bytes in cbuf in addition to rbuf to maintain compatibility.
I prefer to minimize code for specific environments, so this is not to my ideal.

Updated by javanthropus (Jeremy Bopp) about 16 hours ago Actions #2 [ruby-core:123779]

I think this is related to or maybe the same as #20889.

Updated by YO4 (Yoshinao Muramatsu) about 4 hours ago Actions #3 [ruby-core:123780]

It was my mistake not to mention #20889.
Regarding #20889, I recall being unable to decide my own opinion due to concerns about compatibility.

The situations differ between IO#ungetc and IO#ungetbyte.

  • ungetc has different behavior depending on the situation. ungetbyte is consistent in itself.
  • ungetbyte is sometimes used to implement functionality equivalent to peek, for example in io_strip_bom() in io.c.
    The instability of ungetc makes it unsuitable for this purpose. The current behavior of ungetbyte should be preserved.

Due to these I think that it is worth considering changes to IO#ungetc.

For this issue, I would like to limit the discussion to IO#ungetc.

Actions

Also available in: PDF Atom