Description
Currently, if a delimiter is not provided when reading a file, it is detected automatically in detectdelimandguessrows
, specifically here:
for attempted_delim in (UInt8(','), UInt8('\t'), UInt8(' '), UInt8('|'), UInt8(';'), UInt8(':'))
cnt = bvc.counts[Int(attempted_delim) + 1]
# @show Char(attempted_delim), cnt, nlines
if cnt > 0 && cnt % nlines == 0
d = attempted_delim
break
end
end
A consequence of this is that if for example, ';'
, is the delimiter, another delimiter in the list may be chosen simply because the number of its occurrences is a multiple of nlines
. Consider the following example where the delimiter is ';'
:
Input1;Letter1;Test Date;Number;Price 1;Price 2
A 1.;B;Mar 1, 2025;100;$1;$1,024
A-1 2.;B;Mar 2, 2025;200;$2;$10
A 1;B;Mar 3, 2025;1,000;$3;$4
Currently, this is interpreted as
julia> CSV.read("../../Downloads/test.csv", DataFrame)
3×4 DataFrame
Row │ Input1;Letter1;Test Date;Number;Price 1;Price 2
│ String3 String15 String3 String31
─────┼─────────────────────────────────────────────────────────────────────
1 │ A 1.;B;Mar 1, 2025;100;$1;$1,024
2 │ A-1 2.;B;Mar 2, 2025;200;$2;$10
3 │ A 1;B;Mar 3, 2025;1,000;$3;$4
Providing the delimiter directly:
julia> CSV.read("../../Downloads/test.csv", DataFrame, delim = ';')
3×6 DataFrame
Row │ Input1 Letter1 Test Date Number Price 1 Price 2
│ String7 String1 String15 String7 String3 String7
─────┼──────────────────────────────────────────────────────────
1 │ A 1. B Mar 1, 2025 100 $1 $1,024
2 │ A-1 2. B Mar 2, 2025 200 $2 $10
3 │ A 1 B Mar 3, 2025 1,000 $3 $4
The number of counts for each delimiter are:
(Char(attempted_delim), cnt, nlines) = (',', 5, 4)
(Char(attempted_delim), cnt, nlines) = ('\t', 0, 4)
(Char(attempted_delim), cnt, nlines) = (' ', 12, 4)
(Char(attempted_delim), cnt, nlines) = ('|', 0, 4)
(Char(attempted_delim), cnt, nlines) = (';', 20, 4)
(Char(attempted_delim), cnt, nlines) = (':', 0, 4)
meaning that space (' '
) is chosen since 4 divides 12, even though for ';'
, 20 also divides 4 and is more abundant (and perhaps safe to assume the more likely candidate).
It would be nice instead to calculate this for each delimiter, choose the most abundant that fits the criteria, and then fall back to ','
otherwise. This avoids choosing characters based on their given order, unless there is some reason for this that I'm not aware of.