Skip to content

Determine delimiter detection by number of occurrences first #1159

Open
@burtonjosh

Description

@burtonjosh

Currently, if a delimiter is not provided when reading a file, it is detected automatically in detectdelimandguessrows, specifically here:

for attempted_delim in (UInt8(','), UInt8('\t'), UInt8(' '), UInt8('|'), UInt8(';'), UInt8(':'))
  cnt = bvc.counts[Int(attempted_delim) + 1]
  # @show Char(attempted_delim), cnt, nlines
  if cnt > 0 && cnt % nlines == 0
    d = attempted_delim
    break
  end
end

A consequence of this is that if for example, ';', is the delimiter, another delimiter in the list may be chosen simply because the number of its occurrences is a multiple of nlines. Consider the following example where the delimiter is ';':

Input1;Letter1;Test Date;Number;Price 1;Price 2
A 1.;B;Mar 1, 2025;100;$1;$1,024
A-1 2.;B;Mar 2, 2025;200;$2;$10
A 1;B;Mar 3, 2025;1,000;$3;$4

Currently, this is interpreted as

julia> CSV.read("../../Downloads/test.csv", DataFrame)
3×4 DataFrame
 Row │ Input1;Letter1;Test  Date;Number;Price  1;Price  2                  
     │ String3              String15           String3  String31           
─────┼─────────────────────────────────────────────────────────────────────
   1 │ A                    1.;B;Mar           1,       2025;100;$1;$1,024
   2 │ A-1                  2.;B;Mar           2,       2025;200;$2;$10
   3 │ A                    1;B;Mar            3,       2025;1,000;$3;$4

Providing the delimiter directly:

julia> CSV.read("../../Downloads/test.csv", DataFrame, delim = ';')
3×6 DataFrame
 Row │ Input1   Letter1  Test Date    Number   Price 1  Price 2 
     │ String7  String1  String15     String7  String3  String7 
─────┼──────────────────────────────────────────────────────────
   1 │ A 1.     B        Mar 1, 2025  100      $1       $1,024
   2 │ A-1 2.   B        Mar 2, 2025  200      $2       $10
   3 │ A 1      B        Mar 3, 2025  1,000    $3       $4

The number of counts for each delimiter are:

(Char(attempted_delim), cnt, nlines) = (',', 5, 4)
(Char(attempted_delim), cnt, nlines) = ('\t', 0, 4)
(Char(attempted_delim), cnt, nlines) = (' ', 12, 4)
(Char(attempted_delim), cnt, nlines) = ('|', 0, 4)
(Char(attempted_delim), cnt, nlines) = (';', 20, 4)
(Char(attempted_delim), cnt, nlines) = (':', 0, 4)

meaning that space (' ') is chosen since 4 divides 12, even though for ';', 20 also divides 4 and is more abundant (and perhaps safe to assume the more likely candidate).

It would be nice instead to calculate this for each delimiter, choose the most abundant that fits the criteria, and then fall back to ',' otherwise. This avoids choosing characters based on their given order, unless there is some reason for this that I'm not aware of.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions