Determine delimiter detection by number of occurrences first

Currently, if a delimiter is not provided when reading a file, it is detected automatically in `detectdelimandguessrows`, specifically here:

```
for attempted_delim in (UInt8(','), UInt8('\t'), UInt8(' '), UInt8('|'), UInt8(';'), UInt8(':'))
  cnt = bvc.counts[Int(attempted_delim) + 1]
  # @show Char(attempted_delim), cnt, nlines
  if cnt > 0 && cnt % nlines == 0
    d = attempted_delim
    break
  end
end
```
A consequence of this is that if for example, `';'`, is the delimiter, another delimiter in the list may be chosen simply because the number of its occurrences is a multiple of `nlines`. Consider the following example where the delimiter is `';'`:

```
Input1;Letter1;Test Date;Number;Price 1;Price 2
A 1.;B;Mar 1, 2025;100;$1;$1,024
A-1 2.;B;Mar 2, 2025;200;$2;$10
A 1;B;Mar 3, 2025;1,000;$3;$4
```

Currently, this is interpreted as

```
julia> CSV.read("../../Downloads/test.csv", DataFrame)
3×4 DataFrame
 Row │ Input1;Letter1;Test  Date;Number;Price  1;Price  2                  
     │ String3              String15           String3  String31           
─────┼─────────────────────────────────────────────────────────────────────
   1 │ A                    1.;B;Mar           1,       2025;100;$1;$1,024
   2 │ A-1                  2.;B;Mar           2,       2025;200;$2;$10
   3 │ A                    1;B;Mar            3,       2025;1,000;$3;$4
```

Providing the delimiter directly:

```
julia> CSV.read("../../Downloads/test.csv", DataFrame, delim = ';')
3×6 DataFrame
 Row │ Input1   Letter1  Test Date    Number   Price 1  Price 2 
     │ String7  String1  String15     String7  String3  String7 
─────┼──────────────────────────────────────────────────────────
   1 │ A 1.     B        Mar 1, 2025  100      $1       $1,024
   2 │ A-1 2.   B        Mar 2, 2025  200      $2       $10
   3 │ A 1      B        Mar 3, 2025  1,000    $3       $4
```

The number of counts for each delimiter are:

```
(Char(attempted_delim), cnt, nlines) = (',', 5, 4)
(Char(attempted_delim), cnt, nlines) = ('\t', 0, 4)
(Char(attempted_delim), cnt, nlines) = (' ', 12, 4)
(Char(attempted_delim), cnt, nlines) = ('|', 0, 4)
(Char(attempted_delim), cnt, nlines) = (';', 20, 4)
(Char(attempted_delim), cnt, nlines) = (':', 0, 4)
```
meaning that space (`' '`) is chosen since 4 divides 12, even though for `';'`, 20 also divides 4 and is more abundant (and perhaps safe to assume the more likely candidate).

It would be nice instead to calculate this for each delimiter, choose the most abundant that fits the criteria, and then fall back to `','` otherwise. This avoids choosing characters based on their given order, unless there is some reason for this that I'm not aware of.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Determine delimiter detection by number of occurrences first #1159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Determine delimiter detection by number of occurrences first #1159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions