Skip to content

read fallback to ntasks=1 not always working #1157

Open
@Crefok

Description

@Crefok

Hey Julia Community,
I am very new to Julia, but what I saw so far is very good. On a Project I read some csv-files with the csv.jl. The Threads.nthreads() is set to 60.
Some CSV-Files could not be read and the Programm exit with the following error Message:

ERROR: TaskFailedException

    nested task error: thread = 38 fatal error, encountered an invalidly quoted field while parsing around row = 127, col = 1: ""outcviusqgvvejbjwrbumoedfhtdyiorvqueekyfhwzegowxkzomzskinamwxiimajggitwcymyxnjtpuhtbwngpunlwelyfkpfo
    vqosvsysvoqkxgzaepzvrbrbneqpidrcrhgsmglapotilebnkntoecqywbxwaiiticlzbpslhkkyjvujwddoduzmixjpznipcptb
    
    ", error=INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD , check your `quotechar` arguments or manually fix the field in the file itself
    
    Stacktrace:
     [1] fatalerror(buf::Vector{UInt8}, pos::Int64, len::Int64, code::Int16, row::Int64, col::Int64)
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:596
     [2] parsevalue!(::Type{…}, buf::Vector{…}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:804
     [3] parserow
       @ ~/.julia/packages/CSV/XLcqT/src/file.jl:646 [inlined]
     [4] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{…}, ::Type{…})
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:556
     [5] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{…}, rowchunkguess::Int64, i::Int64, rows::Vector{…}, wholecolumnslock::ReentrantLock)
       @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:366
     [6] (::CSV.var"#34#39"{CSV.Context, Vector{Vector{CSV.Column}}, Int64, Int64, Vector{Int64}, ReentrantLock})()
       @ CSV ~/.julia/packages/WorkerUtilities/ey0fP/src/WorkerUtilities.jl:384
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:455
 [2] macro expansion
   @ ./task.jl:487 [inlined]
 [3] CSV.File(ctx::CSV.Context, chunking::Bool)
   @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:240
 [4] File
   @ ~/.julia/packages/CSV/XLcqT/src/file.jl:227 [inlined]
 [5] #File#32
   @ ~/.julia/packages/CSV/XLcqT/src/file.jl:223 [inlined]
 [6] #read#118
   @ ~/.julia/packages/CSV/XLcqT/src/CSV.jl:117 [inlined]
 [7] read
   @ ~/.julia/packages/CSV/XLcqT/src/CSV.jl:113 [inlined]
 [8] top-level scope
   @ ./REPL[208]:3
Some type information was truncated. Use `show(err)` to see complete types.

The data I am using here is generated by the following code:

using Random
using CSV

factor = 100
open(joinpath(@__DIR__, "test.csv"), "w") do file
    write(file, "a;b;c;d\n")
    write(file, randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*"\n")
    for i in 1:1000
        write(file, "\""*randstring('a':'z', 1*factor)*"\n"*randstring('a':'z', 1*factor)*"\n"*"\n"*randstring('a':'z', 1*factor)*";"*randstring('a':'z', 1*factor)*"\""
        *";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*"\n")
    end
end

I tried to generate a csv-file which looks similar to the real world data I am facing. There are a lot more columns in the real world data but that doesn't matter. The Problem is caused by splitting the input-file into several chunks and read them in parallel. Thats a very big advantage of this library and results in a lot of speed when it comes to reading csv files. Simple workaround is to set ntasks to one so the file could easily be read and parsed as a DataFrame.

on one execution I got the following Message

┌ Error: Multithreaded parsing failed and fell back to single-threaded parsing. This can happen if the input contains multi-line fields; otherwise, please report this issue.
└ @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:579

after I saw this Message I want to share my results of the findings and ask why this fallback method isn't used every time?

My code to read the csv file:

for i in 1:60
    println(i)
    CSV.read(joinpath(@__DIR__,"test.csv"), DataFrame ;quotechar='"', escapechar='"', delim=';', ntasks=i)
end

i do this in a for loop to find the crashing ntasks parameter currently it is the 8 but that depends on the inputdata (I would guess)

I am currently using Julia in Version 1.10.7
and the CSV (v0.10.15) and DataFrames (v1.7.0) Package with the SHA1 Hash:
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions