Description
Hey Julia Community,
I am very new to Julia, but what I saw so far is very good. On a Project I read some csv-files with the csv.jl. The Threads.nthreads()
is set to 60.
Some CSV-Files could not be read and the Programm exit with the following error Message:
ERROR: TaskFailedException
nested task error: thread = 38 fatal error, encountered an invalidly quoted field while parsing around row = 127, col = 1: ""outcviusqgvvejbjwrbumoedfhtdyiorvqueekyfhwzegowxkzomzskinamwxiimajggitwcymyxnjtpuhtbwngpunlwelyfkpfo
vqosvsysvoqkxgzaepzvrbrbneqpidrcrhgsmglapotilebnkntoecqywbxwaiiticlzbpslhkkyjvujwddoduzmixjpznipcptb
", error=INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD , check your `quotechar` arguments or manually fix the field in the file itself
Stacktrace:
[1] fatalerror(buf::Vector{UInt8}, pos::Int64, len::Int64, code::Int16, row::Int64, col::Int64)
@ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:596
[2] parsevalue!(::Type{…}, buf::Vector{…}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
@ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:804
[3] parserow
@ ~/.julia/packages/CSV/XLcqT/src/file.jl:646 [inlined]
[4] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{…}, ::Type{…})
@ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:556
[5] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{…}, rowchunkguess::Int64, i::Int64, rows::Vector{…}, wholecolumnslock::ReentrantLock)
@ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:366
[6] (::CSV.var"#34#39"{CSV.Context, Vector{Vector{CSV.Column}}, Int64, Int64, Vector{Int64}, ReentrantLock})()
@ CSV ~/.julia/packages/WorkerUtilities/ey0fP/src/WorkerUtilities.jl:384
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:455
[2] macro expansion
@ ./task.jl:487 [inlined]
[3] CSV.File(ctx::CSV.Context, chunking::Bool)
@ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:240
[4] File
@ ~/.julia/packages/CSV/XLcqT/src/file.jl:227 [inlined]
[5] #File#32
@ ~/.julia/packages/CSV/XLcqT/src/file.jl:223 [inlined]
[6] #read#118
@ ~/.julia/packages/CSV/XLcqT/src/CSV.jl:117 [inlined]
[7] read
@ ~/.julia/packages/CSV/XLcqT/src/CSV.jl:113 [inlined]
[8] top-level scope
@ ./REPL[208]:3
Some type information was truncated. Use `show(err)` to see complete types.
The data I am using here is generated by the following code:
using Random
using CSV
factor = 100
open(joinpath(@__DIR__, "test.csv"), "w") do file
write(file, "a;b;c;d\n")
write(file, randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*"\n")
for i in 1:1000
write(file, "\""*randstring('a':'z', 1*factor)*"\n"*randstring('a':'z', 1*factor)*"\n"*"\n"*randstring('a':'z', 1*factor)*";"*randstring('a':'z', 1*factor)*"\""
*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*";"*randstring('a':'z', 6*factor)*"\n")
end
end
I tried to generate a csv-file which looks similar to the real world data I am facing. There are a lot more columns in the real world data but that doesn't matter. The Problem is caused by splitting the input-file into several chunks and read them in parallel. Thats a very big advantage of this library and results in a lot of speed when it comes to reading csv files. Simple workaround is to set ntasks
to one so the file could easily be read and parsed as a DataFrame.
on one execution I got the following Message
┌ Error: Multithreaded parsing failed and fell back to single-threaded parsing. This can happen if the input contains multi-line fields; otherwise, please report this issue.
└ @ CSV ~/.julia/packages/CSV/XLcqT/src/file.jl:579
after I saw this Message I want to share my results of the findings and ask why this fallback method isn't used every time?
My code to read the csv file:
for i in 1:60
println(i)
CSV.read(joinpath(@__DIR__,"test.csv"), DataFrame ;quotechar='"', escapechar='"', delim=';', ntasks=i)
end
i do this in a for loop to find the crashing ntasks
parameter currently it is the 8
but that depends on the inputdata (I would guess)
I am currently using Julia in Version 1.10.7
and the CSV (v0.10.15) and DataFrames (v1.7.0) Package with the SHA1 Hash:
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"