DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • A Deep Dive into Apache Doris Indexes
  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • Lakehouse: Starting With Apache Doris + S3 Tables
  • Best Practices for Scaling Kafka-Based Workloads

Trending

  • Transforming AI-Driven Data Analytics with DeepSeek: A New Era of Intelligent Insights
  • A Modern Stack for Building Scalable Systems
  • A Complete Guide to Modern AI Developer Tools
  • Hybrid Cloud vs Multi-Cloud: Choosing the Right Strategy for AI Scalability and Security
  1. DZone
  2. Data Engineering
  3. Data
  4. Problem Analysis in Apache Doris StreamLoad Scenarios

Problem Analysis in Apache Doris StreamLoad Scenarios

Troubleshoot common Apache Doris StreamLoad import errors, including missing partitions, data type mismatches, and special character issues, for smooth syncing.

By 
li yy user avatar
li yy
·
Apr. 03, 25 · Analysis
Likes (2)
Comment
Save
Tweet
Share
3.0K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Doris provides multiple ways to import data, including StreamLoad, HdfsLoad (gradually replacing BrokerLoad), RoutineLoad, MySQLLoad, and others. StreamLoad is the most commonly used method, as many data synchronization tools like Flink, Spark, and DataX use it under the hood.

Since StreamLoad is the underlying mechanism for Flink Doris Connector, Spark Doris Connector, and DataX, most data import issues tend to occur with StreamLoad. This article will discuss common import errors and their solutions.

Several Common Data Import Errors

1. Partition Not Created

Schema

SQL
CREATE TABLE IF NOT EXISTS tb_dynamic_partition_test2 (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(50) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(500) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
PARTITION BY RANGE(`date`)()
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "dynamic_partition.enable"="true",
    "dynamic_partition.start"="-3",
    "dynamic_partition.end"="1",
    "dynamic_partition.time_unit"="DAY",
    "dynamic_partition.prefix"="p_",
    "dynamic_partition.replication_num"="1",
    "dynamic_partition.buckets"="4"
);

StreamLoad Command

Asterisk
 
curl --location-trusted -u root -H "column_separator:,"  -T /mnt/disk2/test.csv http://ip:8030/api/test/tb_dynamic_partition_test2/_stream_load


Error

Plain Text
 
Reason: no partition for this tuple. tuple=+---------------+---------------+--------------------+--------------------+--------------------+------------------+-----------------+----------------+

Error

Solution

The error occurs when there is no partition for the data. To resolve this, create the missing partition:

SQL
-- Disable dynamic partition
ALTER TABLE tb_dynamic_partition_test2 SET ("dynamic_partition.enable" = "false");

-- Add partition
ALTER TABLE tb_dynamic_partition_test2 ADD PARTITION p_20240426 VALUES [("2024-04-26 00:00:00"), ("2024-04-27 00:00:00")) ("replication_num"="1");

-- Re-enable dynamic partition
ALTER TABLE tb_dynamic_partition_test2 SET ("dynamic_partition.enable" = "true");


After adding the partition, data should import successfully.

2. Data and Column Type Mismatch

Schema

SQL
CREATE TABLE IF NOT EXISTS test (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(5) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(5) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "replication_num"="1"
);


StreamLoad Command

Shell
 
curl --location-trusted -u root -H "column_separator:,"  -T /mnt/disk2/liyuanyuan/data/test.csv http://10.16.10.6:18739/api/test/test/_stream_load


Data

Plain Text
 
1, lisixxxxxxxxxxxxxxxxxxxx, 1001, 18, 1, 1008610010, bj, 2024-04-26

Data

Error

Plain Text
 
Reason: column_name[name], the length of input is too long than schema. first 32 bytes of input str: [lisixxxxxxxxxxxxxxxxxxxx] schema length: 5; actual length: 24;


Solution

The name column's data length exceeds the schema definition. To fix this, increase the length of the name column.

SQL
 
ALTER TABLE test MODIFY COLUMN name VARCHAR(50);


Data should now import successfully.

3. Mismatched Columns Between Data and Schema

Schema

SQL
CREATE TABLE IF NOT EXISTS test2 (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(50) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(50) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "replication_num"="1"
);


Data

Plain Text
 
1, xxxxxxxxxxxxxxxxxxxxxxx, 1001, 18, 1, 1008610010, beijing, 2024-04-26, test_column


StreamLoad Command

Shell
 
curl --location-trusted -u root -H "column_separator:,"  -T /mnt/disk2/liyuanyuan/data/test2.csv http://10.16.10.6:18739/api/test/test2/_stream_load

StreamLoad Command

Error

Plain Text
 
Reason: actual column number in CSV file is more than schema column number. Actual number: 9, schema column number: 8.


Solution

To fix this, add the extra column to the schema:

SQL
 
ALTER TABLE test2 ADD COLUMN new_col VARCHAR(50);


4. Special Characters in CSV Causing Import Failure

Special characters like commas within the data can cause issues during import, especially when columns contain delimiters. A good solution is to use the JSON format instead of CSV for such cases.

If the spark or flink engine is used for import, set the following parameters.

Solution

Java
 
properties.setProperty("format", "json");

properties.setProperty("read_json_by_line", "true");


Alternatively, use the proper escape sequences for handling special characters in CSV files.

Handling Special Characters in CSV Files

1. Data Containing Quotation Marks

When dealing with CSV files where data is enclosed in quotation marks, it’s important to configure StreamLoad with the appropriate settings.

Example Schema

SQL
CREATE TABLE IF NOT EXISTS test3 (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(50) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(50) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "replication_num"="1"
);


Data

Plain Text
 
"1","xxxxxxx","1001","18","1","1008610010","beijing","2024-04-26"


StreamLoad Command

Shell
 
curl --location-trusted -u root -H "column_separator:," -H "enclose:\"" -H "trim_double_quotes:true" -T /path/to/test3.csv http://ip:8030/api/test/test3/_stream_load


Solution

Plain Text
 
enclose: specifies a enclose character.
trim_double_quotes: to true when cutting the CSV file for each field of the outermost double quotation marks.

Solution


2. Data Containing Partial Quotes

Data

Plain Text
 
"1","xx,x,x,xxx",1001,18,"1",1008610010,"bei,jing",2024-04-26


StreamLoad Command

Shell
 
curl --location-trusted -u root -H "column_separator:,"  -H "enclose:\"" -H "trim_double_quotes:true"  -T /mnt/disk2/liyuanyuan/data/test4.csv http://10.16.10.6:18739/api/test/test4/_stream_load


Handling Windows Line Endings

If data is imported from Windows (with \r\n line endings) causes issues where queries do not return expected results, check if the Windows line endings are present.

Solution

Use od -c to check for \r\n and specify the correct line delimiter during import:

Shell
 
-H "line_delimiter:\r\n"

Solution

Using Expression in StreamLoad

Example 1

SQL
CREATE TABLE test_streamload (
    user_id BIGINT NOT NULL COMMENT "User ID",
    name VARCHAR(20) COMMENT "User Name",
    age INT COMMENT "User Age"
)
DUPLICATE KEY (user_id)
DISTRIBUTED BY HASH (user_id) BUCKETS 10
PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
);


StreamLoad Command

Shell
 
curl --location-trusted -u "root:" -T /path/to/data.csv -H "format:csv" -H "column_separator:," -H "columns:user_id,tmp,age,name=upper(tmp)" http://ip:8030/api/test/test_streamload/_stream_load

StreamLoad Command

Example 2

SQL
CREATE TABLE test_streamload2 (
    c1 INT,
    c2 INT,
    c3 VARCHAR(20)
)
DUPLICATE KEY (c1)
DISTRIBUTED BY HASH (c1) BUCKETS 10
PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
);


StreamLoad Command

Shell
 
curl --location-trusted -u "root:" -T /path/to/data.csv -H "format:csv" -H "column_separator:," -H "columns:c1,c2,A,B,C,c3=CONCAT(A,B,C)" http://ip:8030/api/test/test_streamload2/_stream_load

StreamLoad Command


Conclusion

By understanding and addressing these common data import errors, you can significantly reduce the time spent troubleshooting and ensure smoother data synchronization with Apache Doris.

sql Apache Data processing

Opinions expressed by DZone contributors are their own.

Related

  • A Deep Dive into Apache Doris Indexes
  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • Lakehouse: Starting With Apache Doris + S3 Tables
  • Best Practices for Scaling Kafka-Based Workloads

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: