-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[feat](catalog) Support reading Hive table with MultiDelimitSerDe #51936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
run buildall |
TPC-H: Total hot run time: 34300 ms
|
TPC-DS: Total hot run time: 185362 ms
|
ClickBench: Total hot run time: 29.37 s
|
@@ -88,6 +88,13 @@ public static String getFieldDelimiter(Table table) { | |||
DEFAULT_FIELD_DELIMITER, fieldDelim, serFormat)); | |||
} | |||
|
|||
public static String getMultiDelimitFieldDelimiter(Table table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is same as getFieldDelimiter()
?
CREATE DATABASE IF NOT EXISTS regression; | ||
USE regression; | ||
|
||
CREATE TABLE `multi_delimit_test`( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding array and map type to test 'mapkey.delim' and 'collection.delim' too?
} | ||
process_value_func(data, value_start, size - value_start, _trimming_char, splitted_values); | ||
} else { | ||
size_t start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add unit test for this algorithm
|
||
try { | ||
// Test 1: MultiDelimitSerDe with |+| delimiter | ||
hive_docker """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you already add hql in docker/thirdparties/docker-compose/hive/scripts/data/regression/multi_delimit_serde/create_table.hql
, why need to create here again?
|
||
hive_docker """INSERT OVERWRITE TABLE multi_delimit_test3 VALUES ('field1', 'field2', 'field3'), ('a', 'b', 'c')""" | ||
|
||
qt_03 """SELECT * FROM multi_delimit_test3 ORDER BY col1""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also test the insert command (Use Doris to write to Hive) for each table type
} else if (serDeLib.equals(HiveMetaStoreClientHelper.HIVE_MULTI_DELIMIT_SERDE)) { | ||
TFileTextScanRangeParams textParams = new TFileTextScanRangeParams(); | ||
// set properties of MultiDelimitSerDe | ||
// 1. set column separator (support multi-character delimiters) | ||
textParams.setColumnSeparator(HiveProperties.getMultiDelimitFieldDelimiter(table)); | ||
// 2. set line delimiter | ||
textParams.setLineDelimiter(HiveProperties.getLineDelimiter(table)); | ||
// 3. set mapkv delimiter | ||
textParams.setMapkvDelimiter(HiveProperties.getMapKvDelimiter(table)); | ||
// 4. set collection delimiter | ||
textParams.setCollectionDelimiter(HiveProperties.getCollectionDelimiter(table)); | ||
// 5. set escape delimiter | ||
HiveProperties.getEscapeDelimiter(table).ifPresent(d -> textParams.setEscape(d.getBytes()[0])); | ||
// 6. set null format | ||
textParams.setNullFormat(HiveProperties.getNullFormat(table)); | ||
fileAttributes.setTextParams(textParams); | ||
fileAttributes.setHeaderType(""); | ||
fileAttributes.setEnableTextValidateUtf8( | ||
sessionVariable.enableTextValidateUtf8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These code is similar to org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
, should be merged together
// hive will escape the field separator in string | ||
if (_escape_char != 0 && i > 0 && data[i - 1] == _escape_char) { | ||
continue; | ||
if (_value_sep_len == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isbetter to abstract the branch code into two functions, refer to PlainCsvTextFieldSplitter:
void PlainCsvTextFieldSplitter::do_split(const Slice& line, std::vector<Slice>* splitted_values) {
if (is_single_char_delim) {
_split_field_single_char(line, splitted_values);
} else {
_split_field_multi_char(line, splitted_values);
}
}
5d3c224
to
1e01919
Compare
run buildall |
7d03769
to
1fe5222
Compare
TPC-H: Total hot run time: 33690 ms
|
run buildall |
TPC-H: Total hot run time: 34109 ms
|
TPC-DS: Total hot run time: 186356 ms
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by anyone and no changes requested. |
ClickBench: Total hot run time: 29.2 s
|
run buildall |
TPC-H: Total hot run time: 33679 ms
|
TPC-DS: Total hot run time: 185403 ms
|
FE UT Coverage ReportIncrement line coverage |
e91eff9
to
f1b29b0
Compare
run buildall |
FE UT Coverage ReportIncrement line coverage |
d437c2a
to
8752d5c
Compare
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 33928 ms
|
TPC-DS: Total hot run time: 184973 ms
|
ClickBench: Total hot run time: 29.69 s
|
b6182b8
to
3ebdfe2
Compare
run buildall |
3ebdfe2
to
ece7b05
Compare
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
run p0 |
run buildall |
FE UT Coverage ReportIncrement line coverage |
run buildall |
1 similar comment
run buildall |
FE UT Coverage ReportIncrement line coverage |
run buildall |
What problem does this PR solve?
Issue Number: close #51846
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)