-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[feat](catalog) Support reading Hive table with MultiDelimitSerDe #51936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
All reactions
Sorry, something went wrong.
run buildall |
All reactions
Sorry, something went wrong.
TPC-H: Total hot run time: 34300 ms
|
All reactions
Sorry, something went wrong.
TPC-DS: Total hot run time: 185362 ms
|
All reactions
Sorry, something went wrong.
ClickBench: Total hot run time: 29.37 s
|
All reactions
Sorry, something went wrong.
@@ -88,6 +88,13 @@ public static String getFieldDelimiter(Table table) { | |||
DEFAULT_FIELD_DELIMITER, fieldDelim, serFormat)); | |||
} | |||
|
|||
public static String getMultiDelimitFieldDelimiter(Table table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is same as getFieldDelimiter()
?
Sorry, something went wrong.
All reactions
CREATE DATABASE IF NOT EXISTS regression; | ||
USE regression; | ||
|
||
CREATE TABLE `multi_delimit_test`( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding array and map type to test 'mapkey.delim' and 'collection.delim' too?
Sorry, something went wrong.
All reactions
} | ||
process_value_func(data, value_start, size - value_start, _trimming_char, splitted_values); | ||
} else { | ||
size_t start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add unit test for this algorithm
Sorry, something went wrong.
All reactions
|
||
try { | ||
// Test 1: MultiDelimitSerDe with |+| delimiter | ||
hive_docker """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you already add hql in docker/thirdparties/docker-compose/hive/scripts/data/regression/multi_delimit_serde/create_table.hql
, why need to create here again?
Sorry, something went wrong.
All reactions
|
||
hive_docker """INSERT OVERWRITE TABLE multi_delimit_test3 VALUES ('field1', 'field2', 'field3'), ('a', 'b', 'c')""" | ||
|
||
qt_03 """SELECT * FROM multi_delimit_test3 ORDER BY col1""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also test the insert command (Use Doris to write to Hive) for each table type
Sorry, something went wrong.
All reactions
} else if (serDeLib.equals(HiveMetaStoreClientHelper.HIVE_MULTI_DELIMIT_SERDE)) { | ||
TFileTextScanRangeParams textParams = new TFileTextScanRangeParams(); | ||
// set properties of MultiDelimitSerDe | ||
// 1. set column separator (support multi-character delimiters) | ||
textParams.setColumnSeparator(HiveProperties.getMultiDelimitFieldDelimiter(table)); | ||
// 2. set line delimiter | ||
textParams.setLineDelimiter(HiveProperties.getLineDelimiter(table)); | ||
// 3. set mapkv delimiter | ||
textParams.setMapkvDelimiter(HiveProperties.getMapKvDelimiter(table)); | ||
// 4. set collection delimiter | ||
textParams.setCollectionDelimiter(HiveProperties.getCollectionDelimiter(table)); | ||
// 5. set escape delimiter | ||
HiveProperties.getEscapeDelimiter(table).ifPresent(d -> textParams.setEscape(d.getBytes()[0])); | ||
// 6. set null format | ||
textParams.setNullFormat(HiveProperties.getNullFormat(table)); | ||
fileAttributes.setTextParams(textParams); | ||
fileAttributes.setHeaderType(""); | ||
fileAttributes.setEnableTextValidateUtf8( | ||
sessionVariable.enableTextValidateUtf8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These code is similar to org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
, should be merged together
Sorry, something went wrong.
All reactions
// hive will escape the field separator in string | ||
if (_escape_char != 0 && i > 0 && data[i - 1] == _escape_char) { | ||
continue; | ||
if (_value_sep_len == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isbetter to abstract the branch code into two functions, refer to PlainCsvTextFieldSplitter:
void PlainCsvTextFieldSplitter::do_split(const Slice& line, std::vector<Slice>* splitted_values) {
if (is_single_char_delim) {
_split_field_single_char(line, splitted_values);
} else {
_split_field_multi_char(line, splitted_values);
}
}
Sorry, something went wrong.
All reactions
5d3c224
to
1e01919
Compare
run buildall |
All reactions
Sorry, something went wrong.
7d03769
to
1fe5222
Compare
TPC-H: Total hot run time: 33690 ms
|
All reactions
Sorry, something went wrong.
run buildall |
All reactions
Sorry, something went wrong.
TPC-H: Total hot run time: 34109 ms
|
All reactions
Sorry, something went wrong.
TPC-DS: Total hot run time: 186356 ms
|
All reactions
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sorry, something went wrong.
All reactions
PR approved by anyone and no changes requested. |
All reactions
Sorry, something went wrong.
ClickBench: Total hot run time: 29.2 s
|
All reactions
Sorry, something went wrong.
run buildall |
All reactions
Sorry, something went wrong.
TPC-H: Total hot run time: 33679 ms
|
All reactions
Sorry, something went wrong.
TPC-DS: Total hot run time: 185403 ms
|
All reactions
Sorry, something went wrong.
FE UT Coverage ReportIncrement line coverage |
All reactions
Sorry, something went wrong.
e91eff9
to
f1b29b0
Compare
run buildall |
All reactions
Sorry, something went wrong.
run buildall |
All reactions
Sorry, something went wrong.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
All reactions
Sorry, something went wrong.
run buildall |
All reactions
Sorry, something went wrong.
TPC-H: Total hot run time: 33846 ms
|
All reactions
Sorry, something went wrong.
TPC-DS: Total hot run time: 184266 ms
|
All reactions
Sorry, something went wrong.
ClickBench: Total hot run time: 29.65 s
|
All reactions
Sorry, something went wrong.
FE UT Coverage ReportIncrement line coverage |
All reactions
Sorry, something went wrong.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
All reactions
Sorry, something went wrong.
run cloud_p0 |
All reactions
Sorry, something went wrong.
PR approved by at least one committer and no changes requested. |
All reactions
Sorry, something went wrong.
…mitSerDe apache#51936 (apache#52772) bp apache#51936 Co-authored-by: lw112 <131352377+felixwluo@users.noreply.github.com>
morningman
suxiaogang223
Successfully merging this pull request may close these issues.
[Feature] Support reading Hive table with MultiDelimitSerDe
What problem does this PR solve?
Issue Number: close #51846
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)