Skip to content

[BUG] Data generation failed for data maintenance in local mode with parallelism #221

@wjxiz1992

Description

@wjxiz1992

Describe the bug

(spark-rapids-benchmarks) ➜  nds git:(main) ✗ python nds_gen_data.py local 1 2 /data/tpcds/sf=1/updates --update 20 --range 1,2
Warning: This scale factor is valid for QUALIFICATION ONLY
dsdgen Population Generator (Version 3.2.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2021
Writing s_customer_address ... Done
Writing s_call_center ... Done
Writing s_catalog_order and s_catalog_order_lineitem ... Done
Writing s_catalog_page ... Done
Writing s_customer ... Done
Writing s_inventory ... Done
Writing s_item ... Done
Writing s_promotion ... Done
Writing s_purchase and s_purchase_lineitem ... Done
Writing s_store ... Done
Writing s_warehouse ... Done
Writing s_web_order and s_web_order_lineitem ... Done
Writing s_web_page ... Done
Writing s_web_site ... Done
Writing s_zip_to_gmt ... Done
ERROR: /data/tpcds/sf=1/updates/delete_20.dat exists. Either remove it or use the FORCE option to overwrite it.

the error is due to the replication of the same delete_n.dat file generated by the native dsdgen (compiled by make in the tpcds-gen folder) with diferent child numbers. A typical repro is like this:

~/spark-rapids-benchmarks/nds/tpcds-gen/target/tools$ ./dsdgen -scale 1 -dir $PWD/sf1 -parallel 2 -child 1 -verbose -update 20
Warning: This scale factor is valid for QUALIFICATION ONLY
dsdgen Population Generator (Version 3.2.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2021
Writing s_customer_address ... Done
Writing s_call_center ... Done
Writing s_catalog_order and s_catalog_order_lineitem ... Done
Writing s_catalog_page ... Done
Writing s_customer ... Done
Writing s_inventory ... Done
Writing s_item ... Done
Writing s_promotion ... Done
Writing s_purchase and s_purchase_lineitem ... Done
Writing s_store ... Done
Writing s_warehouse ... Done
Writing s_web_order and s_web_order_lineitem ... Done
Writing s_web_page ... Done
Writing s_web_site ... Done
Writing s_zip_to_gmt ... Done
~/spark-rapids-benchmarks/nds/tpcds-gen/target/tools$ ./dsdgen -scale 1 -dir $PWD/sf1 -parallel 2 -child 2 -verbose -update 20
ERROR: ~/spark-rapids-benchmarks/nds/tpcds-gen/target/tools/sf1/delete_20.dat exists. Either remove it or use the FORCE option to overwrite it.

A simple fix is to detect the update flag, and always honor the overwrite_output as well when update is on.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions