Skip to content

Commit a4bfdbb

Browse files
committed
options
1 parent 9db9277 commit a4bfdbb

File tree

1 file changed

+29
-0
lines changed

1 file changed

+29
-0
lines changed

docs/hub/datasets-spark.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,20 @@ To filter the dataset and only keep dialogues in Chinese:
168168
It is also possible to apply filters or remove columns on the loaded DataFrame, but it is more efficient to do it while loading, especially on Parquet datasets.
169169
Indeed, Parquet contains metadata at the file and row group level, which allows to skip entire parts of the dataset that don't contain samples that satisfy the criteria. Columns in Parquet can also be loaded indepentently, whch allows to skip the excluded columns and avoid loading unnecessary data.
170170

171+
### Options
172+
173+
Here is the list of available options you can pass to `read..option()`:
174+
175+
* `config` (string): select a dataset subset/config
176+
* `split` (string): select a dataset split (default is "train")
177+
* `token` (string): your Hugging Face token
178+
179+
For Parquet datasets:
180+
* `columns` (string): select a subset of columns to load, e.g. `'["id"]'`
181+
* `filters` (string): to skip files and row groups that don't match a criteria, e.g. `'["source", "=", "code_exercises"]'`. Filters are passed to [pyarrow.parquet.ParquetDataset](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html).
182+
183+
Any other option is passed as an argument to [datasets.load_dataset] (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset)
184+
171185
### Run SQL queries
172186

173187
Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql`:
@@ -234,3 +248,18 @@ Then, make sure you are authenticated and you can use the "huggingface" Data Sou
234248
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-chinese-only-min.png"/>
235249
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-spark-infinity-instruct-chinese-only-dark-min.png"/>
236250
</div>
251+
252+
### Mode
253+
254+
Two modes are available when pushing a dataset to Hugging Face:
255+
256+
* "overwrite": overwrite the dataset if it already exists
257+
* "append": append the dataset to an existing dataset
258+
259+
### Options
260+
261+
Here is the list of available options you can pass to `write.option()`:
262+
263+
* `token` (string): your Hugging Face token
264+
265+
Contributions are welcome to add more options here, in particular `subset` and `split`.

0 commit comments

Comments
 (0)