You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hub/datasets-spark.md
+29Lines changed: 29 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -168,6 +168,20 @@ To filter the dataset and only keep dialogues in Chinese:
168
168
It is also possible to apply filters or remove columns on the loaded DataFrame, but it is more efficient to do it while loading, especially on Parquet datasets.
169
169
Indeed, Parquet contains metadata at the fileand row group level, which allows to skip entire parts of the dataset that don't contain samples that satisfy the criteria. Columns in Parquet can also be loaded indepentently, whch allows to skip the excluded columns and avoid loading unnecessary data.
170
170
171
+
### Options
172
+
173
+
Here is the list of available options you can pass to `read..option()`:
174
+
175
+
*`config` (string): select a dataset subset/config
176
+
*`split` (string): select a dataset split (default is"train")
177
+
*`token` (string): your Hugging Face token
178
+
179
+
For Parquet datasets:
180
+
*`columns` (string): select a subset of columns to load, e.g. `'["id"]'`
181
+
*`filters` (string): to skip files and row groups that don't match a criteria, e.g. `'["source", "=", "code_exercises"]'`. Filters are passed to [pyarrow.parquet.ParquetDataset](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html).
182
+
183
+
Any other option is passed as an argument to [datasets.load_dataset] (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset)
184
+
171
185
### Run SQL queries
172
186
173
187
Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql`:
@@ -234,3 +248,18 @@ Then, make sure you are authenticated and you can use the "huggingface" Data Sou
0 commit comments