-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Insights: huggingface/datasets
Overview
Could not load contribution data
Please try again later
21 Pull requests merged by 11 people
-
Preserve formatting in concatenated IterableDataset
#7522 merged
May 19, 2025 -
fix string_to_dict test
#7571 merged
May 19, 2025 -
Implementation of iteration over values of a column in an IterableDataset object
#7564 merged
May 19, 2025 -
Refactor
Dataset.map
to reuse cache files mapped with differentnum_proc
#7434 merged
May 12, 2025 -
set dev version
#7563 merged
May 7, 2025 -
release: 3.6.0
#7562 merged
May 7, 2025 -
fix decoding tests
#7560 merged
May 7, 2025 -
fix aiohttp import
#7559 merged
May 7, 2025 -
Remove
aiohttp
from direct dependencies#7294 merged
May 7, 2025 -
fix regression
#7558 merged
May 7, 2025 -
Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation
#7532 merged
May 6, 2025 -
Rebatch arrow iterables before formatted iterable
#7553 merged
May 6, 2025 -
Avoid global umask for setting file mode.
#7547 merged
May 6, 2025 -
Enable xet in push to hub
#7552 merged
May 6, 2025 -
Add try_original_type to DatasetDict.map
#7544 merged
May 5, 2025 -
set dev version
#7542 merged
Apr 28, 2025 -
release: 3.5.1
#7541 merged
Apr 28, 2025 -
chore: fix typos
#7436 merged
Apr 28, 2025 -
correct use with polars example
#7524 merged
Apr 28, 2025 -
support pyarrow 20
#7540 merged
Apr 28, 2025
5 Pull requests opened by 5 people
-
Add custom fingerprint support to `from_generator`
#7533 opened
Apr 23, 2025 -
Change dill version in requirements
#7535 opened
Apr 24, 2025 -
Add `--merge-pull-request` option for `convert_to_parquet`
#7556 opened
May 6, 2025 -
add check if repo exists for dataset uploading
#7565 opened
May 9, 2025 -
Fixed typos
#7572 opened
May 19, 2025
12 Issues closed by 6 people
-
`concatenate_datasets` does not preserve Pytorch format for IterableDataset
#7515 closed
May 19, 2025 -
Large memory use when loading large datasets to a ZFS pool
#7546 closed
May 13, 2025 -
`Dataset.map` ignores existing caches and remaps when ran with different `num_proc`
#7433 closed
May 12, 2025 -
Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames
#7517 closed
May 7, 2025 -
Document the HF_DATASETS_CACHE env variable
#7457 closed
May 6, 2025 -
IterableDataset's state_dict shard_example_idx is always equal to the number of samples in a shard
#7475 closed
May 6, 2025 -
`IterableDataset` drops samples when resuming from a checkpoint
#7538 closed
May 6, 2025 -
[Errno 13] Permission denied: on `.incomplete` file
#7536 closed
May 6, 2025 -
The memory-disk mapping failure issue of the map function(resolved, but there are some suggestions.)
#7543 closed
Apr 30, 2025 -
`sort` after `filter` unreasonably slow
#7041 closed
Apr 29, 2025 -
How to solve "Spaces stuck in Building" problems
#7530 closed
Apr 22, 2025
14 Issues opened by 14 people
-
No Samsum dataset
#7573 opened
May 20, 2025 -
Dataset lib seems to broke after fssec lib update
#7570 opened
May 15, 2025 -
Dataset creation is broken if nesting a dict inside a dict inside a list
#7569 opened
May 13, 2025 -
`IterableDatasetDict.map()` call removes `column_names` (in fact info.features)
#7568 opened
May 13, 2025 -
interleave_datasets seed with multiple workers
#7567 opened
May 12, 2025 -
terminate called without an active exception; Aborted (core dumped)
#7566 opened
May 11, 2025 -
Issue with offline mode and partial dataset cached
#7551 opened
May 4, 2025 -
TypeError: Couldn't cast array of type string to null on webdataset format dataset
#7549 opened
May 2, 2025 -
Python 3.13t (free threads) Compat
#7548 opened
May 2, 2025 -
Networked Pull Through Cache
#7545 opened
Apr 30, 2025 -
`datasets.map(..., num_proc=4)` multi-processing fails
#7537 opened
Apr 25, 2025 -
TensorFlow RaggedTensor Support (batch-level)
#7534 opened
Apr 24, 2025 -
Deepspeed reward training hangs at end of training with Dataset.from_list
#7531 opened
Apr 21, 2025
19 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Improved type annotation
#7429 commented on
May 15, 2025 • 13 new comments -
fix: loading of datasets from Disk(#7373)
#7489 commented on
Apr 24, 2025 • 2 new comments -
Iterating over values of a column in the IterableDataset
#7381 commented on
May 20, 2025 • 0 new comments -
Incompatibile dill version (0.3.9) in datasets 2.18.0 - 3.5.0
#7510 commented on
May 19, 2025 • 0 new comments -
Columns in the dataset obtained though load_dataset do not correspond to the one in the dataset viewer since 3.4.0
#7495 commented on
May 19, 2025 • 0 new comments -
Add the option of saving in parquet instead of arrow
#6903 commented on
May 19, 2025 • 0 new comments -
Datasets' cache not re-used
#3847 commented on
May 19, 2025 • 0 new comments -
Feature request: IterableDataset.push_to_hub
#5665 commented on
May 15, 2025 • 0 new comments -
Filter on dataset too much slowww
#1796 commented on
May 15, 2025 • 0 new comments -
Excessive warnings when resuming an IterableDataset+buffered shuffle+DDP.
#7444 commented on
May 13, 2025 • 0 new comments -
Faster downloads/uploads with Xet storage
#7526 commented on
May 12, 2025 • 0 new comments -
Auto-merge option for `convert-to-parquet`
#7527 commented on
May 7, 2025 • 0 new comments -
Data Studio Error: Convert JSONL incorrectly
#7528 commented on
May 6, 2025 • 0 new comments -
[Feature request] Indexing datasets by a customly-defined id field to enable random access dataset items via the id
#6532 commented on
May 5, 2025 • 0 new comments -
load_dataset can't work with symbolic links
#6764 commented on
Apr 29, 2025 • 0 new comments -
Dataset uses excessive memory when loading files
#7509 commented on
Apr 28, 2025 • 0 new comments -
HF_DATASETS_CACHE ignored?
#7480 commented on
Apr 28, 2025 • 0 new comments -
load_dataset for CSV files not working
#743 commented on
Apr 24, 2025 • 0 new comments -
MemoryError while creating dataset from generator
#7513 commented on
Apr 23, 2025 • 0 new comments