|
| 1 | +# Dataset Configuration Design |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This design provides a generic, configuration-driven approach to handling different datasets and models in MLPerf Inference harnesses. The system uses YAML configuration files to define dataset field mappings and model-specific settings, allowing the same harness code to work with different datasets and scenarios. |
| 6 | + |
| 7 | +## Key Components |
| 8 | + |
| 9 | +### 1. Dataset Configuration System (`harness/data/dataset_config.py`) |
| 10 | + |
| 11 | +- **DatasetConfigLoader**: Loads YAML configuration files for datasets |
| 12 | +- **DatasetConfig**: Data class containing dataset field mappings |
| 13 | +- **ModelDatasetConfig**: Model-specific dataset configuration |
| 14 | + |
| 15 | +### 2. Enhanced Dataset Processor (`harness/data/dataset_processor.py`) |
| 16 | + |
| 17 | +- Automatically loads dataset configuration from YAML files |
| 18 | +- Uses field mappings from config (input_column, input_ids_column, output_column) |
| 19 | +- Falls back to defaults if config not available |
| 20 | + |
| 21 | +### 3. Base Harness (`harness/harness/base_harness.py`) |
| 22 | + |
| 23 | +- Works for both Offline and Server scenarios |
| 24 | +- Automatically loads dataset configuration |
| 25 | +- Provides hooks for model-specific customizations: |
| 26 | + - `_pre_run_setup()`: Pre-run initialization |
| 27 | + - `_post_run_processing()`: Post-run processing |
| 28 | + - `_cleanup_custom()`: Custom cleanup |
| 29 | + |
| 30 | +### 4. Model-Specific Harnesses |
| 31 | + |
| 32 | +- **DeepSeek R1** (`language/deepseek-r1/harness_deepseek_r1.py`): Extends BaseHarness |
| 33 | +- **Llama 3.1 8B** (`harness/harness_llama3.1_8b.py`): Extends BaseHarness |
| 34 | + |
| 35 | +## Configuration Files |
| 36 | + |
| 37 | +Configuration files are stored in `harness/data/configs/`: |
| 38 | + |
| 39 | +- `llama3.1-8b.yaml`: Llama 3.1 8B dataset configuration |
| 40 | +- `deepseek-r1.yaml`: DeepSeek R1 dataset configuration |
| 41 | + |
| 42 | +### Configuration Structure |
| 43 | + |
| 44 | +```yaml |
| 45 | +name: dataset-name |
| 46 | +description: "Description" |
| 47 | + |
| 48 | +fields: |
| 49 | + input_column: "text_input" |
| 50 | + input_ids_column: "tok_input" |
| 51 | + output_column: "ref_output" |
| 52 | + input_lens_column: null # Optional |
| 53 | + |
| 54 | +file_format: "auto" |
| 55 | +total_sample_count: 4388 |
| 56 | + |
| 57 | +model_specific: |
| 58 | + default_model_name: "model-name" |
| 59 | +``` |
| 60 | +
|
| 61 | +## Usage Examples |
| 62 | +
|
| 63 | +### Using BaseHarness Directly |
| 64 | +
|
| 65 | +```python |
| 66 | +from harness.base_harness import BaseHarness |
| 67 | + |
| 68 | +harness = BaseHarness( |
| 69 | + model_name="deepseek-ai/DeepSeek-R1-0528", |
| 70 | + dataset_path="./dataset.pkl", |
| 71 | + dataset_name="deepseek-r1", |
| 72 | + scenario="Offline", # or "Server" |
| 73 | + test_mode="performance" |
| 74 | +) |
| 75 | + |
| 76 | +results = harness.run() |
| 77 | +``` |
| 78 | + |
| 79 | +### Creating Model-Specific Harness |
| 80 | + |
| 81 | +```python |
| 82 | +from harness.base_harness import BaseHarness |
| 83 | + |
| 84 | +class MyModelHarness(BaseHarness): |
| 85 | + def __init__(self, **kwargs): |
| 86 | + if 'dataset_name' not in kwargs: |
| 87 | + kwargs['dataset_name'] = 'my-dataset' |
| 88 | + super().__init__(**kwargs) |
| 89 | + |
| 90 | + def _pre_run_setup(self): |
| 91 | + # Model-specific setup |
| 92 | + pass |
| 93 | +``` |
| 94 | + |
| 95 | +## Benefits |
| 96 | + |
| 97 | +1. **Code Reusability**: Same harness code works for different datasets |
| 98 | +2. **Easy Configuration**: Add new datasets by creating YAML files |
| 99 | +3. **Scenario Agnostic**: Works for both Offline and Server scenarios |
| 100 | +4. **Extensible**: Model-specific customizations via subclass hooks |
| 101 | +5. **Maintainable**: All dataset info centralized in config files |
| 102 | + |
| 103 | +## Adding New Datasets |
| 104 | + |
| 105 | +1. Create YAML config file in `harness/data/configs/` |
| 106 | +2. Define field mappings |
| 107 | +3. Use in harness with `dataset_name` parameter |
| 108 | + |
| 109 | +No code changes needed! |
| 110 | + |
| 111 | +## Design Principles |
| 112 | + |
| 113 | +- **Configuration over Code**: Dataset-specific info in YAML, not code |
| 114 | +- **Inheritance Hierarchy**: BaseHarness → ModelHarness (if needed) |
| 115 | +- **Backward Compatible**: Falls back to defaults if config not available |
| 116 | +- **Extensible**: Hooks for model-specific behavior |
| 117 | + |
| 118 | +## File Structure |
| 119 | + |
| 120 | +``` |
| 121 | +harness/ |
| 122 | +├── data/ |
| 123 | +│ ├── dataset_config.py # Configuration loader |
| 124 | +│ ├── dataset_processor.py # Enhanced processor |
| 125 | +│ └── configs/ |
| 126 | +│ ├── llama3.1-8b.yaml # Llama config |
| 127 | +│ └── deepseek-r1.yaml # DeepSeek config |
| 128 | +├── harness/ |
| 129 | +│ └── base_harness.py # Base harness (with dataset config support) |
| 130 | +└── harness_llama3.1_8b.py # Extends BaseHarness |
| 131 | +
|
| 132 | +language/ |
| 133 | +└── deepseek-r1/ |
| 134 | + └── harness_deepseek_r1.py # New DeepSeek harness |
| 135 | +``` |
| 136 | + |
0 commit comments