A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
Modify from http://drivendata.github.io/cookiecutter-data-science/
- Python 2.7 or 3.5
- Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda config --add channels conda-forge
$ conda install cookiecutter
cookiecutter https://github.com/syhsu/cookiecutter-data-science
The directory structure of your new project looks like this:
{{cookiecutter.repo_name}}
├── .gitignore <- GitHub's excellent Python .gitignore customized for this project
├── LICENSE <- Your project's license.
├── requirements.txt <- The required packages for reproducing the analysis environment
├── README.md <- The top-level README for developers using this project.
├── Dockerfile <- The Dockerfile for running the codes in src
│
├── data
│ ├── raw <- The original, immutable data dump; .json/.yaml pointing to the raw data dump
│ │ └── metadata.json <- Format still requires to define but just update from the previous version, i.e. only-one record is kept.
│ ├── external
│ │ └── metadata.json
│ ├── interim <- Intermediate data that has been transformed.
│ │ └── metadata.json
│ └── final <- The final, canonical data sets for modeling.
│ └── metadata.json
│
├── docs <- Documentations, reports, References, and all other explanatory materials
│
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `_` delimited description, e.g.
│ `01_cp_exploratory_data_analysis.ipynb`. NOTE: clean outputs before pushing to git!!
│
├── models <- Trained and serialized models, model predictions, or model summaries
│ └── metadata.json
│
├── pipelines <- Pipelines and data workflows. Add subfolder from the used orchestration tool, e.g. airflow, kubeflow
│ ├── kubeflow
│ │ ├── {{cookiecutter.repo_name}} <- compiled kubeflow pipelines
│ └── airflow
│
├── src <- Source folder for training, analyzing codes
│ └── components <- kubeflow components or airflow custom operators [Align with Dbox design]
│
├── tests <- Testing codes
│ └──components
└── setup.py
pip install -r requirements.txt
py.test tests