Skip to content

Commit 662fdad

Browse files
authored
Merge pull request ydataai#762 from pandas-profiling/develop
v2.12.0 release
2 parents 5756097 + 1d4c9b5 commit 662fdad

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+668
-471
lines changed

.github/workflows/benchmark.yml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
name: Performance Benchmarks
2+
3+
on:
4+
push:
5+
branches:
6+
- master
7+
- develop
8+
9+
jobs:
10+
benchmark:
11+
name: ${{ matrix.os }} x ${{ matrix.python }}
12+
runs-on: ${{ matrix.os }}
13+
strategy:
14+
fail-fast: false
15+
matrix:
16+
os: [ ubuntu-latest ] #, macos-latest, windows-latest ]
17+
python: ['3.8']
18+
steps:
19+
- uses: actions/checkout@v2
20+
with:
21+
fetch-depth: 0
22+
- uses: actions/setup-python@v1
23+
with:
24+
python-version: ${{ matrix.python }}
25+
- name: Run benchmark
26+
run: |
27+
pip install --upgrade pip setuptools wheel
28+
pip install -r requirements.txt
29+
pip install -r requirements-test.txt
30+
- run: make install
31+
- run: pytest tests/benchmarks/bench.py --benchmark-min-rounds 10 --benchmark-warmup "on" --benchmark-json benchmark.json
32+
- name: Store benchmark result
33+
uses: rhysd/github-action-benchmark@v1
34+
with:
35+
name: Pandas Profiling Benchmarks
36+
tool: 'pytest'
37+
output-file-path: benchmark.json
38+
github-token: ${{ secrets.GITHUB_TOKEN }}
39+
auto-push: true
40+
41+
comment-on-alert: true
42+
alert-comment-cc-users: '@sbrugman'

.github/workflows/commit.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
name: Lint Commit Messages
2+
on: [pull_request]
3+
4+
jobs:
5+
commitlint:
6+
runs-on: ubuntu-latest
7+
steps:
8+
- uses: actions/checkout@v2
9+
with:
10+
fetch-depth: 0
11+
- uses: wagoid/commitlint-github-action@v3

.github/workflows/ci.yml renamed to .github/workflows/release.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: CI
1+
name: Release CI
22

33
on:
44
push:

.github/workflows/ci_test.yml renamed to .github/workflows/tests.yml

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
name: Tests and Coverage
1+
name: CI
22

33
on: push
44

55
jobs:
6-
build:
6+
test:
77
runs-on: ${{ matrix.os }}
88
strategy:
99
matrix:
@@ -33,7 +33,53 @@ jobs:
3333
pandas: "pandas>1.1"
3434
numpy: "numpy"
3535

36-
name: python ${{ matrix.python-version }}, ${{ matrix.os }}, ${{ matrix.pandas }}, ${{ matrix.numpy }}
36+
name: Tests | python ${{ matrix.python-version }}, ${{ matrix.os }}, ${{ matrix.pandas }}, ${{ matrix.numpy }}
37+
steps:
38+
- uses: actions/checkout@v2
39+
- name: Setup python
40+
uses: actions/setup-python@v2
41+
with:
42+
python-version: ${{ matrix.python-version }}
43+
architecture: x64
44+
- uses: actions/cache@v2
45+
if: startsWith(runner.os, 'Linux')
46+
with:
47+
path: ~/.cache/pip
48+
key: ${{ runner.os }}-${{ matrix.pandas }}-pip-${{ hashFiles('**/requirements.txt') }}
49+
restore-keys: |
50+
${{ runner.os }}-${{ matrix.pandas }}-pip-
51+
52+
- uses: actions/cache@v2
53+
if: startsWith(runner.os, 'macOS')
54+
with:
55+
path: ~/Library/Caches/pip
56+
key: ${{ runner.os }}-${{ matrix.pandas }}-pip-${{ hashFiles('**/requirements.txt') }}
57+
restore-keys: |
58+
${{ runner.os }}-${{ matrix.pandas }}-pip-
59+
60+
- uses: actions/cache@v2
61+
if: startsWith(runner.os, 'Windows')
62+
with:
63+
path: ~\AppData\Local\pip\Cache
64+
key: ${{ runner.os }}-${{ matrix.pandas }}-pip-${{ hashFiles('**/requirements.txt') }}
65+
restore-keys: |
66+
${{ runner.os }}-${{ matrix.pandas }}-pip-
67+
- run: |
68+
pip install --upgrade pip setuptools wheel
69+
pip install -r requirements.txt "${{ matrix.pandas }}" "${{ matrix.numpy }}"
70+
pip install -r requirements-test.txt
71+
- run: make install
72+
- run: make test
73+
coverage:
74+
runs-on: ${{ matrix.os }}
75+
strategy:
76+
matrix:
77+
os: [ ubuntu-latest ]
78+
python-version: [ 3.8 ]
79+
pandas: [ "pandas>1.1"]
80+
numpy: ["numpy"]
81+
82+
name: Coverage | python ${{ matrix.python-version }}, ${{ matrix.os }}, ${{ matrix.pandas }}, ${{ matrix.numpy }}
3783
steps:
3884
- uses: actions/checkout@v2
3985
- name: Setup python

.pre-commit-config.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ repos:
55
- id: black
66
language_version: python3.8
77
- repo: https://github.com/nbQA-dev/nbQA
8-
rev: 0.5.9
8+
rev: 0.7.0
99
hooks:
1010
- id: nbqa-black
1111
additional_dependencies: [ black==20.8b1 ]
@@ -17,12 +17,12 @@ repos:
1717
additional_dependencies: [ pyupgrade==2.7.3 ]
1818
args: [ --nbqa-mutate, --py36-plus ]
1919
- repo: https://github.com/asottile/pyupgrade
20-
rev: v2.10.0
20+
rev: v2.12.0
2121
hooks:
2222
- id: pyupgrade
2323
args: ['--py36-plus','--exit-zero-even-if-changed']
2424
- repo: https://github.com/pycqa/isort
25-
rev: 5.7.0
25+
rev: 5.8.0
2626
hooks:
2727
- id: isort
2828
files: '.*'
@@ -31,8 +31,8 @@ repos:
3131
rev: "0.46"
3232
hooks:
3333
- id: check-manifest
34-
- repo: https://gitlab.com/pycqa/flake8
35-
rev: "3.8.4"
34+
- repo: https://github.com/PyCQA/flake8
35+
rev: "3.9.1"
3636
hooks:
3737
- id: flake8
3838
args: [ "--select=E9,F63,F7,F82"] #,T001

Makefile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@ test:
1616
pytest tests/issues/
1717
pytest --nbval tests/notebooks/
1818
flake8 . --select=E9,F63,F7,F82 --show-source --statistics
19-
19+
pandas_profiling -h
20+
make typing
21+
2022
test_cov:
2123
pytest --cov=. tests/unit/
2224
pytest --cov=. --cov-append tests/issues/

README.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
<p align="center">
1313
<a href="https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/">Documentation</a>
1414
|
15-
<a href="https://join.slack.com/t/pandas-profiling/shared_invite/zt-l2iqwb92-9JpTEdFBijR2G798j2MpQw">Slack</a>
15+
<a href="https://join.slack.com/t/pandas-profiling/shared_invite/zt-oe5ol4yc-YtbOxNBGUCb~v73TamRLuA">Slack</a>
1616
|
1717
<a href="https://stackoverflow.com/questions/tagged/pandas-profiling">Stack Overflow</a>
1818
</p>
@@ -79,6 +79,7 @@ The following examples can give you an impression of what the package can do:
7979
* [Vektis](https://pandas-profiling.github.io/pandas-profiling/examples/master/vektis/vektis_report.html) (Vektis Dutch Healthcare data)
8080
* [Colors](https://pandas-profiling.github.io/pandas-profiling/examples/master/colors/colors_report.html) (a simple colors dataset)
8181
* [UCI Bank Dataset](https://pandas-profiling.github.io/pandas-profiling/examples/master/cbank_marketing_data/uci_bank_marketing_report.html) (banking marketing dataset)
82+
* [RDW](https://pandas-profiling.github.io/pandas-profiling/examples/master/rdw/rdw.html) (RDW, the Dutch DMV's vehicle registration 10 million rows, 71 features)
8283

8384

8485
Specific features:
@@ -211,7 +212,7 @@ profile.to_file("your_report.json")
211212

212213
Version 2.4 introduces minimal mode.
213214

214-
This is a default configuration that disables expensive computations (such as correlations and dynamic binning).
215+
This is a default configuration that disables expensive computations (such as correlations and duplicate row detection).
215216

216217
Use the following syntax:
217218

@@ -220,6 +221,8 @@ profile = ProfileReport(large_dataset, minimal=True)
220221
profile.to_file("output.html")
221222
```
222223

224+
Benchmarks are available [here](https://pandas-profiling.github.io/pandas-profiling/dev/bench/).
225+
223226
### Command line usage
224227

225228
For standard formatted CSV files that can be read immediately by pandas, you can use the `pandas_profiling` executable.
@@ -239,7 +242,7 @@ A set of options is available in order to adapt the report generated.
239242
* `progress_bar` (`bool`): If True, `pandas-profiling` will display a progress bar.
240243
* `infer_dtypes` (`bool`): When `True` (default) the `dtype` of variables are inferred using `visions` using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric).
241244

242-
More settings can be found in the [default configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml), [minimal configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml) and [dark themed configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_dark.yaml).
245+
More settings can be found in the [default configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml) and [minimal configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml).
243246

244247
You find the configuration docs on the advanced usage page [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html)
245248

@@ -306,14 +309,15 @@ Types are a powerful abstraction for effective data analysis, that goes beyond t
306309
`pandas-profiling` currently, recognizes the following types: _Boolean, Numerical, Date, Categorical, URL, Path, File_ and _Image_.
307310

308311
We have developed a type system for Python, tailored for data analysis: [visions](https://github.com/dylan-profiler/visions).
309-
Selecting the right typeset drastically reduces the complexity the code of your analysis.
310-
Future versions of `pandas-profiling` will have extended type support through `visions`!
312+
Choosing an appropriate typeset can both improve the overall expressiveness and reduce the complexity of your analysis/code.
313+
To learn more about `pandas-profiling`'s type system, check out the default implementation [here](https://github.com/pandas-profiling/pandas-profiling/blob/develop/src/pandas_profiling/model/typeset.py).
314+
In the meantime, user customized summarizations and type definitions are now fully supported - if you have a specific use-case please reach out with ideas or a PR!
311315

312316
## Contributing
313317

314318
Read on getting involved in the [Contribution Guide](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/contribution_guidelines.html).
315319

316-
A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. [Join the Slack community](https://join.slack.com/t/pandas-profiling/shared_invite/zt-hfy3iwp2-qEJSItye5QBZf8YGFMaMnQ).
320+
A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. [Join the Slack community](https://join.slack.com/t/pandas-profiling/shared_invite/zt-oe5ol4yc-YtbOxNBGUCb~v73TamRLuA).
317321

318322
## Editor integration
319323

docsrc/source/pages/advanced_usage.rst

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,3 +165,75 @@ It's possible to disable certain groups of features through configuration shorth
165165
r.set_variable("correlations", None)
166166
r.set_variable("missing_diagrams", None)
167167
r.set_variable("interactions", None)
168+
169+
170+
171+
172+
Customise plots
173+
---------------
174+
175+
A way how to pass arguments to the underlying matplotlib is to use the ``plot`` argument. It is possible to change the default format of images to png (default svg) using the key-pair ``image_format: "png"`` and also the resolution of the image using ``dpi: 800``.
176+
177+
An example would be:
178+
179+
.. code-block:: python
180+
181+
profile = ProfileReport(planets, title='Pandas Profiling Report', explorative=True,
182+
plot={
183+
'dpi':200,
184+
'image_format': 'png'
185+
})
186+
187+
188+
Furthermore, it is possible to change the default values of histograms, the options for that are the following:
189+
190+
histogram:
191+
x_axis_labels: True
192+
193+
# Number of bins (set to 0 to automatically detect the bin size)
194+
bins: 50
195+
196+
# Maximum number of bins (when bins=0)
197+
max_bins: 250
198+
199+
200+
201+
202+
203+
Customise correlation matrix
204+
-----------------------------
205+
206+
It's possible to directly access the correlation matrix as well. That is done with the ``plot`` argument and then with the `correlation` key. It is possible to customise the palett, one can use the following list used in seaborn or create [their own custom matplotlib palette](https://matplotlib.org/stable/gallery/color/custom_cmap.html). Supported values are
207+
208+
```
209+
'Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 'flare_r', 'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'turbo', 'turbo_r', 'twilight', 'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 'winter', 'winter_r'
210+
```
211+
212+
An example can be:
213+
214+
.. code-block:: python
215+
216+
from pandas_profiling import ProfileReport
217+
218+
profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True,
219+
plot={
220+
'correlation':{
221+
'cmap': 'RdBu_r',
222+
'bad': '#000000'}}
223+
)
224+
225+
226+
Similarly, one can change the palette for *Missing values* using the ``missing`` argument, eg:
227+
228+
.. code-block:: python
229+
230+
from pandas_profiling import ProfileReport
231+
232+
profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True,
233+
plot={
234+
'missing':{
235+
'cmap': 'RdBu_r'}}
236+
)
237+
238+
239+

docsrc/source/pages/changelog/v2_12_0.rst

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,27 @@ Changelog v2.12.0
33

44
🎉 Features
55
^^^^^^^^^^^
6-
- Add the number and the percentage of negative values for numerical variables `[695] <https://github.com/pandas-profiling/pandas-profiling/issues/695>`- (contributed by @gverbock).
6+
- Add the number and the percentage of negative values for numerical variables `[695] <https://github.com/pandas-profiling/pandas-profiling/issues/695>`_ (contributed by @gverbock)
77
- Enable setting of typeset/summarizer (contributed by @ieaves)
8+
- Allow empty data frames `[678] <https://github.com/pandas-profiling/pandas-profiling/issues/678>`_ (contributed by @spbail, @fwd2020-c)
9+
10+
🐛 Bug fixes
11+
^^^^^^^^^^^^
12+
- Patch args for great_expectations datetime profiler `[727] <https://github.com/pandas-profiling/pandas-profiling/issues/727>`_ (contributed by @jstammers)
13+
- Negative exponent formatting `[723] <https://github.com/pandas-profiling/pandas-profiling/issues/723>`_ (reported by @rdpapworth)
814

915
📖 Documentation
1016
^^^^^^^^^^^^^^^^
1117
- Fix link syntax (contributed by @ChrisCarini)
1218

19+
👷‍♂️ Internal Improvements
20+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
21+
- Several performance improvements (minimal mode, duplicates, frequency table sorting)
22+
- Introduce ``pytest-benchmark`` in CI to monitor commit performance impact
23+
- Introduce ``commitlint`` in CI to start automating the changelog generation
24+
1325
⬆️ Dependencies
1426
^^^^^^^^^^^^^^^^^^
15-
- The `ipywidgets` dependency was moved to the `[notebook]` extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx).
16-
- Replaced the (testing only) `fastparquet` dependency with `pyarrow` (default pandas parquet engine, contributed by @kurosch).
27+
- The ``ipywidgets`` dependency was moved to the ``[notebook]`` extra, so most of Jupyter will not be installed alongside this package by default (contributed by @akx)
28+
- Replaced the (testing only) ``fastparquet`` dependency with ``pyarrow`` (default pandas parquet engine, contributed by @kurosch)
29+
- Upgrade ``phik``. This drops the hard dependency on numba (contributed by @akx)

docsrc/source/pages/changelog/v2_13_0.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Changelog vx.y.z
2-
----------------
1+
Changelog v2.13.0
2+
-----------------
33

44
🎉 Features
55
^^^^^^^^^^^

docsrc/source/pages/contribution_guidelines.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ Contributing a new feature
99

1010
* Ensure the PR description clearly describes the problem and solution.
1111
Include the relevant issue number if applicable.
12+
13+
Slack community
14+
---------------
15+
A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. `Join the Slack community <https://join.slack.com/t/pandas-profiling/shared_invite/zt-oe5ol4yc-YtbOxNBGUCb~v73TamRLuA>`_.
1216

1317
Developer tools
1418
---------------
@@ -61,4 +65,4 @@ Read Github's `open source legal guide <https://opensource.guide/legal/#does-my-
6165
More information
6266
----------------
6367

64-
Read more on getting involved in the `Contribution Guide <https://github.com/pandas-profiling/pandas-profiling/blob/master/CONTRIBUTING.md>`_ on Github.
68+
Read more on getting involved in the `Contribution Guide <https://github.com/pandas-profiling/pandas-profiling/blob/master/CONTRIBUTING.md>`_ on Github.

docsrc/source/pages/resources.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Notebooks
1414

1515
Articles
1616
--------
17-
17+
- `Bringing Customization to Pandas Profiling <https://medium.com/@ianeaves/customizing-pandas-profiling-summaries-b16714d0dac9>`_ (Ian Eaves, March 5, 2021)
1818
- `Beginner Friendly Data Science Projects Accepting Contributions <https://towardsdatascience.com/beginner-friendly-data-science-projects-accepting-contributions-3b8e26f7e88e>`_ (Adam Ross Nelson, January 18, 2021)
1919
- `Pandas profiling and exploratory data analysis with line one of code! <https://towardsdatascience.com/pandas-profiling-and-exploratory-data-analysis-with-line-one-of-code-423111991e58>`_ (Magdalena Konkiewicz, Jun 10, 2020)
2020
- `The Covid 19 health issue <https://concillier.squarespace.com/datasets/covid-19>`_ (Concillier Kitungulu, April 20, 2020)

docsrc/source/pages/support.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ Users with a request for help on how to use `pandas-profiling` should consider a
3535
:alt: Questions: Stackoverflow "pandas-profiling"
3636
:target: https://stackoverflow.com/questions/tagged/pandas-profiling
3737

38+
Slack community
39+
---------------
40+
41+
`Join the Slack community <https://join.slack.com/t/pandas-profiling/shared_invite/zt-oe5ol4yc-YtbOxNBGUCb~v73TamRLuA>`_ and come into contact with other users and developers, that might be able to answer your questions.
3842

3943
Reporting a bug
4044
---------------

0 commit comments

Comments
 (0)