Skip to content

Commit 4cc48f1

Browse files
authored
Merge pull request ydataai#516 from pandas-profiling/release/v2.9.0rc1
pandas-profiling v2.9.0rc1
2 parents be8baee + d971f23 commit 4cc48f1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+1381
-1236
lines changed

.travis.yml

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,18 +12,10 @@ jobs:
1212
name: "Python 3.9-dev on Linux"
1313
python: 3.9-dev
1414
env: TEST=examples PANDAS=">=1"
15-
- os: windows
16-
name: "Python 3.8 on Windows"
17-
python: 3.8
18-
env: TEST=examples PANDAS=">=1"
19-
- os: osx
20-
name: "Python 3.8 on osx"
21-
python: 3.8
22-
env: TEST=examples PANDAS=">=1"
15+
before_install:
16+
- sudo apt-get -y install libopenblas-dev
2317

2418
allow_failures:
25-
- os: windows
26-
- os: osx
2719
- name: "Python 3.9-dev on Linux"
2820

2921
python:

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ install:
3131
pip install -e .[notebook,app]
3232

3333
lint:
34-
isort --apply
34+
isort --profile black .
3535
black .
3636

3737
typing:

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,10 +94,10 @@ Specific features:
9494
* [Cats and Dogs](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/cats-and-dogs.html) (demonstrates image analysis from the file system)
9595
* [Celebrity Faces](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/celebrity-faces.html) (demonstrates image analysis with EXIF information)
9696
* [Website Inaccessibility](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/website_inaccessibility_report.html) (demonstrates URL analysis)
97-
* [Orange prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/themes/united_report.html) and [Coal prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/flatly_report.html) (showcases report themes)
97+
* [Orange prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/united_report.html) and [Coal prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/flatly_report.html) (showcases report themes)
9898

9999
Tutorials:
100-
* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/kaggle/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/kaggle/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Fkaggle%2Fmodify_report_structure.ipynb)
100+
* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/tutorials/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/kaggle/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Fkaggle%2Fmodify_report_structure.ipynb)
101101

102102

103103
## Installation

docsrc/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
pages/advanced_usage
1414
pages/big_data
1515
pages/sensitive_data
16+
pages/metadata
1617
pages/integrations
1718
pages/changelog
1819

docsrc/source/pages/advanced_usage.rst

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ Then, change the configuration to your liking.
136136
137137
from pandas_profiling import ProfileReport
138138
139-
profile = ProfileReport(df, configuration_file="your_config.yml")
139+
profile = ProfileReport(df, config_file="your_config.yml")
140140
profile.to_file("report.html")
141141
142142
Sample configuration files
@@ -145,9 +145,7 @@ A great way to get an overview of the possible configuration is to look through
145145
The repository contains the following files:
146146
147147
- `default configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml>`_ (default),
148-
- `explorative configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_explorative.yaml>`_ (with text, file and image features enabled),
149148
- `minimal configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml>`_ (minimal computation, optimized for performance)
150-
- `dark themed configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_dark.yaml>`_ and `orange themed configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_united.yaml>`_ (example of customizing styles).
151149
152150
Configuration shorthands
153151
------------------------

docsrc/source/pages/big_data.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,16 @@ Sample
3838
profile = ProfileReport(sample, minimal=True)
3939
profile.to_file("output.html")
4040
41+
The reader of the report might want to know that the profile is generated using a sample from the data.
42+
An example of how to do this:
4143

44+
.. code-block:: python
45+
46+
description = "Disclaimer: this profiling report was generated using a sample of 5% of the original dataset."
47+
sample = large_dataset.sample(frac=0.05)
48+
49+
profile = sample.profile_report(description=description, minimal=True)
50+
profile.to_file("output.html")
4251
4352
Concurrency
4453
-----------

docsrc/source/pages/changelog.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
Changelog
33
=========
44

5+
.. include:: changelog/v2_9_0rc1.rst
6+
57
.. include:: changelog/v2_8_0.rst
68

79
.. include:: changelog/v2_7_1.rst
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
Changelog v2.9.0rc1
2+
-------------------
3+
4+
🎉 Features
5+
^^^^^^^^^^^
6+
- Working with sensitive data: Introduced ``sensitive=True`` option to mask non-aggregated data (such as samples, duplicates, frequency tables for categorical columns) [`#503 <https://github.com/pandas-profiling/pandas-profiling/issues/503>`_].
7+
- The sample section can be parametrized with a custom sample (for instance mock data).
8+
- Introduce shorthands for groups of parameters for styles and explorative mode [`#499 <https://github.com/pandas-profiling/pandas-profiling/issues/499>`_].
9+
- Metadata of a dataset can be added to the report (see documentation).
10+
- Numeric columns now report monotonicity information.
11+
- A pie chart can be generated for boolean and (low) categorical columns.
12+
13+
🐛 Bug fixes
14+
^^^^^^^^^^^^
15+
- NaT in date columns were interpreted as a date in 1680 by histograms [`#507 <https://github.com/pandas-profiling/pandas-profiling/issues/507>`_].
16+
- ValueError: ('widget type not understood', 'select') [`#493 <https://github.com/pandas-profiling/pandas-profiling/issues/493>`_].
17+
- Fixed regression in working with pandas' nullable integers [`#502 <https://github.com/pandas-profiling/pandas-profiling/issues/502>`_].
18+
- Formatting of precision of numeric values has been improved in a few places.
19+
20+
👷‍♂️ Internal Improvements
21+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22+
- Histograms used to be calculated at view time (single thread) and are now computed in parallel.
23+
- Matplotlib's rcParams are now modified through the contextmanager [`#494 <https://github.com/pandas-profiling/pandas-profiling/issues/494>`_].
24+
25+
📖 Documentation
26+
^^^^^^^^^^^^^^^^
27+
- Links to Colab and Binder notebooks [`#480 <https://github.com/pandas-profiling/pandas-profiling/issues/480>`_ and `#497 <https://github.com/pandas-profiling/pandas-profiling/issues/497>`_].
28+
- The documentation for sensitive data, large datasets and metadata have been extended.
29+
30+
🚨 Breaking changes
31+
^^^^^^^^^^^^^^^^^^^
32+
- ``bayesian_blocks`` binning has been removed, together with the ``astropy`` dependency.
33+
- Config files ``config_dark.yaml``, ``config_united.yaml`` and ``config_explorative.yaml`` have been removed in favour of shorthand for groups of parameters.
34+
35+
⬆️ Dependencies
36+
^^^^^^^^^^^^^^^^^^
37+
- ``isort`` updated to major version 5.
38+
- ``attrs`` is now required for classes.

docsrc/source/pages/config_variables.csv

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@ Parameter,Type,Default,Description
33
``variables.descriptions``,dict,{},"Ability to display a description alongside the descriptive statistics of each variable ({'var_name': 'Description'})."
44
``vars.num.quantiles``,list[float],"[0.05,0.25,0.5,0.75,0.95]","The quantiles to calculate. Note that .25, .5 and .75 are required for other metrics median and IQR."
55
``vars.num.skewness_threshold``,integer,20,"Warn if the skewness is above this threshold."
6-
``vars.num.low_categorical_threshold``,integer,5,"If the number of distinct values is equal to or smaller than this number, then the series is considered to be categorical. Set to 0 to disable."
6+
``vars.num.low_categorical_threshold``,integer,5,"If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable."
77
``vars.num.chi_squared_threshold``,float,0.999,"Set to zero to disable chi squared calculation."
88
``vars.cat.length``,boolean,True,"Check the string length and aggregate values (min, max, mean, media)."
99
``vars.cat.unicode``,boolean,False,"Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive."
1010
``vars.cat.cardinality_threshold``,integer,50,"Warn if the number of distinct values is above this threshold."
1111
``vars.cat.n_obs``,integer,5,"Display this number of observations."
1212
``vars.cat.chi_squared_threshold``,float,0.999,"Same as above."
13-
``vars.bool.n_obs``,integer,3,"Same as above."
13+
``vars.bool.n_obs``,integer,3,"Same as above."

docsrc/source/pages/installation.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ If you are in a notebook (locally, at LambdaLabs, on Google Colab or Kaggle), yo
3232
!{sys.executable} -m pip install -U pandas-profiling[notebook]
3333
!jupyter nbextension enable --py widgetsnbextension
3434
35+
You may have to restart the kernel or runtime.
36+
3537
Using conda
3638
-----------
3739

docsrc/source/pages/integrations.rst

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -102,11 +102,10 @@ Ensure to install ``pyqt5``. Via pip use the extras ``app``:
102102
pip install pandas-profiling[app]
103103
104104
105-
Streamlit (suggestion)
106-
~~~~~~~~~~~~~~~~~~~~~~
107-
108-
View progress at https://github.com/streamlit/streamlit/issues/693.
105+
Streamlit / Panel
106+
~~~~~~~~~~~~~~~~~
109107

108+
For more information of how to use ``pandas-profiling`` with Streamlit or Panel, see the https://github.com/streamlit/streamlit/issues/693 and https://github.com/pandas-profiling/pandas-profiling/issues/491.
110109

111110
Cloud Integrations
112111
------------------

docsrc/source/pages/metadata.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
========
2+
Metadata
3+
========
4+
5+
When sharing reports with coworkers or publishing online, you might want to include metadata of the dataset, such as author, copyright holder or a description. The supported properties are inspired by `https://schema.org/Dataset <https://schema.org/Dataset>`_. Currently supported are: "description", "creator", "author", "url", "copyright_year", "copyright_holder".
6+
7+
The following example generates a report with a "description", "copyright_holder" and "copyright_year", "creator" and "url".
8+
You can find these properties in the "Overview" section under the "About" tab.
9+
10+
.. code-block:: python
11+
12+
report = df.profile_report(
13+
title="Masked data",
14+
dataset=dict(
15+
description="This profiling report was generated using a sample of 5% of the original dataset.",
16+
copyright_holder="StataCorp LLC",
17+
copyright_year="2020",
18+
url="http://www.stata-press.com/data/r15/auto2.dta",
19+
),
20+
)
21+
report.to_file(Path("stata_auto_report.html"))

docsrc/source/pages/sensitive_data.rst

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,50 @@
22
Sensitive data
33
==============
44

5-
When dealing with sensitive data, such as private health records, sharing a report that includes a sample would violate patient's privacy. The sample as well as duplicate rows are configurable. We can disable them using:
5+
When dealing with sensitive data, such as private health records, sharing a report that includes a sample would violate patient's privacy. The following shorthand groups together various options so that only aggregate information is provided in the report:
6+
7+
.. code-block:: python
8+
9+
report = df.profile_report(sensitive=True)
10+
11+
Moreover, pandas-profiling does not send data to external services, making it suitable for private data.
12+
13+
Sample and duplicates
14+
---------------------
15+
16+
The sample as well as duplicate rows are configurable. We can disable them using:
617

718
.. code-block:: python
819
920
report = df.profile_report(duplicates=None, samples=None)
1021
11-
Moreover, `pandas-profiling` does not send data to external services, making it suitable for private data.
22+
Alternatively, it's possible to bring your own custom sample.
23+
The following snippet demonstrates how to generate the report with mock data.
24+
Note that the "name" and "caption" keys are optional.
25+
26+
.. code-block:: python
27+
28+
# Replace with the sample you'd like to present in the report
29+
sample_custom_data = pd.DataFrame()
30+
sample_description = "Disclaimer: the following sample consists of synthetic data following the format of the underlying dataset."
31+
32+
report = df.profile_report(
33+
sample=dict(
34+
name="Mock data sample",
35+
data=sample_custom_data,
36+
caption=sample_description
37+
)
38+
)
39+
40+
.. warning::
41+
42+
Be aware when using ``pandas.read_csv`` with sensitive data such as phone numbers.
43+
Pandas' type guessing will by default coerce phone numbers such as ``0612345678`` to numeric.
44+
This leads to information leakage through aggregates (min, max, quantiles).
45+
To prevent this from happening, keep the string representation.
46+
47+
.. code-block:: python
48+
49+
pd.read_csv('filename.csv', dtype={'phone': str})
50+
51+
Note that the type detection is hard. That's the reason why we developed `visions <https://github.com/dylan-profiler/visions>`_, a type system to help developers solve these cases.

docsrc/source/pages/support.rst

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,13 @@ You should provide the minimal information to reproduce this bug. `This guide <h
3030

3131
- the minimal code you are using to generate the report
3232

33-
- which environment you are using:
33+
- Which environment you are using:
3434

3535
- operating system (e.g. Windows, Linux, Mac)
3636
- Python version (e.g. 3.7)
37-
- jupyter notebook, console or IDE such as PyCharm
38-
- Package manager (e.g. pip, conda conda info)
39-
- packages (pip freeze > packages.txt or conda list)
37+
- Jupyter notebook( or cloud services like Google Colab, Kaggle Kernels, etc), console or IDE (such as PyCharm,VS Code,etc)
38+
- package manager (e.g. ``pip --version`` or ``conda info``)
39+
- packages (``pip freeze > packages.txt`` or ``conda list``)
4040

41-
- a sample or description of the dataset (df.head(), df.info())
41+
- a sample of the dataset (``df.sample()`` or ``df.head()``)
42+
- a description of the dataset (``df.info()``)

examples/features/mask_sensitive.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
from pathlib import Path
2+
3+
import pandas as pd
4+
5+
from pandas_profiling import ProfileReport
6+
from pandas_profiling.utils.cache import cache_file
7+
8+
if __name__ == "__main__":
9+
file_name = cache_file("auto2.dta", "http://www.stata-press.com/data/r15/auto2.dta")
10+
df = pd.read_stata(file_name)
11+
12+
# In case that a sample of the real data (cars) would disclose sensitive information, we can replace it with
13+
# mock data. For illustrative purposes, we use data based on cars from a popular game series.
14+
mock_data = pd.DataFrame(
15+
{
16+
"make": ["Blista Kanjo", "Sentinel", "Burrito"],
17+
"price": [58000, 95000, 65000],
18+
"mpg": [20, 30, 22],
19+
"rep78": ["Average", "Excellent", "Fair"],
20+
"headroom": [2.5, 3.0, 1.5],
21+
"trunk": [8, 10, 4],
22+
"weight": [1050, 1600, 2500],
23+
"length": [165, 170, 180],
24+
"turn": [40, 50, 32],
25+
"displacement": [80, 100, 60],
26+
"gear_ratio": [2.74, 3.51, 2.41],
27+
"foreign": ["Domestic", "Domestic", "Foreign"],
28+
}
29+
)
30+
31+
report = ProfileReport(
32+
df.sample(frac=0.25),
33+
title="Masked data",
34+
dataset=dict(
35+
description="This profiling report was generated using a sample of 5% of the original dataset.",
36+
copyright_holder="StataCorp LLC",
37+
copyright_year="2020",
38+
url="http://www.stata-press.com/data/r15/auto2.dta",
39+
),
40+
sensitive=True,
41+
sample=dict(
42+
name="Mock data sample",
43+
data=mock_data,
44+
caption="Disclaimer: this is synthetic data generated based on the format of the data in this table.",
45+
),
46+
vars=dict(cat=dict(unicode=True)),
47+
interactions=None,
48+
)
49+
report.to_file(Path("masked_report.html"))

0 commit comments

Comments
 (0)