rishirelan
diff --git a/‎.travis.yml
Lines changed: 2 additions & 10 deletions b/‎.travis.yml
Lines changed: 2 additions & 10 deletions
diff --git a/‎Makefile
Lines changed: 1 addition & 1 deletion b/‎Makefile
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 2 additions & 2 deletions b/‎README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎docsrc/source/index.rst
Lines changed: 1 addition & 0 deletions b/‎docsrc/source/index.rst
Lines changed: 1 addition & 0 deletions
diff --git a/‎docsrc/source/pages/advanced_usage.rst
Lines changed: 1 addition & 3 deletions b/‎docsrc/source/pages/advanced_usage.rst
Lines changed: 1 addition & 3 deletions
diff --git a/‎docsrc/source/pages/big_data.rst
Lines changed: 9 additions & 0 deletions b/‎docsrc/source/pages/big_data.rst
Lines changed: 9 additions & 0 deletions
diff --git a/‎docsrc/source/pages/changelog.rst
Lines changed: 2 additions & 0 deletions b/‎docsrc/source/pages/changelog.rst
Lines changed: 2 additions & 0 deletions
diff --git a/‎docsrc/source/pages/changelog/v2_9_0rc1.rst
Lines changed: 38 additions & 0 deletions b/‎docsrc/source/pages/changelog/v2_9_0rc1.rst
Lines changed: 38 additions & 0 deletions
diff --git a/‎docsrc/source/pages/config_variables.csv
Lines changed: 2 additions & 2 deletions b/‎docsrc/source/pages/config_variables.csv
Lines changed: 2 additions & 2 deletions
diff --git a/‎docsrc/source/pages/installation.rst
Lines changed: 2 additions & 0 deletions b/‎docsrc/source/pages/installation.rst
Lines changed: 2 additions & 0 deletions
diff --git a/‎docsrc/source/pages/integrations.rst
Lines changed: 3 additions & 4 deletions b/‎docsrc/source/pages/integrations.rst
Lines changed: 3 additions & 4 deletions
diff --git a/‎docsrc/source/pages/metadata.rst
Lines changed: 21 additions & 0 deletions b/‎docsrc/source/pages/metadata.rst
Lines changed: 21 additions & 0 deletions
diff --git a/‎docsrc/source/pages/sensitive_data.rst
Lines changed: 42 additions & 2 deletions b/‎docsrc/source/pages/sensitive_data.rst
Lines changed: 42 additions & 2 deletions
diff --git a/‎docsrc/source/pages/support.rst
Lines changed: 6 additions & 5 deletions b/‎docsrc/source/pages/support.rst
Lines changed: 6 additions & 5 deletions
diff --git a/‎examples/features/mask_sensitive.py
Lines changed: 49 additions & 0 deletions b/‎examples/features/mask_sensitive.py
Lines changed: 49 additions & 0 deletions
@@ -12,18 +12,10 @@ jobs:
     name: "Python 3.9-dev on Linux"
     python: 3.9-dev
     env: TEST=examples PANDAS=">=1"
-  - os: windows
-    name: "Python 3.8 on Windows"
-    python: 3.8
-    env: TEST=examples PANDAS=">=1"
-  - os: osx
-    name: "Python 3.8 on osx"
-    python: 3.8
-    env: TEST=examples PANDAS=">=1"
+    before_install:
+    - sudo apt-get -y install libopenblas-dev
 
   allow_failures:
-  - os: windows
-  - os: osx
   - name: "Python 3.9-dev on Linux"
 
 python:
 
@@ -31,7 +31,7 @@ install:
 	pip install -e .[notebook,app]
 
 lint:
-	isort --apply
+	isort --profile black .
 	black .
 
 typing:
 
@@ -94,10 +94,10 @@ Specific features:
 * [Cats and Dogs](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/cats-and-dogs.html) (demonstrates image analysis from the file system)
 * [Celebrity Faces](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/celebrity-faces.html) (demonstrates image analysis with EXIF information)
 * [Website Inaccessibility](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/website_inaccessibility_report.html) (demonstrates URL analysis)
-* [Orange prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/themes/united_report.html) and [Coal prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/flatly_report.html) (showcases report themes)
+* [Orange prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/united_report.html) and [Coal prices](https://pandas-profiling.github.io/pandas-profiling/examples/master/features/flatly_report.html) (showcases report themes)
 
 Tutorials:
-* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/kaggle/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/kaggle/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Fkaggle%2Fmodify_report_structure.ipynb)
+* [Tutorial: report structure using Kaggle data (advanced)](https://pandas-profiling.github.io/pandas-profiling/examples/master/tutorials/modify_report_structure.ipynb) (modify the report's structure) [![Open In Colab](https://camo.githubusercontent.com/52feade06f2fecbf006889a904d221e6a730c194/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/kaggle/modify_report_structure.ipynb) [![Binder](https://camo.githubusercontent.com/483bae47a175c24dfbfc57390edd8b6982ac5fb3/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667)](https://mybinder.org/v2/gh/pandas-profiling/pandas-profiling/master?filepath=examples%2Fkaggle%2Fmodify_report_structure.ipynb)
 
 
 ## Installation
 
@@ -13,6 +13,7 @@
    pages/advanced_usage
    pages/big_data
    pages/sensitive_data
+   pages/metadata
    pages/integrations
    pages/changelog
 
 
@@ -136,7 +136,7 @@ Then, change the configuration to your liking.
 
   from pandas_profiling import ProfileReport
 
-  profile = ProfileReport(df, configuration_file="your_config.yml")
+  profile = ProfileReport(df, config_file="your_config.yml")
   profile.to_file("report.html")
 
 Sample configuration files
@@ -145,9 +145,7 @@ A great way to get an overview of the possible configuration is to look through
 The repository contains the following files:
 
 - `default configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml>`_ (default),
-- `explorative configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_explorative.yaml>`_ (with text, file and image features enabled),
 - `minimal configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml>`_ (minimal computation, optimized for performance)
-- `dark themed configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_dark.yaml>`_ and `orange themed configuration file <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_united.yaml>`_ (example of customizing styles).
 
 Configuration shorthands
 ------------------------
 
@@ -38,7 +38,16 @@ Sample
   profile = ProfileReport(sample, minimal=True)
   profile.to_file("output.html")
 
+The reader of the report might want to know that the profile is generated using a sample from the data.
+An example of how to do this:
 
+.. code-block:: python
+
+  description = "Disclaimer: this profiling report was generated using a sample of 5% of the original dataset."
+  sample = large_dataset.sample(frac=0.05)
+
+  profile = sample.profile_report(description=description, minimal=True)
+  profile.to_file("output.html")
 
 Concurrency
 -----------
 
@@ -2,6 +2,8 @@
 Changelog
 =========
 
+.. include:: changelog/v2_9_0rc1.rst
+
 .. include:: changelog/v2_8_0.rst
 
 .. include:: changelog/v2_7_1.rst
 
@@ -0,0 +1,38 @@
+Changelog v2.9.0rc1
+-------------------
+
+🎉 Features
+^^^^^^^^^^^
+- Working with sensitive data: Introduced ``sensitive=True`` option to mask non-aggregated data (such as samples, duplicates, frequency tables for categorical columns) [`#503 <https://github.com/pandas-profiling/pandas-profiling/issues/503>`_].
+- The sample section can be parametrized with a custom sample (for instance mock data).
+- Introduce shorthands for groups of parameters for styles and explorative mode [`#499 <https://github.com/pandas-profiling/pandas-profiling/issues/499>`_].
+- Metadata of a dataset can be added to the report (see documentation).
+- Numeric columns now report monotonicity information.
+- A pie chart can be generated for boolean and (low) categorical columns.
+
+🐛 Bug fixes
+^^^^^^^^^^^^
+- NaT in date columns were interpreted as a date in 1680 by histograms [`#507 <https://github.com/pandas-profiling/pandas-profiling/issues/507>`_].
+- ValueError: ('widget type not understood', 'select') [`#493 <https://github.com/pandas-profiling/pandas-profiling/issues/493>`_].
+- Fixed regression in working with pandas' nullable integers [`#502 <https://github.com/pandas-profiling/pandas-profiling/issues/502>`_].
+- Formatting of precision of numeric values has been improved in a few places.
+
+👷‍♂️ Internal Improvements
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+- Histograms used to be calculated at view time (single thread) and are now computed in parallel.
+- Matplotlib's rcParams are now modified through the contextmanager [`#494 <https://github.com/pandas-profiling/pandas-profiling/issues/494>`_].
+
+📖 Documentation
+^^^^^^^^^^^^^^^^
+- Links to Colab and Binder notebooks [`#480 <https://github.com/pandas-profiling/pandas-profiling/issues/480>`_ and `#497 <https://github.com/pandas-profiling/pandas-profiling/issues/497>`_].
+- The documentation for sensitive data, large datasets and metadata have been extended.
+
+🚨 Breaking changes
+^^^^^^^^^^^^^^^^^^^
+- ``bayesian_blocks`` binning has been removed, together with the ``astropy`` dependency.
+- Config files ``config_dark.yaml``, ``config_united.yaml`` and ``config_explorative.yaml`` have been removed in favour of shorthand for groups of parameters.
+
+⬆️ Dependencies
+^^^^^^^^^^^^^^^^^^
+- ``isort`` updated to major version 5.
+- ``attrs`` is now required for classes.
@@ -3,11 +3,11 @@ Parameter,Type,Default,Description
 ``variables.descriptions``,dict,{},"Ability to display a description alongside the descriptive statistics of each variable ({'var_name': 'Description'})."
 ``vars.num.quantiles``,list[float],"[0.05,0.25,0.5,0.75,0.95]","The quantiles to calculate. Note that .25, .5 and .75 are required for other metrics median and IQR."
 ``vars.num.skewness_threshold``,integer,20,"Warn if the skewness is above this threshold."
-``vars.num.low_categorical_threshold``,integer,5,"If the number of distinct values is equal to or smaller than this number, then the series is considered to be categorical. Set to 0 to disable."
+``vars.num.low_categorical_threshold``,integer,5,"If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable."
 ``vars.num.chi_squared_threshold``,float,0.999,"Set to zero to disable chi squared calculation."
 ``vars.cat.length``,boolean,True,"Check the string length and aggregate values (min, max, mean, media)."
 ``vars.cat.unicode``,boolean,False,"Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive."
 ``vars.cat.cardinality_threshold``,integer,50,"Warn if the number of distinct values is above this threshold."
 ``vars.cat.n_obs``,integer,5,"Display this number of observations."
 ``vars.cat.chi_squared_threshold``,float,0.999,"Same as above."
-``vars.bool.n_obs``,integer,3,"Same as above."
+``vars.bool.n_obs``,integer,3,"Same as above."
@@ -32,6 +32,8 @@ If you are in a notebook (locally, at LambdaLabs, on Google Colab or Kaggle), yo
     !{sys.executable} -m pip install -U pandas-profiling[notebook]
     !jupyter nbextension enable --py widgetsnbextension
 
+You may have to restart the kernel or runtime.
+
 Using conda
 -----------
 
 
@@ -102,11 +102,10 @@ Ensure to install ``pyqt5``. Via pip use the extras ``app``:
   pip install pandas-profiling[app]
 
 
-Streamlit (suggestion)
-~~~~~~~~~~~~~~~~~~~~~~
-
-View progress at https://github.com/streamlit/streamlit/issues/693.
+Streamlit / Panel
+~~~~~~~~~~~~~~~~~
 
+For more information of how to use ``pandas-profiling`` with Streamlit or Panel, see the https://github.com/streamlit/streamlit/issues/693 and https://github.com/pandas-profiling/pandas-profiling/issues/491.
 
 Cloud Integrations
 ------------------
 
@@ -0,0 +1,21 @@
+========
+Metadata
+========
+
+When sharing reports with coworkers or publishing online, you might want to include metadata of the dataset, such as author, copyright holder or a description. The supported properties are inspired by `https://schema.org/Dataset <https://schema.org/Dataset>`_. Currently supported are: "description", "creator", "author", "url", "copyright_year", "copyright_holder".
+
+The following example generates a report with a "description", "copyright_holder" and "copyright_year", "creator" and "url".
+You can find these properties in the "Overview" section under the "About" tab.
+
+.. code-block:: python
+
+    report = df.profile_report(
+        title="Masked data",
+        dataset=dict(
+            description="This profiling report was generated using a sample of 5% of the original dataset.",
+            copyright_holder="StataCorp LLC",
+            copyright_year="2020",
+            url="http://www.stata-press.com/data/r15/auto2.dta",
+        ),
+    )
+    report.to_file(Path("stata_auto_report.html"))
@@ -2,10 +2,50 @@
 Sensitive data
 ==============
 
-When dealing with sensitive data, such as private health records, sharing a report that includes a sample would violate patient's privacy. The sample as well as duplicate rows are configurable. We can disable them using:
+When dealing with sensitive data, such as private health records, sharing a report that includes a sample would violate patient's privacy. The following shorthand groups together various options so that only aggregate information is provided in the report:
+
+.. code-block:: python
+
+  report = df.profile_report(sensitive=True)
+
+Moreover, pandas-profiling does not send data to external services, making it suitable for private data.
+
+Sample and duplicates
+---------------------
+
+The sample as well as duplicate rows are configurable. We can disable them using:
 
 .. code-block:: python
 
   report = df.profile_report(duplicates=None, samples=None)
 
-Moreover, `pandas-profiling` does not send data to external services, making it suitable for private data.
+Alternatively, it's possible to bring your own custom sample.
+The following snippet demonstrates how to generate the report with mock data.
+Note that the "name" and "caption" keys are optional.
+
+.. code-block:: python
+
+  # Replace with the sample you'd like to present in the report
+  sample_custom_data = pd.DataFrame()
+  sample_description = "Disclaimer: the following sample consists of synthetic data following the format of the underlying dataset."
+
+  report = df.profile_report(
+        sample=dict(
+        	name="Mock data sample",
+        	data=sample_custom_data,
+        	caption=sample_description
+        )
+  )
+
+.. warning::
+
+   Be aware when using ``pandas.read_csv`` with sensitive data such as phone numbers.
+   Pandas' type guessing will by default coerce phone numbers such as ``0612345678`` to numeric.
+   This leads to information leakage through aggregates (min, max, quantiles).
+   To prevent this from happening, keep the string representation.
+
+   .. code-block:: python
+
+        pd.read_csv('filename.csv', dtype={'phone': str})
+
+   Note that the type detection is hard. That's the reason why we developed `visions <https://github.com/dylan-profiler/visions>`_, a type system to help developers solve these cases.
@@ -30,12 +30,13 @@ You should provide the minimal information to reproduce this bug. `This guide <h
 
 - the minimal code you are using to generate the report
 
-- which environment you are using:
+- Which environment you are using:
 
         - operating system (e.g. Windows, Linux, Mac)
         - Python version (e.g. 3.7)
-        - jupyter notebook, console or IDE such as PyCharm
-        - Package manager (e.g. pip, conda conda info)
-        - packages (pip freeze > packages.txt or conda list)
+        - Jupyter notebook( or cloud services like Google Colab, Kaggle Kernels, etc), console or IDE (such as PyCharm,VS Code,etc)
+        - package manager (e.g. ``pip --version`` or ``conda info``)
+        - packages (``pip freeze > packages.txt`` or ``conda list``)
 
-- a sample or description of the dataset (df.head(), df.info())
+- a sample of the dataset (``df.sample()`` or ``df.head()``)
+- a description of the dataset (``df.info()``)
@@ -0,0 +1,49 @@
+from pathlib import Path
+
+import pandas as pd
+
+from pandas_profiling import ProfileReport
+from pandas_profiling.utils.cache import cache_file
+
+if __name__ == "__main__":
+    file_name = cache_file("auto2.dta", "http://www.stata-press.com/data/r15/auto2.dta")
+    df = pd.read_stata(file_name)
+
+    # In case that a sample of the real data (cars) would disclose sensitive information, we can replace it with
+    # mock data. For illustrative purposes, we use data based on cars from a popular game series.
+    mock_data = pd.DataFrame(
+        {
+            "make": ["Blista Kanjo", "Sentinel", "Burrito"],
+            "price": [58000, 95000, 65000],
+            "mpg": [20, 30, 22],
+            "rep78": ["Average", "Excellent", "Fair"],
+            "headroom": [2.5, 3.0, 1.5],
+            "trunk": [8, 10, 4],
+            "weight": [1050, 1600, 2500],
+            "length": [165, 170, 180],
+            "turn": [40, 50, 32],
+            "displacement": [80, 100, 60],
+            "gear_ratio": [2.74, 3.51, 2.41],
+            "foreign": ["Domestic", "Domestic", "Foreign"],
+        }
+    )
+
+    report = ProfileReport(
+        df.sample(frac=0.25),
+        title="Masked data",
+        dataset=dict(
+            description="This profiling report was generated using a sample of 5% of the original dataset.",
+            copyright_holder="StataCorp LLC",
+            copyright_year="2020",
+            url="http://www.stata-press.com/data/r15/auto2.dta",
+        ),
+        sensitive=True,
+        sample=dict(
+            name="Mock data sample",
+            data=mock_data,
+            caption="Disclaimer: this is synthetic data generated based on the format of the data in this table.",
+        ),
+        vars=dict(cat=dict(unicode=True)),
+        interactions=None,
+    )
+    report.to_file(Path("masked_report.html"))