Skip to content

Commit 11a295e

Browse files
authored
FPGrowth/FPMax and Association Rules with the existence of missing values (rasbt#1004) (rasbt#1106)
* Updated FPGrowth/FPMax and Association Rules with the existence of missing values * Re-structure and document code * Update unit tests * Update CHANGELOG.md * Modify the corresponding documentation in Jupyter notebooks * Final modifications
1 parent d9713ea commit 11a295e

File tree

10 files changed

+1405
-106
lines changed

10 files changed

+1405
-106
lines changed

docs/sources/CHANGELOG.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,18 @@ The CHANGELOG for the current development version is available at
1818

1919
##### New Features and Enhancements
2020

21-
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
22-
- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via [it176131](https://github.com/it176131))
21+
- Implement the FP-Growth and FP-Max algorithms with the possibility of missing values in the input dataset. Added a new metric Representativity for the association rules generated ([#1004](https://github.com/rasbt/mlxtend/issues/1004) via [zazass8](https://github.com/zazass8)).
22+
Files updated:
23+
- ['mlxtend.frequent_patterns.fpcommon']
24+
- ['mlxtend.frequent_patterns.fpgrowth'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpgrowth/)
25+
- ['mlxtend.frequent_patterns.fpmax'](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/fpmax/)
26+
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)
27+
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)Implemented three new metrics: Jaccard, Certainty, and Kulczynski. ([#1096](https://github.com/rasbt/mlxtend/issues/1096))
28+
- Integrated scikit-learn's `set_output` method into `TransactionEncoder` ([#1087](https://github.com/rasbt/mlxtend/issues/1087) via[it176131](https://github.com/it176131))
2329

2430
##### Changes
2531

32+
- [`mlxtend.frequent_patterns.fpcommon`] Added the null_values parameter in valid_input_check signature to check in case the input also includes null values. Changes the returns statements and function signatures for setup_fptree and generated_itemsets respectively to return the disabled array created and to include it as a parameter. Added code in [`mlxtend.frequent_patterns.fpcommon`] and [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) to implement the algorithms in case null values exist when null_values is True.
2633
- [`mlxtend.frequent_patterns.association_rules`](https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) Added optional parameter 'return_metrics' to only return a given list of metrics, rather than every possible metric.
2734

2835
- Add `n_classes_` attribute to stacking classifiers for compatibility with scikit-learn 1.3 ([#1091](https://github.com/rasbt/mlxtend/issues/1091))

docs/sources/user_guide/frequent_patterns/association_rules.ipynb

Lines changed: 352 additions & 8 deletions
Large diffs are not rendered by default.

docs/sources/user_guide/frequent_patterns/fpgrowth.ipynb

Lines changed: 262 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,9 @@
3636
"\n",
3737
"In general, the algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as \"frequent\" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.\n",
3838
"\n",
39-
"In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets."
39+
"In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets.\n",
40+
"\n",
41+
"A new feature is implemented in this algorithm, which is the sub-case when the input contains missing information [3]. The same structure and logic of the algorithm is kept, while \"ignoring\" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. The support is computed differently where for a single item, the cardinality of null values is deducted from the cardinality of all transactions in the database. For the case of an itemset, of more than one elements, the cardinality of null values in at least one item in them itemset is deducted from the cardinality of all transactions in the database. "
4042
]
4143
},
4244
{
@@ -49,6 +51,8 @@
4951
"\n",
5052
"[2] Agrawal, Rakesh, and Ramakrishnan Srikant. \"[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf).\" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.\n",
5153
"\n",
54+
"[3] Ragel, A. and Crémilleux, B., 1998. \"[Treatment of missing values for association rules](https://link.springer.com/chapter/10.1007/3-540-64383-4_22)\". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.\n",
55+
"\n",
5256
"## Related\n",
5357
"\n",
5458
"- [FP-Max](./fpmax.md)\n",
@@ -479,6 +483,261 @@
479483
"fpgrowth(df, min_support=0.6, use_colnames=True)"
480484
]
481485
},
486+
{
487+
"cell_type": "markdown",
488+
"metadata": {},
489+
"source": [
490+
"The example below implements the algorithm when there is missing information from the data, by arbitrarily removing datapoints from the original dataset."
491+
]
492+
},
493+
{
494+
"cell_type": "code",
495+
"execution_count": 3,
496+
"metadata": {},
497+
"outputs": [
498+
{
499+
"name": "stderr",
500+
"output_type": "stream",
501+
"text": [
502+
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
503+
" df.iloc[idx[i], col[i]] = np.nan\n",
504+
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
505+
" df.iloc[idx[i], col[i]] = np.nan\n",
506+
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
507+
" df.iloc[idx[i], col[i]] = np.nan\n",
508+
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
509+
" df.iloc[idx[i], col[i]] = np.nan\n",
510+
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
511+
" df.iloc[idx[i], col[i]] = np.nan\n",
512+
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
513+
" df.iloc[idx[i], col[i]] = np.nan\n",
514+
"C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n",
515+
" df.iloc[idx[i], col[i]] = np.nan\n"
516+
]
517+
},
518+
{
519+
"data": {
520+
"text/html": [
521+
"<div>\n",
522+
"<style scoped>\n",
523+
" .dataframe tbody tr th:only-of-type {\n",
524+
" vertical-align: middle;\n",
525+
" }\n",
526+
"\n",
527+
" .dataframe tbody tr th {\n",
528+
" vertical-align: top;\n",
529+
" }\n",
530+
"\n",
531+
" .dataframe thead th {\n",
532+
" text-align: right;\n",
533+
" }\n",
534+
"</style>\n",
535+
"<table border=\"1\" class=\"dataframe\">\n",
536+
" <thead>\n",
537+
" <tr style=\"text-align: right;\">\n",
538+
" <th></th>\n",
539+
" <th>Apple</th>\n",
540+
" <th>Corn</th>\n",
541+
" <th>Dill</th>\n",
542+
" <th>Eggs</th>\n",
543+
" <th>Ice cream</th>\n",
544+
" <th>Kidney Beans</th>\n",
545+
" <th>Milk</th>\n",
546+
" <th>Nutmeg</th>\n",
547+
" <th>Onion</th>\n",
548+
" <th>Unicorn</th>\n",
549+
" <th>Yogurt</th>\n",
550+
" </tr>\n",
551+
" </thead>\n",
552+
" <tbody>\n",
553+
" <tr>\n",
554+
" <th>0</th>\n",
555+
" <td>False</td>\n",
556+
" <td>False</td>\n",
557+
" <td>False</td>\n",
558+
" <td>True</td>\n",
559+
" <td>False</td>\n",
560+
" <td>True</td>\n",
561+
" <td>True</td>\n",
562+
" <td>True</td>\n",
563+
" <td>True</td>\n",
564+
" <td>NaN</td>\n",
565+
" <td>NaN</td>\n",
566+
" </tr>\n",
567+
" <tr>\n",
568+
" <th>1</th>\n",
569+
" <td>False</td>\n",
570+
" <td>NaN</td>\n",
571+
" <td>True</td>\n",
572+
" <td>True</td>\n",
573+
" <td>False</td>\n",
574+
" <td>True</td>\n",
575+
" <td>False</td>\n",
576+
" <td>True</td>\n",
577+
" <td>True</td>\n",
578+
" <td>False</td>\n",
579+
" <td>True</td>\n",
580+
" </tr>\n",
581+
" <tr>\n",
582+
" <th>2</th>\n",
583+
" <td>True</td>\n",
584+
" <td>False</td>\n",
585+
" <td>False</td>\n",
586+
" <td>True</td>\n",
587+
" <td>False</td>\n",
588+
" <td>True</td>\n",
589+
" <td>True</td>\n",
590+
" <td>False</td>\n",
591+
" <td>False</td>\n",
592+
" <td>False</td>\n",
593+
" <td>False</td>\n",
594+
" </tr>\n",
595+
" <tr>\n",
596+
" <th>3</th>\n",
597+
" <td>False</td>\n",
598+
" <td>True</td>\n",
599+
" <td>False</td>\n",
600+
" <td>False</td>\n",
601+
" <td>NaN</td>\n",
602+
" <td>NaN</td>\n",
603+
" <td>True</td>\n",
604+
" <td>NaN</td>\n",
605+
" <td>False</td>\n",
606+
" <td>NaN</td>\n",
607+
" <td>True</td>\n",
608+
" </tr>\n",
609+
" <tr>\n",
610+
" <th>4</th>\n",
611+
" <td>False</td>\n",
612+
" <td>True</td>\n",
613+
" <td>False</td>\n",
614+
" <td>True</td>\n",
615+
" <td>NaN</td>\n",
616+
" <td>True</td>\n",
617+
" <td>False</td>\n",
618+
" <td>False</td>\n",
619+
" <td>NaN</td>\n",
620+
" <td>False</td>\n",
621+
" <td>False</td>\n",
622+
" </tr>\n",
623+
" </tbody>\n",
624+
"</table>\n",
625+
"</div>"
626+
],
627+
"text/plain": [
628+
" Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion \\\n",
629+
"0 False False False True False True True True True \n",
630+
"1 False NaN True True False True False True True \n",
631+
"2 True False False True False True True False False \n",
632+
"3 False True False False NaN NaN True NaN False \n",
633+
"4 False True False True NaN True False False NaN \n",
634+
"\n",
635+
" Unicorn Yogurt \n",
636+
"0 NaN NaN \n",
637+
"1 False True \n",
638+
"2 False False \n",
639+
"3 NaN True \n",
640+
"4 False False "
641+
]
642+
},
643+
"execution_count": 3,
644+
"metadata": {},
645+
"output_type": "execute_result"
646+
}
647+
],
648+
"source": [
649+
"import numpy as np\n",
650+
"from mlxtend.frequent_patterns import fpgrowth\n",
651+
"\n",
652+
"rows, columns = df.shape\n",
653+
"idx = np.random.randint(0, rows, 10)\n",
654+
"col = np.random.randint(0, columns, 10)\n",
655+
"\n",
656+
"for i in range(10):\n",
657+
" df.iloc[idx[i], col[i]] = np.nan\n",
658+
"\n",
659+
"df"
660+
]
661+
},
662+
{
663+
"cell_type": "markdown",
664+
"metadata": {},
665+
"source": [
666+
"The same function as above is applied by setting `null_values=True` with at least 60% support:"
667+
]
668+
},
669+
{
670+
"cell_type": "code",
671+
"execution_count": 6,
672+
"metadata": {},
673+
"outputs": [
674+
{
675+
"data": {
676+
"text/html": [
677+
"<div>\n",
678+
"<style scoped>\n",
679+
" .dataframe tbody tr th:only-of-type {\n",
680+
" vertical-align: middle;\n",
681+
" }\n",
682+
"\n",
683+
" .dataframe tbody tr th {\n",
684+
" vertical-align: top;\n",
685+
" }\n",
686+
"\n",
687+
" .dataframe thead th {\n",
688+
" text-align: right;\n",
689+
" }\n",
690+
"</style>\n",
691+
"<table border=\"1\" class=\"dataframe\">\n",
692+
" <thead>\n",
693+
" <tr style=\"text-align: right;\">\n",
694+
" <th></th>\n",
695+
" <th>support</th>\n",
696+
" <th>itemsets</th>\n",
697+
" </tr>\n",
698+
" </thead>\n",
699+
" <tbody>\n",
700+
" <tr>\n",
701+
" <th>0</th>\n",
702+
" <td>1.0</td>\n",
703+
" <td>(Kidney Beans)</td>\n",
704+
" </tr>\n",
705+
" <tr>\n",
706+
" <th>1</th>\n",
707+
" <td>0.8</td>\n",
708+
" <td>(Eggs)</td>\n",
709+
" </tr>\n",
710+
" <tr>\n",
711+
" <th>2</th>\n",
712+
" <td>0.6</td>\n",
713+
" <td>(Milk)</td>\n",
714+
" </tr>\n",
715+
" <tr>\n",
716+
" <th>3</th>\n",
717+
" <td>1.0</td>\n",
718+
" <td>(Eggs, Kidney Beans)</td>\n",
719+
" </tr>\n",
720+
" </tbody>\n",
721+
"</table>\n",
722+
"</div>"
723+
],
724+
"text/plain": [
725+
" support itemsets\n",
726+
"0 1.0 (Kidney Beans)\n",
727+
"1 0.8 (Eggs)\n",
728+
"2 0.6 (Milk)\n",
729+
"3 1.0 (Eggs, Kidney Beans)"
730+
]
731+
},
732+
"execution_count": 6,
733+
"metadata": {},
734+
"output_type": "execute_result"
735+
}
736+
],
737+
"source": [
738+
"fpgrowth(df, min_support=0.6, null_values = True, use_colnames=True)"
739+
]
740+
},
482741
{
483742
"cell_type": "markdown",
484743
"metadata": {},
@@ -677,7 +936,7 @@
677936
"metadata": {
678937
"anaconda-cloud": {},
679938
"kernelspec": {
680-
"display_name": "Python 3 (ipykernel)",
939+
"display_name": "Python 3",
681940
"language": "python",
682941
"name": "python3"
683942
},
@@ -691,7 +950,7 @@
691950
"name": "python",
692951
"nbconvert_exporter": "python",
693952
"pygments_lexer": "ipython3",
694-
"version": "3.10.10"
953+
"version": "3.12.7"
695954
},
696955
"toc": {
697956
"nav_menu": {},

0 commit comments

Comments
 (0)