|
36 | 36 | "\n", |
37 | 37 | "In general, the algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as \"frequent\" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.\n", |
38 | 38 | "\n", |
39 | | - "In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets." |
| 39 | + "In particular, and what makes it different from the Apriori frequent pattern mining algorithm, FP-Growth is an frequent pattern mining algorithm that does not require candidate generation. Internally, it uses a so-called FP-tree (frequent pattern tree) datastrucure without generating the candidate sets explicitly, which makes it particularly attractive for large datasets.\n", |
| 40 | + "\n", |
| 41 | + "A new feature is implemented in this algorithm, which is the sub-case when the input contains missing information [3]. The same structure and logic of the algorithm is kept, while \"ignoring\" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. The support is computed differently where for a single item, the cardinality of null values is deducted from the cardinality of all transactions in the database. For the case of an itemset, of more than one elements, the cardinality of null values in at least one item in them itemset is deducted from the cardinality of all transactions in the database. " |
40 | 42 | ] |
41 | 43 | }, |
42 | 44 | { |
|
49 | 51 | "\n", |
50 | 52 | "[2] Agrawal, Rakesh, and Ramakrishnan Srikant. \"[Fast algorithms for mining association rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf).\" Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.\n", |
51 | 53 | "\n", |
| 54 | + "[3] Ragel, A. and Crémilleux, B., 1998. \"[Treatment of missing values for association rules](https://link.springer.com/chapter/10.1007/3-540-64383-4_22)\". In Research and Development in Knowledge Discovery and Data Mining: Second Pacific-Asia Conference, PAKDD-98 Melbourne, Australia, April 15–17, 1998 Proceedings 2 (pp. 258-270). Springer Berlin Heidelberg.\n", |
| 55 | + "\n", |
52 | 56 | "## Related\n", |
53 | 57 | "\n", |
54 | 58 | "- [FP-Max](./fpmax.md)\n", |
|
479 | 483 | "fpgrowth(df, min_support=0.6, use_colnames=True)" |
480 | 484 | ] |
481 | 485 | }, |
| 486 | + { |
| 487 | + "cell_type": "markdown", |
| 488 | + "metadata": {}, |
| 489 | + "source": [ |
| 490 | + "The example below implements the algorithm when there is missing information from the data, by arbitrarily removing datapoints from the original dataset." |
| 491 | + ] |
| 492 | + }, |
| 493 | + { |
| 494 | + "cell_type": "code", |
| 495 | + "execution_count": 3, |
| 496 | + "metadata": {}, |
| 497 | + "outputs": [ |
| 498 | + { |
| 499 | + "name": "stderr", |
| 500 | + "output_type": "stream", |
| 501 | + "text": [ |
| 502 | + "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n", |
| 503 | + " df.iloc[idx[i], col[i]] = np.nan\n", |
| 504 | + "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n", |
| 505 | + " df.iloc[idx[i], col[i]] = np.nan\n", |
| 506 | + "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n", |
| 507 | + " df.iloc[idx[i], col[i]] = np.nan\n", |
| 508 | + "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n", |
| 509 | + " df.iloc[idx[i], col[i]] = np.nan\n", |
| 510 | + "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n", |
| 511 | + " df.iloc[idx[i], col[i]] = np.nan\n", |
| 512 | + "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n", |
| 513 | + " df.iloc[idx[i], col[i]] = np.nan\n", |
| 514 | + "C:\\Users\\User\\AppData\\Local\\Temp\\ipykernel_1940\\3278686283.py:9: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.\n", |
| 515 | + " df.iloc[idx[i], col[i]] = np.nan\n" |
| 516 | + ] |
| 517 | + }, |
| 518 | + { |
| 519 | + "data": { |
| 520 | + "text/html": [ |
| 521 | + "<div>\n", |
| 522 | + "<style scoped>\n", |
| 523 | + " .dataframe tbody tr th:only-of-type {\n", |
| 524 | + " vertical-align: middle;\n", |
| 525 | + " }\n", |
| 526 | + "\n", |
| 527 | + " .dataframe tbody tr th {\n", |
| 528 | + " vertical-align: top;\n", |
| 529 | + " }\n", |
| 530 | + "\n", |
| 531 | + " .dataframe thead th {\n", |
| 532 | + " text-align: right;\n", |
| 533 | + " }\n", |
| 534 | + "</style>\n", |
| 535 | + "<table border=\"1\" class=\"dataframe\">\n", |
| 536 | + " <thead>\n", |
| 537 | + " <tr style=\"text-align: right;\">\n", |
| 538 | + " <th></th>\n", |
| 539 | + " <th>Apple</th>\n", |
| 540 | + " <th>Corn</th>\n", |
| 541 | + " <th>Dill</th>\n", |
| 542 | + " <th>Eggs</th>\n", |
| 543 | + " <th>Ice cream</th>\n", |
| 544 | + " <th>Kidney Beans</th>\n", |
| 545 | + " <th>Milk</th>\n", |
| 546 | + " <th>Nutmeg</th>\n", |
| 547 | + " <th>Onion</th>\n", |
| 548 | + " <th>Unicorn</th>\n", |
| 549 | + " <th>Yogurt</th>\n", |
| 550 | + " </tr>\n", |
| 551 | + " </thead>\n", |
| 552 | + " <tbody>\n", |
| 553 | + " <tr>\n", |
| 554 | + " <th>0</th>\n", |
| 555 | + " <td>False</td>\n", |
| 556 | + " <td>False</td>\n", |
| 557 | + " <td>False</td>\n", |
| 558 | + " <td>True</td>\n", |
| 559 | + " <td>False</td>\n", |
| 560 | + " <td>True</td>\n", |
| 561 | + " <td>True</td>\n", |
| 562 | + " <td>True</td>\n", |
| 563 | + " <td>True</td>\n", |
| 564 | + " <td>NaN</td>\n", |
| 565 | + " <td>NaN</td>\n", |
| 566 | + " </tr>\n", |
| 567 | + " <tr>\n", |
| 568 | + " <th>1</th>\n", |
| 569 | + " <td>False</td>\n", |
| 570 | + " <td>NaN</td>\n", |
| 571 | + " <td>True</td>\n", |
| 572 | + " <td>True</td>\n", |
| 573 | + " <td>False</td>\n", |
| 574 | + " <td>True</td>\n", |
| 575 | + " <td>False</td>\n", |
| 576 | + " <td>True</td>\n", |
| 577 | + " <td>True</td>\n", |
| 578 | + " <td>False</td>\n", |
| 579 | + " <td>True</td>\n", |
| 580 | + " </tr>\n", |
| 581 | + " <tr>\n", |
| 582 | + " <th>2</th>\n", |
| 583 | + " <td>True</td>\n", |
| 584 | + " <td>False</td>\n", |
| 585 | + " <td>False</td>\n", |
| 586 | + " <td>True</td>\n", |
| 587 | + " <td>False</td>\n", |
| 588 | + " <td>True</td>\n", |
| 589 | + " <td>True</td>\n", |
| 590 | + " <td>False</td>\n", |
| 591 | + " <td>False</td>\n", |
| 592 | + " <td>False</td>\n", |
| 593 | + " <td>False</td>\n", |
| 594 | + " </tr>\n", |
| 595 | + " <tr>\n", |
| 596 | + " <th>3</th>\n", |
| 597 | + " <td>False</td>\n", |
| 598 | + " <td>True</td>\n", |
| 599 | + " <td>False</td>\n", |
| 600 | + " <td>False</td>\n", |
| 601 | + " <td>NaN</td>\n", |
| 602 | + " <td>NaN</td>\n", |
| 603 | + " <td>True</td>\n", |
| 604 | + " <td>NaN</td>\n", |
| 605 | + " <td>False</td>\n", |
| 606 | + " <td>NaN</td>\n", |
| 607 | + " <td>True</td>\n", |
| 608 | + " </tr>\n", |
| 609 | + " <tr>\n", |
| 610 | + " <th>4</th>\n", |
| 611 | + " <td>False</td>\n", |
| 612 | + " <td>True</td>\n", |
| 613 | + " <td>False</td>\n", |
| 614 | + " <td>True</td>\n", |
| 615 | + " <td>NaN</td>\n", |
| 616 | + " <td>True</td>\n", |
| 617 | + " <td>False</td>\n", |
| 618 | + " <td>False</td>\n", |
| 619 | + " <td>NaN</td>\n", |
| 620 | + " <td>False</td>\n", |
| 621 | + " <td>False</td>\n", |
| 622 | + " </tr>\n", |
| 623 | + " </tbody>\n", |
| 624 | + "</table>\n", |
| 625 | + "</div>" |
| 626 | + ], |
| 627 | + "text/plain": [ |
| 628 | + " Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion \\\n", |
| 629 | + "0 False False False True False True True True True \n", |
| 630 | + "1 False NaN True True False True False True True \n", |
| 631 | + "2 True False False True False True True False False \n", |
| 632 | + "3 False True False False NaN NaN True NaN False \n", |
| 633 | + "4 False True False True NaN True False False NaN \n", |
| 634 | + "\n", |
| 635 | + " Unicorn Yogurt \n", |
| 636 | + "0 NaN NaN \n", |
| 637 | + "1 False True \n", |
| 638 | + "2 False False \n", |
| 639 | + "3 NaN True \n", |
| 640 | + "4 False False " |
| 641 | + ] |
| 642 | + }, |
| 643 | + "execution_count": 3, |
| 644 | + "metadata": {}, |
| 645 | + "output_type": "execute_result" |
| 646 | + } |
| 647 | + ], |
| 648 | + "source": [ |
| 649 | + "import numpy as np\n", |
| 650 | + "from mlxtend.frequent_patterns import fpgrowth\n", |
| 651 | + "\n", |
| 652 | + "rows, columns = df.shape\n", |
| 653 | + "idx = np.random.randint(0, rows, 10)\n", |
| 654 | + "col = np.random.randint(0, columns, 10)\n", |
| 655 | + "\n", |
| 656 | + "for i in range(10):\n", |
| 657 | + " df.iloc[idx[i], col[i]] = np.nan\n", |
| 658 | + "\n", |
| 659 | + "df" |
| 660 | + ] |
| 661 | + }, |
| 662 | + { |
| 663 | + "cell_type": "markdown", |
| 664 | + "metadata": {}, |
| 665 | + "source": [ |
| 666 | + "The same function as above is applied by setting `null_values=True` with at least 60% support:" |
| 667 | + ] |
| 668 | + }, |
| 669 | + { |
| 670 | + "cell_type": "code", |
| 671 | + "execution_count": 6, |
| 672 | + "metadata": {}, |
| 673 | + "outputs": [ |
| 674 | + { |
| 675 | + "data": { |
| 676 | + "text/html": [ |
| 677 | + "<div>\n", |
| 678 | + "<style scoped>\n", |
| 679 | + " .dataframe tbody tr th:only-of-type {\n", |
| 680 | + " vertical-align: middle;\n", |
| 681 | + " }\n", |
| 682 | + "\n", |
| 683 | + " .dataframe tbody tr th {\n", |
| 684 | + " vertical-align: top;\n", |
| 685 | + " }\n", |
| 686 | + "\n", |
| 687 | + " .dataframe thead th {\n", |
| 688 | + " text-align: right;\n", |
| 689 | + " }\n", |
| 690 | + "</style>\n", |
| 691 | + "<table border=\"1\" class=\"dataframe\">\n", |
| 692 | + " <thead>\n", |
| 693 | + " <tr style=\"text-align: right;\">\n", |
| 694 | + " <th></th>\n", |
| 695 | + " <th>support</th>\n", |
| 696 | + " <th>itemsets</th>\n", |
| 697 | + " </tr>\n", |
| 698 | + " </thead>\n", |
| 699 | + " <tbody>\n", |
| 700 | + " <tr>\n", |
| 701 | + " <th>0</th>\n", |
| 702 | + " <td>1.0</td>\n", |
| 703 | + " <td>(Kidney Beans)</td>\n", |
| 704 | + " </tr>\n", |
| 705 | + " <tr>\n", |
| 706 | + " <th>1</th>\n", |
| 707 | + " <td>0.8</td>\n", |
| 708 | + " <td>(Eggs)</td>\n", |
| 709 | + " </tr>\n", |
| 710 | + " <tr>\n", |
| 711 | + " <th>2</th>\n", |
| 712 | + " <td>0.6</td>\n", |
| 713 | + " <td>(Milk)</td>\n", |
| 714 | + " </tr>\n", |
| 715 | + " <tr>\n", |
| 716 | + " <th>3</th>\n", |
| 717 | + " <td>1.0</td>\n", |
| 718 | + " <td>(Eggs, Kidney Beans)</td>\n", |
| 719 | + " </tr>\n", |
| 720 | + " </tbody>\n", |
| 721 | + "</table>\n", |
| 722 | + "</div>" |
| 723 | + ], |
| 724 | + "text/plain": [ |
| 725 | + " support itemsets\n", |
| 726 | + "0 1.0 (Kidney Beans)\n", |
| 727 | + "1 0.8 (Eggs)\n", |
| 728 | + "2 0.6 (Milk)\n", |
| 729 | + "3 1.0 (Eggs, Kidney Beans)" |
| 730 | + ] |
| 731 | + }, |
| 732 | + "execution_count": 6, |
| 733 | + "metadata": {}, |
| 734 | + "output_type": "execute_result" |
| 735 | + } |
| 736 | + ], |
| 737 | + "source": [ |
| 738 | + "fpgrowth(df, min_support=0.6, null_values = True, use_colnames=True)" |
| 739 | + ] |
| 740 | + }, |
482 | 741 | { |
483 | 742 | "cell_type": "markdown", |
484 | 743 | "metadata": {}, |
|
677 | 936 | "metadata": { |
678 | 937 | "anaconda-cloud": {}, |
679 | 938 | "kernelspec": { |
680 | | - "display_name": "Python 3 (ipykernel)", |
| 939 | + "display_name": "Python 3", |
681 | 940 | "language": "python", |
682 | 941 | "name": "python3" |
683 | 942 | }, |
|
691 | 950 | "name": "python", |
692 | 951 | "nbconvert_exporter": "python", |
693 | 952 | "pygments_lexer": "ipython3", |
694 | | - "version": "3.10.10" |
| 953 | + "version": "3.12.7" |
695 | 954 | }, |
696 | 955 | "toc": { |
697 | 956 | "nav_menu": {}, |
|
0 commit comments