Update the project notebooks

kellyjadams · kellyjadams · commit 0b27d03a6d94 · 2024-06-06T22:50:41.000-07:00
Update explanations and fixed typos.
diff --git a/3_Project/1_EDA_Intro.ipynb b/3_Project/1_EDA_Intro.ipynb
@@ -69,53 +69,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "id": "f3dac2c7",
    "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "d128595bf3844deaa2d23932ca313c0f",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "32c68ae095ef486ca9ace3fa2c7b0ad5",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Downloading data:   0%|          | 0.00/231M [00:00<?, ?B/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "44f90b12d99a4d23b34120b6fdcf8bb6",
-       "version_major": 2,
-       "version_minor": 0
-      },
-      "text/plain": [
-       "Generating train split:   0%|          | 0/785741 [00:00<?, ? examples/s]"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
+   "outputs": [],
    "source": [
     "# Importing Libraries\n",
     "import ast\n",
diff --git a/3_Project/2_Skill_Demand.ipynb b/3_Project/2_Skill_Demand.ipynb
@@ -755,7 +755,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Filters and sorts a DataFrame to get the top 10 skills percentages for these top 3 roles. After sorting the skills by descending percentage, reverse the order of these top 5 entries to use in a horizontal bar plot, which by default starts plotting from the bottom."
+    "Filters and sorts a DataFrame to get the top 5 skills percentages for these top 3 roles. After sorting the skills by descending percentage, reverse the order of these top 5 entries to use in a horizontal bar plot, which by default starts plotting from the bottom."
    ]
   },
   {
diff --git a/3_Project/3_Skills_Trend.ipynb b/3_Project/3_Skills_Trend.ipynb
@@ -65,7 +65,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Select only those job postings that are for Data Analysts and then extracts the month from each job's posting date to see when jobs are listed. Next, converts a column that lists skills into a usable list format. Finally, it rearranges the data so that each skill from the list gets its own row."
+    "Select only those job postings that are for Data Analysts and the job country is the United States. Then extract the month from each job's posting date to see when jobs are listed. Next, converts a column that lists skills into a usable list format. Finally, it rearranges the data so that each skill from the list gets its own row."
    ]
   },
   {
@@ -89,7 +89,7 @@
    "source": [
     "### Pivot in Prep for Plotting\n",
     "\n",
-    "Create a pivot table from the `df_DA_US_explode`, setting 'month' as the index, `job_skills` as the columns, and fills missing entries with zero. It adds a new row labeled `Total` that sums up counts across all months for each skill. Finally, it reorders the columns based on the total counts, displaying them from highest to lowest, and shows the updated pivot table. "
+    "Create a pivot table from the `df_DA_US_explode`, setting 'month' as the index, `job_skills` as the columns, and fills missing entries with zero."
    ]
   },
   {
@@ -524,7 +524,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Sort columns by count and change month numbers to names"
+    "#### Sort columns by count and change month numbers to names\n",
+    "\n",
+    "It adds a new row labeled `Total` that sums up counts across all months for each skill. Finally, it reorders the columns based on the total counts, displaying them from highest to lowest, and shows the updated pivot table. "
    ]
   },
   {
@@ -983,7 +985,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Plot a line chart of the top five skills for data analysts, indexed by month. It excludes the 'Total' row and plots only the monthly data. Then it selects the first five columns and plots them. "
+    "Plot a line chart of the top 5 skills for data analysts, indexed by month. It selects the first five columns and plots them. "
    ]
   },
   {
@@ -1525,7 +1527,7 @@
    "source": [
     "## Plot Monthly Skill Demand \n",
     "\n",
-    "Creates a line plot for the top five skills of data analysts, shown as percentages of the total job entries per month, using the first five columns of the `df_DA_pivot_percent` DataFrame. Also the legend is moved outside of the plot for readability."
+    "Creates a line plot for the top five skills of data analysts, shown as percentages of the total job entries per month, using the first 5 columns of the `df_DA_pivot_percent` DataFrame. Also the legend is moved outside of the plot for readability."
    ]
   },
   {
diff --git a/3_Project/4_Salary_Analysis.ipynb b/3_Project/4_Salary_Analysis.ipynb
@@ -176,7 +176,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Filters the original dataset to focus on entries where the job title is 'Data Analyst', to create a new DataFrame `df_DA`. Converts the `job_skills` column, from string format to actual list objects if they are not null. Then it uses the `explode` method on the `job_skills` column to create a new row in the DataFrame for each skill associated with a job. Finally, it displays the first five entries of the `salary_year_avg` and `job_skills` columns."
+    "Filters the original dataset to only get rows where the job title is 'Data Analyst' and the country is 'United States', to create a new DataFrame `df_DA_US`. Drop NaN values from the 'salary_year_avg' column. Then it uses the `explode` method on the `job_skills` column to create a new row in the DataFrame for each skill associated with a job. Finally, it displays the first five entries of the `salary_year_avg` and `job_skills` columns."
    ]
   },
   {
@@ -276,7 +276,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Gets the top ten highest-paying skills for Data Analysts by calculating the median salary for each skill listed in the `df_DA`. It groups the data by job skills, computes the median salary, sorts these values in descending order, and then selects the top ten. This is then formatted into a new DataFrame (`df_DA_top_pay`) with a reset index and a renamed salary column labeled 'median_salary'."
+    "Gets the top ten highest-paying skills for Data Analysts by calculating the median salary for each skill listed in the `df_DA_US`. It groups the data by job skills, computes the median salary, sorts these values in descending order by median, and then selects the top 10. This is then formatted into a new DataFrame (`df_DA_top_pay`) with a reset index and a renamed salary column labeled 'median_salary'."
    ]
   },
   {
@@ -401,7 +401,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Calculates the count and median salary for each skill in `df_DA`. It groups the data by `job_skills`, aggregates it to find the count and median salary for each skill, and then sorts the results by count in descending order. It re-sorts this subset by median salary in descending order."
+    "Calculates the count and median salary for each skill in `df_DA_US`. It groups the data by `job_skills`, aggregates it to find the count and median salary for each skill, and then sorts the results by count in descending order by count. It re-sorts this subset by median salary in descending order."
    ]
   },
   {
diff --git a/3_Project/5_Optimal_Skills.ipynb b/3_Project/5_Optimal_Skills.ipynb
@@ -70,7 +70,7 @@
    "source": [
     "## Clean Data\n",
     "\n",
-    "Clean the data and convert the `job_skills` list to a list object. Get the salary_year_average for each job skill."
+    "Filters the original dataset to only get rows where the job title is 'Data Analyst' and the country is 'United States', to create a new DataFrame `df_DA_US`. Drop NaN values from the 'salary_year_avg' column. Then it uses the `explode` method on the `job_skills` column to create a new row in a new DataFrame (`df_DA_US_exploded`) for each skill associated with a job. Finally, it displays the first 5 entries of the `salary_year_avg` and `job_skills` columns."
    ]
   },
   {
@@ -169,7 +169,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Group the data by job skills and calculates the count and median salary for each skill, sorting the results in descending order by count. It then determines the total number of Data Analyst positions that have non-missing salary information. Then calculates the percentage that each skill count represents out of the total number of Data Analyst jobs. Finally, filter out any skills that don't have any jobs associated with them."
+    "Group the data by job skills and calculates the count and median salary for each skill, sorting the results in descending order by count. It then renames the columns. Calculates the percentage that each skill count represents out of the total number of Data Analyst jobs. Finally, filter out any skills that don't have any jobs associated with them."
    ]
   },
   {
@@ -330,7 +330,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Filters for Data Analyst skills that exceed a certain percentage (`skill_limit`) and plots a scatter plot to show how the median salary for these high-demand skills."
+    "Filters for Data Analyst skills that exceed a certain percentage (`skill_limit`)."
    ]
   },
   {

Original file line number	Diff line number	Diff line change
`@@ -755,7 +755,7 @@`
`755`	`755`	`"cell_type": "markdown",`
`756`	`756`	`"metadata": {},`
`757`	`757`	`"source": [`
`758`		`- "Filters and sorts a DataFrame to get the top 10 skills percentages for these top 3 roles. After sorting the skills by descending percentage, reverse the order of these top 5 entries to use in a horizontal bar plot, which by default starts plotting from the bottom."`
	`758`	`+ "Filters and sorts a DataFrame to get the top 5 skills percentages for these top 3 roles. After sorting the skills by descending percentage, reverse the order of these top 5 entries to use in a horizontal bar plot, which by default starts plotting from the bottom."`
`759`	`759`	`]`
`760`	`760`	`},`
`761`	`761`	`{`
Original file line number	Diff line number	Diff line change
`@@ -65,7 +65,7 @@`
`65`	`65`	`"cell_type": "markdown",`
`66`	`66`	`"metadata": {},`
`67`	`67`	`"source": [`
`68`		`- "Select only those job postings that are for Data Analysts and then extracts the month from each job's posting date to see when jobs are listed. Next, converts a column that lists skills into a usable list format. Finally, it rearranges the data so that each skill from the list gets its own row."`
	`68`	`+ "Select only those job postings that are for Data Analysts and the job country is the United States. Then extract the month from each job's posting date to see when jobs are listed. Next, converts a column that lists skills into a usable list format. Finally, it rearranges the data so that each skill from the list gets its own row."`
`69`	`69`	`]`
`70`	`70`	`},`
`71`	`71`	`{`
`@@ -89,7 +89,7 @@`
`89`	`89`	`"source": [`
`90`	`90`	`"### Pivot in Prep for Plotting\n",`
`91`	`91`	`"\n",`
`92`		- "Create a pivot table from the `df_DA_US_explode`, setting 'month' as the index, `job_skills` as the columns, and fills missing entries with zero. It adds a new row labeled `Total` that sums up counts across all months for each skill. Finally, it reorders the columns based on the total counts, displaying them from highest to lowest, and shows the updated pivot table. "
	`92`	+ "Create a pivot table from the `df_DA_US_explode`, setting 'month' as the index, `job_skills` as the columns, and fills missing entries with zero."
`93`	`93`	`]`
`94`	`94`	`},`
`95`	`95`	`{`
`@@ -524,7 +524,9 @@`
`524`	`524`	`"cell_type": "markdown",`
`525`	`525`	`"metadata": {},`
`526`	`526`	`"source": [`
`527`		`- "#### Sort columns by count and change month numbers to names"`
	`527`	`+ "#### Sort columns by count and change month numbers to names\n",`
	`528`	`+ "\n",`
	`529`	+ "It adds a new row labeled `Total` that sums up counts across all months for each skill. Finally, it reorders the columns based on the total counts, displaying them from highest to lowest, and shows the updated pivot table. "
`528`	`530`	`]`
`529`	`531`	`},`
`530`	`532`	`{`
`@@ -983,7 +985,7 @@`
`983`	`985`	`"cell_type": "markdown",`
`984`	`986`	`"metadata": {},`
`985`	`987`	`"source": [`
`986`		`- "Plot a line chart of the top five skills for data analysts, indexed by month. It excludes the 'Total' row and plots only the monthly data. Then it selects the first five columns and plots them. "`
	`988`	`+ "Plot a line chart of the top 5 skills for data analysts, indexed by month. It selects the first five columns and plots them. "`
`987`	`989`	`]`
`988`	`990`	`},`
`989`	`991`	`{`
`@@ -1525,7 +1527,7 @@`
`1525`	`1527`	`"source": [`
`1526`	`1528`	`"## Plot Monthly Skill Demand \n",`
`1527`	`1529`	`"\n",`
`1528`		- "Creates a line plot for the top five skills of data analysts, shown as percentages of the total job entries per month, using the first five columns of the `df_DA_pivot_percent` DataFrame. Also the legend is moved outside of the plot for readability."
	`1530`	+ "Creates a line plot for the top five skills of data analysts, shown as percentages of the total job entries per month, using the first 5 columns of the `df_DA_pivot_percent` DataFrame. Also the legend is moved outside of the plot for readability."
`1529`	`1531`	`]`
`1530`	`1532`	`},`
`1531`	`1533`	`{`
Original file line number	Diff line number	Diff line change
`@@ -176,7 +176,7 @@`
`176`	`176`	`"cell_type": "markdown",`
`177`	`177`	`"metadata": {},`
`178`	`178`	`"source": [`
`179`		- "Filters the original dataset to focus on entries where the job title is 'Data Analyst', to create a new DataFrame `df_DA`. Converts the `job_skills` column, from string format to actual list objects if they are not null. Then it uses the `explode` method on the `job_skills` column to create a new row in the DataFrame for each skill associated with a job. Finally, it displays the first five entries of the `salary_year_avg` and `job_skills` columns."
	`179`	+ "Filters the original dataset to only get rows where the job title is 'Data Analyst' and the country is 'United States', to create a new DataFrame `df_DA_US`. Drop NaN values from the 'salary_year_avg' column. Then it uses the `explode` method on the `job_skills` column to create a new row in the DataFrame for each skill associated with a job. Finally, it displays the first five entries of the `salary_year_avg` and `job_skills` columns."
`180`	`180`	`]`
`181`	`181`	`},`
`182`	`182`	`{`
`@@ -276,7 +276,7 @@`
`276`	`276`	`"cell_type": "markdown",`
`277`	`277`	`"metadata": {},`
`278`	`278`	`"source": [`
`279`		- "Gets the top ten highest-paying skills for Data Analysts by calculating the median salary for each skill listed in the `df_DA`. It groups the data by job skills, computes the median salary, sorts these values in descending order, and then selects the top ten. This is then formatted into a new DataFrame (`df_DA_top_pay`) with a reset index and a renamed salary column labeled 'median_salary'."
	`279`	+ "Gets the top ten highest-paying skills for Data Analysts by calculating the median salary for each skill listed in the `df_DA_US`. It groups the data by job skills, computes the median salary, sorts these values in descending order by median, and then selects the top 10. This is then formatted into a new DataFrame (`df_DA_top_pay`) with a reset index and a renamed salary column labeled 'median_salary'."
`280`	`280`	`]`
`281`	`281`	`},`
`282`	`282`	`{`
`@@ -401,7 +401,7 @@`
`401`	`401`	`"cell_type": "markdown",`
`402`	`402`	`"metadata": {},`
`403`	`403`	`"source": [`
`404`		- "Calculates the count and median salary for each skill in `df_DA`. It groups the data by `job_skills`, aggregates it to find the count and median salary for each skill, and then sorts the results by count in descending order. It re-sorts this subset by median salary in descending order."
	`404`	+ "Calculates the count and median salary for each skill in `df_DA_US`. It groups the data by `job_skills`, aggregates it to find the count and median salary for each skill, and then sorts the results by count in descending order by count. It re-sorts this subset by median salary in descending order."
`405`	`405`	`]`
`406`	`406`	`},`
`407`	`407`	`{`
Original file line number	Diff line number	Diff line change
`@@ -70,7 +70,7 @@`
`70`	`70`	`"source": [`
`71`	`71`	`"## Clean Data\n",`
`72`	`72`	`"\n",`
`73`		- "Clean the data and convert the `job_skills` list to a list object. Get the salary_year_average for each job skill."
	`73`	+ "Filters the original dataset to only get rows where the job title is 'Data Analyst' and the country is 'United States', to create a new DataFrame `df_DA_US`. Drop NaN values from the 'salary_year_avg' column. Then it uses the `explode` method on the `job_skills` column to create a new row in a new DataFrame (`df_DA_US_exploded`) for each skill associated with a job. Finally, it displays the first 5 entries of the `salary_year_avg` and `job_skills` columns."
`74`	`74`	`]`
`75`	`75`	`},`
`76`	`76`	`{`
`@@ -169,7 +169,7 @@`
`169`	`169`	`"cell_type": "markdown",`
`170`	`170`	`"metadata": {},`
`171`	`171`	`"source": [`
`172`		`- "Group the data by job skills and calculates the count and median salary for each skill, sorting the results in descending order by count. It then determines the total number of Data Analyst positions that have non-missing salary information. Then calculates the percentage that each skill count represents out of the total number of Data Analyst jobs. Finally, filter out any skills that don't have any jobs associated with them."`
	`172`	`+ "Group the data by job skills and calculates the count and median salary for each skill, sorting the results in descending order by count. It then renames the columns. Calculates the percentage that each skill count represents out of the total number of Data Analyst jobs. Finally, filter out any skills that don't have any jobs associated with them."`
`173`	`173`	`]`
`174`	`174`	`},`
`175`	`175`	`{`
`@@ -330,7 +330,7 @@`
`330`	`330`	`"cell_type": "markdown",`
`331`	`331`	`"metadata": {},`
`332`	`332`	`"source": [`
`333`		- "Filters for Data Analyst skills that exceed a certain percentage (`skill_limit`) and plots a scatter plot to show how the median salary for these high-demand skills."
	`333`	+ "Filters for Data Analyst skills that exceed a certain percentage (`skill_limit`)."
`334`	`334`	`]`
`335`	`335`	`},`
`336`	`336`	`{`