Skip to content

prawinrajan/Big-Data-Project

Repository files navigation

SIH-PROJECT

Problem statement

MGNREGA program houses a large volume of data on various parameters across the country (excess of 50 TBs) consisting of year on year data starting primarily FY 2005-06 till date. As part of reporting and monitoring activities, the program has a large and complex reporting framework consisting of reports count in excess of 600 reports. It is required that a solution be devised which is able to streamline the reporting process (generation of reports) and is able to highlight/eliminate duplicate reports, properly categorize reports, highlight high/medium/low importance reports etc. An on the fly dynamic facility for generation of reports by selection required filters/parameters may also be conceptualized, developed and implemented with minimal gaps and errors. Sample Data Required: Yes (Reports, available in public domain)

Objective

Python Data Pre-processing using Spark Dataframe

  1. Loading Data (Loading CSV file into HDFS, HDFS to Spark)
  2. Exploring Data 2.1 Understaning Dataframe Schema 2.2 Obtaining summary statistics 2.3 GroupBy and Aggregation 2.4 Visualizing Data
  3. Cleaning Data 3.1 Filtering Data
  4. Streamlining the reporting data
  5. Eliminating the duplicate report / categorize report
  6. Highlighting the High/Medium/Low importance reports
  7. Generation of reports by selection required filters/parameters.

Architecture Diagram

Output

To Start all the Daemons services


PySpark

Map Reduce

Map Reduce Task

To Streamline the Data

spark-submit spark_stream_main.py localhost 9999

spark_stream_main.py - mapreducing code of live data from locathost port 9999

Hadoop Web UI

Moving data from LocalFile to HDFS.

we can access it from any where.

Dealing with 50TB data using functions

process_data.py

  1. Remove duplicate records.
  2. Removing duplicate based on col
  3. Sort by max job card holders state
  4. Sort by min job card holders state
  5. Sort by max SC people's hold job cards
  6. Sort by min SC people's hold job cards
  7. Sort by state names

Create a horizontal bar plot

df_pandas.plot(kind='barh', x='State Name', y='Job card Holders', colormap='winter_r') plt.show()

After completion of process, we can store data into HDFS. It's secure and we can access from any where.while retriving the data, we can use the above filters agian.

predictive analytics

We are having huge amount of data. Based on this we can predict, what will be happen. Like this year 20,000 Job cards issued. So, what will be value of next year!. Here, we did predictive analytics by using fulldata.csv.
prediction analysis.ipynb - This file has code of predictive analytics. After that, we can plot graph to visuailize the output. Click Here! to see the predictive analytics report and graph.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published