ADMAS UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE
SELLECTED TOPICS IN COMPUTTER SCIENCE
GROUP ASSIGNMENT
Group Members ID
1 kssahun Alemneh…………...........................................................................................................1196\21
2 habtamu Teketel…………………………………………………………………………………………….0141\19
3 Demisew Bereded…………………….............................................................................................1152\21
presentation date………….. ………..NOV 20 \2024
submission date…………………..
sujbmitted to …………………….MR
Introduction to Data Science
Overview for Data Science
Definition of data and information
Data types and representation
Data Value Chain
Data Acquisition
Data Analysis
Data Curating
Data Storage
Data Usage
Basic concepts of Big data
Characteristics of big data
Clustered computing
Benefits of clustered computing
Handoop and its ecosystem
Introduction to Data Science
Introduction to Data Science
What is data science?
Data science is a multi disciplinary field that uses scientific methods, processes,
algorithms, and systems to examain large amount of data
it insights from structured, semi structured and unstructured data.
Data science is simply analyzing data.
Dealing with huge amounts of data to find marketing patterns is known as data science
Data science combines all about: Statistics, Data expertise, Data engineering,
Visualization and Advanced computing
used to analyze large amounts of data and spot trends through formats like data
visualizations and predictive models
Applicaions of data science
1, Healthcare
2, Transportation
3, Sport
4, Government
5, e_ comerce
6, social media
7, logistics
8, gaming
What are data and information?
Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner
It can be described as unprocessed facts and figures.
It is represented with the help of characters such as
Alphabets (A-Z, a-z)
Digits (0-9)or
Special characters(+,-,/,*,<,>,=,etc.)
Measure of data in file
1, bit =1/8 byte
2, Nibble =4 bit =1/2 byte
3, Byte =8 bit
4, Kilobyte = 1024 byte
5, megabyte =1024 KB
6, Gigabyte =1024 MB
7, Terrabyte =1024 GB
8, Petabyte = 1024 TB
9, Exabyte = 1024 PB
10, zettabyte = 1024 EB
11, Yottabyte = 1024 ZB
What are data and information?
Information is the processed data on which decisions
and actions are based.
It is data that has been processed in to a form that is
meaningful to there recipient and is of real or perceived
value in the current or the prospective action or decision
of recipient.
information is interpreted data; created from
organized, structured, and processed data in a
particular context.
Data Processing Cycle
data processing is the re-structuring or re-ordering of
data by people or machines to increase their usefulness
and add values for a particular purpose Data processing
consists of the following basic steps-input, processing,
and out put.
The three steps constitute the data processing cycle
Input data is prepared in some convenient form for
processing.
The form will depend on the processing machine.
Processing the input data is changed to produce data
processi
in more use full forms
input output
g
Data types and their representation
Data types can be described from diverse perspectives.
In computer science and computer programming, for instance, a
data type is simply an attribute of data that tells the compiler or
interpreter how the programme rintends to use the data
Data types from Computer programming perspective
Common data types include
Integers (int)-is used to store whole numbers, mathematically known
as integers
Booleans(bool)- is used to represent restricted to one of two values:
true or false
Characters(char)-is used to store a single character
Floating-point numbers(float)-is used to store real numbers
Alphanumeric strings(string)-used to store a combination of
characters and numbers
Data types from Data Analytics perspective
A data type makes the values that expression, such as a variable or a function,
might take.
This data type defines the operations that can be done on the data, the meaning of
the data, and the way values of that type can be stored
Structured Data
Structured data is data that adheres to a pre-defined data
model and is therefore straight forward to analyze.
Structured data conforms to a tabular format with a
relationship between the different rows and columns.
Common examples of structured data are Excel files or
SQL data bases.
Each of these has structured rows and columns that can be
sorted.
Data types from Data Analytics perspective
Semi-structured data
Semi-structured data is a form of structured data that does not
conform with the formal structure of data models Associated with
relational databases or other forms of data tables, but nonetheless,
contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields with in the data.
Therefore, it is also known as a self-describing structure.
Examples of semi-structured data include JSON and XML are forms of
semi-structured data.
JSON : java script object Notation: is data interchange format that
uses human readable text to store and transmit data.
XML : extensible markup language provides rules to define
any data
Data types from Data Analytics perspective
Unstructured Data
Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
Can not displayed in rows, columens and relational database
It requirs more storage and defficult to manage and protect
Unstructured information is typically text-heavy but may contain data such
as dates, numbers, and facts as well.
This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in
structured databases.
Common examples of unstructured data include audio, video files or No-
Metadata –Data about Data
The last category of data type is metadata. From a
technical point of view, this is not a separate data
structure, but it is one of the most important elements for
Big Data analysis and big data solutions.
Metadata is data about data. It provides additional
information about a specific set of data.
In a set of photographs, for example, metadata could
describe when and where the photos were taken.
The metadata then provides fields for dates and
locations which, by themselves, can be considered
structured data. Because of this reason, metadata is
frequently used by Big Data solutions for initial analysis.
Data value Chain
Data Value Chain is introduced to describe the information flow
with in a big data system
a series of steps needed to generate value and useful in sights from
data.
The Big Data Value Chain identifies the following key high-level
activities:
Data Acquisition It is the process of collecting, filtering, and
cleaning data before it is put in a data warehouse
Data acquisition is one of the major big data challenges interms of
infrastructure requirements
The infrastructure required to support the acquisition of big data
must deliver low, predictable latency in both capturing data and in
executing queries.
Used in distributed environment; and support flexible and dynamic
Data value Chain
Data Analysis is making the raw data acquired amenable to use in decision-
making.
Data analysis involves exploring ,transforming ,and modeling data
the goal of high lighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.
Related areas include data mining, business intelligence, and machine learning
Data Curation It is the active management of data over its life cycle
Data curation processes can be categorized into different activities such as
content creation, selection, classification, transformation, validation, and
preservation.
Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
Data curators (also known as scientific curators or data annotators) hold the
responsibility of ensuring that data are trust worthy, discoverable ,accessible,
reusable and fit their purpose.
A key trend for the duration of big data utilizes community and crowdsourcing
approaches.
Data value Chain
Data Storage
It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
Relational Database Management Systems (RDBMS) have been the
main, and almost unique, a solution to the storage paradigm .
the ACID (Atomicity, Consistency, Isolation, and Durability) properties
that guarantee database transactions.
NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models
Data Usage
It covers the data-driven business activities that need access to data,
its analysis, and the tools needed to integrate the data analysis with in
the business activity.
Data usage in business decision-making can enhance competitiveness
Basic concepts of big data
Big data non-traditional strategies and technologies needed
to gather, organize, process, and gather insights from large data
sets.
Large, complex and divers data sets
Big data is a collection of data sets so large and complex that it
becomes difficult to process using on traditional data processing
applications.
“large dataset” means a data set tool large to reasonably
process or store with traditional tooling or on a single computer.
Big data is characterized by 3V and more:
Volume: large amounts of data /Massive datasets has been
generated
Velocity: Data is live streaming or in motion
Variety : data comes in many different forms from diverse sources
Veracity: can quality, accuracy or trustworthiness of the data
Clustered Computing
Clustered Computing
Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages.
To better address the high storage and computational needs
of big data, computer clusters are a better fit.
Is acollection of losely connected computers that works
togethet and they act as a single entity
Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:
Cluster computing benefits
Resource Pooling
High Availability
Easy Scalability
Clustered Computing
Cluster computing benefits
Resource Pooling
Combining the available storage space to hold data is a clear benefit but CPU and
memory pooling are also extremely important
High Availability
Clusters can provide varying levels of fault tolerance and availability guarantees to
prevent hardware or software failures from affecting access to data and processing.
Easy Scalability
Clusters make it easy to scale horizontally by adding additional machines to the
group.
Clusters requires a solution for managing cluster membership, coordinating resource
sharing, and scheduling actual work on individual nodes.
Cluster membership and resource allocation can be handled by software like
Hadoop’s YARN (Yet Another Resource Negotiator).
Hadoop and its Ecosystem
Hadoop is an open-source apache software framework intended to make
interaction with big data easier.
A framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
It is inspired by a technical document published by Google.
Key characteristics of Hadoop
Economical: Its systems are highly economical as ordinary computers
can be used for data processing.
Reliable: It is reliable as it stores copies of the data on different machines
and is resistant to hardware failure
Scalable: It is easily scalable both, horizontally and vertically. A few extra
nodes help in scaling up the framework.
Flexible:It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later
Hadoop and its Ecosystem
Hadoop ecosystem consists of a variety of components that work together
to provide a comprehensive framework for processing and managing large
datasets.
HDFS : Hadoop Distributed File System : It is designed to store large files
across multiple machines in a distributed manner.
YARN: Yet Another Resource Negotiator: is the resource management layer
of Hadoop.
MapReduce: is the programming model used for processing large
datasets in parallel across a Hadoop cluster.
MapReduce consists of two main tasks:
the Map task, which processes input data and generates intermediate key-
value pairs
the Reduce task, which aggregates these intermediate results.
Spark: is a fast, in-memory data processing engine that can run on top of
Hadoop.
Big Data Life Cycle with Hadoop
Ingesting data into the system
The first stage of Big Data processing is Ingest.
The data is ingested or transferred to Hadoop from various sources
such as relational databases, systems, or local files.
Processing the data in storage
The second stage is Processing. In this stage , the data is stored and
processed.
The data is stored in the distributed file system.
Computing and analyzing data
The third stage is to Analyze. Here, the data is analyzed by
processing frameworks such as Pig , Hive , and Impala.
Visualizing the results
The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
In this stage the analyzed data can be accessed by users.
Summery questions
1.What do you mean by data science?
2. Structured vs unstructured data?
4. What is the difference between BI (Business intelligence) and Data
science?
5. What are benefits of Cluster computing?
6. ____ is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse
7. _____provides additional information about a specific set of data
8. List the characterstics of big data
a. __________
b. __________
c. __________
d __________
End of chapter
THANK YOU