0% found this document useful (0 votes)
86 views15 pages

INTERVIEW QUESTIONS - ALL Companies

The document contains interview questions and answers related to data engineering technologies like PySpark, SQL, data transformations, data pipelines, and cloud computing. Questions cover topics such as RDD vs DataFrames, data ingestion strategies, data partitioning, cloud deployments, and validation processes.

Uploaded by

Aniket Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views15 pages

INTERVIEW QUESTIONS - ALL Companies

The document contains interview questions and answers related to data engineering technologies like PySpark, SQL, data transformations, data pipelines, and cloud computing. Questions cover topics such as RDD vs DataFrames, data ingestion strategies, data partitioning, cloud deployments, and validation processes.

Uploaded by

Aniket Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

INTERVIEW QUESTIONS

Roja Questions

Accenture
1. Uses of lit()

It is a function that is used to assign a constant value.

Eg: Added Language as English, Ordered Type as Manage Account, and Channel as
Middleware,

2. Differences between Row_Num, Rank, and Dense Rank

Row_Num - It is a numbering system that gives the sequential row number where we want to
give the numbers for the data.

Rank() and Dense Rank() are used for the Ranking system where it requires.

Rank()- Based on the requirement wherever it has the common value it will have the same rank
and once the value is changed it skips the sequential number and displays the rank based on
the current row number-wise. Eg - 1,2,2,4,5

Dense Rank()- It does not skip the value and it follows the sequential number. Eg - 1,2,2,3,4

3. Difference between RDD and Dataframes.

RDD - It is the collection of objects which is capable of storing the data partitioned across the
multiple nodes of the cluster and also allows them to do processing in parallel.

Dataframes - Dataframes can read and write the data into various formats like CSV, JSON,
AVRO, HDFS, and HIVE tables. It is already optimized to process large datasets for most of the
pre-processing tasks so that we do not need to write complex functions on our own.

4. Snappy compression

Snappy is a compression/decompression library. It does not aim for maximum compression or


compatibility with any other compression library; instead, it aims for very high speeds and
reasonable compression.

5. Have to drop the duplicate value from the column without using drop duplicate.
6. Usage of withColumn

withColumn() – Returns a new DataFrame by adding a column or replacing the existing column
that has the same name.

7. Drop duplicate values from columns from the Fruits Table.

A B C

1 apple 200

2 apple 300

3 bat 400

SQL> delete from Fruits where A in (select max(A) from Fruits group by B having count(*) > 1);

—-------------------------------------------------------------------------------------------------------------------------

Tiger Analytics

1. Given table emp(id, name, salary), write a select query to pull all cols and a
flag column to identify whether the emp salary is less than/greater than/equal to
the average salary of all employees.

i) Select avg(sal) from emp; → 2073

ii) select empno,ename,

CASE

WHEN sal > 2073 THEN 'Salary is less the avg salary'

WHEN sal = 2073 THEN 'Salary is equal to avg salary'

ELSE ' Salary is greater than avg salary '

END AS SalFlag

from emp;
2. Consider table: geodata(id, country, postcode, region, city, state) --- id is
unique

1, IN, 400078, M1, Mumbai, MH

2, IN, 400078, M1, Mumbai, MH

3, IN, 411033, P2, Pune, MH

4, IN, 411044, P1, Pune, MH

5, IN, 411044, P1, Pune, MH

6, IN, 400078, M2, Mumbai, MH

Write a delete statement to remove duplicate records from a table.

3. Identify the Slowest executing statement.

dfA: DataFrame= spark.read.format(“parquet”)…options…load #1 Schema:


(a,b,c)

dfB: Dataframe= spark.read.format(“parquet”) …options…load #2 Schema:


(p,q,r,s,t,u,v)

df1 = dfA.filter(“a=1”) #3

df2= dfB.select(“p”, “q”, “r”, “s”, “t”) #4

df3 = df2.join(df1, df1(“a”) ===df2(“p”), “full”) #5

df3.saveAsTable(“default.myresult”) #6

df4=df2.join(df1, df1(“b”) === df2(“q”), “inner”) #7 —- Slowest Executing.

—-------------------------------------------------------------------------------------------------------------------------

Experis Questions

1. What technologies are used for data ingestion and data transformation? - Pyspark

2. Based on the technologies asked it is on the cloud or on-premise? - Cloud

3. If it is on-premise, which Hadoop distribution used it?


4. Worked on the Pyspark right, so what are the source data and source system for your
transformation? – JSON and AWS S3 Bucket.

5. from where you will read the data? – We read the data from an S3 folder where the
upstream team puts the data in an S3 bucket every day with the date name.

6. What kind of transformations are used in spark? - Filter, groupBy, and map are examples of
transformations

7. Who is your downstream team? – order team

8. What wide transformations that you used? - Functions such as groupByKey(),


aggregateByKey(), aggregate(), join(), repartition() are some examples of a wider
transformations.

9. Basic schema of emp tab and dept table. To find out the highest salary for each department
along with emp details.

select * from emp where sal in ( select max(sal) from emp group by deptno) ;

10. How do you write your outputs? What strategy do you follow?

Actual data pipelines don’t display the result. How do you save the output? What is your
strategy? What are the different things that you used in the pipeline?

11. Have you done Partitions?

12. To save the data have you used partitions?

13. Can you explain to me coalesce and re-partition? When should we need to use re-partition?

Coalesce – It is used only to decrease the number of partitions in an efficient way.

Re-partition – It is used to increase or decrease the RDD, Dataframe, and Dataset


partitions.

14. Cloud experience.

15. Have you deployed anything in production?

16. How did you deploy your data pipeline?

17. What kind of validations?

18. How you will access the spark? How do you log in to the dataframes?

19. Have you used spark-shell?


Persistent System

1. Difference between the Array and List DataType?

A list is used to collect items of multiple datatypes. An array also collects items of the
same data type. List : [1, apple. 2.5], Array[1,2,3]

List cannot manage Arithmetic operators, while an array can manage arithmetic operations.

S.No. List Array

1 List is used to collect items that usually consist of An array is a vital component that collects several
elements of multiple data types. items of the same data type.

2 List cannot manage arithmetic operations. An array can manage arithmetic operations.

3 It consists of elements that belong to the It consists of elements that belong to the same data
different data types. type.

4 When it comes to flexibility, the list is perfect as it When it comes to flexibility, the array is not suitable as
allows easy modification of data. it does not allow easy modification of data.

5 It consumes a larger memory. It consumes less memory than a list.

6 In a list, the complete list can be accessed In an array, a loop is mandatory to access the
without any specific looping. components of the array.

7 It favors a shorter sequence of data. It favors a longer sequence of data.

2. Which one is faster when compared to List and Array?

An array is faster than a list in python since all the elements stored in an array are
homogeneous i.e., they have the same data type whereas a list contains heterogeneous
elements. Moreover, Python arrays are implemented in C which makes it a lot faster than
lists that are built-in in Python itself.

3. What is the Application of Decorators in Python?


A decorator is a design pattern in Python that allows a user to add new functionality to an
existing object without modifying its structure. Decorators are usually called before the definition
of a function you want to decorate.

4. You will Write some code by importing Modules and Packages. Difference between
both?

A Python Module can be a simple python File (.py extension file), i.e., a
combination of numerous Functions and Global variables.

● A Python Package is a collection of different Python modules with an __init__.py


File.
● __init__.py Python File works as a Constructor for the Python Package.

5. Difference between Iterators and Enumerators.

The iterator can do modifications (e.g using the remove() method which
removes the element from the Collection during traversal).

Iterator has the remove() method.

The enumeration interface acts as a read-only interface, one can not do any
modifications to the Collection while traversing the elements of the Collection.

Enumeration does not have the remove() method.

6. Difference between Methods and Functions in Python.

Modules and functions may appear similar to their purpose, which is reusability.
However, modules are on a larger scale because of their use in different classes,
functions, and attributes to fulfill larger functionalities. At the same time, functions
are more particular to specific activities on a smaller scale.

Pyspark:

7. Let's say two dataframes and both the dataframes have the same schema. What is the
way to combine both dataframes into a single dataframe.

Using UnionAll or Union

Example - df.union(df2) and df.unionAll(df2)

8. What is the way to add new column as M with a constant value as True?

df.withColumn(“M”).Lit(“True”)
9. What is the difference between Datawarehouse and DataLake?

A data lake contains all an organization's data in a raw, unstructured form, and
can store the data indefinitely — for immediate or future use.

Storage costs are fairly inexpensive in a data lake vs data warehouse. Data lakes are
also less time-consuming to manage, which reduces operational costs.

A data warehouse contains structured data that has been cleaned and processed,
ready for strategic analysis based on predefined business needs.

Data warehouses cost more than data lakes, and also require more time to manage,
resulting in additional operational costs.

10. What are the actions performed in Pyspark?

Count, Display, Show, glom.collect, collect,

11. Difference between ReduceByKey and GroupByKey

Both reduceByKey and groupByKey result in wide transformations which means both
triggers a shuffle operation

The key difference between reduceByKey and groupByKey is that reduceByKey


does a map side combine and groupByKey does not do a map side combine.

Map side combine means – the join operation is performed in the map phase itself.

12. What is the Broadcast Join?

13. What is the difference between Persist and Cache functionality?

14. DF:

Product Category Revenue

Product A Cat 1 30000

Product B Cat 1 40000

Product C Cat 1 10000

Product D Cat 1 90000

Product E Cat 2 20000


Product B Cat 2 50000

Product D Cat 2 20000

Product F Cat 3 30000

Find the top 3 Revenue Category in each category

15. Input =[1,2,3,4,5,6,7,8,9]

N=2(user input)

Output = [[1,2],[3,4],[5,6],[7,8],[9]]

5th Company Interview Question:


1. How will you trigger the notepad
2. What is the databrick version have used?
3. currently what is the runtime version?
4. what file size have you worked on?
5. how many workers are needed for a 10GB file size
6. spark architecture
7. difference between parquet and JSON
8. Data Queueing
9. How do you execute the job script
10. Delta format
11. How will you run the commands
12. what is the default join in pyspark
13. When the job is scheduled and after 50% if its get failed, what will you do next?
14. How do you handle the error and eliminate the junk data?
15. What is the junk data you handled in your project?

—-------------------------------------------------------------------------------------------------------------------------
Ayush Questions
CAREERNET
1. What all responsibilities or tools are you currently working on?

2. What databases have you worked on? Oracle Database

3. What are operators and their types?

Operators are used to performing operations on variables and value

Python divides the operators into the following groups:

Arithmetic operators Assignment operators

Comparison operators Logical operators

Identity operators Membership operators

Bitwise operators.

4. Set Operators.

A Set is an unordered collection data type that is iterable, mutable, and has no duplicate
elements.

5. What are the scenarios in which we use Union and Interection?

We use union when we want All the elements from both sets without duplicates.

We use Intersection when we want only the common elements in both sets.

6. Syntax for performing union in table A and table B

print("A U B:", A.union(B)

7. In the emp table, In dep 1 Multiple emp are there but their salary is different,

a)how to get the details of only the highest salary from the table and

b) Also if the higher salary of two employees is the same then we should get the details
of the recently joined emp detail, not the old one.
8. Difference between List and Tuples.

The list is mutable and Tuples are immutable. The list consumes time unlike tuples will
be much faster when iterations are implied.

9. What are the scenarios that have to use list and tuples

When we know the size of the values is changeable List is used. In tuples size of the
values cannot be changed.

10. What is the datatype of the list and tuple?

Tuple – String, Int, Boolean likewise same for List, It can be of any datatype

11. What is the format of data you are getting from customers?

JSON

—-------------------------------------------------------------------------------------------------------------------------

T-systems
1. How to decide which query and where to run for cost optimization.

2. Hadoop and AWS glue and AWS Athena

3. I have a 1 GB CSV file, and in two projects there are 2 requirements,

a) One is a storage issue so we need to save storage and


b) another project requirement is we need to read data frequently using spark,

so for each case which file format, we can use out of these different file formats -
Parquet, Avro, JSON, ORC.

a) Parquet, because it is a column-oriented data file format designed for


efficient data storage which is mainly used for its compression feature.
b) ORC - It stores data in columns which enables users to read and
decompress just the pieces they need.

4. Why choose different file formats?

5. Consider that in python we have a tuple (1,2,3) and I just want to update the second
element, can we do it if yes then how we can do it.

We can’t change the values in Tuple because a tuple is immutable.

6. Datatypes and data structures of python.


7. Out of 5 which are immutable and which are mutable - list, tuple, set, dictionary,
frozenset.

Mutable - List, Set, Dictionary

Immutable - Tuple, Frozenset

8. I have a table as table 1 containing city name and students and the student's row is
a comma separated, consider that database is a Postgres which doesn't have any split
functionality, so write a SQL query to get the city name and student count and we have
to split that student column.

CITY NAME STUDENTS

A 1,1,1

B 1,2

C 2,3,4

—-------------------------------------------------------------------------------------------------------------------------

ADIDAS
1) What is your Tech stack in this project?
Databricks, AWS

2) What do you use for job scheduling?


There is a jobs option in the data bricks, using which we can schedule our jobs, and attach the
required notebook to run it at a particular time.

3) What environments do you have?


Dev, QA, UAT, Prod

4) In production, how do you run the jobs or pipeline?

5) What is the nature of the data which you receive?


We receive data as a JSON file.

6) Where you can find these jobs icons in the data bricks portal, is it present by default in every
data bricks environment or do we need to install a package for this and why do we need this?
There is a jobs option in the data bricks, using which we can schedule our jobs, and attach the
required notebook to run it at a particular time.

7) What are the basic key components of spark architecture?

8) Difference between a driver and an executor?


In Apache Spark, a driver is the main program that creates the SparkContext and runs the main
function. It acts as a coordinator, managing the execution of tasks on executors.

An executor is a worker process that runs on a node in a cluster. It is responsible for executing
tasks assigned to it by the driver and returning the results to the driver. Executors are the units
of parallelism in Spark, as they run multiple tasks concurrently to make full use of the available
resources in a cluster. In summary, the driver coordinates the work, while the executors do the
work.

9) What is the data storage layer in your case?


We save data in S3 Storage.

10) What is the maximum number of drivers you can have in a single spark application?
One driver

11) What is the concept of cores in a spark cluster?


The cores are used by Spark to parallelize the processing of data across multiple tasks. The
number of cores that a Spark application can use is determined by the configuration of the
cluster and the resources that are available. More cores generally mean faster processing but
also require more memory to be allocated to the Spark process.

12) What is the data format we use in s3 to store data and what is the most commonly used
data format?

13) What is the benefit of each file format?

CSV - CSV (Comma Separated Values) is the simplest file format where fields are separated by
using ‘comma’ as a delimiter. It is human-readable, compressible,
and most importantly platform-independent.

JSON - It carries a predefined complex data type along with the data content, JSON is a self-
describing language-independent data format.

AVRO - It is widely supported within the Hadoop community and beyond. Ideal for long-term
storage of essential data, the Avro file format.
It can read and write in many languages such as Java, Scala, etc.
ORC - ORC is a self-describing, optimized file format that stores data in columns which enables
users to read and decompress just the pieces they need.

Parquet - Parquet is an open-source, column-oriented data file format designed for efficient
data storage and retrieval.

14) What is the architecture that you follow in data bricks? (Bronze, silver, gold)

15) how many workspaces do you have in data bricks?


DEV, QA, UAT, Prod

16) Basic difference between transformations & Actions in spark and some examples of both?

17) What is the difference between narrow and wide transformations?


A wide transformation spreads data out into multiple columns, creating a wider data frame. For
example, the PySpark pivot function can be used to spread values from one column of a data
frame into multiple columns.

A narrow transformation, on the other hand, compresses data into fewer columns, creating a
narrower data frame. An example of a narrow transformation in PySpark is the groupBy
function, which aggregates values from multiple columns into one or a few columns based on
the specified key(s).

In general, wide transformations are useful when dealing with data that has a lot of redundant
information, while narrow transformations are useful when summarizing or aggregating data.

18) what is the concept of partitioning in data bricks?


In Apache Spark and Databricks, partitioning is the process of dividing a large dataset into
smaller, more manageable chunks called partitions. Each partition is processed independently
and in parallel by the executors in the cluster.
Partitioning is a key concept in distributed computing, as it enables Spark to scale processing to
large datasets by breaking them down into smaller, more manageable pieces. By partitioning
data, Spark can process each partition in parallel, allowing it to make full use of the available
resources in the cluster.

19) If you have data with 1 lakh mobile numbers then how many partitions you will create?
The number of partitions to create for data with 1 lakh mobile numbers depends on several
factors such as the size of the data, the available cluster resources (e.g., number of executors,
memory, CPU), and the desired level of parallelism.

By default, Spark creates one partition for each block of data in the file system, which is typically
128 MB. However, this can be adjusted according to the specific needs of the application.

As a general guideline, it is recommended to have 2-4 partitions per CPU core in the cluster.
This means that if you have a cluster with 8 cores, you would create 16-32 partitions. However,
the optimal number of partitions will depend on the specific use case, and it may be necessary
to experiment to determine the best value.

In the case of 1 lakh mobile numbers, you may create several partitions, each containing a
portion of the data. This allows Spark to process the data in parallel, making use of the available
resources in the cluster. The exact number of partitions to create will depend on the size of the
data and the available cluster resources.

20) what is the concept of bucketing?


Bucketing in AWS refers to the process of grouping objects into containers (buckets) within
Amazon S3 and provides a way to store, organize, and categorize data in the cloud.

21) What are delta tables in data bricks?


Delta tables are a feature in Databricks that allows you to store and manage large datasets in
an optimized and efficient manner, with fast, incremental updates and deletes, and the ability to
track and manage data changes over time through time travel.

Other Interview Questions.


1) What is lazy evaluation in spark and what are its advantages in spark?
2) Difference between Narrow and wide transformation.

RAMESH Sir
1. How do you get the data from the upstream team, how do you deal directly with clients?
2. Tell me about your day-to-day activities as a data engineer.
3. How you will access the spark? How do you log in to the databricks?
4. How do you write your outputs? What strategy do you follow? Actual data pipelines don’t
display the result. How do you save the output? What is your strategy? What are the
different things that you used in the pipeline?
5. Difference between ReduceByKey and GroupByKey?

You might also like