INTERVIEW QUESTIONS - ALL Companies
INTERVIEW QUESTIONS - ALL Companies
Roja Questions
Accenture
1. Uses of lit()
Eg: Added Language as English, Ordered Type as Manage Account, and Channel as
Middleware,
Row_Num - It is a numbering system that gives the sequential row number where we want to
give the numbers for the data.
Rank() and Dense Rank() are used for the Ranking system where it requires.
Rank()- Based on the requirement wherever it has the common value it will have the same rank
and once the value is changed it skips the sequential number and displays the rank based on
the current row number-wise. Eg - 1,2,2,4,5
Dense Rank()- It does not skip the value and it follows the sequential number. Eg - 1,2,2,3,4
RDD - It is the collection of objects which is capable of storing the data partitioned across the
multiple nodes of the cluster and also allows them to do processing in parallel.
Dataframes - Dataframes can read and write the data into various formats like CSV, JSON,
AVRO, HDFS, and HIVE tables. It is already optimized to process large datasets for most of the
pre-processing tasks so that we do not need to write complex functions on our own.
4. Snappy compression
5. Have to drop the duplicate value from the column without using drop duplicate.
6. Usage of withColumn
withColumn() – Returns a new DataFrame by adding a column or replacing the existing column
that has the same name.
A B C
1 apple 200
2 apple 300
3 bat 400
SQL> delete from Fruits where A in (select max(A) from Fruits group by B having count(*) > 1);
—-------------------------------------------------------------------------------------------------------------------------
Tiger Analytics
1. Given table emp(id, name, salary), write a select query to pull all cols and a
flag column to identify whether the emp salary is less than/greater than/equal to
the average salary of all employees.
CASE
WHEN sal > 2073 THEN 'Salary is less the avg salary'
END AS SalFlag
from emp;
2. Consider table: geodata(id, country, postcode, region, city, state) --- id is
unique
df1 = dfA.filter(“a=1”) #3
df3.saveAsTable(“default.myresult”) #6
—-------------------------------------------------------------------------------------------------------------------------
Experis Questions
1. What technologies are used for data ingestion and data transformation? - Pyspark
5. from where you will read the data? – We read the data from an S3 folder where the
upstream team puts the data in an S3 bucket every day with the date name.
6. What kind of transformations are used in spark? - Filter, groupBy, and map are examples of
transformations
9. Basic schema of emp tab and dept table. To find out the highest salary for each department
along with emp details.
select * from emp where sal in ( select max(sal) from emp group by deptno) ;
10. How do you write your outputs? What strategy do you follow?
Actual data pipelines don’t display the result. How do you save the output? What is your
strategy? What are the different things that you used in the pipeline?
13. Can you explain to me coalesce and re-partition? When should we need to use re-partition?
18. How you will access the spark? How do you log in to the dataframes?
A list is used to collect items of multiple datatypes. An array also collects items of the
same data type. List : [1, apple. 2.5], Array[1,2,3]
List cannot manage Arithmetic operators, while an array can manage arithmetic operations.
1 List is used to collect items that usually consist of An array is a vital component that collects several
elements of multiple data types. items of the same data type.
2 List cannot manage arithmetic operations. An array can manage arithmetic operations.
3 It consists of elements that belong to the It consists of elements that belong to the same data
different data types. type.
4 When it comes to flexibility, the list is perfect as it When it comes to flexibility, the array is not suitable as
allows easy modification of data. it does not allow easy modification of data.
6 In a list, the complete list can be accessed In an array, a loop is mandatory to access the
without any specific looping. components of the array.
An array is faster than a list in python since all the elements stored in an array are
homogeneous i.e., they have the same data type whereas a list contains heterogeneous
elements. Moreover, Python arrays are implemented in C which makes it a lot faster than
lists that are built-in in Python itself.
4. You will Write some code by importing Modules and Packages. Difference between
both?
A Python Module can be a simple python File (.py extension file), i.e., a
combination of numerous Functions and Global variables.
The iterator can do modifications (e.g using the remove() method which
removes the element from the Collection during traversal).
The enumeration interface acts as a read-only interface, one can not do any
modifications to the Collection while traversing the elements of the Collection.
Modules and functions may appear similar to their purpose, which is reusability.
However, modules are on a larger scale because of their use in different classes,
functions, and attributes to fulfill larger functionalities. At the same time, functions
are more particular to specific activities on a smaller scale.
Pyspark:
7. Let's say two dataframes and both the dataframes have the same schema. What is the
way to combine both dataframes into a single dataframe.
8. What is the way to add new column as M with a constant value as True?
df.withColumn(“M”).Lit(“True”)
9. What is the difference between Datawarehouse and DataLake?
A data lake contains all an organization's data in a raw, unstructured form, and
can store the data indefinitely — for immediate or future use.
Storage costs are fairly inexpensive in a data lake vs data warehouse. Data lakes are
also less time-consuming to manage, which reduces operational costs.
A data warehouse contains structured data that has been cleaned and processed,
ready for strategic analysis based on predefined business needs.
Data warehouses cost more than data lakes, and also require more time to manage,
resulting in additional operational costs.
Both reduceByKey and groupByKey result in wide transformations which means both
triggers a shuffle operation
Map side combine means – the join operation is performed in the map phase itself.
14. DF:
N=2(user input)
Output = [[1,2],[3,4],[5,6],[7,8],[9]]
—-------------------------------------------------------------------------------------------------------------------------
Ayush Questions
CAREERNET
1. What all responsibilities or tools are you currently working on?
Bitwise operators.
4. Set Operators.
A Set is an unordered collection data type that is iterable, mutable, and has no duplicate
elements.
We use union when we want All the elements from both sets without duplicates.
We use Intersection when we want only the common elements in both sets.
7. In the emp table, In dep 1 Multiple emp are there but their salary is different,
a)how to get the details of only the highest salary from the table and
b) Also if the higher salary of two employees is the same then we should get the details
of the recently joined emp detail, not the old one.
8. Difference between List and Tuples.
The list is mutable and Tuples are immutable. The list consumes time unlike tuples will
be much faster when iterations are implied.
9. What are the scenarios that have to use list and tuples
When we know the size of the values is changeable List is used. In tuples size of the
values cannot be changed.
Tuple – String, Int, Boolean likewise same for List, It can be of any datatype
11. What is the format of data you are getting from customers?
JSON
—-------------------------------------------------------------------------------------------------------------------------
T-systems
1. How to decide which query and where to run for cost optimization.
so for each case which file format, we can use out of these different file formats -
Parquet, Avro, JSON, ORC.
5. Consider that in python we have a tuple (1,2,3) and I just want to update the second
element, can we do it if yes then how we can do it.
8. I have a table as table 1 containing city name and students and the student's row is
a comma separated, consider that database is a Postgres which doesn't have any split
functionality, so write a SQL query to get the city name and student count and we have
to split that student column.
A 1,1,1
B 1,2
C 2,3,4
—-------------------------------------------------------------------------------------------------------------------------
ADIDAS
1) What is your Tech stack in this project?
Databricks, AWS
6) Where you can find these jobs icons in the data bricks portal, is it present by default in every
data bricks environment or do we need to install a package for this and why do we need this?
There is a jobs option in the data bricks, using which we can schedule our jobs, and attach the
required notebook to run it at a particular time.
An executor is a worker process that runs on a node in a cluster. It is responsible for executing
tasks assigned to it by the driver and returning the results to the driver. Executors are the units
of parallelism in Spark, as they run multiple tasks concurrently to make full use of the available
resources in a cluster. In summary, the driver coordinates the work, while the executors do the
work.
10) What is the maximum number of drivers you can have in a single spark application?
One driver
12) What is the data format we use in s3 to store data and what is the most commonly used
data format?
CSV - CSV (Comma Separated Values) is the simplest file format where fields are separated by
using ‘comma’ as a delimiter. It is human-readable, compressible,
and most importantly platform-independent.
JSON - It carries a predefined complex data type along with the data content, JSON is a self-
describing language-independent data format.
AVRO - It is widely supported within the Hadoop community and beyond. Ideal for long-term
storage of essential data, the Avro file format.
It can read and write in many languages such as Java, Scala, etc.
ORC - ORC is a self-describing, optimized file format that stores data in columns which enables
users to read and decompress just the pieces they need.
Parquet - Parquet is an open-source, column-oriented data file format designed for efficient
data storage and retrieval.
14) What is the architecture that you follow in data bricks? (Bronze, silver, gold)
16) Basic difference between transformations & Actions in spark and some examples of both?
A narrow transformation, on the other hand, compresses data into fewer columns, creating a
narrower data frame. An example of a narrow transformation in PySpark is the groupBy
function, which aggregates values from multiple columns into one or a few columns based on
the specified key(s).
In general, wide transformations are useful when dealing with data that has a lot of redundant
information, while narrow transformations are useful when summarizing or aggregating data.
19) If you have data with 1 lakh mobile numbers then how many partitions you will create?
The number of partitions to create for data with 1 lakh mobile numbers depends on several
factors such as the size of the data, the available cluster resources (e.g., number of executors,
memory, CPU), and the desired level of parallelism.
By default, Spark creates one partition for each block of data in the file system, which is typically
128 MB. However, this can be adjusted according to the specific needs of the application.
As a general guideline, it is recommended to have 2-4 partitions per CPU core in the cluster.
This means that if you have a cluster with 8 cores, you would create 16-32 partitions. However,
the optimal number of partitions will depend on the specific use case, and it may be necessary
to experiment to determine the best value.
In the case of 1 lakh mobile numbers, you may create several partitions, each containing a
portion of the data. This allows Spark to process the data in parallel, making use of the available
resources in the cluster. The exact number of partitions to create will depend on the size of the
data and the available cluster resources.
RAMESH Sir
1. How do you get the data from the upstream team, how do you deal directly with clients?
2. Tell me about your day-to-day activities as a data engineer.
3. How you will access the spark? How do you log in to the databricks?
4. How do you write your outputs? What strategy do you follow? Actual data pipelines don’t
display the result. How do you save the output? What is your strategy? What are the
different things that you used in the pipeline?
5. Difference between ReduceByKey and GroupByKey?