Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Tuples and Lists:The difference between tuples and lists in Python is that the lists are mutable and the tuples are not; they are immutable.In the lists we can store whatever data structure we want—lists, sets, nested data, strings, floats, booleans, ints.In the tuples, once we have the tuple created, we cannot change its structure.The trick and important part is that if we have, for example, tuples filled with nested lists, we can change the values into the nested data.I mean, changing values into the nested tuple data will be reflected in the data, but we cannot change the structure of the tuples.Lists can be changed by adding or removing elements (mutable).square brackets []
- Tuples can’t be changed by adding or removing elements (immutable). Tuples can be defined using parentheses ()Dicts:
- The dict is a collection that contains key-value pairs. The key is unique, and into the value we can store everything.We can access the value using the key. This data structure is very fast when searching by key.Decorators are a powerful and flexible way to modify or extend the behavior of functions or methods without changing their actual code.A decorator is a function that takes another function as an argument and returns a new function with extended functionality.
- Decorators are often used in scenarios such as:logging, authentication, and memorization, allowing us to add additional functionality to existing functions or methods in a clean, reusable way.Generators:A generator function is a special type of function that returns an iterator object. Instead of using return to send back a single value, generator functions use yield to produce a series of results over time. This allows the function to generate values and pause its execution after each yield, maintaining its state between iterations.Generators are the optimization technique when we need to process big data. It does not load the whole file into the memory but calls them one by one.Context managers in Python are a resource management mechanism. They are most commonly used to open a resource (file, network connection, etc.), perform an operation, and finally automatically close or release it, even if an error has occurred.
- Control the entry and exit of a context (usually with __enter__ and __exit__ methods).
- Most commonly used with the with statement:
- You can make your own context managers with:
- Class with __enter__ and __exit__.Decorator @contextmanager from contextlib.
- Python Module and Package.Module is a python file that contains a python code/script.Package is a folder that contains the __init__.py file. Fast import into this package.
- OOP:Inheritance is a fundamental concept in OOP that allows a class (called a child class) to inherit attributes and methods from another class (called a parent or base class).We can code reuse, modularity, and a hierarchical class structure.
- Encapsulation:
- Encapsulation is the concept of hiding internal details of a class and restricting direct access to some of its components.Python doesn’t have strict rules about _private or __protected (as in Java/C++). We can access inner attributes with tricky syntax.Encapsulation allows controlled access by using getter/setter methods or properties.It helps maintain data integrity and hides internal logic.
- Example of Encapsulation:
- Encapsulation in Python is like having a bank account system where your account balance (data) is kept private. You can’t directly change your balance by accessing the account database. Instead, the bank provides you with methods (functions) like deposit and withdraw to modify your balance safely.
- Private Data (Balance): Your balance is stored securely. Direct access from outside is not allowed, ensuring the data is protected from unauthorized changes.
- Public Methods (Deposit and Withdraw): These are the only ways to modify your balance. They check if your requests (like withdrawing money) follow the rules (e.g., you have enough balance) before allowing changes.Polymorphism:
- In OOP, polymorphism allows methods in different classes to have the same name but perform different tasks. This is made possible by using inheritance and interface design.Polymorphism extends other OOP principles such as inheritance (sharing behavior) and encapsulation (hiding complexity) to create trusted and modular applications.Allows us to treat different classes with a common father class.Animal(speak): Cat, Dog Abstraction:Refers to the creation of abstract classes that define common features and methods but leave the implementation details to the concrete subclasses.Abstraction allows hiding the details and focusing on the essence of the objects.
- "Abstraction is doing a job without saying exactly how you do it."Car (press break and accelerate without knowing actually how the engine works). Shop (You want to buy a product in a shop, but you don't care how it is actually going to do it—cashier, robot cashier, woman—pay with card, cash, voucher, mobile device). Shape(area): Circle, Square
- SOLID PRINCIPLESS — Single Responsibility Principle (SRP)A class should have only one reason to change.
- A class/module should do one thing only and handle a single responsibility.
- In Python, this means splitting up the problems:One class for business logicAnother for database accessA third for validation, etc.O — Open/Closed Principle (OCP)Software entities should be open for extension but closed for modification.
- You should be able to add new functionality without changing existing code.
- In Python, this is often achieved through abstract base classes (ABC), interfaces, and polymorphism.L — Liskov Substitution Principle (LSP)Subtypes must be substitutable for their base types.
- Subclasses should behave like their parent class without breaking expectations.
- In Python, if a function works with a base class, it should also work with its subclasses without issues.I — Interface Segregation Principle (ISP)Clients should not be forced to depend on interfaces they do not use.
- Prefer small, specific interfaces over large, general ones.
- In Python, break down large abstract base classes into smaller, more focused interfaces.D — Dependency Inversion Principle (DIP)High-level modules should not depend on low-level modules. Both should depend on abstractions.The paternal class must know nothing about the class that inherits it. The details depend on the abstraction, not the abstraction of the details.Dependencies should be injected, not created inside the class.
- In Python, pass dependencies via constructor parameters or methods instead of instantiating them directly inside the class.
- Python & PySpark1. How do you manage memory in Python?Use efficient data structures, generators instead of lists (when possible), and the gc module for garbage collection.2. List comprehension vs. generator?List comprehension returns a list in memory.Generator yields items one by one, more memory-efficient for large datasets.3. Deep copy vs. shallow copy?Shallow copy: copies outer object, references inner objects.Deep copy: Recursively copies all nested objects.4. Explain OOP concepts in Python.Inheritance: Reuse code from parent classes.Encapsulation: Hide internal details with _ or __.Polymorphism: Same interface, different implementations.
- PySpark1. RDD vs DataFrame?RDD: Low-level, immutable, no schema, more complex.DataFrame: High-level, schema, optimized (Catalyst & Tungsten).2. Transformations vs Actions?Transformations: Lazy operations (e.g., map, filter).Actions: Trigger execution (e.g., count(), collect()).3. Partitioning/Repartitioning?Partitioning optimizes parallelism. Use repartition() to increase partitions or coalesce() to decrease.4. PySpark performance optimization?Cache/persist where needed.Avoid shuffles (e.g., use broadcast joins).Optimize partitioning.Use DataFrames over RDDs.
- ETL Processes1. How do you design scalable pipelines?Modular, reusable components.Parallel processing.Handle schema changes dynamically.2. Ensure ETL efficiency and data quality?Validate data at each step.Monitor jobs and automate error handling.Optimize transformations.3. GDPR compliance?Mask/encrypt sensitive data.Data minimization.Log data access and processing.4. Incremental vs. Full Loads?Incremental: Process only new/changed data, faster and efficient.Full Load: Load entire dataset, simpler but slower.
- Cloud & Containers (Optional)1. Docker/OpenShift?Docker: Package apps in containers.OpenShift: Orchestrate containers with security & scaling.2. AWS experience?S3 (storage), Glue (ETL), EMR (big data processing).IAM for permissions, encryption for security.3. Airflow?DAGs define task workflows.Use operators, sensors.Handle dependencies with set_upstream / set_downstream.
- General Data Engineering1. Data lineage?Track data origin and transformations. Helps with audits, debugging, compliance.2. Data quality/validation?Implement validation checks (nulls, formats). Monitor with logs and alerts.3. Handling large data volumes?Partitioning.Distributed processing.Optimize storage (columnar formats like Parquet).
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement