Hi There 👋

Hi, My name is Harsh, Welcome to my blog. Here I share whatever new things I learn or my past learnings while exploring system architecture, database internals, observability, and backend infrastructure—one deep dive at a time.

Map Reduce

This article shares learnings from Google’s influential MapReduce paper and explores the challenges encountered while implementing a simplified version. Our system uses multiple worker processes, running on a single machine and communicating via RPC, to mimic key aspects of a distributed environment. What is Map-Reduce At its core, MapReduce is a programming model and an associated framework for processing and generating massive datasets using a parallel, distributed algorithm, typically on a cluster of computers. You might already be familiar with map and reduce operations from functional programming languages. For instance, in JavaScript, array.map() transforms each element of an array independently based on a mapper function, while array.reduce() iterates through an array, applying a reducer function to accumulate its elements into a single output value (e.g., a sum, or a new, aggregated object). ...

Debugging Redis Latency

This article is about how at work we solved the issue of high response time while executing Redis commands from Node.js server to a Redis compatible database known as dragonfly. Background After introducing metrics to our Node.js service, we started recording the overall response time whenever a Redis command was executed. We had a wrapper service around a Redis driver known as ioredis for interacting with our Redis-compatible database. Once we set up Grafana dashboards for metrics like cache latency, we saw unusually high p99 latency numbers, close to 200ms. This is a very large number, especially considering the underlying database query itself typically takes less than 10ms to complete. To understand why this latency was so high, we needed more detailed insight than metrics alone could provide. As part of a broader effort to set up our observability stack, I had been exploring various tracing solutions – options ranged from open-source SDKs (OpenTelemetry Node.js SDK) with a self-deployed trace backend, to third-party managed solutions (Datadog, Middleware, etc.). For this investigation, we decided to proceed with a self-hosted Grafana Tempo instance to test the setup and feasibility. (So far, the setup is working great, and I’m planning a detailed blog post on our observability architecture soon). With tracing set up, we could get a waterfall view of the path taken by the service while responding to things like HTTP requests or event processing, which we hoped would pinpoint the source of the delay in our Redis command execution. ...

Socket File Descriptor and TCP connections

Socket File Descriptors and Their Kernel Structures A socket is a special type of file descriptor (FD) in Linux, represented as socket:[inode]. Unlike regular file FDs, socket FDs point to in-memory kernel structures, not disk inodes. The /proc/<pid>/fd directory lists all FDs for a process, including sockets. The inode number of a socket can be used to inspect its details via tools like ss and /proc/net/tcp. Example: Checking Open FDs for Process 216 ls -l /proc/216/fd Output: ...

Understanding Inodes and Disk Layout

Overall Organization Of Data In Disks Assuming we have a 256KB disk. Disk Blocks: The basic units of storage on the disk, each 4 KB in size. The disk is divided into these blocks, numbered from 0 to N-1 (where N is the total number of blocks). Inode Bitmap (i): Block 1; a bitmap tracking which inodes are free (0) or in-use (1). Data Bitmap (d): Block 2; a bitmap tracking which data blocks are free (0) or allocated (1). Inode Table (I): Blocks 3-7; an array of inodes, where each inode (256 bytes) holds metadata about a file, like size, permissions, and pointers to data blocks. 5 blocks of 4KB will contain 80 256 byte inode strutures. Data Region (D): Blocks 8-63; the largest section, storing the actual contents of files and directories. Inode Every inode has a unique identifier called an inode number (or i-number). This number acts like a file’s address in the file system, allowing the operating system to quickly locate its inode. For example: ...

Files And Directories

Files and directories File systems virtualize persistent storage (e.g., hard drives, SSDs) into user-friendly files and directories, adding a third pillar to OS abstractions (processes for CPU, address spaces for memory). File Paths and System Calls Files are organized in a tree-like directory structure, starting from the root (/). A file’s location is identified by its pathname (e.g., /home/user/file.txt). To interact with files, processes use system calls: open(path, flags): Opens a file and returns a file descriptor (fd). read(fd, buffer, size): Reads data from the file into a buffer using the fd. write(fd, buffer, size): Writes data to the file via the fd. close(fd): Closes the file, freeing the fd. File Descriptors A file descriptor is a small integer, unique to each process, that identifies an open file. When a process calls open(), the operating system assigns it the next available fd (e.g., 3, 4, etc.). Every process starts with three default fds: ...

RAID (Redundant array of inexpensive disk)

RAID Disks Three axes on which disks are analysed Capacity - How much capacity is needed to store X bytes of data Reliability - How much fault-tolerant is the disk Performance - Read and write speeds (Sequential and random) To make a logical disk (comprising set of physical disks) reliable we need replication, so there is tradeoff with capacity and performance (write amplification) When we talk about collection of physical disks representing one single logical disk we should know that there would be small compute and some non-volatile RAM also included to fully complete the disk controller component. This RAM is also used for WAL for faster writes similar to #Database In a way this set of disks also have challenges similar to distributes databases. ...

Multilevel Page table

Segmented Page Table Page table can grow large for a 32-bit address space and 4 KB page size we will be using 20 bits for virtual page number resulting in 2^20 bytes (i.e. 4MB of page table) for a single page table and each process will have its own page table so it is possible that we will be storing ~100sMB for page table alone which is not good. For above page table with 4 bits for VPN (Virtual page number) we can see that only VPN 0,4,14 and 15 are valid i.e. pointing to a PFN (Physical Frame Number) other PTEs (Page table entry) are just taking up space which is not used. We can use segmentation here with base and bound registers for each page table to only store valid PTE in the table. This will again split the virtual address to also contain the segment bits to identify which segment the address belongs to (code, heap or stack). Instead of using Base Page Table Register to query page table we will now be using Base Page Table Register [Segment] to get page table physical address for a given segment. ...

TLB

TLB Translation look-aside buffer is a CPU cache which is generally small but since it is closer to CPU a TLB hit results in address translation to happen in 1-5 CPU cycles. CPU Cycle Time taken by CPU to fully execute an instruction, while CPU frequency refers to the number of these cycles that occur per second A TLB hit means for given virtual address the physical frame number was found in the TLB cache. A TLB hit will benefit all the address that lie on the same page. In the above given image page size is 16 bytes, so 4 INT variables can be saved in a single page, so a TLB hit of VPN 07 will serve address translation for VPN = 07 + page of offset of 0, 4,8 and 12 byte. This type of caching is benefitted from spatial locality of data where a cache hit results in cache hits for surrounding data as well. If we cache data and other data points which are more probable to get accessed in the same time frame (like loop variables etc) then such caching is benefitted from Temporal locality. ...

Page Tables

Page Tables Page table contains the translation information of virtual page number to physical frame number. For an address space of 32 bits and page size of 4 KB (i.e. memory of 2^32 is divided into segments of 4 KB where each segment is called a memory page) , The virtual address will be of size 32 bits of which 12 bits (2^12 = 4 KB) will be used as offset inside a single page whereas remaining 20 bits will be used as virtual page number ...

B-Tree Latch Optimisation

References 5.6 Problem Generally when traversing the index made up of btree we have to take latch on it. In MySQL 5.6 the approach of taking latch depends on the possible operation we are doing: If the operation is a read operation then taking a read lock is sufficient to prevent any writes to happen to the pages we are accessing in Btree while reading If the operation is a write operation then there are again two possibilities: Optimistic Locking If the write is limited to modifying the leaf page only without modifying the structure of the tree (Merging OR Splitting) then it’s an optimistic locking approach where we take read latch on root of the tree and write latch only on the leaf node to modify ^ab3c53 Pessimistic Locking But if the operation result is in any type of restructuring of the tree itself then that will be known to us only after reaching the target leaf node and knowing its neighbours and parents. So the approach is first to try with optimistic locking defined above and then go for pessimistic locking Pessimistic locking involves taking a write latch on the root resulting in full ownership of the tree by the current operation (until the operation is complete no other operation can take a read or write latch, so all the other operations has to wait even if they are read operations and involve only optimistic locking). When the leaf node is found we take write latch on the leaf’s neighbours as well as its parent and do the restructuring and if the same restructuring needs to happen at parent level then we will take similar write locks recursively up the tree. ^17a3ff ...