Skip to content

Commit edc404e

Browse files
committed
Add Sales Rank solution
1 parent d484f80 commit edc404e

File tree

5 files changed

+415
-0
lines changed

5 files changed

+415
-0
lines changed
Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
# Design Amazon's sales rank by category feature
2+
3+
*Note: This document links directly to relevant areas found in the [system design topics](https://github.com/donnemartin/system-design-primer-interview#index-of-system-design-topics-1) to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.*
4+
5+
## Step 1: Outline use cases and constraints
6+
7+
> Gather requirements and scope the problem.
8+
> Ask questions to clarify use cases and constraints.
9+
> Discuss assumptions.
10+
11+
Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
12+
13+
### Use cases
14+
15+
#### We'll scope the problem to handle only the following use case
16+
17+
* **Service** calculates the past week's most popular products by category
18+
* **User** views the past week's most popular products by category
19+
* **Service** has high availability
20+
21+
#### Out of scope
22+
23+
* The general e-commerce site
24+
* Design components only for calculating sales rank
25+
26+
### Constraints and assumptions
27+
28+
#### State assumptions
29+
30+
* Traffic is not evenly distributed
31+
* Items can be in multiple categories
32+
* Items cannot change categories
33+
* There are no subcategories ie `foo/bar/baz`
34+
* Results must be updated hourly
35+
* More popular products might need to be updated more frequently
36+
* 10 million products
37+
* 1000 categories
38+
* 1 billion transactions per month
39+
* 100 billion read requests per month
40+
* 100:1 read to write ratio
41+
42+
#### Calculate usage
43+
44+
**Clarify with your interviewer if you should run back-of-the-envelope usage calculations.**
45+
46+
* Size per transaction:
47+
* `created_at` - 5 bytes
48+
* `product_id` - 8 bytes
49+
* `category_id` - 4 bytes
50+
* `seller_id` - 8 bytes
51+
* `buyer_id` - 8 bytes
52+
* `quantity` - 4 bytes
53+
* `total_price` - 5 bytes
54+
* Total: ~40 bytes
55+
* 40 GB of new transaction content per month
56+
* 40 bytes per transaction * 1 billion transactions per month
57+
* 1.44 TB of new transaction content in 3 years
58+
* Assume most are new transactions instead of updates to existing ones
59+
* 400 transactions per second on average
60+
* 40,000 read requests per second on average
61+
62+
Handy conversion guide:
63+
64+
* 2.5 million seconds per month
65+
* 1 request per second = 2.5 million requests per month
66+
* 40 requests per second = 100 million requests per month
67+
* 400 requests per second = 1 billion requests per month
68+
69+
## Step 2: Create a high level design
70+
71+
> Outline a high level design with all important components.
72+
73+
![Imgur](http://i.imgur.com/vwMa1Qu.png)
74+
75+
## Step 3: Design core components
76+
77+
> Dive into details for each core component.
78+
79+
### Use case: Service calculates the past week's most popular products by category
80+
81+
We could store the raw **Sales API** server log files on a managed **Object Store** such as Amazon S3, rather than managing our own distributed file system.
82+
83+
**Clarify with your interviewer how much code you are expected to write**.
84+
85+
We'll assume this is a sample log entry, tab delimited:
86+
87+
```
88+
timestamp product_id category_id qty total_price seller_id buyer_id
89+
t1 product1 category1 2 20.00 1 1
90+
t2 product1 category2 2 20.00 2 2
91+
t2 product1 category2 1 10.00 2 3
92+
t3 product2 category1 3 7.00 3 4
93+
t4 product3 category2 7 2.00 4 5
94+
t5 product4 category1 1 5.00 5 6
95+
...
96+
```
97+
98+
The **Sales Rank Service** could use **MapReduce**, using the **Sales API** server log files as input and writing the results to an aggregate table `sales_rank` in a **SQL Database**. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer-interview#sql-or-nosql).
99+
100+
We'll use a multi-step **MapReduce**:
101+
102+
* **Step 1** - Transform the data to `(category, product_id), sum(quantity)`
103+
* **Step 2** - Perform a distributed sort
104+
105+
```
106+
class SalesRanker(MRJob):
107+
108+
def within_past_week(self, timestamp):
109+
"""Return True if timestamp is within past week, False otherwise."""
110+
...
111+
112+
def mapper(self, _ line):
113+
"""Parse each log line, extract and transform relevant lines.
114+
115+
Emit key value pairs of the form:
116+
117+
(category1, product1), 2
118+
(category2, product1), 2
119+
(category2, product1), 1
120+
(category1, product2), 3
121+
(category2, product3), 7
122+
(category1, product4), 1
123+
"""
124+
timestamp, product_id, category_id, quantity, total_price, seller_id, \
125+
buyer_id = line.split('\t')
126+
if self.within_past_week(timestamp):
127+
yield (category_id, product_id), quantity
128+
129+
def reducer(self, key, value):
130+
"""Sum values for each key.
131+
132+
(category1, product1), 2
133+
(category2, product1), 3
134+
(category1, product2), 3
135+
(category2, product3), 7
136+
(category1, product4), 1
137+
"""
138+
yield key, sum(values)
139+
140+
def mapper_sort(self, key, value):
141+
"""Construct key to ensure proper sorting.
142+
143+
Transform key and value to the form:
144+
145+
(category1, 2), product1
146+
(category2, 3), product1
147+
(category1, 3), product2
148+
(category2, 7), product3
149+
(category1, 1), product4
150+
151+
The shuffle/sort step of MapReduce will then do a
152+
distributed sort on the keys, resulting in:
153+
154+
(category1, 1), product4
155+
(category1, 2), product1
156+
(category1, 3), product2
157+
(category2, 3), product1
158+
(category2, 7), product3
159+
"""
160+
category_id, product_id = key
161+
quantity = value
162+
yield (category_id, quantity), product_id
163+
164+
def reducer_identity(self, key, value):
165+
yield key, value
166+
167+
def steps(self):
168+
"""Run the map and reduce steps."""
169+
return [
170+
self.mr(mapper=self.mapper,
171+
reducer=self.reducer),
172+
self.mr(mapper=self.mapper_sort,
173+
reducer=self.reducer_identity),
174+
]
175+
```
176+
177+
The result would be the following sorted list, which we could insert into the `sales_rank` table:
178+
179+
```
180+
(category1, 1), product4
181+
(category1, 2), product1
182+
(category1, 3), product2
183+
(category2, 3), product1
184+
(category2, 7), product3
185+
```
186+
187+
The `sales_rank` table could have the following structure:
188+
189+
```
190+
id int NOT NULL AUTO_INCREMENT
191+
category_id int NOT NULL
192+
total_sold int NOT NULL
193+
product_id int NOT NULL
194+
PRIMARY KEY(id)
195+
FOREIGN KEY(category_id) REFERENCES Categories(id)
196+
FOREIGN KEY(product_id) REFERENCES Products(id)
197+
```
198+
199+
We'll create an [index](https://github.com/donnemartin/system-design-primer-interview#use-good-indices) on `id `, `category_id`, and `product_id` to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer-interview#latency-numbers-every-programmer-should-know>1</a></sup>
200+
201+
### Use case: User views the past week's most popular products by category
202+
203+
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer-interview#reverse-proxy-web-server)
204+
* The **Web Server** forwards the request to the **Read API** server
205+
* The **Read API** server reads from the **SQL Database** `sales_rank` table
206+
207+
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer-interview##representational-state-transfer-rest):
208+
209+
```
210+
$ curl https://amazon.com/api/v1/popular?category_id=1234
211+
```
212+
213+
Response:
214+
215+
```
216+
{
217+
"id": "100",
218+
"category_id": "1234",
219+
"total_sold": "100000",
220+
"product_id": "50",
221+
},
222+
{
223+
"id": "53",
224+
"category_id": "1234",
225+
"total_sold": "90000",
226+
"product_id": "200",
227+
},
228+
{
229+
"id": "75",
230+
"category_id": "1234",
231+
"total_sold": "80000",
232+
"product_id": "3",
233+
},
234+
```
235+
236+
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer-interview#remote-procedure-call-rpc).
237+
238+
## Step 4: Scale the design
239+
240+
> Identify and address bottlenecks, given the constraints.
241+
242+
![Imgur](http://i.imgur.com/MzExP06.png)
243+
244+
**Important: Do not simply jump right into the final design from the initial design!**
245+
246+
State you would 1) **Benchmark/Load Test**, 2) **Profile** for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See [Design a system that scales to millions of users on AWS]() as a sample on how to iteratively scale the initial design.
247+
248+
It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a **Load Balancer** with multiple **Web Servers**? **CDN**? **Master-Slave Replicas**? What are the alternatives and **Trade-Offs** for each?
249+
250+
We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
251+
252+
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer-interview#) for main talking points, tradeoffs, and alternatives:
253+
254+
* [DNS](https://github.com/donnemartin/system-design-primer-interview#domain-name-system)
255+
* [CDN](https://github.com/donnemartin/system-design-primer-interview#content-delivery-network)
256+
* [Load balancer](https://github.com/donnemartin/system-design-primer-interview#load-balancer)
257+
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer-interview#horizontal-scaling)
258+
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer-interview#reverse-proxy-web-server)
259+
* [API server (application layer)](https://github.com/donnemartin/system-design-primer-interview#application-layer)
260+
* [Cache](https://github.com/donnemartin/system-design-primer-interview#cache)
261+
* [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer-interview#relational-database-management-system-rdbms)
262+
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer-interview#fail-over)
263+
* [Master-slave replication](https://github.com/donnemartin/system-design-primer-interview#master-slave-replication)
264+
* [Consistency patterns](https://github.com/donnemartin/system-design-primer-interview#consistency-patterns)
265+
* [Availability patterns](https://github.com/donnemartin/system-design-primer-interview#availability-patterns)
266+
267+
The **Analytics Database** could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
268+
269+
We might only want to store a limited time period of data in the database, while storing the rest in a data warehouse or in an **Object Store**. An **Object Store** such as Amazon S3 can comfortably handle the constraint of 40 GB of new content per month.
270+
271+
To address the 40,000 *average* read requests per second (higher at peak), traffic for popular content (and their sales rank) should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. With the large volume of reads, the **SQL Read Replicas** might not be able to handle the cache misses. We'll probably need to employ additional SQL scaling patterns.
272+
273+
400 *average* writes per second (higher at peak) might be tough for a single **SQL Write Master-Slave**, also pointing to a need for additional scaling techniques.
274+
275+
SQL scaling patterns include:
276+
277+
* [Federation](https://github.com/donnemartin/system-design-primer-interview#federation)
278+
* [Sharding](https://github.com/donnemartin/system-design-primer-interview#sharding)
279+
* [Denormalization](https://github.com/donnemartin/system-design-primer-interview#denormalization)
280+
* [SQL Tuning](https://github.com/donnemartin/system-design-primer-interview#sql-tuning)
281+
282+
We should also consider moving some data to a **NoSQL Database**.
283+
284+
## Additional talking points
285+
286+
> Additional topics to dive into, depending on the problem scope and time remaining.
287+
288+
#### NoSQL
289+
290+
* [Key-value store](https://github.com/donnemartin/system-design-primer-interview#)
291+
* [Document store](https://github.com/donnemartin/system-design-primer-interview#)
292+
* [Wide column store](https://github.com/donnemartin/system-design-primer-interview#)
293+
* [Graph database](https://github.com/donnemartin/system-design-primer-interview#)
294+
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer-interview#)
295+
296+
### Caching
297+
298+
* Where to cache
299+
* [Client caching](https://github.com/donnemartin/system-design-primer-interview#client-caching)
300+
* [CDN caching](https://github.com/donnemartin/system-design-primer-interview#cdn-caching)
301+
* [Web server caching](https://github.com/donnemartin/system-design-primer-interview#web-server-caching)
302+
* [Database caching](https://github.com/donnemartin/system-design-primer-interview#database-caching)
303+
* [Application caching](https://github.com/donnemartin/system-design-primer-interview#application-caching)
304+
* What to cache
305+
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer-interview#caching-at-the-database-query-level)
306+
* [Caching at the object level](https://github.com/donnemartin/system-design-primer-interview#caching-at-the-object-level)
307+
* When to update the cache
308+
* [Cache-aside](https://github.com/donnemartin/system-design-primer-interview#cache-aside)
309+
* [Write-through](https://github.com/donnemartin/system-design-primer-interview#write-through)
310+
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer-interview#write-behind-write-back)
311+
* [Refresh ahead](https://github.com/donnemartin/system-design-primer-interview#refresh-ahead)
312+
313+
### Asynchronism and microservices
314+
315+
* [Message queues](https://github.com/donnemartin/system-design-primer-interview#)
316+
* [Task queues](https://github.com/donnemartin/system-design-primer-interview#)
317+
* [Back pressure](https://github.com/donnemartin/system-design-primer-interview#)
318+
* [Microservices](https://github.com/donnemartin/system-design-primer-interview#)
319+
320+
### Communications
321+
322+
* Discuss tradeoffs:
323+
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer-interview#representational-state-transfer-rest)
324+
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer-interview#remote-procedure-call-rpc)
325+
* [Service discovery](https://github.com/donnemartin/system-design-primer-interview#service-discovery)
326+
327+
### Security
328+
329+
Refer to the [security section](https://github.com/donnemartin/system-design-primer-interview#security).
330+
331+
### Latency numbers
332+
333+
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer-interview#latency-numbers-every-programmer-should-know).
334+
335+
### Ongoing
336+
337+
* Continue benchmarking and monitoring your system to address bottlenecks as they come up
338+
* Scaling is an iterative process

solutions/system_design/sales_rank/__init__.py

Whitespace-only changes.
213 KB
Loading
78.2 KB
Loading

0 commit comments

Comments
 (0)