1
1
# Container Networking Talk Notes
2
2
3
- * Motivation. Why am I doing this?
4
- Some time back I looked into updating the networking layer in the Oracle managed
5
- networking service from using Flannel (an overlay network) to a solution which utilises
6
- the native networking features of the Oracle cloud (secondary VNICs + IPs). However, once
7
- I started digging in, I quickly found that I didnt understand the current solution...
8
-
9
- * Prerequsites. Here I am aiming to describe * container* networking from scratch.
10
- However, some networking concepts such as L2 vs. L3, subnets, CIDR ranges are assumed.
11
- However, I'll try my best to briefly describe these as we go..
12
-
13
- * No expert though, i.e. at the end of this talk you'll know everything I know about container networking!
3
+ * I work in the Oracle cloud infrastruture group, more specifically on
4
+ Kubernetes related stuff, and some time back I was given the task of looking
5
+ into updating the networking layer in the Oracle managed Kubernetes service
6
+ from using Flannel (an overlay network) to a solution which utilises the native
7
+ networking features of the Oracle cloud (secondary VNICs + IPs). Dont worry if
8
+ you dont know what Flannel is, or know what an overlay network is, as that is
9
+ the point of this talk! However, once I started digging in, I quickly found
10
+ that I didn't understand how Flannle worked, and as it seemed a little wrong to
11
+ replace one thing with another solution, if you dont understand how the
12
+ original worked, I started digging deeper, and then relaised that I dont
13
+ understand networking in general! Long story short, big rabitt hole, learnt
14
+ some stuff, and most importantly found that I really enjoyed this, so I thought
15
+ I would write a talk and come and spread the networking love!
16
+
17
+ * So, I'm Kris, and in the next 30 minutes or so, I'm going to attempt to explain
18
+ how a contaniner on one computer on the internet, can talk to a container on
19
+ another computer, somewhere else on the internet.
14
20
15
21
## Slide: The aim
16
22
19
25
* No NAT'ing going on.
20
26
* Host can talk to containers, and vice versa.
21
27
22
- * Contrast this with the default docker approach (RHS diagram).
23
- * i.e. Only containers on a node have unique IP addresses.
24
- * Processes inside containers accessed via port mapping (IP tables).
28
+ * Note: We are not covering the default docker model here, where
29
+ containers on different nodes can have the same IPs.
25
30
26
31
## Slide: The plan
27
32
28
- * Summarise the 4 steps.
33
+ * Going to work our way toward the general case in 4 steps.
34
+
35
+ * Foreach, we will explain the model via a diagram. Show some code, run the code,
36
+ then test what we have created.
29
37
30
- * Summarise the demo setup, i.e. using pre-prepared/up vagrant environments.
38
+ * Each step we be created using vagrant based VMs.
39
+
40
+ * Summarise the 4 steps.
31
41
32
42
## Slide: Single network namespace diagram
33
43
34
44
* Describe the outer box (the node). Could be a physical machine, or a VM as in this case.
35
45
36
46
* Describe containers vs namespaces:
37
- * What is a container:
38
- * Cgroups: What a process can do. E.g:
39
- * Restrict memory
40
- * Restrict CPU
41
- * Restrict network bandwidth
42
-
43
- * Apparmour/secconf/capabilities: Security layer. E.g:
44
- * Restrict the system calls the contained process has access to.
45
-
46
- * Namespaces: What a process can see. E.g.
47
- * Mount namespace: Controls which parts of the file system the contained process can see.
48
- * Process namespace: Controls which other processes the contained process can see.
49
- * Network namespace: See below...
47
+ * Containers use a bunch of different linux mechanisms to isolate the processes running inside,
48
+ both in terms of system calls, available resources, what it can see, i.e. filesystems, other processes, etc.
49
+ However, from a network connectivity point of view, the only mechanism that matters here is the network
50
+ namespace, so from now on, whenever I say container, what I really mean is network namespace.
50
51
51
52
* What is a network namespace:
52
53
* It's own network stack containing:
56
57
* When created, it is empty, i.e. no interfaces, routing or IP tables rules.
57
58
58
59
* Describe VETH pair: Ethernet cable with NIC on each end.
59
- * Describe the relevant routing from/to the network namespace, including the
60
- types of each of the routing rules used here. Note The 'aha' monment, when I worked out
61
- the possible types of routing rules => Key takeaway, understanding these is key!
60
+
61
+ * Describe the relevant routing from/to the network namespace:
62
+ * Directly connected route from the host to the network namespace.
63
+ * Default route out of the network namespace.
64
+
65
+ * Note The 'aha' monment, when I worked out the possible types of routing rules
66
+ => Understanding these was for me the key to understanding networking in general.
62
67
63
68
## Code: Single network namespace setup.sh
64
69
71
76
72
77
```
73
78
./setup.sh
74
- # The interfaces + routes inside the network namespace
79
+ # The interfaces inside the network namespace
75
80
sudo ip netns exec con ip a
76
- sudo ip netns exec con ip r
77
- # The interfaces + routes on the node
78
- ip a
79
- ip r
80
81
# Pings the network namespace from the node
81
82
ping 176.16.0.1
82
- # Pings the node from the network namespace
83
- sudo ip netns exec con ping 10.0.0.10
84
83
```
85
84
86
85
* What is actually responding to the pings in these cases, as there is no process running
87
- inside the namespace who can respond in this case?
88
- Do a quick dive into ICMP here. It is a layer 3(.5?) protocol, i.e. we have
89
- an ICMP header inside of the IP packet, which defines a bunch of bits used in managing
90
- IP packetes. e.g.
91
- - Reporting that TTL has expired - More on this later...
92
- - Reporting that we need to fragment, but the DF bit is set - again, more on this later.
93
- - Bits for echo request and echo response (A.K.A ping).
94
- Therefore, in this case, it is the network stack in the kernel that is reponding to the
95
- ICMP echo request packet, with a ICMP echo request packet.
96
-
97
- For a more realistic example, We can run one (or more) real process in the network namespace
98
- (e.g. the python file server), and can the curl this from the node:
86
+ inside the namespace who can respond in this case? It is the kernel network stack
87
+ inside of the network namespace that is responding to these IMCP echo requests, with
88
+ an ICMP echo request packet.
99
89
100
- ```
101
- # Runs the python file server in the background on port 8000, inside the network namespace
102
- sudo ip netns exec con python3 -m http.server 8000 &
103
- # Curls the python file server from the node
104
- curl 172.16.0.1:8000
105
- ```
90
+ * For a more realistic example, We would run one (or more) real process in the network namespace.
91
+ However, for the purposes of testing connectivity, pinging is enough.
106
92
107
- Note: you can run multiple processes inside a network namespace, which roughly corresponds to a Kubernetes pod .
93
+ * Note: you can run multiple processes inside a network namespace, which is what happens inside Kubernetes pods .
108
94
109
95
## Slide: Diagram of multiple network namespaces on the same node
110
96
111
- * Describe the Linux bridge:
112
- * A single L2 broadcast domain, much like a switch, implemented in the kernel.
97
+ * Describe the Linux bridge: A single L2 broadcast domain, a virtual ethernet switch, implemented in the kernel.
113
98
* The bridge now has its own subnet.
114
99
* The bridge also has its own IP: Allows access from the outside.
115
100
* Describe the route for the subnet.
101
+ * Note: This corresponds to the default docker0 bridge.
116
102
117
103
## Code: Multiple network namespace setup.sh
118
104
@@ -125,26 +111,16 @@ Note: you can run multiple processes inside a network namespace, which roughly c
125
111
126
112
```
127
113
./setup.sh
128
- # The interfaces + routes inside a network namespace
129
- sudo ip netns exec con1 ip a
130
- sudo ip netns exec con1 ip r
131
- # The interfaces + routes on the node
114
+ # The interfaces on the node
132
115
ip a
133
- ip r
134
116
# Pings between the network namespaces
135
117
sudo ip netns exec con1 ping 172.16.0.3
136
118
# Pings the node from the network namespace
137
119
sudo ip netns exec con1 ping 10.0.0.10
138
120
```
139
121
140
- * When we ping between the network namespaces:
141
- * Highlight the TTL. Should be the default value, thus no routing is going on here!
142
- * Describe what the TTL is, and what happens when the TTL reaches zero.
143
- * Describe how the TTL is used, e.g. in the implementation of traceroute.
144
- * When we ping network namespace from node:
145
- * Highlight the TTL. Should be the same.
146
- * Mention that currently we cant get external traffic to the namespaces, as we are not fowarding IP packets.
147
- However, we will set this up in the next example.
122
+ * Highlight the TTL. Should be the default value, thus no routing is going on here!
123
+ * Describe what the TTL is, and what happens when the TTL reaches zero.
148
124
149
125
## Slide: Diagram of multiple network namespaces on different nodes but same L2 network
150
126
@@ -161,34 +137,27 @@ sudo ip netns exec con1 ping 10.0.0.10
161
137
* Talk through the * setup.sh* .
162
138
* Describe the parts common to the previous step.
163
139
* Describe the setup of the extra routes.
164
- * Explain the IP forwarding.
165
- * What does this do/why is it needed: Turns your Linux box into a router.
166
- * Is enabling this a security risk: Maybe, but it is required in this case!
140
+ * Explain the IP forwarding: Turns your Linux box into a router, which is
141
+ required in this case as the node has to forward the packets for any network
142
+ namespaces that live on that node.
167
143
168
144
## Demo: Multi node
169
145
170
146
On each node, run:
171
147
172
148
```
173
149
./setup.sh
174
- # The routes on the node
175
- ip r
176
150
```
177
151
178
- From 10.0.0.20:
179
-
180
- ```
181
- # Captures ICMP packetes on the veth10 interface connected to the bridge
182
- sudo tcpdump -ni veth10 icmp
183
- ```
184
-
185
- Then from 10.0.0.10:
152
+ From 10.0.0.10:
186
153
187
154
```
155
+ # The routes on the node
156
+ ip r
188
157
# Pings from a network namespaces on one node to one on the other node
189
158
sudo ip netns exec con1 ping 172.16.1.2
190
- # Pings the same network namespace from the node
191
- ping 172.16.1.2
159
+ # Pings from a network namespaces on one node to the other node
160
+ sudo ip netns exec con1 ping 10.0.0.20
192
161
```
193
162
194
163
* When we ping from a network namespaces to another network namespace across nodes:
@@ -223,20 +192,13 @@ ping 172.16.1.2
223
192
* Talk through the * setup.sh* .
224
193
* Describe the parts common to the previous step.
225
194
* We need packet forwarding enabled here. This allows the node to act as a router, i.e.
226
- to accept and forward packets recieved, but no tdestined for, the IP of the node.
195
+ to accept and forward packets recieved, but not destined for, the IP of the node.
227
196
* Now no extra routes, but contains the socat implementation of the overlay.
228
197
* Describe * socat* in general. It creates 2 bidirectional bytestreams, and transfers data between them.
229
198
* Describe how * socat* is being used here.
230
- * Describe how this is similar to a VPN: How could we construct a virtual network
231
- between 2 hosts using socat (creating a VN!). For example, start the VN 'server' on the destination
232
- network using a tun device and the UDP tunnel. Start the VN 'client', with the same setup. Connect the UDP
233
- tunnel, then assign the tun device an address on the desination network. Just add encryption to this, and you'll
234
- have your very own VPN!
235
199
* Note the MTU settings, what is going on here? We reduce the MTU of the tun0
236
200
device as this allows for the 8 bytes UDP header that will be added, thus ensuring that
237
201
fragmentation does not occur.
238
- * Describe the scheme that is used to ensure that the kernel chooses packet sizes that dont cause fragmentation
239
- (using the DF bit in the IP packets, and the 'Fragmentation required' ICMP response.
240
202
* Reverse packet filtering:
241
203
* What is this: Discards incoming packets from interfaces where they shouldn't be.
242
204
* It's purpose: A security feature to stop IP spoofed packets from being propagated.
@@ -245,9 +207,6 @@ ping 172.16.1.2
245
207
tunnel. However, the response will not (as it is destined for the node), thus the response
246
208
will emerge on a different interface to which the request packet went. Therefore, the kernel
247
209
consider this suspicious, unless we tell it that all is ok.
248
- * Is it OK to turn this off? Again, maybe, as this is primarily for DOS attacks, thus aimed at
249
- routers on the public internet. The alternative is to ensure that packets from network
250
- namespaces to remote nodes also go via the overlay (which would involve src based routing!)
251
210
252
211
## Demo: Overlay network
253
212
@@ -257,20 +216,13 @@ On each node, run:
257
216
./setup.sh
258
217
```
259
218
260
- From 10.0.0.20:
261
-
262
- ```
263
- # Captures ICMP packetes on the veth20 interface connected to the bridge
264
- sudo tcpdump -ni veth10 icmp
265
- ```
266
-
267
219
From 10.0.0.10:
268
220
269
221
```
270
222
# Ping from a network namespaces on one node to one on the other node
271
223
sudo ip netns exec con1 ping 172.16.1.2
272
- # Pings the same network namespace from the node
273
- ping 172.16.1.2
224
+ # Ping from a network namespaces on one node to the other node
225
+ sudo ip netns exec con1 ping 10.0.0.20
274
226
```
275
227
276
228
* When we ping from a network namespace to a network namespace across nodes:
0 commit comments