Docker网络之深挖overlay
部署
首先部署一个包含两个节点的docker swarm集群,名称为别为node 1与node 2,创建swarm集群的过程不赘述。接下来创建一个overlay网络与三个服务,每个服务只有一个实例,如下:
-
docker network create --opt encrypted --subnet 100.0.0.0/24 -d overlay net1
-
docker service create --name redis --network net1 redis
-
docker service create --name node --network net1 nvbeta/node
-
docker service create --name nginx --network net1 -p 1080:80 nvbeta/swarm_nginx
以上命令创建了一个典型的三层应用。nginx是前置的负载均衡器,将用户请求流量分发到node服务,node服务是一个web服务,它负责访问redis并将结果通过nginx服务返回给用户。简单起见,只创建一个node实例。
以下是此应用的逻辑视图:
网络
看一下在docker swarm中已经创建的网络:
-
$ docker network ls
-
NETWORK ID NAME DRIVER SCOPE
-
cac91f9c60ff bridge bridge local
-
b55339bbfab9 docker_gwbridge bridge local
-
fe6ef5d2e8e8 host host local
-
f1nvcluv1xnf ingress overlay swarm
-
8vty8k3pejm5 net1 overlay swarm
-
893a1bbe3118 none null local
net1
刚才创建的overlay网络,负责容器之间东西向通信。
docker_gwbridge
由Docker创建的bridge网络,它允许容器与宿主机通信。
ingress
由Docker创建的overlay网络,在Docker swarm中通过此网络向外部世界暴露服务与routing mesh功能。
net1网络
每个服务在创建时都指定了“--network net1”选项,因此每个容器实例必然有一个接口连接到net1网络,查看一下node 1,有两个容器被部署在这个节点上:
-
$ docker ps
-
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
-
eb03383913fb nvbeta/node:latest "nodemon /src/index.j" 2 hours ago Up 2 hours 8888/tcp node.1.10yscmxtoymkvs3bdd4z678w4
-
434ce2679482 redis:latest "docker-entrypoint.sh" 2 hours ago Up 2 hours 6379/tcp redis.1.1a2l4qmvg887xjpfklp4d6n7y
通过创建与docker网络名称空间的符号链接,查看一下node 1结点上所有的网络名称空间:
-
$ cd /var/run
-
$ sudo ln -s /var/run/docker/netns netns
-
$ sudo ip netns
-
be663feced43
-
6e9de12ede80
-
2-8vty8k3pej
-
1-f1nvcluv1x
-
72df0265d4af
对比名称空间ID与Docker swarm中的网络ID,我们猜测net1网络属于2-8vty8k3pej名称空间,net1网络的ID为8vty8k3pej。这个可以通过对比名称空间下的接口与容器中的接口确认。
容器中的接口:
-
$ docker exec node.1.10yscmxtoymkvs3bdd4z678w4 ip link
-
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
-
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
-
11040: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
-
link/ether 02:42:65:00:00:03 brd ff:ff:ff:ff:ff:ff
-
11042: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
-
link/ether 02:42:ac:12:00:04 brd ff:ff:ff:ff:ff:ff
-
$ docker exec redis.1.1a2l4qmvg887xjpfklp4d6n7y ip link
-
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
-
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
-
11036: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
-
link/ether 02:42:65:00:00:08 brd ff:ff:ff:ff:ff:ff
-
11038: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
-
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
2-8vty8k3pej名称空间下的接口:
-
$ sudo ip netns exec 2-8vty8k3pej ip link
-
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
-
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
-
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
-
link/ether 22:37:32:66:b0:48 brd ff:ff:ff:ff:ff:ff
-
11035: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default
-
link/ether 2a:30:95:63:af:75 brd ff:ff:ff:ff:ff:ff
-
11037: veth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
-
link/ether da:84:44:5c:91:ce brd ff:ff:ff:ff:ff:ff
-
11041: veth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
-
link/ether 8a:f9:bf:c1:ec:09 brd ff:ff:ff:ff:ff:ff
注意br0,它是LinuxBridge设备,所有其它接口都连接在它上边,包括vxlan1、veth2、veth3。vxlan1是VTEP类型的Linux网络虚拟化设备,它是br0的从设备,用来实现vxlan功能。
veth2与veth3都是veth类型的虚拟化设备,总是成对出现,其中之一位于名称空间内,另一个位于容器中,并且们于容器中的veth设备ID总比位于namespacew中的ID小数字1。因此,名称空间下的veth2与redis中的eth0是一对,名称空间下的veth3与node中的eth0是一对。
目前我们可以确认网络net1属于名称空间2-8vty8k3pej,基于目前了理的情报,网络拓扑图暂时如下:
-
node 1
-
+-----------+ +-----------+
-
| nodejs | | redis |
-
| | | |
-
+--------+--+ +--------+--+
-
| |
-
| |
-
| |
-
| |
-
+----+------------------+-------+ net1
-
101.0.0.3 101.0.0.8
-
101.0.0.4(vip) 101.0.0.2(vip)
docker_gwbridge网络
对比redis、node容器中的接口与node 1宿主机中的接口。宿主机接口如下:
-
$ ip link
-
...
-
4: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
-
link/ether 02:42:24:f1:af:e8 brd ff:ff:ff:ff:ff:ff
-
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
-
link/ether 02:42:e4:56:7e:9a brd ff:ff:ff:ff:ff:ff
-
11039: veth97d586b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
-
link/ether 02:6b:d4:fc:8a:8a brd ff:ff:ff:ff:ff:ff
-
11043: vethefdaa0d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
-
link/ether 0a:d5:ac:22:e7:5c brd ff:ff:ff:ff:ff:ff
-
10876: vethceaaebe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
-
link/ether 3a:77:3d:cc:1b:45 brd ff:ff:ff:ff:ff:ff
-
...
可以看到,有三个veth设备连接到docker_gwbridge,ID分别是11039、11043、10876,可以通过如下命令确认:
-
$ brctl show
-
bridge name bridge id STP enabled interfaces
-
docker0 8000.0242e4567e9a no
-
docker_gwbridge 8000.024224f1afe8 no veth97d586b
-
vethceaaebe
-
vethefdaa0d
根据在net1中总结的veth对的匹配规则,我们知道11039与redis中的eth1(11038)是一对,11043与node中的eth1(11042)是一对。现在的网络拓扑图如下:
-
node 1
-
172.18.0.4 172.18.0.3
-
+----+------------------+----------------+ docker_gwbridge
-
| |
-
| |
-
| |
-
| |
-
+--+--------+ +--+--------+
-
| nodejs | | redis |
-
| | | |
-
+--------+--+ +--------+--+
-
| |
-
| |
-
| |
-
| |
-
+----+------------------+----------+ net1
-
101.0.0.3 101.0.0.8
-
101.0.0.4(vip) 101.0.0.2(vip)
docker_gwbridge的功能与单机Docker下默认的docker0(也可能叫bridge,取决于Docker版本)类型的网络很像。但是有区别,docker0有连接外网的功能,docker_gwbridge没有,它只负责同宿主机下不同容器这间的通信。当容器使用-p选项连接外网时由另一个叫ingress的网络负责。
ingress网络
再一次列出node 1宿主机下的网络名称空间与Docker swarm中的网络:
-
$ sudo ip netns
-
be663feced43
-
6e9de12ede80
-
2-8vty8k3pej
-
1-f1nvcluv1x
-
72df0265d4af
-
$ docker network ls
-
NETWORK ID NAME DRIVER SCOPE
-
cac91f9c60ff bridge bridge local
-
b55339bbfab9 docker_gwbridge bridge local
-
fe6ef5d2e8e8 host host local
-
f1nvcluv1xnf ingress overlay swarm
-
8vty8k3pejm5 net1 overlay swarm
-
893a1bbe3118 none null local
很明显,ingress网络属于1-f1nvcluv1x名称空间。但是72df0265d4af名称空间是干什么的呢?先看一下72df0265d4af名称空间下的接口:
-
$ sudo ip netns exec 72df0265d4af ip addr
-
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
-
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
-
inet 127.0.0.1/8 scope host lo
-
valid_lft forever preferred_lft forever
-
inet6 ::1/128 scope host
-
valid_lft forever preferred_lft forever
-
10873: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
-
link/ether 02:42:0a:ff:00:03 brd ff:ff:ff:ff:ff:ff
-
inet 10.255.0.3/16 scope global eth0
-
valid_lft forever preferred_lft forever
-
inet6 fe80::42:aff:feff:3/64 scope link
-
valid_lft forever preferred_lft forever
-
10875: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
-
link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
-
inet 172.18.0.2/16 scope global eth1
-
valid_lft forever preferred_lft forever
-
inet6 fe80::42:acff:fe12:2/64 scope link
-
valid_lft forever preferred_lft forever
eth1(10875)与宿主机上的vethceaaebe(10876)是一对,我们也可以知道eth0(10873)
连接到ingress网络,这一点可以通过查询ingress与docker_gwbridge的详细信息佐证:
-
$ docker network inspect ingress
-
[
-
{
-
"Name": "ingress",
-
"Id": "f1nvcluv1xnfa0t2lca52w69w",
-
"Scope": "swarm",
-
"Driver": "overlay",
-
....
-
"Containers": {
-
"ingress-sbox": {
-
"Name": "ingress-endpoint",
-
"EndpointID": "3d48dc8b3e960a595e52b256e565a3e71ea035bb5e77ae4d4d1c56cab50ee112",
-
"MacAddress": "02:42:0a:ff:00:03",
-
"IPv4Address": "10.255.0.3/16",
-
"IPv6Address": ""
-
}
-
},
-
....
-
}
-
]
-
$ docker network inspect docker_gwbridge
-
[
-
{
-
"Name": "docker_gwbridge",
-
"Id": "b55339bbfab9bdad4ae51f116b028ad7188534cb05936bab973dceae8b78047d",
-
"Scope": "local",
-
"Driver": "bridge",
-
....
-
"Containers": {
-
....
-
"ingress-sbox": {
-
"Name": "gateway_ingress-sbox",
-
"EndpointID": "0b961253ec65349977daa3f84f079ec5e386fa0ae2e6dd80176513e7d4a8b2c3",
-
"MacAddress": "02:42:ac:12:00:02",
-
"IPv4Address": "172.18.0.2/16",
-
"IPv6Address": ""
-
}
-
},
-
....
-
}
-
]
以上输出中endpoints的“MAC/IP”与网络名称空间72df0265d4af中的“MAC/IP”匹配。由此可见,网络名称空间72df0265d4af是为了隐藏容器“ingress-sbox”,“ingress-sbox”有两个接口,一个连接宿主机,另一个连接ingress网络。
Docker Swarm的其中之一特性是"routing mesh",对于向外部暴露端口的容器,无论它实际运行在那个节点,可以通过访问集群中的任何一个节点访问到它,怎么做到的呢?继续深挖容器。
在我们的应用中,只有nginx服务通过将自已的80端口映射到宿主机的1080端口,但nginx并没有运行在node 1节点上。
继续查看node 1
-
$ sudo iptables -t nat -nvL
-
...
-
...
-
Chain DOCKER-INGRESS (2 references)
-
pkts bytes target prot opt in out source destination
-
0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 to:172.18.0.2:1080
-
176K 11M RETURN all -- * * 0.0.0.0/0 0.0.0.0/0
-
$ sudo ip netns exec 72df0265d4af iptables -nvL -t nat
-
...
-
...
-
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
-
pkts bytes target prot opt in out source destination
-
9 576 REDIRECT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 redir ports 80
-
...
-
...
-
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
-
pkts bytes target prot opt in out source destination
-
0 0 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 127.0.0.11
-
14 896 SNAT all -- * * 0.0.0.0/0 10.255.0.0/16 ipvs to:10.255.0.3
你可以看到,iptables规则直接转发宿主机1080端口的流量到隐藏容器‘ingress-sbox’,然后POSTROUTING链将数据包放在IP地址10.255.0.3上,而其对应的接口则连接到ingress网络上。
注意SNAT规则中的‘ipvs’。‘ipvs’是Linux内核实现的本地负载均衡器:
-
$ sudo ip netns exec 72df0265d4af iptables -nvL -t mangle
-
Chain PREROUTING (policy ACCEPT 144 packets, 12119 bytes)
-
pkts bytes target prot opt in out source destination
-
87 5874 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 MARK set 0x12c
-
...
-
...
-
Chain OUTPUT (policy ACCEPT 15 packets, 936 bytes)
-
pkts bytes target prot opt in out source destination
-
0 0 MARK all -- * * 0.0.0.0/0 10.255.0.2 MARK set 0x12c
-
...
-
...
iptables规则将流标记为0x12c(=300),然后如此配置ipvs:
-
$ sudo ip netns exec 72df0265d4af ipvsadm -ln
-
IP Virtual Server version 1.2.1 (size=4096)
-
Prot LocalAddress:Port Scheduler Flags
-
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
-
FWM 300 rr
-
-> 10.255.0.5:0 Masq 1 0 0
在另一个节点上nginx服务的容器被配置了IP地址为10.255.0.5,它是负载均衡管理的唯一后端。全部总结下来,目前全部网络连接如下图所示:
总结
Docker Swarm网络背后发生了许多很酷的事情,这使得在多宿主机网络下的应用开发变得容易实现,甚至是跨云环境。对低层细节的挖掘有助于在开发时进行问题定位、调试。