Docker网络之深挖overlay

部署

首先部署一个包含两个节点的docker swarm集群,名称为别为node 1与node 2,创建swarm集群的过程不赘述。接下来创建一个overlay网络与三个服务,每个服务只有一个实例,如下:

 
  1. docker network create --opt encrypted --subnet 100.0.0.0/24 -d overlay net1

  2.  
  3. docker service create --name redis --network net1 redis

  4. docker service create --name node --network net1 nvbeta/node

  5. docker service create --name nginx --network net1 -p 1080:80 nvbeta/swarm_nginx

以上命令创建了一个典型的三层应用。nginx是前置的负载均衡器,将用户请求流量分发到node服务,node服务是一个web服务,它负责访问redis并将结果通过nginx服务返回给用户。简单起见,只创建一个node实例。
以下是此应用的逻辑视图:
Docker网络之深挖overlay

网络

看一下在docker swarm中已经创建的网络:

 
  1. $ docker network ls

  2. NETWORK ID NAME DRIVER SCOPE

  3. cac91f9c60ff bridge bridge local

  4. b55339bbfab9 docker_gwbridge bridge local

  5. fe6ef5d2e8e8 host host local

  6. f1nvcluv1xnf ingress overlay swarm

  7. 8vty8k3pejm5 net1 overlay swarm

  8. 893a1bbe3118 none null local

net1

刚才创建的overlay网络,负责容器之间东西向通信。

docker_gwbridge

由Docker创建的bridge网络,它允许容器与宿主机通信。

ingress

由Docker创建的overlay网络,在Docker swarm中通过此网络向外部世界暴露服务与routing mesh功能。

net1网络

每个服务在创建时都指定了“--network net1”选项,因此每个容器实例必然有一个接口连接到net1网络,查看一下node 1,有两个容器被部署在这个节点上:

 
  1. $ docker ps

  2. CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

  3. eb03383913fb nvbeta/node:latest "nodemon /src/index.j" 2 hours ago Up 2 hours 8888/tcp node.1.10yscmxtoymkvs3bdd4z678w4

  4. 434ce2679482 redis:latest "docker-entrypoint.sh" 2 hours ago Up 2 hours 6379/tcp redis.1.1a2l4qmvg887xjpfklp4d6n7y

通过创建与docker网络名称空间的符号链接,查看一下node 1结点上所有的网络名称空间:

 
  1. $ cd /var/run

  2. $ sudo ln -s /var/run/docker/netns netns

  3. $ sudo ip netns

  4. be663feced43

  5. 6e9de12ede80

  6. 2-8vty8k3pej

  7. 1-f1nvcluv1x

  8. 72df0265d4af

对比名称空间ID与Docker swarm中的网络ID,我们猜测net1网络属于2-8vty8k3pej名称空间,net1网络的ID为8vty8k3pej。这个可以通过对比名称空间下的接口与容器中的接口确认。

容器中的接口:

 
  1. $ docker exec node.1.10yscmxtoymkvs3bdd4z678w4 ip link

  2. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default

  3. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

  4. 11040: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default

  5. link/ether 02:42:65:00:00:03 brd ff:ff:ff:ff:ff:ff

  6. 11042: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default

  7. link/ether 02:42:ac:12:00:04 brd ff:ff:ff:ff:ff:ff

 
  1. $ docker exec redis.1.1a2l4qmvg887xjpfklp4d6n7y ip link

  2. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default

  3. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

  4. 11036: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default

  5. link/ether 02:42:65:00:00:08 brd ff:ff:ff:ff:ff:ff

  6. 11038: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default

  7. link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff

2-8vty8k3pej名称空间下的接口:

 
  1. $ sudo ip netns exec 2-8vty8k3pej ip link

  2. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default

  3. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

  4. 2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default

  5. link/ether 22:37:32:66:b0:48 brd ff:ff:ff:ff:ff:ff

  6. 11035: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default

  7. link/ether 2a:30:95:63:af:75 brd ff:ff:ff:ff:ff:ff

  8. 11037: veth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default

  9. link/ether da:84:44:5c:91:ce brd ff:ff:ff:ff:ff:ff

  10. 11041: veth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default

  11. link/ether 8a:f9:bf:c1:ec:09 brd ff:ff:ff:ff:ff:ff

注意br0,它是LinuxBridge设备,所有其它接口都连接在它上边,包括vxlan1、veth2、veth3。vxlan1是VTEP类型的Linux网络虚拟化设备,它是br0的从设备,用来实现vxlan功能。

veth2与veth3都是veth类型的虚拟化设备,总是成对出现,其中之一位于名称空间内,另一个位于容器中,并且们于容器中的veth设备ID总比位于namespacew中的ID小数字1。因此,名称空间下的veth2与redis中的eth0是一对,名称空间下的veth3与node中的eth0是一对。

目前我们可以确认网络net1属于名称空间2-8vty8k3pej,基于目前了理的情报,网络拓扑图暂时如下:

 
  1. node 1

  2.  
  3. +-----------+ +-----------+

  4. | nodejs | | redis |

  5. | | | |

  6. +--------+--+ +--------+--+

  7. | |

  8. | |

  9. | |

  10. | |

  11. +----+------------------+-------+ net1

  12. 101.0.0.3 101.0.0.8

  13. 101.0.0.4(vip) 101.0.0.2(vip)

docker_gwbridge网络

对比redis、node容器中的接口与node 1宿主机中的接口。宿主机接口如下:

 
  1. $ ip link

  2. ...

  3. 4: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default

  4. link/ether 02:42:24:f1:af:e8 brd ff:ff:ff:ff:ff:ff

  5. 5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default

  6. link/ether 02:42:e4:56:7e:9a brd ff:ff:ff:ff:ff:ff

  7. 11039: veth97d586b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default

  8. link/ether 02:6b:d4:fc:8a:8a brd ff:ff:ff:ff:ff:ff

  9. 11043: vethefdaa0d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default

  10. link/ether 0a:d5:ac:22:e7:5c brd ff:ff:ff:ff:ff:ff

  11. 10876: vethceaaebe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default

  12. link/ether 3a:77:3d:cc:1b:45 brd ff:ff:ff:ff:ff:ff

  13. ...

可以看到,有三个veth设备连接到docker_gwbridge,ID分别是11039、11043、10876,可以通过如下命令确认:

 
  1. $ brctl show

  2. bridge name bridge id STP enabled interfaces

  3. docker0 8000.0242e4567e9a no

  4. docker_gwbridge 8000.024224f1afe8 no veth97d586b

  5. vethceaaebe

  6. vethefdaa0d

根据在net1中总结的veth对的匹配规则,我们知道11039与redis中的eth1(11038)是一对,11043与node中的eth1(11042)是一对。现在的网络拓扑图如下:

 
  1. node 1

  2.  
  3. 172.18.0.4 172.18.0.3

  4. +----+------------------+----------------+ docker_gwbridge

  5. | |

  6. | |

  7. | |

  8. | |

  9. +--+--------+ +--+--------+

  10. | nodejs | | redis |

  11. | | | |

  12. +--------+--+ +--------+--+

  13. | |

  14. | |

  15. | |

  16. | |

  17. +----+------------------+----------+ net1

  18. 101.0.0.3 101.0.0.8

  19. 101.0.0.4(vip) 101.0.0.2(vip)

docker_gwbridge的功能与单机Docker下默认的docker0(也可能叫bridge,取决于Docker版本)类型的网络很像。但是有区别,docker0有连接外网的功能,docker_gwbridge没有,它只负责同宿主机下不同容器这间的通信。当容器使用-p选项连接外网时由另一个叫ingress的网络负责。

ingress网络

再一次列出node 1宿主机下的网络名称空间与Docker swarm中的网络:

 
  1. $ sudo ip netns

  2. be663feced43

  3. 6e9de12ede80

  4. 2-8vty8k3pej

  5. 1-f1nvcluv1x

  6. 72df0265d4af

  7.  
  8. $ docker network ls

  9. NETWORK ID NAME DRIVER SCOPE

  10. cac91f9c60ff bridge bridge local

  11. b55339bbfab9 docker_gwbridge bridge local

  12. fe6ef5d2e8e8 host host local

  13. f1nvcluv1xnf ingress overlay swarm

  14. 8vty8k3pejm5 net1 overlay swarm

  15. 893a1bbe3118 none null local

很明显,ingress网络属于1-f1nvcluv1x名称空间。但是72df0265d4af名称空间是干什么的呢?先看一下72df0265d4af名称空间下的接口:

 
  1. $ sudo ip netns exec 72df0265d4af ip addr

  2. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default

  3. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

  4. inet 127.0.0.1/8 scope host lo

  5. valid_lft forever preferred_lft forever

  6. inet6 ::1/128 scope host

  7. valid_lft forever preferred_lft forever

  8. 10873: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default

  9. link/ether 02:42:0a:ff:00:03 brd ff:ff:ff:ff:ff:ff

  10. inet 10.255.0.3/16 scope global eth0

  11. valid_lft forever preferred_lft forever

  12. inet6 fe80::42:aff:feff:3/64 scope link

  13. valid_lft forever preferred_lft forever

  14. 10875: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default

  15. link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff

  16. inet 172.18.0.2/16 scope global eth1

  17. valid_lft forever preferred_lft forever

  18. inet6 fe80::42:acff:fe12:2/64 scope link

  19. valid_lft forever preferred_lft forever

eth1(10875)与宿主机上的vethceaaebe(10876)是一对,我们也可以知道eth0(10873)
连接到ingress网络,这一点可以通过查询ingress与docker_gwbridge的详细信息佐证:

 
  1. $ docker network inspect ingress

  2. [

  3. {

  4. "Name": "ingress",

  5. "Id": "f1nvcluv1xnfa0t2lca52w69w",

  6. "Scope": "swarm",

  7. "Driver": "overlay",

  8. ....

  9. "Containers": {

  10. "ingress-sbox": {

  11. "Name": "ingress-endpoint",

  12. "EndpointID": "3d48dc8b3e960a595e52b256e565a3e71ea035bb5e77ae4d4d1c56cab50ee112",

  13. "MacAddress": "02:42:0a:ff:00:03",

  14. "IPv4Address": "10.255.0.3/16",

  15. "IPv6Address": ""

  16. }

  17. },

  18. ....

  19. }

  20. ]

  21.  
  22. $ docker network inspect docker_gwbridge

  23. [

  24. {

  25. "Name": "docker_gwbridge",

  26. "Id": "b55339bbfab9bdad4ae51f116b028ad7188534cb05936bab973dceae8b78047d",

  27. "Scope": "local",

  28. "Driver": "bridge",

  29. ....

  30. "Containers": {

  31. ....

  32. "ingress-sbox": {

  33. "Name": "gateway_ingress-sbox",

  34. "EndpointID": "0b961253ec65349977daa3f84f079ec5e386fa0ae2e6dd80176513e7d4a8b2c3",

  35. "MacAddress": "02:42:ac:12:00:02",

  36. "IPv4Address": "172.18.0.2/16",

  37. "IPv6Address": ""

  38. }

  39. },

  40. ....

  41. }

  42. ]

以上输出中endpoints的“MAC/IP”与网络名称空间72df0265d4af中的“MAC/IP”匹配。由此可见,网络名称空间72df0265d4af是为了隐藏容器“ingress-sbox”,“ingress-sbox”有两个接口,一个连接宿主机,另一个连接ingress网络。

Docker Swarm的其中之一特性是"routing mesh",对于向外部暴露端口的容器,无论它实际运行在那个节点,可以通过访问集群中的任何一个节点访问到它,怎么做到的呢?继续深挖容器。

在我们的应用中,只有nginx服务通过将自已的80端口映射到宿主机的1080端口,但nginx并没有运行在node 1节点上。

继续查看node 1

 
  1. $ sudo iptables -t nat -nvL

  2. ...

  3. ...

  4. Chain DOCKER-INGRESS (2 references)

  5. pkts bytes target prot opt in out source destination

  6. 0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 to:172.18.0.2:1080

  7. 176K 11M RETURN all -- * * 0.0.0.0/0 0.0.0.0/0

  8.  
  9. $ sudo ip netns exec 72df0265d4af iptables -nvL -t nat

  10. ...

  11. ...

  12. Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)

  13. pkts bytes target prot opt in out source destination

  14. 9 576 REDIRECT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 redir ports 80

  15. ...

  16. ...

  17. Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)

  18. pkts bytes target prot opt in out source destination

  19. 0 0 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 127.0.0.11

  20. 14 896 SNAT all -- * * 0.0.0.0/0 10.255.0.0/16 ipvs to:10.255.0.3

你可以看到,iptables规则直接转发宿主机1080端口的流量到隐藏容器‘ingress-sbox’,然后POSTROUTING链将数据包放在IP地址10.255.0.3上,而其对应的接口则连接到ingress网络上。

注意SNAT规则中的‘ipvs’。‘ipvs’是Linux内核实现的本地负载均衡器:

 
  1. $ sudo ip netns exec 72df0265d4af iptables -nvL -t mangle

  2. Chain PREROUTING (policy ACCEPT 144 packets, 12119 bytes)

  3. pkts bytes target prot opt in out source destination

  4. 87 5874 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 MARK set 0x12c

  5. ...

  6. ...

  7. Chain OUTPUT (policy ACCEPT 15 packets, 936 bytes)

  8. pkts bytes target prot opt in out source destination

  9. 0 0 MARK all -- * * 0.0.0.0/0 10.255.0.2 MARK set 0x12c

  10. ...

  11. ...

iptables规则将流标记为0x12c(=300),然后如此配置ipvs:

 
  1. $ sudo ip netns exec 72df0265d4af ipvsadm -ln

  2. IP Virtual Server version 1.2.1 (size=4096)

  3. Prot LocalAddress:Port Scheduler Flags

  4. -> RemoteAddress:Port Forward Weight ActiveConn InActConn

  5. FWM 300 rr

  6. -> 10.255.0.5:0 Masq 1 0 0

在另一个节点上nginx服务的容器被配置了IP地址为10.255.0.5,它是负载均衡管理的唯一后端。全部总结下来,目前全部网络连接如下图所示:
Docker网络之深挖overlay

总结 

Docker Swarm网络背后发生了许多很酷的事情,这使得在多宿主机网络下的应用开发变得容易实现,甚至是跨云环境。对低层细节的挖掘有助于在开发时进行问题定位、调试。