k8s系列06-负载均衡器之MatelLB

本文主要在k8s原生集群上部署v0.12.1版本的MetalLB作为k8s的LoadBalancer,主要涉及MetalLBLayer2模式BGP模式两种部署方案。由于BGP的相关原理和配置比较复杂,这里仅涉及简单的BGP配置。

文中使用的k8s集群是在CentOS7系统上基于dockerflannel组件部署v1.23.6版本,此前写的一些关于k8s基础知识和集群搭建的一些方案,有需要的同学可以看一下。

1、工作原理

1.1 简介

在开始之前,我们需要了解一下MetalLB的工作原理。

MetalLB hooks into your Kubernetes cluster, and provides a network load-balancer implementation. In short, it allows you to create Kubernetes services of type LoadBalancer in clusters that don’t run on a cloud provider, and thus cannot simply hook into paid products to provide load balancers.

It has two features that work together to provide this service: address allocation, and external announcement.

MetalLB是 Kubernetes 集群中关于LoadBalancer的一个具体实现,主要用于暴露k8s集群的服务到集群外部访问。MetalLB可以让我们在k8s集群中创建服务类型为LoadBalancer的服务,并且无需依赖云厂商提供的LoadBalancer

它具有两个共同提供此服务的工作负载(workload):地址分配(address allocation)和外部公告(external announcement);对应的就是在k8s中部署的controllerspeaker

1.2 address allocation

地址分配(address allocation)这个功能比较好理解,首先我们需要给MetalLB分配一段IP,接着它会根据k8s的service中的相关配置来给LoadBalancer的服务分配IP,从官网文档中我们可以得知LoadBalancer的IP可以手动指定,也可以让MetalLB自动分配;同时还可以在MetalLB的configmap中配置多个IP段,并且单独设置每个IP段是否开启自动分配。

地址分配(address allocation)主要就是由作为deployment部署的controller来实现,它负责监听集群中的service状态并且分配IP

1.3 external announcement

外部公告(external announcement)的主要功能就是要把服务类型为LoadBalancer的服务的EXTERNAL-IP公布到网络中去,确保客户端能够正常访问到这个IP。MetalLB对此的实现方式主要有三种:ARP/NDP和BGP;其中ARP/NDP分别对应IPv4/IPv6协议的Layer2模式,BGP路由协议则是对应BGP模式。外部公告(external announcement)主要就是由作为daemonset部署的speaker来实现,它负责在网络中发布ARP/NDP报文或者是和BGP路由器建立连接并发布BGP报文。

1.4 关于网络

不管是Layer2模式还是BGP模式,两者都不使用Linux的网络栈,也就是说我们没办法使用诸如ip命令之类的操作准确的查看VIP所在的节点和相应的路由,相对应的是在每个节点上面都能看到一个kube-ipvs0网卡接口上面的IP。同时,两种模式都只是负责把VIP的请求引到对应的节点上面,之后的请求怎么到达pod,按什么规则轮询等都是由kube-proxy实现的。

两种不同的模式各有优缺点和局限性,我们先把两者都部署起来再进行分析。

2、准备工作

2.1 系统要求

在开始部署MetalLB之前,我们需要确定部署环境能够满足最低要求:

  • 一个k8s集群,要求版本不低于1.13.0,且没有负载均衡器相关插件
  • k8s集群上的CNI组件和MetalLB兼容
  • 预留一段IPv4地址给MetalLB作为LoadBalance的VIP使用
  • 如果使用的是MetalLB的BGP模式,还需要路由器支持BGP协议
  • 如果使用的是MetalLB的Layer2模式,因为使用了memberlist算法来实现选主,因此需要确保各个k8s节点之间的7946端口可达(包括TCP和UDP协议),当然也可以根据自己的需求配置为其他端口

2.2 cni插件的兼容性

MetalLB官方给出了对主流的一些CNI的兼容情况,考虑到MetalLB主要还是利用了k8s自带的kube-proxy组件做流量转发,因此对大多数的CNI兼容情况都相当不错。

CNI兼容性主要问题
CalicoMostly (see known issues)主要在于BGP模式的兼容性,但是社区也提供了解决方案
CanalYes-
CiliumYes-
FlannelYes-
Kube-ovnYes-
Kube-routerMostly (see known issues)无法支持 builtin external BGP peering mode
Weave NetMostly (see known issues)externalTrafficPolicy: Local支持情况视版本而定

从兼容性上面我们不难看出,大多数情况是没问题的,出现兼容性问题的主要原因就是和BGP有冲突。实际上BGP相关的兼容性问题几乎存在于每个开源的k8s负载均衡器上面。

2.3 云厂商的兼容性

MetalLB官方给出的列表中,我们可以看到对大多数云厂商的兼容性都很差,原因也很简单,大多数的云环境上面都没办法运行BGP协议,而通用性更高的layer2模式则因为各个云厂商的网络环境不同而没办法确定是否能够兼容

The short version is: cloud providers expose proprietary APIs instead of standard protocols to control their network layer, and MetalLB doesn’t work with those APIs.

当然如果使用了云厂商的服务,最好的方案是直接使用云厂商提供的LoadBalance服务。

3、Layer2 mode

3.1 部署环境

本次MetalLB的部署环境为基于dockerflannel部署的1.23.6版本的k8s集群

IPHostname
10.31.8.1tiny-flannel-master-8-1.k8s.tcinternal
10.31.8.11tiny-flannel-worker-8-11.k8s.tcinternal
10.31.8.12tiny-flannel-worker-8-12.k8s.tcinternal
10.8.64.0/18podSubnet
10.8.0.0/18serviceSubnet
10.31.8.100-10.31.8.200MetalLB IPpool

3.2 配置ARP参数

部署Layer2模式需要把k8s集群中的ipvs配置打开strictARP,开启之后k8s集群中的kube-proxy会停止响应kube-ipvs0网卡之外的其他网卡的arp请求,而由MetalLB接手处理。

strict ARP开启之后相当于把 将 arp_ignore 设置为 1 并将 arp_announce 设置为 2 启用严格的 ARP,这个原理和LVS中的DR模式对RS的配置一样,可以参考之前的文章中的解释

strict ARP configure arp_ignore and arp_announce to avoid answering ARP queries from kube-ipvs0 interface

# 查看kube-proxy中的strictARP配置
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
      strictARP: false

# 手动修改strictARP配置为true
$ kubectl edit configmap -n kube-system kube-proxy
configmap/kube-proxy edited

# 使用命令直接修改并对比不同
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl diff -f - -n kube-system

# 确认无误后使用命令直接修改并生效
$ kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system

# 重启kube-proxy确保配置生效
$ kubectl rollout restart ds kube-proxy -n kube-system

# 确认配置生效
$ kubectl get configmap -n kube-system kube-proxy -o yaml | grep strictARP
      strictARP: true

3.3 部署MetalLB

MetalLB的部署也十分简单,官方提供了manifest文件部署(yaml部署),helm3部署Kustomize部署三种方式,这里我们还是使用manifest文件部署。

大多数的官方教程为了简化部署的步骤,都是写着直接用kubectl命令部署一个yaml的url,这样子的好处是部署简单快捷,但是坏处就是本地自己没有存档,不方便修改等操作,因此我个人更倾向于把yaml文件下载到本地保存再进行部署。

# 下载v0.12.1的两个部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml

# 如果使用frr来进行BGP路由管理,则下载这两个部署文件
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/namespace.yaml
$ wget https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb-frr.yaml

下载官方提供的yaml文件之后,我们再提前准备好configmap配置,github上面有提供一个参考文件,layer2模式需要的配置并不多,这里我们只做最基础的一些参数配置定义即可:

  • protocol这一项我们配置为layer2
  • addresses这里我们可以使用CIDR来批量配置(198.51.100.0/24),也可以指定首尾IP来配置(192.168.0.150-192.168.0.200),这里我们指定一段和k8s节点在同一个子网的IP
$ cat > configmap-metallb.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: default
      protocol: layer2
      addresses:
      - 10.31.8.100-10.31.8.200
EOF

接下来就可以开始进行部署,整体可以分为三步:

  1. 部署namespace
  2. 部署deploymentdaemonset
  3. 配置configmap
# 创建namespace
$ kubectl apply -f namespace.yaml
namespace/metallb-system created
$ kubectl get ns
NAME              STATUS   AGE
default           Active   8d
kube-node-lease   Active   8d
kube-public       Active   8d
kube-system       Active   8d
metallb-system    Active   8s
nginx-quic        Active   8d

# 部署deployment和daemonset,以及相关所需的其他资源
$ kubectl apply -f metallb.yaml
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
role.rbac.authorization.k8s.io/controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
rolebinding.rbac.authorization.k8s.io/controller created
daemonset.apps/speaker created
deployment.apps/controller created

# 这里主要就是部署了controller这个deployment来检查service的状态
$ kubectl get deploy -n metallb-system
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
controller   1/1     1            1           86s
# speaker则是使用ds部署到每个节点上面用来协商VIP、收发ARP、NDP等数据包
$ kubectl get ds -n metallb-system
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
speaker   3         3         3       3            3           kubernetes.io/os=linux   64s
$ kubectl get pod -n metallb-system -o wide
NAME                         READY   STATUS    RESTARTS   AGE    IP           NODE                                      NOMINATED NODE   READINESS GATES
controller-57fd9c5bb-svtjw   1/1     Running   0          117s   10.8.65.4    tiny-flannel-worker-8-11.k8s.tcinternal   <none>           <none>
speaker-bf79q                1/1     Running   0          117s   10.31.8.11   tiny-flannel-worker-8-11.k8s.tcinternal   <none>           <none>
speaker-fl5l8                1/1     Running   0          117s   10.31.8.12   tiny-flannel-worker-8-12.k8s.tcinternal   <none>           <none>
speaker-nw2fm                1/1     Running   0          117s   10.31.8.1    tiny-flannel-master-8-1.k8s.tcinternal    <none>           <none>

      
$ kubectl apply -f configmap-layer2.yaml
configmap/config created

3.4 部署测试服务

我们还是自定义一个服务来进行测试,测试镜像使用nginx,默认情况下会返回请求客户端的IP和端口

$ cat > nginx-quic-lb.yaml <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: nginx-quic

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-lb
  namespace: nginx-quic
spec:
  selector:
    matchLabels:
      app: nginx-lb
  replicas: 4
  template:
    metadata:
      labels:
        app: nginx-lb
    spec:
      containers:
      - name: nginx-lb
        image: tinychen777/nginx-quic:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80

---

apiVersion: v1
kind: Service
metadata:
  name: nginx-lb-service
  namespace: nginx-quic
spec:
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  selector:
    app: nginx-lb
  ports:
  - protocol: TCP
    port: 80 # match for service access port
    targetPort: 80 # match for pod access port
  type: LoadBalancer
  loadBalancerIP: 10.31.8.100
EOF

注意上面的配置中我们把service配置中的type字段指定为LoadBalancer,并且指定了loadBalancerIP10.31.8.100

注意:并非所有的LoadBalancer都允许设置 loadBalancerIP

如果LoadBalancer支持该字段,那么将根据用户设置的 loadBalancerIP 来创建负载均衡器。

如果没有设置 loadBalancerIP 字段,将会给负载均衡器指派一个临时 IP。

如果设置了 loadBalancerIP,但LoadBalancer并不支持这种特性,那么设置的 loadBalancerIP 值将会被忽略掉。

# 创建一个测试服务检查效果
$ kubectl apply -f nginx-quic-lb.yaml
namespace/nginx-quic created
deployment.apps/nginx-lb created
service/nginx-lb-service created

查看服务状态,这时候TYPE已经变成LoadBalancerEXTERNAL-IP显示为我们定义的10.31.8.100

# 查看服务状态,这时候TYPE已经变成LoadBalancer
$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
nginx-lb-service   LoadBalancer   10.8.32.221   10.31.8.100   80:30181/TCP   25h

此时我们再去查看k8s机器中的nginx-lb-service状态,可以看到ClusetIPLoadBalancer-VIPnodeport的相关信息以及流量策略TrafficPolicy等配置

$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"nginx-lb-service","namespace":"nginx-quic"},"spec":{"externalTrafficPolicy":"Cluster","internalTrafficPolicy":"Cluster","loadBalancerIP":"10.31.8.100","ports":[{"port":80,"protocol":"TCP","targetPort":80}],"selector":{"app":"nginx-lb"},"type":"LoadBalancer"}}
  creationTimestamp: "2022-05-16T06:01:23Z"
  name: nginx-lb-service
  namespace: nginx-quic
  resourceVersion: "1165135"
  uid: f547842e-4547-4d01-abbc-89ac8b059a2a
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.8.32.221
  clusterIPs:
  - 10.8.32.221
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerIP: 10.31.8.100
  ports:
  - nodePort: 30181
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx-lb
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 10.31.8.100

查看IPVS规则,这时候可以看到ClusetIP、LoadBalancer-VIP和nodeport的转发规则,默认情况下在创建LoadBalance的时候还会创建nodeport服务:

$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.17.0.1:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.8.32.221:80 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.8.64.0:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.8.64.1:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.31.8.1:30181 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0
TCP  10.31.8.100:80 rr
  -> 10.8.65.15:80                Masq    1      0          0
  -> 10.8.65.16:80                Masq    1      0          0
  -> 10.8.66.12:80                Masq    1      0          0
  -> 10.8.66.13:80                Masq    1      0          0

使用curl检查服务是否正常

$ curl 10.31.8.100:80
10.8.64.0:60854
$ curl 10.8.1.166:80
10.8.64.0:2562
$ curl 10.31.8.1:30974
10.8.64.0:1635
$ curl 10.31.8.100:80
10.8.64.0:60656

3.5 关于VIP

在每台k8s节点机器上面的kube-ipvs0网卡上面都能看到这个LoadBalancer的VIP:

$ ip addr show kube-ipvs0
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether 4e:ba:e8:25:cf:17 brd ff:ff:ff:ff:ff:ff
    inet 10.8.0.1/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.8.0.10/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.8.32.221/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.31.8.100/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever

想要定位到VIP在那个节点上面则比较麻烦,我们可以找一台和K8S集群处于同一个二层网络的机器,查看arp表,再根据mac地址找到对应的节点IP,这样子可以反查到IP在哪个节点上面。

$ arp -a | grep 10.31.8.100
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0

$ arp -a | grep 52:54:00:5c:9c:97
tiny-flannel-worker-8-12.k8s.tcinternal (10.31.8.12) at 52:54:00:5c:9c:97 [ether] on eth0
? (10.31.8.100) at 52:54:00:5c:9c:97 [ether] on eth0

$ ip a | grep 52:54:00:5c:9c:97
    link/ether 52:54:00:5c:9c:97 brd ff:ff:ff:ff:ff:ff

又或者我们可以查看speaker的pod日志,我们可以找到对应的服务IP被宣告的日志记录

$ kubectl logs -f -n metallb-system speaker-fl5l8
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:11:34.099204376Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.527334808Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:09.547734268Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.267651651Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.31.8.100"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"nginx-quic/nginx-lb-service","ts":"2022-05-16T06:12:34.286130424Z"}

3.6 关于nodeport

相信不少细心的同学已经发现了,我们在创建LoadBalancer服务的时候,默认情况下k8s会帮我们自动创建一个nodeport服务,这个操作可以通过指定Service中的allocateLoadBalancerNodePorts字段来定义开关,默认情况下为true

不同的loadbalancer实现原理不同,有些是需要依赖nodeport来进行流量转发,有些则是直接转发请求到pod中。对于MetalLB而言,是通过kube-proxy将请求的流量直接转发到pod,因此我们需要关闭nodeport的话可以修改service中的spec.allocateLoadBalancerNodePorts字段,将其设置为false,那么在创建svc的时候就不会分配nodeport。

但是需要注意的是如果是对已有service进行修改,关闭nodeport(从true改为false),k8s不会自动去清除已有的ipvs规则,这需要我们自行手动删除。

我们重新定义创建一个svc

apiVersion: v1
kind: Service
metadata:
  name: nginx-lb-service
  namespace: nginx-quic
spec:
  allocateLoadBalancerNodePorts: false
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  selector:
    app: nginx-lb
  ports:
  - protocol: TCP
    port: 80 # match for service access port
    targetPort: 80 # match for pod access port
  type: LoadBalancer
  loadBalancerIP: 10.31.8.100

此时再去查看对应的svc状态和ipvs规则会发现已经没有nodeport相关的配置

$ ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.8.62.180:80 rr
  -> 10.8.65.18:80                Masq    1      0          0
  -> 10.8.65.19:80                Masq    1      0          0
  -> 10.8.66.14:80                Masq    1      0          0
  -> 10.8.66.15:80                Masq    1      0          0
TCP  10.31.8.100:80 rr
  -> 10.8.65.18:80                Masq    1      0          0
  -> 10.8.65.19:80                Masq    1      0          0
  -> 10.8.66.14:80                Masq    1      0          0
  -> 10.8.66.15:80                Masq    1      0          0

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.62.180   10.31.8.100   80/TCP    23s

如果是把已有服务的spec.allocateLoadBalancerNodePortstrue改为false,原有的nodeport不会自动删除,因此最好在初始化的时候就规划好相关参数

$ kubectl get svc -n nginx-quic nginx-lb-service -o yaml | egrep " allocateLoadBalancerNodePorts: "
  allocateLoadBalancerNodePorts: false
$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
nginx-lb-service   LoadBalancer   10.8.62.180   10.31.8.100   80:31405/TCP   85m

4、BGP mode

4.1 网络拓扑

测试环境的网络拓扑非常的简单,MetalLB的网段为了和前面Layer2模式进行区分,更换为10.9.0.0/16,具体信息如下

IPHostname
10.31.8.1tiny-flannel-master-8-1.k8s.tcinternal
10.31.8.11tiny-flannel-worker-8-11.k8s.tcinternal
10.31.8.12tiny-flannel-worker-8-12.k8s.tcinternal
10.31.254.251OpenWrt
10.9.0.0/16MetalLB BGP IPpool

三台k8s的节点直连Openwrt路由器,OpenWRT作为k8s节点的网关的同时,还在上面跑BGP协议,将对MetalLB使用的VIP的请求路由到各个k8s节点上。

在开始配置之前,我们需要给路由器和k8s节点都分配一个私有的AS号,这里可以参考wiki上面的AS号划分使用。这里我们路由器使用AS号为64512,MetalLB使用AS号为64513。

4.2 安装路由软件

以家里常见的openwrt路由器为例,我们先在上面安装quagga组件,当然要是使用的openwrt版本编译了frr模块的话推荐使用frr来进行配置。

如果使用的是别的发行版Linux(如CentOS或者Debian)推荐直接使用frr进行配置。

我们先在openwrt上面直接使用opkg安装quagga

$ opkg update 
$ opkg install quagga quagga-zebra quagga-bgpd quagga-vtysh

如果使用的openwrt版本足够新,是可以直接使用opkg安装frr组件的

$ opkg update 
$ opkg install frr frr-babeld frr-bfdd frr-bgpd frr-eigrpd frr-fabricd frr-isisd frr-ldpd frr-libfrr frr-nhrpd frr-ospf6d frr-ospfd frr-pbrd frr-pimd frr-ripd frr-ripngd frr-staticd frr-vrrpd frr-vtysh frr-watchfrr frr-zebra

如果是使用frr记得在配置中开启bgpd参数再重启frr

$ sed -i 's/bgpd=no/bgpd=yes/g' /etc/frr/daemons
$ /etc/init.d/frr restart

4.3 配置路由器BGP

下面的服务配置以frr为例,实际上使用quagga的话也是使用vtysh进行配置或者是直接修改配置文件,两者区别不大。

检查服务是否监听了2601和2605端口

root@OpenWrt:~# netstat -ntlup | egrep "zebra|bgpd"
tcp        0      0 0.0.0.0:2601            0.0.0.0:*               LISTEN      3018/zebra
tcp        0      0 0.0.0.0:2605            0.0.0.0:*               LISTEN      3037/bgpd

BGP协议使用的179端口还没有被监听是因为我们还没有进行配置,这里我们可以直接使用vtysh进行配置或者是直接修改配置文件然后重启服务

直接在命令行输入vtysh就可以进入到vtysh的配置终端(和kvm虚拟化的virsh类似),这时候注意留意终端的提示符变化了

root@OpenWrt:~# vtysh

Hello, this is Quagga (version 1.2.4).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

OpenWrt#

但是命令行配置比较麻烦,我们也可以直接修改配置文件然后重启服务。

quagga修改的bgp配置文件默认是/etc/quagga/bgpd.conf,不同的发行版和安装方式可能会不同。

$ cat /etc/quagga/bgpd.conf
!
! Zebra configuration saved from vty
!   2022/05/19 11:01:35
!
password zebra
!
router bgp 64512
 bgp router-id 10.31.254.251
 neighbor 10.31.8.1 remote-as 64513
 neighbor 10.31.8.1 description 10-31-8-1
 neighbor 10.31.8.11 remote-as 64513
 neighbor 10.31.8.11 description 10-31-8-11
 neighbor 10.31.8.12 remote-as 64513
 neighbor 10.31.8.12 description 10-31-8-12
 maximum-paths 3
!
 address-family ipv6
 exit-address-family
 exit
!
access-list vty permit 127.0.0.0/8
access-list vty deny any
!
line vty
 access-class vty
!

如果使用的是frr,那么配置文件会有所变化,需要修改的是/etc/frr/frr.conf,不同的发行版和安装方式可能会不同。

$ cat /etc/frr/frr.conf
frr version 8.2.2
frr defaults traditional
hostname tiny-openwrt-plus
!
password zebra
!
router bgp 64512
 bgp router-id 10.31.254.251
 no bgp ebgp-requires-policy
 neighbor 10.31.8.1 remote-as 64513
 neighbor 10.31.8.1 description 10-31-8-1
 neighbor 10.31.8.11 remote-as 64513
 neighbor 10.31.8.11 description 10-31-8-11
 neighbor 10.31.8.12 remote-as 64513
 neighbor 10.31.8.12 description 10-31-8-12
 !
 address-family ipv4 unicast
 exit-address-family
exit
!
access-list vty seq 5 permit 127.0.0.0/8
access-list vty seq 10 deny any
!
line vty
 access-class vty
exit
!

完成配置后需要重启服务

# 重启frr的命令
$ /etc/init.d/frr restart
# 重启quagge的命令
$ /etc/init.d/quagga restart

重启后我们进入vtysh查看bgp的状态

tiny-openwrt-plus# show ip bgp summary

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 3, using 2149 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.31.8.1       4      64513         0         0        0    0    0    never       Active        0 10-31-8-1
10.31.8.11      4      64513         0         0        0    0    0    never       Active        0 10-31-8-11
10.31.8.12      4      64513         0         0        0    0    0    never       Active        0 10-31-8-12

Total number of neighbors 3

这时候再查看路由器的监听端口,可以看到BGP已经跑起来了

$ netstat -ntlup | egrep "zebra|bgpd"
tcp        0      0 127.0.0.1:2605          0.0.0.0:*               LISTEN      31625/bgpd
tcp        0      0 127.0.0.1:2601          0.0.0.0:*               LISTEN      31618/zebra
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      31625/bgpd
tcp        0      0 :::179                  :::*                    LISTEN      31625/bgpd

4.4 配置MetalLB BGP

首先我们修改configmap

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
    - peer-address: 10.31.254.251
      peer-port: 179
      peer-asn: 64512
      my-asn: 64513
    address-pools:
    - name: default
      protocol: bgp
      addresses:
      - 10.9.0.0/16

修改完成后我们重新部署configmap,并检查metallb的状态

$ kubectl apply -f configmap-metal.yaml
configmap/config configured

$ kubectl get cm -n metallb-system config -o yaml
apiVersion: v1
data:
  config: |
    peers:
    - peer-address: 10.31.254.251
      peer-port: 179
      peer-asn: 64512
      my-asn: 64513
    address-pools:
    - name: default
      protocol: bgp
      addresses:
      - 10.9.0.0/16
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"config":"peers:\n- peer-address: 10.31.254.251\n  peer-port: 179\n  peer-asn: 64512\n  my-asn: 64513\naddress-pools:\n- name: default\n  protocol: bgp\n  addresses:\n  - 10.9.0.0/16\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"config","namespace":"metallb-system"}}
  creationTimestamp: "2022-05-16T04:37:54Z"
  name: config
  namespace: metallb-system
  resourceVersion: "1412854"
  uid: 6d94ca36-93fe-4ea2-9407-96882ad8e35c

此时我们从路由器上面可以看到已经和三个k8s节点建立了BGP连接

tiny-openwrt-plus# show ip bgp summary

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.31.254.251, local AS number 64512 vrf-id 0
BGP table version 3
RIB entries 5, using 920 bytes of memory
Peers 3, using 2149 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.31.8.1       4      64513         6         4        0    0    0 00:00:45            3        3 10-31-8-1
10.31.8.11      4      64513         6         4        0    0    0 00:00:45            3        3 10-31-8-11
10.31.8.12      4      64513         6         4        0    0    0 00:00:45            3        3 10-31-8-12

Total number of neighbors 3

如果出现某个节点的BGP连接建立失败的情况,可以重启该节点上面的speaker来重试建立BGP连接

$ kubectl delete po speaker-fl5l8 -n metallb-system

4.5 配置Service

当configmap更改生效之后,原有服务的EXTERNAL-IP不会重新分配

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.4.92    10.31.8.100   80/TCP    18h

此时我们可以重启controller,让它重新为我们的服务分配EXTERNAL-IP

$ kubectl delete po -n metallb-system controller-57fd9c5bb-svtjw
pod "controller-57fd9c5bb-svtjw" deleted

重启完成之后我们再检查svc的状态,如果svc的配置中关于LoadBalancer的VIP是自动分配的(即没有指定loadBalancerIP字段),那么这时候应该就已经拿到新的IP在正常运行了,但是我们这个服务的loadBalancerIP之前手动指定为10.31.8.100了,这里的EXTERNAL-IP状态就变为pending

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.4.92    <pending>     80/TCP    18h

重新修改loadBalancerIP10.9.1.1,此时可以看到服务已经正常

$ kubectl get svc -n nginx-quic
NAME               TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service   LoadBalancer   10.8.4.92    10.9.1.1      80/TCP    18h

再查看controller的日志可以看到

$ kubectl logs controller-57fd9c5bb-d6jsl -n metallb-system
{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / amd64","level":"info","msg":"MetalLB controller starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-05-18T03:45:45.440872105Z","version":"0.12.1"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:45:45.610395481Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611009691Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611062419Z"}
{"caller":"level.go:63","error":"controller not synced","level":"error","msg":"controller not synced yet, cannot allocate IP; will retry after sync","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.611080525Z"}
{"caller":"level.go:63","event":"stateSynced","level":"info","msg":"controller synced, can allocate IPs now","ts":"2022-05-18T03:45:45.611117023Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","event":"clearAssignment","level":"info","msg":"current IP not allowed by config, clearing","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617013146Z"}
{"caller":"level.go:63","event":"clearAssignment","level":"info","msg":"user requested a different IP than the one currently assigned","reason":"differentIPRequested","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617089367Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.617122976Z"}
{"caller":"level.go:63","event":"serviceUpdated","level":"info","msg":"updated service object","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626039403Z"}
{"caller":"level.go:63","error":"[\"10.31.8.100\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:45:45.626361986Z"}
{"caller":"level.go:63","event":"ipAllocated","ip":["10.9.1.1"],"level":"info","msg":"IP address assigned by controller","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.943434144Z"}

再查看speaker的日志我们可以看到和路由之间成功建立BGP连接的日志、使用了不符合规范的loadBalancerIP10.31.8.100的报错日志,以及为loadBalancerIP10.9.1.1分配BGP路由的日志

$ kubectl logs -n metallb-system speaker-bf79q

{"caller":"level.go:63","configmap":"metallb-system/config","event":"peerAdded","level":"info","msg":"peer configured, starting BGP session","peer":"10.31.254.251","ts":"2022-05-18T03:41:55.046091105Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-05-18T03:41:55.046268735Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:41:55.051955069Z"}
struct { Version uint8; ASN16 uint16; HoldTime uint16; RouterID uint32; OptsLen uint8 }{Version:0x4, ASN16:0xfc00, HoldTime:0xb4, RouterID:0xa1ffefd, OptsLen:0x1e}
{"caller":"level.go:63","event":"sessionUp","level":"info","localASN":64513,"msg":"BGP session established","peer":"10.31.254.251:179","peerASN":64512,"ts":"2022-05-18T03:41:55.052734174Z"}
{"caller":"level.go:63","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-05-18T03:42:40.183574415Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeLeave","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:03.649494062Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:03.655003303Z"}
{"caller":"level.go:63","level":"info","msg":"node event - forcing sync","node addr":"10.31.8.12","node event":"NodeJoin","node name":"tiny-flannel-worker-8-12.k8s.tcinternal","ts":"2022-05-18T03:44:06.247929645Z"}
{"caller":"level.go:63","error":"assigned IP not allowed by config","ips":["10.31.8.100"],"level":"error","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:44:06.25369106Z"}
{"caller":"level.go:63","event":"updatedAdvertisements","ips":["10.9.1.1"],"level":"info","msg":"making advertisements using BGP","numAds":1,"pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953729779Z"}
{"caller":"level.go:63","event":"serviceAnnounced","ips":["10.9.1.1"],"level":"info","msg":"service has IP, announcing","pool":"default","protocol":"bgp","service":"nginx-quic/nginx-lb-service","ts":"2022-05-18T03:47:19.953912236Z"}

我们在集群外的任意一个机器进行测试

$ curl -v 10.9.1.1
* About to connect() to 10.9.1.1 port 80 (#0)
*   Trying 10.9.1.1...
* Connected to 10.9.1.1 (10.9.1.1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.9.1.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Wed, 18 May 2022 04:17:41 GMT
< Content-Type: text/plain
< Content-Length: 16
< Connection: keep-alive
<
10.8.64.0:43939
* Connection #0 to host 10.9.1.1 left intact

4.6 检查ECMP

此时再查看路由器上面的路由状态,可以看到有关于10.9.1.1/32路由,这时候的下一条有多个IP,说明已经成功开启了ECMP。

tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:04:52
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:01:40
  *                    via 10.31.8.11, eth0, weight 1, 00:01:40
  *                    via 10.31.8.12, eth0, weight 1, 00:01:40
C>* 10.31.0.0/16 is directly connected, eth0, 00:04:52

我们再创建多几个服务进行测试

# kubectl get svc -n nginx-quic
NAME                TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
nginx-lb-service    LoadBalancer   10.8.4.92    10.9.1.1      80/TCP    23h
nginx-lb2-service   LoadBalancer   10.8.10.48   10.9.1.2      80/TCP    64m
nginx-lb3-service   LoadBalancer   10.8.6.116   10.9.1.3      80/TCP    64m

再查看此时路由器的状态

tiny-openwrt-plus# show ip bgp
BGP table version is 3, local router ID is 10.31.254.251, vrf id 0
Default local pref 100, local AS 64512
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*= 10.9.1.1/32      10.31.8.12                             0 64513 ?
*>                  10.31.8.1                              0 64513 ?
*=                  10.31.8.11                             0 64513 ?
*= 10.9.1.2/32      10.31.8.12                             0 64513 ?
*>                  10.31.8.1                              0 64513 ?
*=                  10.31.8.11                             0 64513 ?
*= 10.9.1.3/32      10.31.8.12                             0 64513 ?
*>                  10.31.8.1                              0 64513 ?
*=                  10.31.8.11                             0 64513 ?

Displayed  3 routes and 9 total paths


tiny-openwrt-plus# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 10.31.254.254, eth0, 00:06:12
B>* 10.9.1.1/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
  *                    via 10.31.8.11, eth0, weight 1, 00:03:00
  *                    via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.2/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
  *                    via 10.31.8.11, eth0, weight 1, 00:03:00
  *                    via 10.31.8.12, eth0, weight 1, 00:03:00
B>* 10.9.1.3/32 [20/0] via 10.31.8.1, eth0, weight 1, 00:03:00
  *                    via 10.31.8.11, eth0, weight 1, 00:03:00
  *                    via 10.31.8.12, eth0, weight 1, 00:03:00
C>* 10.31.0.0/16 is directly connected, eth0, 00:06:12

只有当路由表显示我们的LoadBalancerIP的下一跳有多个IP的时候,才说明ECMP配置成功,否则需要检查BGP的配置是否正确。

5、总结

5.1 Layer2 mode优缺点

优点:

  • 通用性强,对比BGP模式不需要BGP路由器支持,几乎可以适用于任何网络环境;当然云厂商的网络环境例外

缺点:

  • 所有的流量都会在同一个节点上,该节点的容易成为流量的瓶颈
  • 当VIP所在节点宕机之后,需要较长时间进行故障转移(一般在10s),这主要是因为MetalLB使用了memberlist来进行选主,当VIP所在节点宕机之后重新选主的时间要比传统的keepalived使用的vrrp协议要更长
  • 难以定位VIP所在节点,MetalLB并没有提供一个简单直观的方式让我们查看到底哪一个节点是VIP所属节点,基本只能通过抓包或者查看pod日志来确定,当集群规模变大的时候这会变得非常的麻烦

改进方案:

  • 有条件的可以考虑使用BGP模式
  • 既不能用BGP模式也不能接受Layer2模式的,基本和目前主流的三个开源负载均衡器无缘了(三者都是Layer2模式和BGP模式且原理类似,优缺点相同)

5.2 BGP mode优缺点

BGP模式的优缺点几乎和Layer2模式相反

优点:

  • 无单点故障,在开启ECMP的前提下,k8s集群内所有的节点都有请求流量,都会参与负载均衡并转发请求

缺点:

  • 条件苛刻,需要有BGP路由器支持,配置起来也更复杂;
  • ECMP的故障转移(failover)并不是特别地优雅,这个问题的严重程度取决于使用的ECMP算法;当集群的节点出现变动导致BGP连接出现变动,所有的连接都会进行重新哈希(使用三元组或五元组哈希),这对一些服务来说可能会有影响;

路由器中使用的哈希值通常 不稳定,因此每当后端集的大小发生变化时(例如,当一个节点的 BGP 会话关闭时),现有的连接将被有效地随机重新哈希,这意味着大多数现有的连接最终会突然被转发到不同的后端,而这个后端可能和此前的后端毫不相干且不清楚上下文状态信息。

改进方案:

MetalLB给出了一些改进方案,下面列出来给大家参考一下

  • 使用更稳定的ECMP算法来减少后端变动时对现有连接的影响,如“resilient ECMP” or “resilient LAG”
  • 将服务部署到特定的节点上减少可能带来的影响
  • 在流量低峰期进行变更
  • 将服务分开部署到两个不同的LoadBalanceIP的服务中,然后利用DNS进行流量切换
  • 在客户端加入透明的用户无感的重试逻辑
  • 在LoadBalance后面加入一层ingress来实现更优雅的failover(但是并不是所有的服务都可以使用ingress)
  • 接受现实……(Accept that there will be occasional bursts of reset connections. For low-availability internal services, this may be acceptable as-is.)

5.3 MetalLB的优缺点

这里尽量客观的总结概况一些客观事实,是否为优缺点可能会因人而异:

  • 开源时间久(相对于云原生负载均衡器而言),有一定的社区基础和热度,但是项目目前还处于beta状态
  • 部署简单快捷,默认情况下需要配置的参数不多,可以快速上手
  • 官方文档不多,只有一些基础的配置和说明,想要深入了解,可能需要阅读源码
  • 进阶管理配置不便,如果你想精确了解服务的当前状态,可能会比较麻烦
  • configmap修改之后生效不是特别的优雅,很多情况下需要我们手动重启pod

总的来说,MetalLB作为一款处于beta阶段的开源负载均衡器,很好地弥补了这一块领域的空白,并且对后面开源的一些同类服务有着一定的影响。但是从实际生产落地的角度,给我的感觉就是目前更倾向于有得用且能用,并不能算得上好用,但是又考虑到MetalLB最开始只是一个个人开源项目,最近才有专门的组织进行管理维护,这也是可以理解的,希望它能够发展得更好吧。


版权声明:本文为qq_36885515原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。