Kubernetes1.16下部署Prometheus+node-exporter+Grafana监控系统

目录

环境说明

一、Prometheus 持久化安装

1.1、建议将所有的prometheus yaml文件存在一块

1.2、生成配置文件

1.3、创建资源

1.4、创建prometheus的Pod资源

1.5、nfs服务部署

1.6、NFS服务器操作如下

1.7、创建prometheus-volume文件

1.8、创建rbac认证,因为prometheus需要访问k8s集群内部的资源

1.9、创建rbac文件

1.10、将ConfigMap volume rbac 创建完毕后,就可以创建prometheus_deploy

1.11、登录prometheus web界面查看

二、监控集群节点

2.1、创建node-exporter

2.2、查看node-exporter的pod

2.3、更新prometheus configmap文件

2.4、刷新配置

2.5、查看状态

三、Grafana 安装并监控k8s集群

3.1、使用deployment持久化安装grafana

3.2、添加一个pv和pvc用于绑定grafana

3.3、创建一个service,使用NodePort

3.4、创建pv和pvc (这里是需要按照顺序来创建)

3.5、查看创建结果

3.6、使用NodeIP+NodePort访问grafana web界面

3.7、第一次创建grafana需要添加数据源


环境说明

主机IP角色
192.168.126.111 (kmaster)master
192.168.126.112 (knode1)node1
192.168.126.113 (knode2)node2
192.168.126.114 (knode3)node3

查看k8s版本

安装过程中的文件:

Kubernetes1.16下部署Prometheus+node-exporter+Grafana监_Attachments.zip-kubernetes文档类资源-CSDN下载

一、Prometheus 持久化安装

我们prometheus采用nfs挂载方式来存储数据,同时使用configMap管理配置文件。并且我们将所有的prometheus存储在kube-system

1.1、建议将所有的prometheus yaml文件存在一块

mkdir /app/prometheus -p && cd /app/prometheus

1.2、生成配置文件

cat >> prometheus.configmap.yaml <<EOF

apiVersion: v1

kind: ConfigMap

metadata:

  name: prometheus-config

  namespace: kube-system

data:

  prometheus.yml: |

    global:

      scrape_interval: 15s

      scrape_timeout: 15s

    scrape_configs:

    - job_name: 'prometheus'

      static_configs:

      - targets: ['localhost:9090']

EOF

1.3、创建资源

[root@kmaster prometheus]# kubectl apply -f prometheus.configmap.yaml

configmap/prometheus-config created

[root@kmaster prometheus]# kubectl get configmaps -n kube-system |grep prometheus

1.4、创建prometheus的Pod资源

[root@k8s-01 prometheus]# cat > prometheus.deploy.yaml <<EOF

apiVersion: apps/v1

kind: Deployment

metadata:

  name: prometheus

  namespace: kube-system

  labels:

    app: prometheus

spec:

  selector:

    matchLabels:

      app: prometheus

  template:

    metadata:

      labels:

        app: prometheus

    spec:

      serviceAccountName: prometheus

      containers:

      - image: prom/prometheus:v2.4.3

        name: prometheus

        command:

        - "/bin/prometheus"

        args:

        - "--config.file=/etc/prometheus/prometheus.yml"

        - "--storage.tsdb.path=/prometheus"

        - "--storage.tsdb.retention=30d"

        - "--web.enable-admin-api"  # 控制对admin HTTP API的访问,其中包括删除时间序列等功能

        - "--web.enable-lifecycle"  # 支持热更新,直接执行localhost:9090/-/reload立即生效

        ports:

        - containerPort: 9090

          protocol: TCP

          name: http

        volumeMounts:

        - mountPath: "/prometheus"

          subPath: prometheus

          name: data

        - mountPath: "/etc/prometheus"

          name: config-volume

        resources:

          requests:

            cpu: 100m

            memory: 512Mi

          limits:

            cpu: 100m

            memory: 512Mi

      securityContext:

        runAsUser: 0

      volumes:

      - name: data

        persistentVolumeClaim:

          claimName: prometheus

      - configMap:

          name: prometheus-config

        name: config-volume

---

apiVersion: v1

kind: Service

metadata:

  namespace: kube-system

  name: prometheus

  labels:

    app: prometheus

spec:

  type: NodePort

  selector:

    app: prometheus

  ports:

  - name: http

    port: 9090

EOF

我们在启动程序的时候,除了指定prometheus.yaml(configmap)以外,还通过storage.tsdb.path指定了TSDB数据的存储路径、通过storage.tsdb.rentention设置了保留多长时间的数据,还有下面的web.enable-admin-api参数可以用来开启对admin api的访问权限,

参数web.enable-lifecyle用来开启支持热更新,有了这个参数之后,prometheus.yaml(configmap)文件只要更新了,通过执行localhost:9090/-/reload就会立即生效

我们添加了一行securityContext,,其中runAsUser设置为0,这是因为prometheus运行过程中使用的用户是nobody,如果不配置可能会出现权限问题

1.5、nfs服务部署

[root@kmaster prometheus]# yum install nfs-utils rpcbind -y

1.6、NFS服务器操作如下

[root@kmaster prometheus]# mkdir -p /app/k8s

[root@kmaster prometheus]# systemctl start rpcbind

[root@kmaster prometheus]# systemctl enable rpcbind

[root@kmaster prometheus]# systemctl start nfs

[root@kmaster prometheus]# systemctl enable nfs

[root@kmaster prometheus]# echo "/app/k8s  192.168.126.0/24(rw,no_root_squash,sync)" >>/etc/exports

[root@kmaster prometheus]# exportfs -r     #使配置生效

1.7、创建prometheus-volume文件

[root@kmaster prometheus]# cat >prometheus-volume.yaml <<EOF

apiVersion: v1

kind: PersistentVolume

metadata:

  name: prometheus

spec:

  capacity:

    storage: 10Gi

  accessModes:

  - ReadWriteOnce

  persistentVolumeReclaimPolicy: Recycle

  nfs:

    server: 192.168.126.111

    path: /app/k8s

---

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: prometheus

  namespace: kube-system

spec:

  accessModes:

  - ReadWriteOnce

  resources:

    requests:

      storage: 10Gi

EOF

这里通过一个简单的NFS作为存储后端创建一个pv & pvc

[root@kmaster prometheus]# kubectl create -f prometheus-volume.yaml

persistentvolume/prometheus created

persistentvolumeclaim/prometheus created

1.8、创建rbac认证,因为prometheus需要访问k8s集群内部的资源

[root@kmaster prometheus]# cat >>prometheus-rbac.yaml <<EOF

apiVersion: v1

kind: ServiceAccount

metadata:

  name: prometheus

  namespace: kube-system

---

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

  name: prometheus

rules:

- apiGroups:

  - ""

  resources:

  - nodes

  - services

  - endpoints

  - pods

  - nodes/proxy

  verbs:

  - get

  - list

  - watch

- apiGroups:

  - ""

  resources:

  - configmaps

  - nodes/metrics

  verbs:

  - get

- nonResourceURLs:

  - /metrics

  verbs:

  - get

---

apiVersion: rbac.authorization.k8s.io/v1beta1

kind: ClusterRoleBinding

metadata:

  name: prometheus

roleRef:

  apiGroup: rbac.authorization.k8s.io

  kind: ClusterRole

  name: prometheus

subjects:

- kind: ServiceAccount

  name: prometheus

  namespace: kube-system

EOF

1.9、创建rbac文件

[root@kmaster prometheus]# kubectl create -f prometheus-rbac.yaml

serviceaccount/prometheus created

clusterrole.rbac.authorization.k8s.io/prometheus created

clusterrolebinding.rbac.authorization.k8s.io/prometheus created

1.10、将ConfigMap volume rbac 创建完毕后,就可以创建prometheus_deploy

[root@kmaster prometheus]#  kubectl create -f prometheus.deploy.yaml

deployment.extensions/prometheus created

1.11、登录prometheus web界面查看

通过一下命令查询service端口

浏览器中输入IP+NodePort

二、监控集群节点

首先需要我们监控集群的节点,要监控节点其实我们已经有很多非常成熟的方案了,比如Nagios、Zabbix,甚至可以我们自己收集数据,这里我们通过prometheus来采集节点的监控指标,可以通过node_exporter获取,node_exporter就是抓取用于采集服务器节点的各种运行指标,目前node_exporter几乎支持所有常见的监控点,比如cpu、distats、loadavg、meminfo、netstat等,详细的监控列表可以参考github repo

这里使用DeamonSet控制器来部署该服务,这样每一个节点都会运行一个Pod,如果我们从集群中删除或添加节点后,也会进行自动扩展

[root@kmaster prometheus]# cat >>prometheus-node-exporter.yaml<<EOF

apiVersion: apps/v1

kind: DaemonSet

metadata:

  name: node-exporter

  namespace: kube-system

  labels:

    name: node-exporter

    k8s-app: node-exporter

spec:

  selector:

    matchLabels:

      name: node-exporter

  template:

    metadata:

      labels:

        name: node-exporter

        app: node-exporter

    spec:

      hostPID: true

      hostIPC: true

      hostNetwork: true

      containers:

      - name: node-exporter

        image: prom/node-exporter:v0.16.0

        ports:

        - containerPort: 9100

        resources:

          requests:

            cpu: 0.15

        securityContext:

          privileged: true

        args:

        - --path.procfs

        - /host/proc

        - --path.sysfs

        - /host/sys

        - --collector.filesystem.ignored-mount-points

        - '"^/(sys|proc|dev|host|etc)($|/)"'

        volumeMounts:

        - name: dev

          mountPath: /host/dev

        - name: proc

          mountPath: /host/proc

        - name: sys

          mountPath: /host/sys

        - name: rootfs

          mountPath: /rootfs

      tolerations:

      - key: "node-role.kubernetes.io/master"

        operator: "Exists"

        effect: "NoSchedule"  # 使node_exporter可以调度到master节点

      volumes:

        - name: proc

          hostPath:

            path: /proc

        - name: dev

          hostPath:

            path: /dev

        - name: sys

          hostPath:

            path: /sys

        - name: rootfs

          hostPath:

            path: /

EOF

2.1、创建node-exporter

[root@kmaster prometheus]# kubectl create -f prometheus-node-exporter.yaml

daemonset.extensions/node-exporter created

2.2、查看node-exporter的pod

[root@kmaster prometheus]# kubectl get pod -n kube-system -o wide|grep node-exporter

2.3、更新prometheus configmap文件

[root@kmaster prometheus]# cat prometheus.configmap.1.yaml

apiVersion: v1

kind: ConfigMap

metadata:

  name: prometheus-config

  namespace: kube-system

  annotations:

    prometheus.io/port: "9153"

    prometheus.io/scrape: "true"

data:

  prometheus.yml: |

    global:

      scrape_interval: 15s

      scrape_timeout: 15s

    scrape_configs:

    - job_name: 'prometheus'

      static_configs:

      - targets: ['localhost:9090']

    - job_name: 'kubernetes-node'

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - source_labels: [__address__]

        regex: '(.*):10250'

        replacement: '${1}:9100'

        target_label: __address__

        action: replace

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

    - job_name: 'kubernetes-cadvisor'

      kubernetes_sd_configs:

      - role: node

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      relabel_configs:

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - target_label: __address__

        replacement: kubernetes.default.svc:443

      - source_labels: [__meta_kubernetes_node_name]

        regex: (.+)

        target_label: __metrics_path__

        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

    - job_name: kubernetes-apiservers

      kubernetes_sd_configs:

      - role: endpoints

      relabel_configs:

      - action: keep

        regex: default;kubernetes;https

        source_labels:

        - __meta_kubernetes_namespace

        - __meta_kubernetes_service_name

        - __meta_kubernetes_endpoint_port_name

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

        insecure_skip_verify: true

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: 'kubernetes-service-endpoints'

      kubernetes_sd_configs:

      - role: endpoints

      relabel_configs:

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]

        action: replace

        target_label: __scheme__

        regex: (https?)

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]

        action: replace

        target_label: __metrics_path__

        regex: (.+)

      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]

        action: replace

        target_label: __address__

        regex: ([^:]+)(?::\d+)?;(\d+)

        replacement: $1:$2

      - action: labelmap

        regex: __meta_kubernetes_service_label_(.+)

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: kubernetes_namespace

      - source_labels: [__meta_kubernetes_service_name]

        action: replace

        target_label: kubernetes_name

2.4、刷新配置

[root@kmaster prometheus]# kubectl apply -f prometheus.configmap.yaml

configmap/prometheus-config unchanged

[root@kmaster prometheus]# curl -X POST http://10.102.246.138:9090/-/reload

[root@kmaster prometheus]# curl -X POST http://10.102.246.138:9090/-/reload

[root@kmaster prometheus]# curl -X POST http://10.102.246.138:9090/-/reload

2.5、查看状态

三、Grafana 安装并监控k8s集群

3.1、使用deployment持久化安装grafana

cat >>grafana_deployment.yaml <<EOF

apiVersion: apps/v1

kind: Deployment

metadata:

  name: grafana

  namespace: kube-system

  labels:

    app: grafana

    k8s-app: grafana

spec:

  selector:

    matchLabels:

      k8s-app: grafana

      app: grafana

  revisionHistoryLimit: 10

  template:

    metadata:

      labels:

        app: grafana

        k8s-app: grafana

    spec:

      containers:

      - name: grafana

        image: grafana/grafana:5.3.4

        imagePullPolicy: IfNotPresent

        ports:

        - containerPort: 3000

          name: grafana

        env:

        - name: GF_SECURITY_ADMIN_USER

          value: admin

        - name: GF_SECURITY_ADMIN_PASSWORD

          value: 12345@com

        readinessProbe:

          failureThreshold: 10

          httpGet:

            path: /api/health

            port: 3000

            scheme: HTTP

          initialDelaySeconds: 60

          periodSeconds: 10

          successThreshold: 1

          timeoutSeconds: 30

        livenessProbe:

          failureThreshold: 3

          httpGet:

            path: /api/health

            port: 3000

            scheme: HTTP

          periodSeconds: 10

          successThreshold: 1

          timeoutSeconds: 1

        resources:

          limits:

            cpu: 300m

            memory: 1024Mi

          requests:

            cpu: 300m

            memory: 1024Mi

        volumeMounts:

        - mountPath: /var/lib/grafana

          subPath: grafana

          name: storage

      securityContext:

        fsGroup: 472

        runAsUser: 472

      volumes:

      - name: storage

        persistentVolumeClaim:

          claimName: grafana

EOF

这里使用了grafana 5.3.4的镜像,添加了监控检查、资源声明,比较重要的变量是GF_SECURITY_ADMIN_USER和GF_SECURITY_ADMIN_PASSWORD为grafana的账号和密码

由于grafana将dashboard、插件这些数据保留在/var/lib/grafana目录下,所以我们这里需要做持久化,同时要针对这个目录做挂载声明;由于5.3.4版本用户的userid和groupid都有所变化,所以这里添加了一个securityContext设置用户ID

3.2、添加一个pv和pvc用于绑定grafana

cat >>grafana_volume.yaml <<EOF

apiVersion: v1

kind: PersistentVolume

metadata:

  name: grafana

spec:

  capacity:

    storage: 10Gi

  accessModes:

  - ReadWriteOnce

  persistentVolumeReclaimPolicy: Recycle

  nfs:

    server: 192.168.126.111

    path: /app/k8s

---

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: grafana

  namespace: kube-system

spec:

  accessModes:

  - ReadWriteOnce

  resources:

    requests:

      storage: 10Gi

EOF

3.3、创建一个service,使用NodePort

cat >>grafana_svc.yaml<<EOF

apiVersion: v1

kind: Service

metadata:

  name: grafana

  namespace: kube-system

  labels:

    app: grafana

spec:

  type: NodePort

  ports:

    - port: 3000

  selector:

    app: grafana

EOF

由于5.1(可以选择5.1之前的docker镜像,可以避免此类错误)版本后groupid更改,同时我们将/var/lib/grafana挂载到pvc后,目录拥有者可能不是grafana用户,所以我们还需要添加一个Job用于授权目录

cat > grafana_job.yaml <<EOF

apiVersion: batch/v1

kind: Job

metadata:

  name: grafana-chown

  namespace: kube-system

spec:

  template:

    spec:

      restartPolicy: Never

      containers:

      - name: grafana-chown

        command: ["chown", "-R", "472:472", "/var/lib/grafana"]

        image: busybox

        imagePullPolicy: IfNotPresent

        volumeMounts:

        - name: storage

          subPath: grafana

          mountPath: /var/lib/grafana

      volumes:

      - name: storage

        persistentVolumeClaim:

          claimName: grafana

EOF

使用一个busybox镜像将/var/lib/grafana目录修改为权限472

3.4、创建pv和pvc (这里是需要按照顺序来创建)

[root@kmaster prometheus]# kubectl create -f grafana_volume.yaml

persistentvolume/grafana created

persistentvolumeclaim/grafana created

[root@kmaster prometheus]# kubectl create -f grafana_job.yaml

job.batch/grafana-chown created

[root@kmaster prometheus]# kubectl apply -f grafana_deployment.yaml

deployment.apps/grafana created

[root@kmaster prometheus]# kubectl create -f grafana_svc.yaml

3.5、查看创建结果

[root@kmaster prometheus]# kubectl get pod,svc -n kube-system |grep grafana

3.6、使用NodeIP+NodePort访问grafana web界面

http://192.168.126.111:30183

3.7、第一次创建grafana需要添加数据源

url填写:http://prometheus.kube-system.svc.cluster.local:9090

数据源添加完毕后,接下来添加New dashboard

grafana提供了很多模板,类似和docker镜像仓库一下。导入模板也极其简单。点击上方的Dashboard

本文用到的模板:
容器监控模板:315 8588 3146 8685
主机监控模板:8919 9276 10467 10171 9965

sort_desc(sum(container_memory_usage_bytes{image!=" ", pod_name!=" "})by(pod))

sum by(pod)(rate(container_cpu_usage_seconds_total{pod_name="$pod"}[1m]))


版权声明:本文为qq_25518029原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。