在上篇文章中,Prometheus-Operator 手动入门实战 ,已经手动安装了Prometheus-Operator并且手动安装了Prometheus、Alertmanager、ServiceMonitor、PrometheusRule这些资源,明白的运行原理,但是实际生产中,需要监控的有很多包括Kubernetes集群本身,以及跑在Kubernetes集群之上的应用,都需要我们手动的去添加监控非常的麻烦,所以这篇文章主要记录使用Helm3安装prometheus-operator,来减轻手动的繁琐步骤。
本文采用的环境以及版本:
Kubernetes 1.15.5 + Calico v3.6.5
Helm v3.0.0
InfluxDB Chart Version:3.0.1,对应的InfluxDB版本:1.7.6
Prometheus-Operator Chart Version:8.2.2 对应的是 Prometheus-Operator版本: v0.34.0
所有的都安装到monitoring 命名空间中
前提依赖
生产环境要考虑Prometheus的持久化存储问题,当然可以参考前面的Prometheus-Operator 手动入门实战 配置storage。不过当有多台prometheus server的时候数据的备份就很麻烦,所以这里使用influxDB作为prometheus server的remote-read,remote-write 。
InfluxDB 也是比较流行的时序数据库,参考官网
摘自官方:https://prometheus.io/docs/operating/integrations
File Service Discovery For service discovery mechanisms not natively supported by Prometheus, file-based service discovery provides an interface for integrating.
Remote Endpoints and Storage The remote write and remote read features of Prometheus allow transparently sending and receiving samples. This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any solution in this space to confirm it can handle your data volumes.
Alertmanager Webhook Receiver For notification mechanisms not natively supported by the Alertmanager, the webhook receiver allows for integration.
Management Prometheus does not include configuration management functionality, allowing you to integrate it with your existing systems or build on top of it.
Prometheus Operator : Manages Prometheus on top of Kubernetes
Promgen : Web UI and configuration generator for Prometheus and Alertmanager
Other
karma : alert dashboard
PushProx : Proxy to transverse NAT and similar network setups
Promregator : discovery and scraping for Cloud Foundry applications
Helm3 安装InfluxDB 在下载或者安装Chart的时候一定要记得更新Helm repo:
搜索influxDB:
$ helm search repo influxdb NAME CHART VERSION APP VERSION DESCRIPTION stable/influxdb 3.0.1 1.7.6 Scalable datastore for metrics, events, and rea... stable/kapacitor 1.1.3 1.5.2 InfluxDB's native data processing engine. It ca...
因为我们有很多的参数要定义,所以先下载下来,然后更改values.yaml
$ helm pull stable/influxdb $ tar xf influxdb-3.0.1.tgz
备份默认的values,方便回滚:
$ cd influxdb $ cp values.yaml{,.ori}
更改values.yaml,如下,只截取了更改的部分:
persistence: enabled: true storageClass: rook-ceph-rbd annotations: accessMode: ReadWriteOnce size: 8Gi resources: requests: memory: 2Gi cpu: 0.5 limits: memory: 3Gi cpu: 1 ingress: enabled: true tls: false hostname: influxdb.test.aws.test.com annotations: kubernetes.io/ingress.class: "nginx" env: - name: INFLUXDB_ADMIN_ENABLED value: "true" - name: INFLUXDB_ADMIN_USER value: "admin" - name: INFLUXDB_ADMIN_PASSWORD value: "Adm1n123" - name: INFLUXDB_DB value: "prometheus" - name: INFLUXDB_USER value: "prometheus" - name: INFLUXDB_USER_PASSWORD value: "Pr0m123" config: data: max_series_per_database: 0 max_values_per_tag: 0 admin: enabled: true http: auth_enabled: true initScripts: enabled: true scripts: retention.iql: |+ CREATE RETENTION POLICY "prometheus_retention_policy" on "prometheus" DURATION 180d REPLICATION 1 DEFAULT
注意:
上面的配置是:
开启influxdb auth认证 创建admin用户,密码为Adm1n123 创建prometheus数据库 创建用户prometheus,密码为Pr0m123;此时prometheus用户对prometheus数据库有读写权限 对数据库prometheus设置保留策略:保留180天 警告:
警告,经过测试后,如果开启auth,并通过设定setDefaultUser(prometheus job)去设定管理员的用户名密码,此时管理员用户名密码是设置OK了,但是并没有创建数据库,也没有设置用户。所以这里建议不要用setDefaultUser这个,用env的方式,但是使用env的方式,如果进入到容器内是可以通过env命令查看到这些敏感信息的,需要做好pod/exec 权限。
参考InfluxDB helm chart
参考InfluxDB docker
参考InfluxDB dockerfile
安装influxdb:
$ kubectl create ns monitoring $ cd influxdb $ helm install influxdb ./ --namespace monitoring NAME: influxdb LAST DEPLOYED: Thu Nov 21 18:04:11 2019 NAMESPACE: monitoring STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: InfluxDB can be accessed via port 8086 on the following DNS name from within your cluster: - http://influxdb.monitoring:8086 You can easily connect to the remote instance with your local influx cli. To forward the API port to localhost:8086 run the following: - kubectl port-forward --namespace monitoring $(kubectl get pods --namespace monitoring -l app=influxdb -o jsonpath='{ .items[0].metadata.name }' ) 8086:8086 You can also connect to the influx cli from inside the container. To open a shell session in the InfluxDB pod run the following: - kubectl exec -i -t --namespace monitoring $(kubectl get pods --namespace monitoring -l app=influxdb -o jsonpath='{.items[0].metadata.name}' ) /bin/sh To tail the logs for the InfluxDB pod run the following: - kubectl logs -f --namespace monitoring $(kubectl get pods --namespace monitoring -l app=influxdb -o jsonpath='{ .items[0].metadata.name }' ) $ kubectl -n monitoring get all NAME READY STATUS RESTARTS AGE pod/influxdb-0 1/1 Running 0 31m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/influxdb ClusterIP 10.100.50.188 <none> 8086/TCP,8083/TCP,8088/TCP 31m NAME READY AGE statefulset.apps/influxdb 1/1 31m
登陆到influxdb-0这个pod,验证一下上面的定义是否都正常:
$ kubectl -n monitoring exec -it influxdb-0 -- bash bash-4.4 Connected to http://localhost:8086 version 1.7.6 InfluxDB shell version: 1.7.6 Enter an InfluxQL query > show databases ERR: unable to parse authentication credentials Warning: It is possible this error is due to not setting a database. Please set a database with the command "use <database>" . > auth admin Adm1n123 > show databases name: databases name ---- prometheus _internal > show users user admin ---- ----- admin true prometheus false > show grants for prometheus database privilege -------- --------- prometheus ALL PRIVILEGE > use prometheus Using database prometheus > show retention policies name duration shardGroupDuration replicaN default ---- -------- ------------------ -------- ------- autogen 0s 168h0m0s 1 false prometheus_retention_policy 4320h0m0s 168h0m0s 1 true
设置都是正常的,具体的influxdb的操作请参考:
然后检查一下influxdb的配置文件:
bash-4.4# cat /etc/influxdb/influxdb.conf reporting-disabled = false bind-address = ":8088" [meta] dir = "/var/lib/influxdb/meta" retention-autocreate = true logging-enabled = true [data] dir = "/var/lib/influxdb/data" wal-dir = "/var/lib/influxdb/wal" query-log-enabled = true cache-max-memory-size = 1073741824 cache-snapshot-memory-size = 26214400 cache-snapshot-write-cold-duration = "10m0s" compact-full-write-cold-duration = "4h0m0s" max-series-per-database = 0 max-values-per-tag = 0 trace-logging-enabled = false [coordinator] write-timeout = "10s" max-concurrent-queries = 0 query-timeout = "0s" log-queries-after = "0s" max-select-point = 0 max-select-series = 0 max-select-buckets = 0 [retention] enabled = true check-interval = "30m0s" [shard-precreation] enabled = true check-interval = "10m0s" advance-period = "30m0s" [admin] enabled = true bind-address = ":8083" https-enabled = false https-certificate = "/etc/ssl/influxdb.pem" [monitor] store-enabled = true store-database = "_internal" store-interval = "10s" [subscriber] enabled = true http-timeout = "30s" insecure-skip-verify = false ca-certs = "" write-concurrency = 40 write-buffer-size = 1000 [http] enabled = true bind-address = ":8086" flux-enabled = true auth-enabled = true # 已经开启auth,必须通过认证才能进行操作 log-enabled = true write-tracing = false pprof-enabled = true https-enabled = false https-certificate = "/etc/ssl/influxdb.pem" https-private-key = "" max-row-limit = 10000 max-connection-limit = 0 shared-secret = "beetlejuicebeetlejuicebeetlejuice" realm = "InfluxDB" unix-socket-enabled = false bind-socket = "/var/run/influxdb.sock" [[graphite]] enabled = false bind-address = ":2003" database = "graphite" retention-policy = "autogen" protocol = "tcp" batch-size = 5000 batch-pending = 10 batch-timeout = "1s" consistency-level = "one" separator = "." udp-read-buffer = 0 [[collectd]] enabled = false bind-address = ":25826" database = "collectd" retention-policy = "autogen" batch-size = 5000 batch-pending = 10 batch-timeout = "10s" read-buffer = 0 typesdb = "/usr/share/collectd/types.db" security-level = "none" auth-file = "/etc/collectd/auth_file" [[opentsdb]] enabled = false bind-address = ":4242" database = "opentsdb" retention-policy = "autogen" consistency-level = "one" tls-enabled = false certificate = "/etc/ssl/influxdb.pem" batch-size = 1000 batch-pending = 5 batch-timeout = "1s" log-point-errors = true [[udp]] enabled = false bind-address = ":8089" database = "udp" retention-policy = "autogen" batch-size = 5000 batch-pending = 10 read-buffer = 0 batch-timeout = "1s" precision = "ns" [continuous_queries] log-enabled = true enabled = true run-interval = "1s" [logging] format = "auto" level = "info" supress-logo = false
注意:开源版本的InfluxDB(InfluxDB OSS)不支持集群模式,生产要记得备份:备份参考官网
InfluxDB 已经安装完毕,下面开始安装prometheus-operator。
Helm3 安装prometheus-operator 因为我们有很多的自定义参数要更改,同样先要将prometheus-operator的chart下载下来:
更新Helm repo:
然后搜索prometheus-operator:
$ helm search repo prometheus-operator NAME CHART VERSION APP VERSION DESCRIPTION stable/prometheus-operator 8.2.2 0.34.0 Provides easy monitoring definitions for Kubern...
下载prometheus-operator chart:
$ helm pull stable/prometheus-operator $ tar xf prometheus-operator-8.2.2.tgz $ ls prometheus-operator prometheus-operator-8.2.2.tgz
首先我们备份一下默认的values.yaml文件:
$ cd prometheus-operator $ cp values.yaml{,.ori} $ ls charts Chart.yaml CONTRIBUTING.md crds README.md requirements.lock requirements.yaml templates values.yaml values.yaml.ori
看一下整个Chart的目录结构吧:
$ tree ./ ./ ├── charts │ ├── grafana │ │ ├── Chart.yaml │ │ ├── ci │ │ │ ├── default-values.yaml │ │ │ ├── with-dashboard-json-values.yaml │ │ │ └── with-dashboard-values.yaml │ │ ├── dashboards │ │ │ └── custom-dashboard.json │ │ ├── README.md │ │ ├── templates │ │ │ ├── clusterrolebinding.yaml │ │ │ ├── clusterrole.yaml │ │ │ ├── configmap-dashboard-provider.yaml │ │ │ ├── configmap.yaml │ │ │ ├── dashboards-json-configmap.yaml │ │ │ ├── deployment.yaml │ │ │ ├── headless-service.yaml │ │ │ ├── _helpers.tpl │ │ │ ├── ingress.yaml │ │ │ ├── NOTES.txt │ │ │ ├── poddisruptionbudget.yaml │ │ │ ├── podsecuritypolicy.yaml │ │ │ ├── _pod.tpl │ │ │ ├── pvc.yaml │ │ │ ├── rolebinding.yaml │ │ │ ├── role.yaml │ │ │ ├── secret-env.yaml │ │ │ ├── secret.yaml │ │ │ ├── serviceaccount.yaml │ │ │ ├── service.yaml │ │ │ ├── statefulset.yaml │ │ │ └── tests │ │ │ ├── test -configmap.yaml │ │ │ ├── test -podsecuritypolicy.yaml │ │ │ ├── test -rolebinding.yaml │ │ │ ├── test -role.yaml │ │ │ ├── test -serviceaccount.yaml │ │ │ └── test.yaml │ │ └── values.yaml │ ├── kube-state-metrics │ │ ├── Chart.yaml │ │ ├── OWNERS │ │ ├── README.md │ │ ├── templates │ │ │ ├── clusterrolebinding.yaml │ │ │ ├── clusterrole.yaml │ │ │ ├── deployment.yaml │ │ │ ├── _helpers.tpl │ │ │ ├── NOTES.txt │ │ │ ├── podsecuritypolicy.yaml │ │ │ ├── psp-clusterrolebinding.yaml │ │ │ ├── psp-clusterrole.yaml │ │ │ ├── serviceaccount.yaml │ │ │ ├── servicemonitor.yaml │ │ │ └── service.yaml │ │ └── values.yaml │ └── prometheus-node-exporter │ ├── Chart.yaml │ ├── OWNERS │ ├── README.md │ ├── templates │ │ ├── daemonset.yaml │ │ ├── endpoints.yaml │ │ ├── _helpers.tpl │ │ ├── monitor.yaml │ │ ├── NOTES.txt │ │ ├── psp-clusterrolebinding.yaml │ │ ├── psp-clusterrole.yaml │ │ ├── psp.yaml │ │ ├── serviceaccount.yaml │ │ └── service.yaml │ └── values.yaml ├── Chart.yaml ├── CONTRIBUTING.md ├── crds │ ├── crd-alertmanager.yaml │ ├── crd-podmonitor.yaml │ ├── crd-prometheusrules.yaml │ ├── crd-prometheus.yaml │ └── crd-servicemonitor.yaml ├── README.md ├── requirements.lock ├── requirements.yaml ├── templates │ ├── alertmanager │ │ ├── alertmanager.yaml │ │ ├── ingress.yaml │ │ ├── podDisruptionBudget.yaml │ │ ├── psp-clusterrolebinding.yaml │ │ ├── psp-clusterrole.yaml │ │ ├── psp.yaml │ │ ├── secret.yaml │ │ ├── serviceaccount.yaml │ │ ├── servicemonitor.yaml │ │ └── service.yaml │ ├── exporters │ │ ├── core-dns │ │ │ ├── servicemonitor.yaml │ │ │ └── service.yaml │ │ ├── kube-api-server │ │ │ └── servicemonitor.yaml │ │ ├── kube-controller-manager │ │ │ ├── endpoints.yaml │ │ │ ├── servicemonitor.yaml │ │ │ └── service.yaml │ │ ├── kube-dns │ │ │ ├── servicemonitor.yaml │ │ │ └── service.yaml │ │ ├── kube-etcd │ │ │ ├── endpoints.yaml │ │ │ ├── servicemonitor.yaml │ │ │ └── service.yaml │ │ ├── kubelet │ │ │ └── servicemonitor.yaml │ │ ├── kube-proxy │ │ │ ├── endpoints.yaml │ │ │ ├── servicemonitor.yaml │ │ │ └── service.yaml │ │ ├── kube-scheduler │ │ │ ├── endpoints.yaml │ │ │ ├── servicemonitor.yaml │ │ │ └── service.yaml │ │ ├── kube-state-metrics │ │ │ └── serviceMonitor.yaml │ │ └── node-exporter │ │ └── servicemonitor.yaml │ ├── grafana │ │ ├── configmap-dashboards.yaml │ │ ├── configmaps-datasources.yaml │ │ ├── dashboards │ │ │ ├── etcd.yaml │ │ │ ├── k8s-cluster-rsrc-use.yaml │ │ │ ├── k8s-node-rsrc-use.yaml │ │ │ ├── k8s-resources-cluster.yaml │ │ │ ├── k8s-resources-namespace.yaml │ │ │ ├── k8s-resources-pod.yaml │ │ │ ├── k8s-resources-workloads-namespace.yaml │ │ │ ├── k8s-resources-workload.yaml │ │ │ ├── nodes.yaml │ │ │ ├── persistentvolumesusage.yaml │ │ │ ├── pods.yaml │ │ │ └── statefulset.yaml │ │ ├── dashboards-1.14 │ │ │ ├── apiserver.yaml │ │ │ ├── cluster-total.yaml │ │ │ ├── controller-manager.yaml │ │ │ ├── etcd.yaml │ │ │ ├── k8s-coredns.yaml │ │ │ ├── k8s-resources-cluster.yaml │ │ │ ├── k8s-resources-namespace.yaml │ │ │ ├── k8s-resources-node.yaml │ │ │ ├── k8s-resources-pod.yaml │ │ │ ├── k8s-resources-workloads-namespace.yaml │ │ │ ├── k8s-resources-workload.yaml │ │ │ ├── kubelet.yaml │ │ │ ├── namespace-by-pod.yaml │ │ │ ├── namespace-by-workload.yaml │ │ │ ├── node-cluster-rsrc-use.yaml │ │ │ ├── node-rsrc-use.yaml │ │ │ ├── nodes.yaml │ │ │ ├── persistentvolumesusage.yaml │ │ │ ├── pods.yaml │ │ │ ├── pod-total.yaml │ │ │ ├── prometheus-remote-write.yaml │ │ │ ├── prometheus.yaml │ │ │ ├── proxy.yaml │ │ │ ├── scheduler.yaml │ │ │ ├── statefulset.yaml │ │ │ └── workload-total.yaml │ │ └── servicemonitor.yaml │ ├── _helpers.tpl │ ├── NOTES.txt │ ├── prometheus │ │ ├── additionalAlertmanagerConfigs.yaml │ │ ├── additionalAlertRelabelConfigs.yaml │ │ ├── additionalPrometheusRules.yaml │ │ ├── additionalScrapeConfigs.yaml │ │ ├── clusterrolebinding.yaml │ │ ├── clusterrole.yaml │ │ ├── ingressperreplica.yaml │ │ ├── ingress.yaml │ │ ├── podDisruptionBudget.yaml │ │ ├── podmonitors.yaml │ │ ├── prometheus.yaml │ │ ├── psp-clusterrolebinding.yaml │ │ ├── psp-clusterrole.yaml │ │ ├── psp.yaml │ │ ├── rules │ │ │ ├── alertmanager.rules.yaml │ │ │ ├── etcd.yaml │ │ │ ├── general.rules.yaml │ │ │ ├── k8s.rules.yaml │ │ │ ├── kube-apiserver.rules.yaml │ │ │ ├── kube-prometheus-node-alerting.rules.yaml │ │ │ ├── kube-prometheus-node-recording.rules.yaml │ │ │ ├── kubernetes-absent.yaml │ │ │ ├── kubernetes-apps.yaml │ │ │ ├── kubernetes-resources.yaml │ │ │ ├── kubernetes-storage.yaml │ │ │ ├── kubernetes-system.yaml │ │ │ ├── kube-scheduler.rules.yaml │ │ │ ├── node-network.yaml │ │ │ ├── node.rules.yaml │ │ │ ├── node-time.yaml │ │ │ ├── prometheus-operator.yaml │ │ │ └── prometheus.rules.yaml │ │ ├── rules-1.14 │ │ │ ├── alertmanager.rules.yaml │ │ │ ├── etcd.yaml │ │ │ ├── general.rules.yaml │ │ │ ├── k8s.rules.yaml │ │ │ ├── kube-apiserver.rules.yaml │ │ │ ├── kube-prometheus-node-recording.rules.yaml │ │ │ ├── kubernetes-absent.yaml │ │ │ ├── kubernetes-apps.yaml │ │ │ ├── kubernetes-resources.yaml │ │ │ ├── kubernetes-storage.yaml │ │ │ ├── kubernetes-system-apiserver.yaml │ │ │ ├── kubernetes-system-controller-manager.yaml │ │ │ ├── kubernetes-system-kubelet.yaml │ │ │ ├── kubernetes-system-scheduler.yaml │ │ │ ├── kubernetes-system.yaml │ │ │ ├── kube-scheduler.rules.yaml │ │ │ ├── node-exporter.rules.yaml │ │ │ ├── node-exporter.yaml │ │ │ ├── node-network.yaml │ │ │ ├── node.rules.yaml │ │ │ ├── node-time.yaml │ │ │ ├── prometheus-operator.yaml │ │ │ └── prometheus.yaml │ │ ├── serviceaccount.yaml │ │ ├── servicemonitors.yaml │ │ ├── servicemonitor.yaml │ │ ├── serviceperreplica.yaml │ │ └── service.yaml │ └── prometheus-operator │ ├── admission-webhooks │ │ ├── job-patch │ │ │ ├── clusterrolebinding.yaml │ │ │ ├── clusterrole.yaml │ │ │ ├── job-createSecret.yaml │ │ │ ├── job-patchWebhook.yaml │ │ │ ├── psp.yaml │ │ │ ├── rolebinding.yaml │ │ │ ├── role.yaml │ │ │ └── serviceaccount.yaml │ │ ├── mutatingWebhookConfiguration.yaml │ │ └── validatingWebhookConfiguration.yaml │ ├── cleanup-crds.yaml │ ├── clusterrolebinding.yaml │ ├── clusterrole.yaml │ ├── crds.yaml │ ├── deployment.yaml │ ├── psp-clusterrolebinding.yaml │ ├── psp-clusterrole.yaml │ ├── psp.yaml │ ├── serviceaccount.yaml │ ├── servicemonitor.yaml │ └── service.yaml ├── values.yaml └── values.yaml.ori 33 directories, 229 files
直接编辑values.yaml,下面截取更改的部分:
alertmanager: config: global: resolve_timeout: 5m smtp_from: alert@test.com smtp_smarthost: smtphm.qiye.163.com:465 smtp_hello: alert@test.com smtp_auth_username: alert@test.com smtp_auth_password: xxxxxxx smtp_require_tls: false wechat_api_secret: Pxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxc wechat_api_corp_id: wxxxxxxxxx7 route: group_by: ['job','alertname','instance'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'email-receiver' routes: - match_re: job: pushgw|grafana receiver: 'wechat-receiver' receivers: - name: 'email-receiver' email_configs: - to: xxxxxxx@test.com send_resolved: true - name: 'wechat-receiver' wechat_configs: - send_resolved: true agent_id: 1xxxxxx3 to_user: '@all' ingress: enabled: true annotations: kubernetes.io/ingress.class: "nginx" hosts: - prometheus.test.aws.test.com paths: - /alertmanager tls: [] alertmanagerSpec: image: repository: quay.azk8s.cn/prometheus/alertmanager logFormat: json replicas: 3 retention: 24h externalUrl: http://prometheus.test.aws.test.com/alertmanager routePrefix: /alertmanager resources: requests: memory: 256Mi cpu: 100m limits: memory: 256Mi cpu: 100m grafana: adminPassword: xxxxxxx ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx hosts: - grafana.test.aws.microoak.cn path: / tls: [] kubeEtcd: enabled: true service: port: 2381 targetPort: 2381 prometheusOperator: image: repository: quay.azk8s.cn/coreos/prometheus-operator configmapReloadImage: repository: quay.azk8s.cn/coreos/configmap-reload prometheusConfigReloaderImage: repository: quay.azk8s.cn/coreos/prometheus-config-reloader hyperkubeImage: repository: gcr.azk8s.cn/google-containers/hyperkube resources: limits: cpu: 500m memory: 500Mi requests: cpu: 100m memory: 100Mi prometheus: ingress: enabled: true annotations: kubernetes.io/ingress.class: "nginx" hosts: - prometheus.test.aws.test.com paths: - / tls: [] prometheusSpec: image: repository: quay.azk8s.cn/prometheus/prometheus retention: 1d logFormat: json remoteRead: - url: "http://influxdb:8086/api/v1/prom/read?db=prometheus&u=prometheus&p=Pr0m123" remoteWrite: - url: "http://influxdb:8086/api/v1/prom/write?db=prometheus&u=prometheus&p=Pr0m123" remoteWriteDashboards: true resources: requests: memory: 400Mi cpu: 0.5 limits: memory: 800Mi cpu: 0.8
首先安装一下CRD,以免安装prometheus operator的时候报错:
$ cd prometheus-operator $ kubectl apply -f crds/
然后再安装prometheus-operator,并禁用创建CRD,参考 ,注意是在monitoring 空间下:
$ cd prometheus-operator $ helm install prometheus --namespace=monitoring ./ --set prometheusOperator.createCustomResource=false
检查状态:
$ kubectl -n monitoring get all NAME READY STATUS RESTARTS AGE pod/alertmanager-prometheus-prometheus-oper-alertmanager-0 2/2 Running 0 8m18s pod/alertmanager-prometheus-prometheus-oper-alertmanager-1 2/2 Running 0 8m18s pod/alertmanager-prometheus-prometheus-oper-alertmanager-2 2/2 Running 0 8m18s pod/influxdb-0 1/1 Running 0 15h pod/prometheus-grafana-c89877b8c-k87pb 2/2 Running 0 13m pod/prometheus-kube-state-metrics-57d6c55b56-qjpdx 1/1 Running 0 13m pod/prometheus-prometheus-node-exporter-frs6j 1/1 Running 0 13m pod/prometheus-prometheus-node-exporter-ktpzj 1/1 Running 0 13m pod/prometheus-prometheus-node-exporter-r2ngs 1/1 Running 0 13m pod/prometheus-prometheus-oper-admission-patch-9l55v 0/1 Completed 0 13m pod/prometheus-prometheus-oper-operator-9568b7df6-4nrhw 2/2 Running 0 13m pod/prometheus-prometheus-prometheus-oper-prometheus-0 3/3 Running 1 8m8s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 8m18s service/influxdb ClusterIP 10.100.50.188 <none> 8086/TCP,8083/TCP,8088/TCP 15h service/prometheus-grafana ClusterIP 10.100.76.213 <none> 80/TCP 13m service/prometheus-kube-state-metrics ClusterIP 10.100.229.32 <none> 8080/TCP 13m service/prometheus-operated ClusterIP None <none> 9090/TCP 8m8s service/prometheus-prometheus-node-exporter ClusterIP 10.100.59.29 <none> 9100/TCP 13m service/prometheus-prometheus-oper-alertmanager ClusterIP 10.100.81.192 <none> 9093/TCP 13m service/prometheus-prometheus-oper-operator ClusterIP 10.100.225.101 <none> 8080/TCP,443/TCP 13m service/prometheus-prometheus-oper-prometheus ClusterIP 10.100.200.80 <none> 9090/TCP 13m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/prometheus-prometheus-node-exporter 3 3 3 3 3 <none> 13m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/prometheus-grafana 1/1 1 1 13m deployment.apps/prometheus-kube-state-metrics 1/1 1 1 13m deployment.apps/prometheus-prometheus-oper-operator 1/1 1 1 13m NAME DESIRED CURRENT READY AGE replicaset.apps/prometheus-grafana-c89877b8c 1 1 1 13m replicaset.apps/prometheus-kube-state-metrics-57d6c55b56 1 1 1 13m replicaset.apps/prometheus-prometheus-oper-operator-9568b7df6 1 1 1 13m NAME READY AGE statefulset.apps/alertmanager-prometheus-prometheus-oper-alertmanager 3/3 8m18s statefulset.apps/influxdb 1/1 15h statefulset.apps/prometheus-prometheus-prometheus-oper-prometheus 1/1 8m8s NAME COMPLETIONS DURATION AGE job.batch/prometheus-prometheus-oper-admission-patch 1/1 5m32s 13m
通过之前设定的ingress地址,打开prometheus server的web界面查看所有的target是否正常:
有两个失败的:
对于etcd:
默认存放在/etc/kubernetes/manifests/
$ cd /etc/kubernetes/manifests/ $ ll total 16K -rw------- 1 root root 1.9K Nov 12 13:30 etcd.yaml -rw------- 1 root root 2.6K Nov 14 10:35 kube-apiserver.yaml -rw------- 1 root root 2.7K Nov 14 10:44 kube-controller-manager.yaml -rw------- 1 root root 1012 Nov 12 13:30 kube-scheduler.yaml
此kubernetes版本(1.15.5)中还没有增加 –listen-metrics-urls=http://127.0.0.1:2381,但是对于kubernetes 1.16.3已经增加了这个参数,因为我们在values.yaml中指定了etcd的metrics端口为2381,所以在这里增加参数如下:
$ sudo vim etcd.yaml spec: containers: - command: - etcd - --advertise-client-urls=https://172.17.0.7:2379 - --cert-file=/etc/kubernetes/pki/etcd/server.crt - --client-cert-auth=true - --data-dir=/var/lib/etcd - --initial-advertise-peer-urls=https://172.17.0.7:2380 - --initial-cluster=k8s01.test.awsbj.cn=https://172.17.0.7:2380 - --key-file=/etc/kubernetes/pki/etcd/server.key - --listen-client-urls=https://127.0.0.1:2379,https://172.17.0.7:2379 - --listen-peer-urls=https://172.17.0.7:2380 - --listen-metrics-urls=http://0.0.0.0:2381 - --name=k8s01.test.awsbj.cn - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt - --peer-client-cert-auth=true - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt - --snapshot-count=10000 - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
再次检查target etcd状态,发现已经OK了。
对于kube-proxy:
我们检查一下这台机器的端口:
$ sudo netstat -lnutp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN 31604/kube-proxy ...
默认是监听在127.0.0.1的,我们进入pod容器内查看一下kube-proxy的命令帮助:
--bind -address 0.0.0.0 The IP address for the proxy server to serve on (set to 0.0.0.0 for all IPv4 interfaces and `::` for all IPv6 interfaces) (default 0.0.0.0) --healthz-bind-address 0.0.0.0 The IP address for the health check server to serve on (set to 0.0.0.0 for all IPv4 interfaces and `::` for all IPv6 interfaces) (default 0.0.0.0:10256) --healthz-port int32 The port to bind the health check server. Use 0 to disable . (default 10256) --metrics-bind-address 0.0.0.0 The IP address for the metrics server to serve on (set to 0.0.0.0 for all IPv4 interfaces and `::` for all IPv6 interfaces) (default 127.0.0.1:10249) --metrics-port int32 The port to bind the metrics server. Use 0 to disable . (default 10249)
检查发现:–metrics-bind-address默认就是127.0.0.1,找到问题后我们更改一下kube-proxy的yaml文件:
它的yaml默认是在kubernetes中kube-system命名空间中daemonset.apps/kube-proxy定义的:
spec: containers: - command: - /usr/local/bin/kube-proxy - --config=/var/lib/kube-proxy/config.conf - --hostname-override=$(NODE_NAME) - --metrics-bind-address=0.0.0.0
保存应用,稍等后再次查看prometheus target中kube-proxy是否OK,我们发现还是不行的 。
我们注意到kube-proxy的命令行启动时指定了个配置文件:–config=/var/lib/kube-proxy/config.conf,默认是通过configMap挂载的,我们查看一下kube-proxy的configMap:
$ kubectl -n kube-system get cm kube-proxy -o yaml apiVersion: v1 data: config.conf: |- apiVersion: kubeproxy.config.k8s.io/v1alpha1 bindAddress: 0.0 .0 .0 clientConnection: acceptContentTypes: "" burst: 10 contentType: application/vnd.kubernetes.protobuf kubeconfig: /var/lib/kube-proxy/kubeconfig.conf qps: 5 clusterCIDR: 10.101 .0 .0 /16 configSyncPeriod: 15m0s conntrack: maxPerCore: 32768 min: 131072 tcpCloseWaitTimeout: 1h0m0s tcpEstablishedTimeout: 24h0m0s enableProfiling: false healthzBindAddress: 0.0 .0 .0 :10256 hostnameOverride: "" iptables: masqueradeAll: false masqueradeBit: 14 minSyncPeriod: 0s syncPeriod: 30s ipvs: excludeCIDRs: null minSyncPeriod: 0s scheduler: "" strictARP: false syncPeriod: 30s kind: KubeProxyConfiguration metricsBindAddress: 127.0 .0 .1 :10249 mode: ipvs nodePortAddresses: null oomScoreAdj: -999 portRange: "" resourceContainer: /kube-proxy udpIdleTimeout: 250ms winkernel: enableDSR: false networkName: "" sourceVip: ""
$ kubectl -n kube-system edit cm kube-proxy
然后删除kube-proxy的pod,重新应用新的配置。
$ kubectl -n kube-system delete po -l k8s-app=kube-proxy
再次检查OK了:
对于prometheus 中target kube-proxy无法链接的问题,直接修改kube-proxy对应的configMap即可,修改启动参数是不生效的。
参考
所有的target都正常了:
查看Grafana所有大屏列表:
查看InfluxDB的资源占用:可以依据这里定义influxdb的资源占用,以免造成性能不够的问题。
好了,通过Helm安装InfluxDB + Prometheus-Operator已经完成了,下面我们添加一下对InfluxDB的metrics的搜集,还记得怎么定义的嘛?参考:Prometheus-Operator 手动入门实战 :
我们先来检查一下:Helm已经安装好了Prometheus资源定义,看看通过哪些Lables去匹配serviceMonitor和podMonitor的:
$ kubectl -n monitoring get prometheuses.monitoring.coreos.com prometheus-prometheus-oper-prometheus -o yaml --export Flag --export has been deprecated, This flag is deprecated and will be removed in future. apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: generation: 1 labels: app: prometheus-operator-prometheus chart: prometheus-operator-8.2.2 heritage: Helm release: prometheus name: prometheus-prometheus-oper-prometheus selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheuses/prometheus-prometheus-oper-prometheus spec: alerting: alertmanagers: - name: prometheus-prometheus-oper-alertmanager namespace: monitoring pathPrefix: /alertmanager port: web baseImage: quay.azk8s.cn/prometheus/prometheus enableAdminAPI: false externalUrl: http://prometheus.test.aws.microoak.cn/ listenLocal: false logFormat: json logLevel: info paused: false podMonitorNamespaceSelector: {} podMonitorSelector: matchLabels: release: prometheus portName: web remoteRead: - url: http://influxdb:8086/api/v1/prom/read?db=prometheus&u=prometheus&p=Pr0m123 remoteWrite: - url: http://influxdb:8086/api/v1/prom/write?db=prometheus&u=prometheus&p=Pr0m123 replicas: 1 resources: limits: cpu: 0.5 memory: 1500Mi requests: cpu: 0.5 memory: 1500Mi retention: 1d routePrefix: / ruleNamespaceSelector: {} ruleSelector: matchLabels: app: prometheus-operator release: prometheus securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 serviceAccountName: prometheus-prometheus-oper-prometheus serviceMonitorNamespaceSelector: {} serviceMonitorSelector: matchLabels: release: prometheus version: v2.13.1
我们再看一下这两个资源都有什么吧:
$ kubectl -n monitoring get servicemonitors.monitoring.coreos.com NAME AGE prometheus-prometheus-oper-alertmanager 144m prometheus-prometheus-oper-apiserver 144m prometheus-prometheus-oper-coredns 144m prometheus-prometheus-oper-grafana 144m prometheus-prometheus-oper-kube-controller-manager 144m prometheus-prometheus-oper-kube-etcd 144m prometheus-prometheus-oper-kube-proxy 144m prometheus-prometheus-oper-kube-scheduler 144m prometheus-prometheus-oper-kube-state-metrics 144m prometheus-prometheus-oper-kubelet 144m prometheus-prometheus-oper-node-exporter 144m prometheus-prometheus-oper-operator 144m prometheus-prometheus-oper-prometheus 144m $ kubectl -n monitoring get podmonitors.monitoring.coreos.com No resources found.
查看一下influxdb的svc:
$ kubectl -n monitoring get svc --show-labels NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE LABELS influxdb ClusterIP 10.100.50.188 <none> 8086/TCP,8083/TCP,8088/TCP 17h app=influxdb,chart=influxdb-3.0.1,heritage=Helm,release=influxdb $ kubectl -n monitoring get svc influxdb -o yaml --export Flag --export has been deprecated, This flag is deprecated and will be removed in future. apiVersion: v1 kind: Service metadata: creationTimestamp: null labels: app: influxdb chart: influxdb-3.0.1 heritage: Helm release: influxdb name: influxdb selfLink: /api/v1/namespaces/monitoring/services/influxdb spec: ports: - name: api port: 8086 protocol: TCP targetPort: 8086 - name: admin port: 8083 protocol: TCP targetPort: 8083 - name: rpc port: 8088 protocol: TCP targetPort: 8088 selector: app: influxdb sessionAffinity: None type : ClusterIP
测试一下influxdb的meitrics:
$ curl 10.100.50.188:8086/metrics go_gc_duration_seconds{quantile="0" } 7.6328e-05 go_gc_duration_seconds{quantile="0.25" } 0.000107326 go_gc_duration_seconds{quantile="0.5" } 0.000118199 go_gc_duration_seconds{quantile="0.75" } 0.000155008 go_gc_duration_seconds{quantile="1" } 0.07417821 go_gc_duration_seconds_sum 3.874605216 go_gc_duration_seconds_count 2198 go_goroutines 31 go_info{version="go1.11" } 1 go_memstats_alloc_bytes 4.37425856e+08 go_memstats_alloc_bytes_total 3.84833321384e+11
好了所需要的信息都已经找到,我们定义一个serviceMonitor来让prometheus搜取influxdb的metrics:
kind: ServiceMonitor apiVersion: monitoring.coreos.com/v1 metadata: name: prometheus-prometheus-oper-influxdb namespace: monitoring labels: app: prometheus-influxdb release: prometheus spec: endpoints: - path: /metrics interval: 30s port: api namespaceSelector: matchNames: - monitoring selector: matchLabels: release: influxdb app: influxdb
稍等片刻,operator自动生成prometheus的配置,等到prometheus搜集到会在target中看到:
好了,我们再来看一下influxdb中存储的prometheus数据:
$ kubectl -n monitoring exec -it influxdb-0 -- sh / Connected to http://localhost:8086 version 1.7.6 InfluxDB shell version: 1.7.6 Enter an InfluxQL query > auth prometheus Pr0m123 > use prometheus Using database prometheus > show measurements name: measurements name ---- :kube_pod_info_node_count: :node_memory_MemAvailable_bytes:sum ALERTS ALERTS_FOR_STATE APIServiceOpenAPIAggregationControllerQueue1_adds APIServiceOpenAPIAggregationControllerQueue1_depth APIServiceOpenAPIAggregationControllerQueue1_longest_running_processor_microseconds up ...... > select * from up limit 10 name: up time __name__ endpoint instance job namespace node pod prometheus prometheus_replica service value ---- -------- -------- -------- --- --------- ---- --- ---------- ------------------ ------- ----- 1574386880730000000 up https-metrics 172.17.0.7:10250 kubelet kube-system k8s01.test.awsbj.cn monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 prometheus-prometheus-oper-kubelet 1 1574386883246000000 up http 10.101.253.87:8080 prometheus-prometheus-oper-operator monitoring prometheus-prometheus-oper-operator-9568b7df6-4nrhw monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 prometheus-prometheus-oper-operator 1 1574386884733000000 up https-metrics 172.17.0.213:10250 kubelet kube-system k8s02.test.awsbj.cn monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 prometheus-prometheus-oper-kubelet 1 1574386884871000000 up http-metrics 172.17.0.213:10249 kube-proxy kube-system kube-proxy-9gtz5 monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 prometheus-prometheus-oper-kube-proxy 0 1574386885147000000 up https-metrics 172.17.0.7:10250 kubelet kube-system k8s01.test.awsbj.cn monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 kubelet 1 1574386886447000000 up http-metrics 172.17.0.230:10249 kube-proxy kube-system kube-proxy-pwb7l monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 prometheus-prometheus-oper-kube-proxy 0 1574386887508000000 up https-metrics 172.17.0.230:10250 kubelet kube-system k8s03.test.awsbj.cn monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 kubelet 1 1574386888264000000 up https-metrics 172.17.0.230:10250 kubelet kube-system k8s03.test.awsbj.cn monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 kubelet 1 1574386889313000000 up http-metrics 10.101.253.65:9153 coredns kube-system coredns-5c98db65d4-xn6fg monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 prometheus-prometheus-oper-coredns 1 1574386889485000000 up metrics 172.17.0.213:9100 node-exporter monitoring prometheus-prometheus-node-exporter-frs6j monitoring/prometheus-prometheus-oper-prometheus prometheus-prometheus-prometheus-oper-prometheus-0 prometheus-prometheus-node-exporter 1 >
再来查看一下prometheus 中的up:
发现了什么:
prometheus将每个metrics按照名称都存在influxdb的measurement中,这个measurement就相当于一个表。
Prometheus 的示例Sample(value)变成InfluxDB的field字段使用value
做field key,它永远是float类型。
Prometheus labels变成InfluxDB的tags。
所有的# HELP
和 # TYPE
在InfluxDB中都被忽略。
参考:
https://zhuanlan.zhihu.com/p/79561704
https://github.com/helm/charts/tree/master/stable/prometheus-operator#configuration # Prometheus-Operator Helm Chart 配置解释
https://docs.influxdata.com/influxdb/v1.7/supported_protocols/prometheus/ # 如何配置prometheus 使用开启了auth的influxdb。
本文到这里就结束了,欢迎期待后面的文章。您可以关注下方的公众号二维码,在第一时间查看新文章。