介绍
这篇文章介绍了使用K8s搭建Prometheus + Grafana的服务监控系统,同时将Loki + Promtail 集成进来的实操步骤, 集合了Prometheus服务监控 和 LPG日志收集方案 两篇文章内容,对于细节请参考原文章。
前置要求
- 安装Docker
- 安装Kubernetes
- 安装JDK
准备工作
# Spring Boot服务打包基础镜像
docker pull azul/zulu-openjdk-centos:17-jre-latest
# Loki + Promtail + Grafana
docker pull grafana/loki:3.5.1
docker pull grafana/promtail:3.5.1
docker pull grafana/grafana:12.0.1
# Prometheus Operator
docker pull quay.io/prometheus/prometheus:v3.4.1
docker pull quay.io/prometheus/alertmanager:v0.28.1
docker pull quay.io/prometheus/node-exporter:v1.9.1
docker pull quay.io/prometheus-operator/admission-webhook:v0.82.2
docker pull quay.io/prometheus-operator/prometheus-operator:v0.82.2
docker pull quay.io/prometheus-operator/prometheus-config-reloader:v0.82.2
docker pull quay.io/thanos/thanos:v0.38.0
docker pull registry.k8s.io/ingress-nginx/kube-webhook-certgen:v1.5.3
docker pull registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.15.0
# registry.k8s.io 拉取失败,使用代理拉取,拉取成功后重新打tag即可
docker pull k8s.mirror.nju.edu.cn/ingress-nginx/kube-webhook-certgen:v1.5.3
docker tag k8s.mirror.nju.edu.cn/ingress-nginx/kube-webhook-certgen:v1.5.3 registry.k8s.io/ingress-nginx/kube-webhook-certgen:v1.5.3
docker pull k8s.mirror.nju.edu.cn/kube-state-metrics/kube-state-metrics:v2.15.0
docker tag k8s.mirror.nju.edu.cn/kube-state-metrics/kube-state-metrics:v2.15.0 registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.15.0
# Prometheus Operator 的 Helm Chart
wget https://github.com/prometheus-community/helm-charts/releases/download/kube-prometheus-stack-73.2.0/kube-prometheus-stack-73.2.0.tgz
# Promtail 安装包、配置文件
wget https://github.com/grafana/loki/releases/download/v3.5.1/promtail-linux-amd64.zip
wget https://raw.githubusercontent.com/grafana/loki/v3.5.1/clients/cmd/promtail/promtail-local-config.yaml
# Prometheus node-exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz
K8s 部署 Prometheus + Grafana
这里使用Helm Chart部署 Prometheus Operator,其中包含了Grafana的部署
Helm 部署 Prometheus Operator
# 添加chart仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# 更新chart仓库
helm repo update
# 下载chart安装包
helm pull prometheus-community/kube-prometheus-stack --version 73.2.0
# 解压
tar -zxvf kube-prometheus-stack-73.2.0.tgz -C ~/
cd ~/kube-prometheus-stack
# 查看默认的配置文件
vim values.yaml
# 部署
helm install prometheus . --create-namespace --namespace monitoring
# 修改 Prometheus Grafana 服务为 NodePort,否则外部无法访问
kubectl patch svc prometheus-kube-prometheus-prometheus -n monitoring -p '{"spec": {"type": "NodePort"}}'
kubectl patch svc prometheus-grafana -n monitoring -p '{"spec": {"type": "NodePort"}}'
# 获取Grafana的默认账号密码 admin / prom-operator
kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 -d ; echo
kubectl --namespace monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
部署完成后获取服务的状态及端口
# 查看 pod
kubectl get pods -n monitoring -o wide
# 输出结果
[diginn@k8s-master-01 ~]$ kubectl get pods -n monitoring -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 2 (19m ago) 51m 10.244.151.185 k8s-master-01 <none> <none>
prometheus-grafana-76cd8bb66b-4ttx5 3/3 Running 0 51m 10.244.95.28 k8s-master-02 <none> <none>
prometheus-kube-prometheus-operator-5cfd684899-rd877 1/1 Running 1 (19m ago) 51m 10.244.151.184 k8s-master-01 <none> <none>
prometheus-kube-state-metrics-74b7dc4795-tmgdj 1/1 Running 0 51m 10.244.44.194 k8s-node-02 <none> <none>
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 2 (19m ago) 51m 10.244.151.182 k8s-master-01 <none> <none>
prometheus-prometheus-node-exporter-26xmn 0/1 Pending 0 51m <none> k8s-master-03 <none> <none>
prometheus-prometheus-node-exporter-2crj7 1/1 Running 1 (19m ago) 51m 192.168.137.121 k8s-master-01 <none> <none>
prometheus-prometheus-node-exporter-bktcd 1/1 Running 0 51m 192.168.137.122 k8s-master-02 <none> <none>
prometheus-prometheus-node-exporter-h6bh6 0/1 Pending 0 51m <none> k8s-node-01 <none> <none>
prometheus-prometheus-node-exporter-jqvkv 0/1 Pending 0 51m <none> k8s-node-03 <none> <none>
prometheus-prometheus-node-exporter-vdtzd 1/1 Running 0 51m 192.168.137.132 k8s-node-02 <none> <none>
# 查看服务端口信息
kubectl get svc -n monitoring -o wide
# 输出结果
[diginn@k8s-master-01 ~]$ kubectl get svc -n monitoring -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 52m app.kubernetes.io/name=alertmanager
prometheus-grafana NodePort 10.111.36.16 <none> 80:30182/TCP 52m app.kubernetes.io/instance=prometheus,app.kubernetes.io/name=grafana
prometheus-kube-prometheus-alertmanager ClusterIP 10.101.24.100 <none> 9093/TCP,8080/TCP 52m alertmanager=prometheus-kube-prometheus-alertmanager,app.kubernetes.io/name=alertmanager
prometheus-kube-prometheus-operator ClusterIP 10.110.144.229 <none> 443/TCP 52m app=kube-prometheus-stack-operator,release=prometheus
prometheus-kube-prometheus-prometheus NodePort 10.100.104.204 <none> 9090:31694/TCP,8080:31199/TCP 52m app.kubernetes.io/name=prometheus,operator.prometheus.io/name=prometheus-kube-prometheus-prometheus
prometheus-kube-state-metrics ClusterIP 10.103.22.102 <none> 8080/TCP 52m app.kubernetes.io/instance=prometheus,app.kubernetes.io/name=kube-state-metrics
prometheus-operated ClusterIP None <none> 9090/TCP 52m app.kubernetes.io/name=prometheus
prometheus-prometheus-node-exporter ClusterIP 10.100.244.250 <none> 9100/TCP 52m app.kubernetes.io/instance=prometheus,app.kubernetes.io/name=prometheus-node-exporter
浏览器访问服务的Web页面
注: 服务的端口见上方 kubectl get svc -n monitoring -o wide 输出的 PORT(S)
- Prometheus 主界面
http://192.168.137.121:31694
| URL | 介绍 | 备注 |
|---|---|---|
| / | 主界面 | |
| /metrics | 自身指标 | 采集到的指标信息 |
| /config | 配置页面 | 运行时的配置,如果发现目标服务的指标在Grafan搜索不到,需要比对这里的配置及Relabel Rule |
| /service-discovery | 服务发现 | 记录了所有服务的原始标签及经规则过滤后的标签信息,配合 /config 页面的配置比对 |
| /targets | 采集对象的状态 | 采集对象的状态信息, 可以简单排除服务问题 |
- Grafana 主界面
http://192.168.137.121:30182
| URL | 介绍 | 备注 |
|---|---|---|
| / | 主界面 | |
| /connections/datasources | 数据源 | 连接数据源 如 Prometheus AlertManager Loki等 |
| /explore | 搜索页 | 在这个页面使用GrafaQL搜索查看上报的原始数据(主要是日志) |
| /dashboards | 监控面板 | 服务状态监控大屏 |
K8s 部署 Loki
由于上面部署Prometheus Operator已经包含了Grafana,这里就不需要再部署Grafana了
官方的部署方案太消耗服务器资源(Pod数10+ 且随集群节点数增加而增加),这里仅部署一个节点
k8s 资源文件定义
Loki部署资源文件: lpg-loki.yaml
# Loki 账号定义
apiVersion: v1
kind: ServiceAccount
metadata:
name: loki
namespace: lpg
---
# Loki 角色定义
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: loki
namespace: lpg
rules:
- apiGroups: ["extensions"]
resources: ["podsecuritypolicies"]
verbs: ["use"]
resourceNames: [loki]
---
# Loki 角色绑定定义
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: loki
namespace: lpg
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: loki
subjects:
- kind: ServiceAccount
name: loki
---
# Loki配置文件
apiVersion: v1
kind: ConfigMap
metadata:
name: loki
namespace: lpg
labels:
app: loki
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
common:
instance_addr: 127.0.0.1
path_prefix: /data/loki
storage:
filesystem:
chunks_directory: /data/loki/chunks
rules_directory: /data/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
---
# Loki 服务定义
apiVersion: v1
kind: Service
metadata:
name: loki
namespace: lpg
labels:
app: loki
spec:
type: NodePort
ports:
- port: 3100
protocol: TCP
name: http-metrics
targetPort: http-metrics
nodePort: 30310
selector:
app: loki
---
# Loki StatefulSet定义
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: loki
namespace: lpg
labels:
app: loki
spec:
podManagementPolicy: OrderedReady
replicas: 1
selector:
matchLabels:
app: loki
serviceName: loki
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app: loki
spec:
serviceAccountName: loki
initContainers:
- name: chmod-data
image: busybox:1.28.4
imagePullPolicy: IfNotPresent
command: ["chmod","-R","777","/loki/data"]
volumeMounts:
- name: storage
mountPath: /loki/data
containers:
- name: loki
image: grafana/loki:3.5.1
imagePullPolicy: IfNotPresent
args:
- -config.file=/etc/loki/loki.yaml
ports:
- name: http-metrics
containerPort: 3100
protocol: TCP
# 添加安全上下文 id
securityContext:
runAsUser: 1000
runAsGroup: 1000
livenessProbe:
httpGet:
path: /ready
port: http-metrics
scheme: HTTP
initialDelaySeconds: 45
readinessProbe:
httpGet:
path: /ready
port: http-metrics
scheme: HTTP
initialDelaySeconds: 45
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /data
terminationGracePeriodSeconds: 4800
volumes:
- name: config
configMap:
name: loki
- name: storage
hostPath:
path: /home/diginn/lpg/loki
如有需要,修改下面的几个属性:
*.namespace: lpgloki.hostPath.path: /home/diginn/lpg/lokiloki.ports[*].port.nodePort: 30310
部署资源
# 创建命名空间
kubectl get namespace lpg
kubectl create namespace lpg
# 部署
kubectl apply -f lpg-loki.yaml
# 删除
kubectl delete -f lpg-loki.yaml
部署完成后获取服务的状态及端口
# 查看 Pod
kubectl get pods -n lpg -o wide
# 输出结果
[diginn@k8s-master-01 ~]$ kubectl get pods -n lpg -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
loki-0 1/1 Running 1 (99m ago) 112m 10.244.44.249 k8s-node-02 <none> <none>
# 查看服务端口信息
kubectl get services -n lpg -o wide
# 输出结果
[diginn@k8s-master-01 ~]$ kubectl get services -n lpg -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
loki NodePort 10.96.77.123 <none> 3100:30310/TCP 113m app=loki