阿辉的博客

系统 网络 集群 数据库 分布式云计算等 研究

在kubernetes 上部署ceph Rook测试

下载:

mkdir rook
cd rook/

wget https://raw.githubusercontent.com/rook/rook/release-1.0/cluster/examples/kubernetes/ceph/common.yaml
wget https://raw.githubusercontent.com/rook/rook/release-1.0/cluster/examples/kubernetes/ceph/cluster.yaml
wget https://raw.githubusercontent.com/rook/rook/release-1.0/cluster/examples/kubernetes/ceph/operator.yaml

修改配置:

vim cluster.yaml
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: ceph/ceph:v13.2.6-20190604 #ceph版本
    allowUnsupported: false
  dataDirHostPath: /data/ceph/rook  #存储的目录
  mon:
    count: 3
    allowMultiplePerNode: false
  dashboard:
    enabled: true
  network:
    hostNetwork: false
  rbdMirroring:
    workers: 0
  annotations:
  resources:
    useAllNodes: true
    useAllDevices: true

部署:

kubectl create -f common.yaml 
kubectl create -f operator.yaml 
kubectl create -f cluster.yaml

查看POD状态:

[root@sh-ops-k8stest-master-dev-01 rook]# kubectl get all -n rook-ceph 
NAME                                      READY   STATUS    RESTARTS   AGE
pod/rook-ceph-agent-2qrlm                 1/1     Running   0          57s
pod/rook-ceph-agent-l6v7f                 1/1     Running   0          57s
pod/rook-ceph-agent-td7zz                 1/1     Running   0          57s
pod/rook-ceph-mon-a-68947cffc8-j6tpt      1/1     Running   0          27s
pod/rook-ceph-mon-b-65ccd7bcc6-nvm5h      1/1     Running   0          19s
pod/rook-ceph-operator-7f5ff79d9f-dknph   1/1     Running   0          88s
pod/rook-discover-css92                   1/1     Running   0          57s
pod/rook-discover-fvkcf                   1/1     Running   0          57s
pod/rook-discover-wgl56                   1/1     Running   0          57s

NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/rook-ceph-mon-a   ClusterIP   10.248.255.203   <none>        6789/TCP   28s
service/rook-ceph-mon-b   ClusterIP   10.248.254.82    <none>        6789/TCP   21s

NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/rook-ceph-agent   3         3         3       3            3           <none>          57s
daemonset.apps/rook-discover     3         3         3       3            3           <none>          57s

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/rook-ceph-mon-a      1/1     1            1           27s
deployment.apps/rook-ceph-mon-b      1/1     1            1           19s
deployment.apps/rook-ceph-operator   1/1     1            1           88s

NAME                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/rook-ceph-mon-a-68947cffc8      1         1         1       27s
replicaset.apps/rook-ceph-mon-b-65ccd7bcc6      1         1         1       19s
replicaset.apps/rook-ceph-operator-7f5ff79d9f   1         1         1       88s

通过docker overlay2 目录名查找容器名

有时候经常会有个别容器占用磁盘空间特别大,这个时候就需要通过docker overlay2 目录名查找容器名:

先进入overlay2的目录,这个目录可以通过docker的配置文件(/etc/docker/daemon.json)内找到。然后看看谁占用空间比较多。

[root@sh-saas-k8s1-node-qa-04 overlay2]# du -sc * | sort -rn  | more
33109420        total
1138888 20049e2e445181fc742b9e74a8819edf0e7ee8f0c0041fb2d1c9d321f73d8f5b
1066548 010d0a26a1fe5b00e330d0d87649fc73af45f9333fd824bf0f9d91a37276af18
943208  030c0f111675f6ed534eaa6e4183ec91d4c065dd4bdb5a289f4b572357667378
825116  0ad9e737795dd367bb72f7735fb69a65db3d8907305b305ec21232505241d044
824756  bf3c698966bc19318f3263631bc285bde07c6a1a4eaea25c4ecd3b7b8f29b3fd
661000  15763b72802e1e71cc943e09cba8b747779bf80fa35d56318cf1b89f7b1f1e71
575564  02eaa52e2f999dc387a9dee543028bada0762022cef1400596b5cc18a6223635
486780  4353c30611d7f51932d9af24bb1330db0fdb86faa9d9cae02ed618fa975c697a
486420  562a8874cc345b8ea830c1486c42211b288c886c5dca08e14d7057cacab984c1
486420  4f897e8cd355320b0e7ee1ecc9db5c43d5151f9afa29f1925fe264c88429be4c
448652  a8d0596d123fcc59983ce63df3f3acd40d5c930ed72874ce0a9efbc3234466de
448296  851cc4912edb9989e120acf241f26c82e9715e7fcb1d2bd19a002fcfb894f1f4
417780  20608baacae6bafcd4230a18a272857bc75703a6eefef5c9b40ba4ea19496b11
387388  43a8a76de3b5531e4c12f956f7bfcfcdb8fd38548faf20812cafa9a39813abc5

再通过目录名查找容器名:

[root@sh-saas-k8s1-node-qa-04 overlay2]#  docker ps -q | xargs docker inspect --format '{{.State.Pid}}, {{.Name}}, {{.GraphDriver.Data.WorkDir}}' | grep "20049e2e445181fc742b9e74a8819edf0e7ee8f0c0041fb2d1c9d321f73d8f5b"
4884, /k8s_taskmanager_flink-taskmanager-java-qa-7879d55f45-xbd74_public-tjpm-flink-java-qa_ad8bf915-a23f-11e9-be66-52540088db9a_0, /data/kubernetes/docker/overlay2/20049e2e445181fc742b9e74a8819edf0e7ee8f0c0041fb2d1c9d321f73d8f5b/work

如果发现有目录查不到,通常是因为容器已经被删掉了,目录没有清理,这时直接清理便可:

docker system prune -a -f

k8s创建集群只读service account

有时需要在k8s 集群上给比如开发人员创建一个只读的service account,在这里记录一下创建方法:

先创建oms-viewonly.yaml:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: oms-viewonly
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - endpoints
  - persistentvolumeclaims
  - pods
  - replicationcontrollers
  - replicationcontrollers/scale
  - serviceaccounts
  - services
  - nodes
  - persistentvolumeclaims
  - persistentvolumes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - bindings
  - events
  - limitranges
  - namespaces/status
  - pods/log
  - pods/status
  - replicationcontrollers/status
  - resourcequotas
  - resourcequotas/status
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - daemonsets
  - deployments
  - deployments/scale
  - replicasets
  - replicasets/scale
  - statefulsets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - deployments
  - deployments/scale
  - ingresses
  - networkpolicies
  - replicasets
  - replicasets/scale
  - replicationcontrollers/scale
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterrolebindings
  - clusterroles
  - roles
  - rolebindings
  verbs:
  - get
  - list
  - watch

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: oms-read 
  namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: oms-read
  labels: 
    k8s-app: oms-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: oms-viewonly
subjects:
- kind: ServiceAccount
  name: oms-read
  namespace: kube-system

然后创建:
kubectl apply -f oms-viewonly.yaml

最后就可以使用以下命令查找刚刚创建SA的token:
kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep oms-read | awk '{print $1}')

k8s 命令常用批量操作

用一行命令搞定:
kubectl get pods --all-namespaces -o wide | grep Evicted | awk '{print $1,$2}' | xargs -L1 kubectl delete pod -n

如:

kubectl get pods --all-namespaces -o wide | grep Evicted | awk '{print $1,$2}' | xargs -L1 kubectl delete pod -n 
pod "public-fe-tweeter-node-online-65fc7889-tjgn9" deleted
pod "flink-taskmanager-java-online-76b8c459f-5xqtv" deleted
pod "flink-taskmanager-java-online-76b8c459f-7hfdw" deleted
pod "flink-taskmanager-java-online-76b8c459f-jkb8l" deleted
pod "flink-taskmanager-java-online-76b8c459f-nwls4" deleted
pod "flink-taskmanager-java-online-76b8c459f-t7xxk" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-67bdb8585c-skzz5" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-69996b44-kcnqp" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-6cb9c6cb5-qjfj4" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-6cb9c6cb5-rr9nf" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-77948c97c5-2m4jf" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-77948c97c5-qjgh5" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-b4c456646-qr5wl" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-b4c456646-sshlr" deleted
pod "saas-jcpt-saas-uc-base-tomcat-online-b4c456646-wqnpq" deleted

也可以把Evicted换成OutOfcpu等其它状态使用。

批量加标签:

for node in `kubectl get node | grep node | awk '{print $1}'`; do kubectl label node $node edug=traefik ; done 

docker 在宿主机上根据进程PID查找归属容器ID

在使用docker时经常出现一台docker主机上跑了多个容器,可能其中一个容器里的进程导致了整个宿主机load很高,其实一条命令就可以找出罪魁祸首

#查找容器ID

docker inspect -f "{{.Id}}" $(docker ps -q) |grep <PID>

#查找k8s pod name

docker inspect -f "{{.Id}} {{.State.Pid}} {{.Config.Hostname}}" $(docker ps -q) |grep <PID>

#如果PID是容器内运行子进程那docker inspect就无法显示了

for i in  `docker ps |grep Up|awk '{print $1}'`;do echo \ &&docker top $i &&echo ID=$i; done |grep -A 10 <PID>

转自:https://www.cnblogs.com/37yan/p/9559308.html

通过zabbix监控kubernetes集群

日前写了一个zabbix的监控脚本来监控kubernetes集群,主要用于报警的功能。性能监控还是使用其它方式来实现。

github URL

https://github.com/farmerluo/k8s_zabbix

k8s_zabbix说明

k8s_zabbix实现了使用zabbix监控kubernetes的ingress,hpa,pod状态等功能。

Template Check K8S Cluster Status.xml:zabbix模板,可通过此文件导入到zabbix

check_k8s_status.py:kubernetes的监控脚本

userparameter_k8s.conf:zabbix agent端的配置文件,需要注意脚本的路径

check_k8s_status.py说明

  • 监控的k8s集群配置:
conf.host = "https://10.10.88.20:8443"
conf.api_key['authorization'] = "xxxxxx.xxxxxxx.x-x-xxx-xxx-x"
  • 脚本会监控traefik ingress的访问状态,将对400~599的非正常状态进行报警,需事先将traefik的访问日志通过fluentd或filebeat等导入到elasticsearch集群,脚本将定时通过查询访问日志来监控ingress的访问状态。监控脚本内的配置:
# elasticsearch server config
es_server = [{"host": "10.16.252.50", "port": 9200},
             {"host": "10.16.252.50", "port": 9200},
             {"host": "10.16.252.50", "port": 9200}
             ]
# 索引名
es_index = "logstash-traefik-ingress-lb-*"
# es查询间隔,ms
es_query_duration = 60000

# 状态码报警及阈值配置
# xxx.com为自定域名的例子
status_code_config = {
    'default': {'403': '90', '404': '90', '500': '2', '502': '2', '499': '70', '406': '70', '503': '5',
                '504': '5', '599': '2', 'other': '30', '429': '5', '430': '1'
                },
    'xxx.com': {'403': '100', '404': '100', '500': '70', '502': '70', '499': '100', '406': '80', '503': '70',
                '504': '70', '599': '0', 'other': '60', '429': '5', '430': '1'
                }
}

coredns解析结果异常的问题

我发现k8s内coredns的解析结果有点问题。经常解析不出来。

/ # nslookup kubernetes-dashboard.kube-system.svc.cluster.local
Server:         10.253.255.10
Address:        10.253.255.10:53

Non-authoritative answer:

*** Can't find kubernetes-dashboard.kube-system.svc.cluster.local: No answer

/ # nslookup kubernetes-dashboard.kube-system.svc.cluster.local
Server:         10.253.255.10
Address:        10.253.255.10:53

Name:   kubernetes-dashboard.kube-system.svc.cluster.local
Address: 10.253.255.40

*** Can't find kubernetes-dashboard.kube-system.svc.cluster.local: No answer

/ # nslookup kubernetes-dashboard.kube-system.svc.cluster.local
Server:         10.253.255.10
Address:        10.253.255.10:53

Name:   kubernetes-dashboard.kube-system.svc.cluster.local
Address: 10.253.255.40

(更多…)

kubernetes 日常使用及操作测试


前文已经安装好了一套kubernetes 1.10,下面我们来进行日常使用测试

1. 创建部署及服务

编辑一个yaml文件:

vim nginx.yaml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx-router
  namespace: test
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx-router
    spec:
      containers:
      - name: nginx-router
        image: 172.21.248.242/base/nginx
        ports:
        - containerPort: 80

---
kind: Service
apiVersion: v1
metadata:
  name: nginx-router
  namespace: test
spec:
  selector:
    app: nginx-router
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: nginx-router-ingress
  namespace: test
  annotations:
    kubernetes.io/ingress.class: traefik
spec:
  rules:
  - host: "nginx.k8s.dev.huilog.com"
    http:
      paths:
      - backend:
          serviceName: nginx-router
          servicePort: 80

# 创建部署及服务:
[root@bs-ops-test-docker-dev-01 dev]# kubectl create -f nginx.yaml 
deployment.extensions "nginx-router" created
service "nginx-router" created
ingress.extensions "nginx-router-ingress" created
[root@bs-ops-test-docker-dev-01 dev]#

(更多…)

kubernetes traefik ingress 安装及配置

Traefik是一款开源的反向代理与负载均衡工具。它最大的优点是能够与常见的微服务系统直接整合,可以实现自动化动态配置。目前支持Docker, Swarm, Mesos/Marathon, Mesos, Kubernetes, Consul, Etcd, Zookeeper, BoltDB, Rest API等等后端模型。
以下是架构图:

需要指出的是,ingress-controllers其实是kubernetes的一部分,ingress就是从kubernetes集群外访问集群的入口,将用户的URL请求转发到不同的service上。Ingress相当于nginx、apache等负载均衡方向代理服务器,其中还包括规则定义,即URL的路由信息,路由信息得的刷新由Ingress controller来提供。

Ingress Controller 实质上可以理解为是个监视器,Ingress Controller 通过不断地跟 kubernetes API 打交道,实时的感知后端 service、pod 等变化,比如新增和减少 pod,service 增加与减少等;当得到这些变化信息后,Ingress Controller 再结合下文的 Ingress 生成配置,然后更新反向代理负载均衡器,并刷新其配置,达到服务发现的作用。
(更多…)

kubernetes V1.10集群安装及配置

1. k8s集群系统规划

1.1. kubernetes 1.10的依赖

k8s V1.10对一些相关的软件包,如etcd,docker并不是全版本支持或全版本测试,建议的版本如下:
– docker: 1.11.2 to 1.13.1 and 17.03.x
– etcd: 3.1.12
– 全部信息如下:

参考:External Dependencies
* The supported etcd server version is 3.1.12, as compared to 3.0.17 in v1.9 (#60988)
* The validated docker versions are the same as for v1.9: 1.11.2 to 1.13.1 and 17.03.x (ref)
* The Go version is go1.9.3, as compared to go1.9.2 in v1.9. (#59012)
* The minimum supported go is the same as for v1.9: go1.9.1. (#55301)
* CNI is the same as v1.9: v0.6.0 (#51250)
* CSI is updated to 0.2.0 as compared to 0.1.0 in v1.9. (#60736)
* The dashboard add-on has been updated to v1.8.3, as compared to 1.8.0 in v1.9. (#57326)
* Heapster has is the same as v1.9: v1.5.0. It will be upgraded in v1.11. (ref)
* Cluster Autoscaler has been updated to v1.2.0. (#60842, @mwielgus)
* Updates kube-dns to v1.14.8 (#57918, @rramkumar1)
* Influxdb is unchanged from v1.9: v1.3.3 (#53319)
* Grafana is unchanged from v1.9: v4.4.3 (#53319)
* CAdvisor is v0.29.1 (#60867)
* fluentd-gcp-scaler is v0.3.0 (#61269)
* Updated fluentd in fluentd-es-image to fluentd v1.1.0 (#58525, @monotek)
* fluentd-elasticsearch is v2.0.4 (#58525)
* Updated fluentd-gcp to v3.0.0. (#60722)
* Ingress glbc is v1.0.0 (#61302)
* OIDC authentication is coreos/go-oidc v2 (#58544)
* Updated fluentd-gcp updated to v2.0.11. (#56927, @x13n)
* Calico has been updated to v2.6.7 (#59130, @caseydavenport)

(更多…)