需求背景
我们的集群迁移到云上的k8s后,本地无法继续直连调试,需要新的远程调试方案.
调研阿里出品的kt-connect时在社区看到一个issue,由于协议转换的问题,在使用redis,kafka等中间件时会受影响,于是这个方案就排除了.然后抽空试了下telepresence(和ambassador同源),基本需求都能满足,所以就把定下来了.
安装traffic-manager
这类工具都需要在集群和本地创建代理,集群上的代理客户端叫traffic-manager,本地的叫telepresence.
需要确保telepresence和traffic-manager版本一致.我们使用的版本是2.4.6
helm repo add datawire https://app.getambassador.io
helm repo update
kubectl create namespace ambassador
helm install traffic-manager --namespace ambassador datawire/telepresence
安装telepresence
macOS
# Install via brew:
brew install datawire/blackbird/telepresence
# OR install manually:
# 1. Download the latest binary (~60 MB):
sudo curl -fL https://app.getambassador.io/download/tel2/darwin/amd64/2.4.6/telepresence -o /usr/local/bin/telepresence
# 2. Make the binary executable:
sudo chmod a+x /usr/local/bin/telepresence
Linux
# 1. Download the latest binary (~50 MB):
sudo curl -fL https://app.getambassador.io/download/tel2/linux/amd64/2.4.6/telepresence -o /usr/local/bin/telepresence
# 2. Make the binary executable:
sudo chmod a+x /usr/local/bin/telepresence
Windows
接触的时候他们刚推出windows的开发者预览版本,在安装脚本和使用过程中可能都会有一些小问题.
-
下载官方安装包: https://app.getambassador.io/download/tel2/windows/amd64/2.4.6/telepresence.zip
-
安装命令请在Powershell内以管理员身份执行(需要把安装包放到盘根目录再解压,否则可能失败)
Expand-Archive -Path telepresence.zip Remove-Item 'telepresence.zip' cd telepresence
-
默认安装到C:\telepresence,可以通过编辑install-telepresence.ps1来修改安装路径
Set-ExecutionPolicy Bypass -Scope Process .\install-telepresence.ps1
-
安装完成后系统环境变量中会多出两行
C:\telepresence C:\Program Files\SSHFS-Win\bin
-
确认是否安装成功
PS C:\Users\iplas> telepresence.exe status Root Daemon: Not running User Daemon: Not running
-
安装kubectl(新版本的docker desktop v20.10.7自动安装了kubectl) https://kubernetes.io/zh/docs/tasks/tools/install-kubectl-windows/
-
验证
$ kubectl.exe version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.4-tke.6", GitCommit:"194201819cf1e5cf45d38f72ce1aac9efca4c7ff", GitTreeState:"clean", BuildDate:"2020-12-29T09:13:24Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"linux/amd64"}
-
如果没有server的信息是因为kubeconfig没有配置,配置文件可以在Rancher获取并保存本地
apiVersion: v1 kind: Config clusters: - name: "testk8s" cluster: server: "https://rancher.xxx.com.cn/k8s/clusters/local" users: - name: "testk8s" user: token: "kubeconfig-u-sr6p9:xxxx" contexts: - name: "testk8s" context: user: "testk8s" cluster: "testk8s" current-context: "testk8s"
-
添加系统变量,变量名=KUBECONFIG 变量值=${kubeconfig.yml保存位置}
-
验证集群连通性
$ kubectl cluster-info Kubernetes control plane is running at https://rancher.xxx.com.cn/k8s/clusters/local CoreDNS is running at https://rancher.xxx.com.cn/k8s/clusters/local/api/v1/namespaces/kube-system/services/kube-dns:dns-tcp/proxy KubeDNSUpstream is running at https://rancher.xxx.com.cn/k8s/clusters/local/api/v1/namespaces/kube-system/services/kube-dns-upstream:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
升级
-
退出telepresence进程
$ telepresence.exe quit Telepresence Root Daemon quitting... done Telepresence User Daemon quitting... done 2021-11-10 14:33:36 iplas@q /e/cmd $ telepresence.exe status Root Daemon: Not running User Daemon: Not running
-
重新安装新的版本(不用卸载)
使用
全量拦截
-
连接traffic-manager
$ telepresence.exe connect Launching Telepresence Root Daemon Launching Telepresence User Daemon Connected to context testk8s (https://rancher.xxx.com.cn/k8s/clusters/local)
-
拦截指定deploy的全部流量,注意这里的service-name指deploy的name,remote-port指svc的port.
telepresence intercept <service-name> --port <local-port>[:<remote-port>] --env-file <path-to-env-file>
-
以test1-oss的mobile-oss为例(如果要拦截的svc只有一个端口,可以不指定远程端口)
$ telepresence.exe intercept -n test1-oss mobile-oss-deploy --port 8080 Using Deployment mobile-oss-deploy intercepted Intercept name : mobile-oss-deploy-test1-oss State : ACTIVE Workload kind : Deployment Destination : 127.0.0.1:8080 Volume Mount Point: T: Intercepting : all TCP connections
-
检查拦截情况
$ telepresence.exe list -n test1-oss activiti-service-deployment : ready to intercept (traffic-agent not yet installed) admin-service-deployment : ready to intercept (traffic-agent not yet installed) attachment-service-deployment : ready to intercept (traffic-agent not yet installed) crm-service-deployment : ready to intercept (traffic-agent not yet installed) crmdb-services-deployment : ready to intercept (traffic-agent not yet installed) mobile-oss-deploy : intercepted Intercept name : mobile-oss-deploy-test1-oss State : ACTIVE Workload kind : Deployment Destination : 127.0.0.1:8080 Volume Mount Point: T: Intercepting : all TCP connections
-
此时访问该svc会发现流量流向本地了
$ curl https://xxx-test1.xxx.com.cn/ /index
-
解除拦截
$ telepresence.exe leave mobile-oss-deploy-test1-oss
-
再次检查拦截情况
$ telepresence.exe list -n test1-oss activiti-service-deployment : ready to intercept (traffic-agent not yet installed) admin-service-deployment : ready to intercept (traffic-agent not yet installed) attachment-service-deployment : ready to intercept (traffic-agent not yet installed) crm-service-deployment : ready to intercept (traffic-agent not yet installed) crmdb-services-deployment : ready to intercept (traffic-agent not yet installed) mobile-oss-deploy : ready to intercept (traffic-agent already installed)
-
可以看到
mobile-oss-deploy
的状态从intercepted
变为ready to intercept
,括号内容从traffic-agent not yet installed
变为traffic-agent already installed
,这是因为每个被代理的deploy都会安装一个名为traffic-agent的sidecar,占用内存约50m.这个sidecar不会因解除拦截而销毁,只能手动销毁##销毁指定deploy $ telepresence.exe -n test1-oss uninstall -d mobile-oss-deploy ##销毁全部 $ telepresence uninstall --everything
局部拦截
-
telepresence还提供了名为preview url的拦截模式,该模式不影响集群原有流量,仅把通过该url访问的流量导向本地.
-
先要登录ambassador(只有登录状态下才能用preview url)
$ telepresence.exe login Launching browser authentication flow... Login successful.
-
确认本地服务已启动的情况下,再次创建拦截
$ telepresence.exe intercept -n test1-oss mobile-oss-deploy --port 8080:80
-
然后访问https://reverent-dhawan-659.preview.edgestack.me/验证拦截情况(访问需要登录状态,如果是浏览器之外的访问方式记得带上cookie)
使用deploy的环境变量启动本地服务
-
维护本地环境变量是个繁琐的事情,IntelliJ可以通过EnvFile插件来使用集群的环境变量
-
创建拦截时通过-e参数把集群变量写入到本地文件
$ telepresence.exe intercept -n test1-oss mobile-oss-deploy --port 8080 -e /e/envfile/mobile-oss.env Using Deployment mobile-oss-deploy intercepted Intercept name : mobile-oss-deploy-test1-oss State : ACTIVE Workload kind : Deployment Destination : 127.0.0.1:8080 Volume Mount Point: T: Intercepting : all TCP connections
-
spring boot/tomcat启动: 安装好EnvFile后,运行配置里会多出一列"EnvFile",选择上一步保存的配置文件
-
maven启动: 用上一步的配置文件覆盖本地项目配置文件
使用场景
- 本地接口自测: 不需要创建拦截,直接用api doc来测试
- 调试服务调用的接口(东西流量): 只能全量拦截
- 前后端联调,接口的入口可以在前端修改(南北流量):可以全量拦截,也可以用preview url
FAQs
telepresence连接失败怎么排查
- 检查telepresence的版本和traffic-manager的版本是否一致,目前用的2.4.0
- 检查是否开了socks代理,把系统变量里的http_proxy和https_proxy删掉
- 重启大法,试下依次重启telepresence进程,要拦截的deploy,本地主机
创建拦截失败怎么排查
- 如果svc暴露了多个port,需要显示指定远程的port(默认是80)
- 检查svc的port是否和deploy的port对应
- traffic-agent只能绑定到一个port,如果之前创建的拦截是绑定到A端口,现在要改成B端口,要先卸载原先的traffic-agent,再创建拦截
- 创建preview url前本地服务需要先启动,并确保服务端口已经开放.创建了preview url后,可以访问 https://app.getambassador.io/cloud/services 查看拦截情况.
- 如果报错中有conflict关键字,可能是别人先抢占了环境,得等别人leave之后才能连(好像自己无法主动断开其他人的连接,所以要养成每次用完主动leave的习惯)
一些补充
- 本地的启动配置替换成k8s,不过由于本地没有namespace.所以svc要补全namespace
- 对于服务注册而言,像xxl-job和nacos这类组件可以手动配置注册ip,使用是正常的,但是如果遇到像我们自研框架的zookeeper这样会把本地ip注册上去(云上无法识别内网ip)的情况,就需要改造成注册ip可配置了.
- 如果已经telepresence login,那么可以创建preview url(只会拦截通过该url访问的流量,不影响集群),否则拦截全部流量.
- preview url依赖ambassador cloud,需要登录状态(小程序可能得考虑封装个全局的cookie)
- 创建preview url前本地服务需要先开启,并确保服务端口已经开放(可能ambassador cloud要检测什么)
- preview url原理: 生成一个请求头带标记的request,然后telepresence将请求转发到ambassador cloud(因为这个url由ambassador创建并公开),然后ambassador再转发回集群,集群内的traffic agent查看头部并拦截请求,将它转发回本地机器
- 曾被拦截过的deploy里面都会创建一个traffic-agent(一个agent消耗50m内存,如果每个环境的每个deploy都创建了一个agent,需要考虑内存占用的问题)
- 如果deploy对应的svc暴露了多个端口,需要在冒号后面指定拦截哪个端口(svc的port)