需求背景

我们的集群迁移到云上的k8s后,本地无法继续直连调试,需要新的远程调试方案.

调研阿里出品的kt-connect时在社区看到一个issue,由于协议转换的问题,在使用redis,kafka等中间件时会受影响,于是这个方案就排除了.然后抽空试了下telepresence(和ambassador同源),基本需求都能满足,所以就把定下来了.

安装traffic-manager

这类工具都需要在集群和本地创建代理,集群上的代理客户端叫traffic-manager,本地的叫telepresence.

需要确保telepresence和traffic-manager版本一致.我们使用的版本是2.4.6

helm repo add datawire  https://app.getambassador.io
helm repo update
kubectl create namespace ambassador
helm install traffic-manager --namespace ambassador datawire/telepresence

安装telepresence

macOS

# Install via brew:
brew install datawire/blackbird/telepresence
 
 
# OR install manually:
# 1. Download the latest binary (~60 MB):
sudo curl -fL https://app.getambassador.io/download/tel2/darwin/amd64/2.4.6/telepresence -o /usr/local/bin/telepresence
 
# 2. Make the binary executable:
sudo chmod a+x /usr/local/bin/telepresence

Linux

# 1. Download the latest binary (~50 MB):
sudo curl -fL https://app.getambassador.io/download/tel2/linux/amd64/2.4.6/telepresence -o /usr/local/bin/telepresence
 
# 2. Make the binary executable:
sudo chmod a+x /usr/local/bin/telepresence

Windows

接触的时候他们刚推出windows的开发者预览版本,在安装脚本和使用过程中可能都会有一些小问题.

  • 下载官方安装包: https://app.getambassador.io/download/tel2/windows/amd64/2.4.6/telepresence.zip

  • 安装命令请在Powershell内以管理员身份执行(需要把安装包放到盘根目录再解压,否则可能失败)

    Expand-Archive -Path telepresence.zip
    Remove-Item 'telepresence.zip'
    cd telepresence
    
  • 默认安装到C:\telepresence,可以通过编辑install-telepresence.ps1来修改安装路径

    Set-ExecutionPolicy Bypass -Scope Process
    .\install-telepresence.ps1
    
  • 安装完成后系统环境变量中会多出两行

    C:\telepresence
    C:\Program Files\SSHFS-Win\bin
    
  • 确认是否安装成功

    PS C:\Users\iplas> telepresence.exe status
    Root Daemon: Not running
    User Daemon: Not running
    
  • 安装kubectl(新版本的docker desktop v20.10.7自动安装了kubectl) https://kubernetes.io/zh/docs/tasks/tools/install-kubectl-windows/

  • 验证

    $ kubectl.exe version
    Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"windows/amd64"}
    Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.4-tke.6", GitCommit:"194201819cf1e5cf45d38f72ce1aac9efca4c7ff", GitTreeState:"clean", BuildDate:"2020-12-29T09:13:24Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"linux/amd64"}
    
  • 如果没有server的信息是因为kubeconfig没有配置,配置文件可以在Rancher获取并保存本地

    apiVersion: v1
    kind: Config
    clusters:
    - name: "testk8s"
     cluster:
       server: "https://rancher.xxx.com.cn/k8s/clusters/local"
    
    users:
    - name: "testk8s"
     user:
       token: "kubeconfig-u-sr6p9:xxxx"
    
    
    contexts:
    - name: "testk8s"
     context:
       user: "testk8s"
       cluster: "testk8s"
    
    current-context: "testk8s"
    
  • 添加系统变量,变量名=KUBECONFIG 变量值=${kubeconfig.yml保存位置}

  • 验证集群连通性

    $ kubectl cluster-info
    Kubernetes control plane is running at https://rancher.xxx.com.cn/k8s/clusters/local
    CoreDNS is running at https://rancher.xxx.com.cn/k8s/clusters/local/api/v1/namespaces/kube-system/services/kube-dns:dns-tcp/proxy
    KubeDNSUpstream is running at https://rancher.xxx.com.cn/k8s/clusters/local/api/v1/namespaces/kube-system/services/kube-dns-upstream:dns/proxy
    
    To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
    
    

升级

  • 退出telepresence进程

    $ telepresence.exe quit
    Telepresence Root Daemon quitting... done
    Telepresence User Daemon quitting... done
    
    2021-11-10 14:33:36 iplas@q /e/cmd
    $ telepresence.exe status
    Root Daemon: Not running
    User Daemon: Not running
    
  • 重新安装新的版本(不用卸载)

使用

全量拦截

  • 连接traffic-manager

    $ telepresence.exe connect
    Launching Telepresence Root Daemon
    Launching Telepresence User Daemon
    Connected to context testk8s (https://rancher.xxx.com.cn/k8s/clusters/local)
    
  • 拦截指定deploy的全部流量,注意这里的service-name指deploy的name,remote-port指svc的port.

    telepresence intercept <service-name> --port <local-port>[:<remote-port>] --env-file <path-to-env-file>
    
  • 以test1-oss的mobile-oss为例(如果要拦截的svc只有一个端口,可以不指定远程端口)

    $ telepresence.exe intercept -n test1-oss mobile-oss-deploy --port 8080
    Using Deployment mobile-oss-deploy
    intercepted
       Intercept name   : mobile-oss-deploy-test1-oss
       State             : ACTIVE
       Workload kind     : Deployment
       Destination       : 127.0.0.1:8080
       Volume Mount Point: T:
       Intercepting     : all TCP connections
    
  • 检查拦截情况

    $ telepresence.exe list -n test1-oss
    activiti-service-deployment       : ready to intercept (traffic-agent not yet installed)
    admin-service-deployment          : ready to intercept (traffic-agent not yet installed)
    attachment-service-deployment     : ready to intercept (traffic-agent not yet installed)
    crm-service-deployment            : ready to intercept (traffic-agent not yet installed)
    crmdb-services-deployment         : ready to intercept (traffic-agent not yet installed)
    mobile-oss-deploy         : intercepted
        Intercept name    : mobile-oss-deploy-test1-oss
        State             : ACTIVE
        Workload kind     : Deployment
        Destination       : 127.0.0.1:8080
        Volume Mount Point: T:
        Intercepting      : all TCP connections
    
  • 此时访问该svc会发现流量流向本地了

    $ curl https://xxx-test1.xxx.com.cn/
    /index
    
  • 解除拦截

    $ telepresence.exe leave mobile-oss-deploy-test1-oss
    
  • 再次检查拦截情况

    $ telepresence.exe list -n test1-oss
    activiti-service-deployment       : ready to intercept (traffic-agent not yet installed)
    admin-service-deployment          : ready to intercept (traffic-agent not yet installed)
    attachment-service-deployment     : ready to intercept (traffic-agent not yet installed)
    crm-service-deployment            : ready to intercept (traffic-agent not yet installed)
    crmdb-services-deployment         : ready to intercept (traffic-agent not yet installed)
    mobile-oss-deploy         : ready to intercept (traffic-agent already installed)
    
  • 可以看到mobile-oss-deploy的状态从intercepted变为ready to intercept,括号内容从traffic-agent not yet installed变为traffic-agent already installed,这是因为每个被代理的deploy都会安装一个名为traffic-agent的sidecar,占用内存约50m.这个sidecar不会因解除拦截而销毁,只能手动销毁

    ##销毁指定deploy
    $ telepresence.exe -n test1-oss uninstall -d mobile-oss-deploy
    
    ##销毁全部
    $ telepresence uninstall --everything
    

局部拦截

  • telepresence还提供了名为preview url的拦截模式,该模式不影响集群原有流量,仅把通过该url访问的流量导向本地.

  • 先要登录ambassador(只有登录状态下才能用preview url)

    $ telepresence.exe login
    Launching browser authentication flow...
    Login successful.
    
  • 确认本地服务已启动的情况下,再次创建拦截

    $ telepresence.exe intercept -n test1-oss mobile-oss-deploy --port 8080:80
    
  • 然后访问https://reverent-dhawan-659.preview.edgestack.me/验证拦截情况(访问需要登录状态,如果是浏览器之外的访问方式记得带上cookie)

使用deploy的环境变量启动本地服务

  • 维护本地环境变量是个繁琐的事情,IntelliJ可以通过EnvFile插件来使用集群的环境变量

  • 创建拦截时通过-e参数把集群变量写入到本地文件

    $ telepresence.exe intercept -n test1-oss mobile-oss-deploy --port 8080 -e /e/envfile/mobile-oss.env
    Using Deployment mobile-oss-deploy
    intercepted
        Intercept name    : mobile-oss-deploy-test1-oss
        State             : ACTIVE
        Workload kind     : Deployment
        Destination       : 127.0.0.1:8080
        Volume Mount Point: T:
        Intercepting      : all TCP connections
    
  • spring boot/tomcat启动: 安装好EnvFile后,运行配置里会多出一列"EnvFile",选择上一步保存的配置文件

  • maven启动: 用上一步的配置文件覆盖本地项目配置文件

使用场景

  • 本地接口自测: 不需要创建拦截,直接用api doc来测试
  • 调试服务调用的接口(东西流量): 只能全量拦截
  • 前后端联调,接口的入口可以在前端修改(南北流量):可以全量拦截,也可以用preview url

FAQs

telepresence连接失败怎么排查

  • 检查telepresence的版本和traffic-manager的版本是否一致,目前用的2.4.0
  • 检查是否开了socks代理,把系统变量里的http_proxy和https_proxy删掉
  • 重启大法,试下依次重启telepresence进程,要拦截的deploy,本地主机

创建拦截失败怎么排查

  • 如果svc暴露了多个port,需要显示指定远程的port(默认是80)
  • 检查svc的port是否和deploy的port对应
  • traffic-agent只能绑定到一个port,如果之前创建的拦截是绑定到A端口,现在要改成B端口,要先卸载原先的traffic-agent,再创建拦截
  • 创建preview url前本地服务需要先启动,并确保服务端口已经开放.创建了preview url后,可以访问 https://app.getambassador.io/cloud/services 查看拦截情况.
  • 如果报错中有conflict关键字,可能是别人先抢占了环境,得等别人leave之后才能连(好像自己无法主动断开其他人的连接,所以要养成每次用完主动leave的习惯)

一些补充

  • 本地的启动配置替换成k8s,不过由于本地没有namespace.所以svc要补全namespace
  • 对于服务注册而言,像xxl-job和nacos这类组件可以手动配置注册ip,使用是正常的,但是如果遇到像我们自研框架的zookeeper这样会把本地ip注册上去(云上无法识别内网ip)的情况,就需要改造成注册ip可配置了.
  • 如果已经telepresence login,那么可以创建preview url(只会拦截通过该url访问的流量,不影响集群),否则拦截全部流量.
  • preview url依赖ambassador cloud,需要登录状态(小程序可能得考虑封装个全局的cookie)
  • 创建preview url前本地服务需要先开启,并确保服务端口已经开放(可能ambassador cloud要检测什么)
  • preview url原理: 生成一个请求头带标记的request,然后telepresence将请求转发到ambassador cloud(因为这个url由ambassador创建并公开),然后ambassador再转发回集群,集群内的traffic agent查看头部并拦截请求,将它转发回本地机器
  • 曾被拦截过的deploy里面都会创建一个traffic-agent(一个agent消耗50m内存,如果每个环境的每个deploy都创建了一个agent,需要考虑内存占用的问题)
  • 如果deploy对应的svc暴露了多个端口,需要在冒号后面指定拦截哪个端口(svc的port)