dpvs學(xué)習(xí)筆記: 5 fullnat 完整流程及syn-proxy實(shí)現(xiàn)

fullnat介紹

相比普通的 nat, 大部分公司都使用 fullnat, 對(duì)網(wǎng)段沒(méi)有任何要求,也不需要配置路由。


fullnat

原理是在流量進(jìn)入和返回時(shí)都做 dnat 和 snat. 而且 fullnat 性能擴(kuò)展性非常好。如圖所示,rs 是看不到真正 client ip 的,所有的請(qǐng)求都來(lái)自 lb. 這是 fullnat 缺點(diǎn),通過(guò)安裝 toa 模塊來(lái)解決。

synproxy介紹

DDOS 攻擊常見(jiàn)的就是 syn flood, 利用三次握手原理,只發(fā)送 syn 包,這時(shí)內(nèi)核協(xié)義棧就會(huì)分配本地內(nèi)存,當(dāng)海量攻擊流量生成時(shí)無(wú)法處理正常流量。也就是說(shuō)如果能在第一次握手時(shí)擋住攻擊流量就可以。


syn proxy原理

如上圖所示,syn proxy 正常請(qǐng)求有四個(gè)階段:

  1. client 發(fā)送 syn, LB 代理了第一次握次,不轉(zhuǎn)發(fā)給 rs. LB 返回 ack 包時(shí),seq 由 syn cookie 算法生成,并且將 win 設(shè)置為 0,不允許在握手階段攜帶數(shù)據(jù)。由此得知不支持 tcp fast open
  2. 當(dāng) client 返回 ack 時(shí),反解 seq, 如果和 cookie 算法匹配,那么就是正常流量。此時(shí) LB 與后端 rs 開(kāi)啟三次所握手,并透?jìng)?win size. 由于經(jīng)過(guò) LB 代理,還需要記錄 seq 差值 delta
  3. 數(shù)據(jù)交互通信,lb 除了正常的 full-nat 工作,還要補(bǔ)償 seq delta
  4. 連接關(guān)閉,正常清理

ipv4_rcv 接收 client syn 請(qǐng)求

前文說(shuō)到 ipv4_rcv 接收請(qǐng)求,最后調(diào)用 hook

INET_HOOK(INET_HOOK_PRE_ROUTING, mbuf, port, NULL, ipv4_rcv_fin)
int INET_HOOK(unsigned int hook, struct rte_mbuf *mbuf,
        struct netif_port *in, struct netif_port *out,
        int (*okfn)(struct rte_mbuf *mbuf))
{
    struct list_head *hook_list;
    struct inet_hook_ops *ops;
    struct inet_hook_state state;
    int verdict = INET_ACCEPT;

    state.hook = hook;
    hook_list = &inet_hooks[hook];

#ifdef CONFIG_DPVS_IPV4_INET_HOOK
    rte_rwlock_read_lock(&inet_hook_lock);
#endif

    ops = list_entry(hook_list, struct inet_hook_ops, list);

    if (!list_empty(hook_list)) {
        verdict = INET_ACCEPT;
        list_for_each_entry_continue(ops, hook_list, list) {
repeat:
            verdict = ops->hook(ops->priv, mbuf, &state);
            if (verdict != INET_ACCEPT) {
                if (verdict == INET_REPEAT)
                    goto repeat;
                break;
            }
        }
    }

#ifdef CONFIG_DPVS_IPV4_INET_HOOK
    rte_rwlock_read_unlock(&inet_hook_lock);
#endif

    if (verdict == INET_ACCEPT || verdict == INET_STOP) {
        return okfn(mbuf);
    } else if (verdict == INET_DROP) {
        rte_pktmbuf_free(mbuf);
        return EDPVS_DROP;
    } else { /* INET_STOLEN */ // 比如 synproxy 第一步處理
        return EDPVS_OK;
    }
}

這個(gè) HOOK 會(huì)執(zhí)行 INET_HOOK_PRE_ROUTING 注冊(cè)的回調(diào)函數(shù):dp_vs_pre_routingdp_vs_in, 但不是所有情況都會(huì)全部執(zhí)行。當(dāng)新連接請(qǐng)求時(shí),dp_vs_pre_routing 會(huì)返回 INET_STOLEN,也就是說(shuō) dp_vs_in 并不會(huì)執(zhí)行,并且 INET_HOOK 傳入的回調(diào)函數(shù) ipv4_rcv_fin 也不會(huì)執(zhí)行。

static int dp_vs_pre_routing(void *priv, struct rte_mbuf *mbuf,
                    const struct inet_hook_state *state)
{
   ...
    /* Synproxy: defence synflood */
    if (IPPROTO_TCP == iph.proto) {
        int v = INET_ACCEPT;
        if (0 == dp_vs_synproxy_syn_rcv(af, mbuf, &iph, &v))
            return v;
    }

    return INET_ACCEPT;
}

這是 dp_vs_pre_routing 函數(shù)體,忽略上半部份,直接看 dp_vs_synproxy_syn_rcv 的實(shí)現(xiàn)

int dp_vs_synproxy_syn_rcv(int af, struct rte_mbuf *mbuf,
        const struct dp_vs_iphdr *iph, int *verdict)
{
    int ret;
    struct dp_vs_service *svc = NULL;
    struct tcphdr *th, _tcph;
    struct dp_vs_synproxy_opt tcp_opt;
    struct netif_port *dev;
    struct ether_hdr *eth;
    struct ether_addr ethaddr;

    th = mbuf_header_pointer(mbuf, iph->len, sizeof(_tcph), &_tcph);
    if (unlikely(NULL == th))
        goto syn_rcv_out;
    // 第一次握手只有 syn 包,并有訪(fǎng)問(wèn)的 svc 開(kāi)啟了 syn proxy防護(hù)
    if (th->syn && !th->ack && !th->rst && !th->fin &&
            (svc = dp_vs_service_lookup(af, iph->proto, 
                                        &iph->daddr, th->dest, 0, NULL, NULL)) &&
            (svc->flags & DP_VS_SVC_F_SYNPROXY)) {
        /* if service's weight is zero (non-active realserver),
         * do noting and drop the packet */
        // 如果后端服務(wù) svc 權(quán)重為 0 ,那么也沒(méi)可用后端
        if (svc->weight == 0) {
            dp_vs_estats_inc(SYNPROXY_NO_DEST);
            dp_vs_service_put(svc);
            goto syn_rcv_out;
        }

        dp_vs_service_put(svc);

        /* drop packet from blacklist */ // 如果在黑名單里,那么退出
        if (dp_vs_blklst_lookup(iph->proto, &iph->daddr, th->dest, &iph->saddr)) {
            goto syn_rcv_out;
        }

    } else {
        if (svc)
            dp_vs_service_put(svc);
        return 1;
    }

    /* mbuf will be reused and ether header will be set.
     * FIXME: to support non-ether packets. */
    if (mbuf->l2_len != sizeof(struct ether_hdr))
        goto syn_rcv_out;

    /* update statistics */
    dp_vs_estats_inc(SYNPROXY_SYN_CNT);

    /* set tx offload flags */
    assert(mbuf->port <= NETIF_MAX_PORTS);
    dev = netif_port_get(mbuf->port);
    if (unlikely(!dev)) {
        RTE_LOG(ERR, IPVS, "%s: device eth%d not found\n",
                __func__, mbuf->port);
        goto syn_rcv_out;
    }
    if (likely(dev && (dev->flag & NETIF_PORT_FLAG_TX_TCP_CSUM_OFFLOAD)))
        mbuf->ol_flags |= (PKT_TX_TCP_CKSUM | PKT_TX_IP_CKSUM | PKT_TX_IPV4);

    /* reuse mbuf */
    syn_proxy_reuse_mbuf(af, mbuf, &tcp_opt);

    /* set L2 header and send the packet out
     * It is noted that "ipv4_xmit" should not used here,
     * because mbuf is reused. */
    eth = (struct ether_hdr *)rte_pktmbuf_prepend(mbuf, mbuf->l2_len);
    if (unlikely(!eth)) {
        RTE_LOG(ERR, IPVS, "%s: no memory\n", __func__);
        goto syn_rcv_out;
    }
    memcpy(&ethaddr, &eth->s_addr, sizeof(struct ether_addr));
    memcpy(&eth->s_addr, &eth->d_addr, sizeof(struct ether_addr));
    memcpy(&eth->d_addr, &ethaddr, sizeof(struct ether_addr));

    if (unlikely(EDPVS_OK != (ret = netif_xmit(mbuf, dev)))) {
        RTE_LOG(ERR, IPVS, "%s: netif_xmit failed -- %s\n",
                __func__, dpvs_strerror(ret));
    /* should not set verdict to INET_DROP since netif_xmit
     * always consume the mbuf while INET_DROP means mbuf'll
     * be free in INET_HOOK.*/
    }

    *verdict = INET_STOLEN;
    return 0;

syn_rcv_out:
    /* drop and destroy the packet */
    *verdict = INET_DROP;
    return 0;
}
  1. 判斷第一次握手只有 syn 包,并且訪(fǎng)問(wèn)的后端服務(wù) svc 開(kāi)啟了 syn proxy 防護(hù)。如果在黑名單那么退出,返回 INET_DROP
  2. syn_proxy_reuse_mbuf 復(fù)用 mbuf, 為什么說(shuō)是復(fù)用呢?因?yàn)閷?duì) mbuf 修改后,直接當(dāng)做回包返回給了 client,這個(gè)修改是重點(diǎn)
  3. 三個(gè) memcpy 操作是交換原來(lái) mbuf 的目的地址和源地址,然后調(diào)用 netif_xmit 返回給了 client

來(lái)看一下 syn_proxy_reuse_mbuf 做了哪些事情?

static void syn_proxy_reuse_mbuf(int af, struct rte_mbuf *mbuf,
        struct dp_vs_synproxy_opt *opt)
{
    uint32_t isn;
    uint32_t tmpaddr;
    uint16_t tmpport;
    struct iphdr *iph;
    struct tcphdr *th;
    int ip4hlen;

    iph = (struct iphdr*)ip4_hdr(mbuf);
    ip4hlen = ip4_hdrlen(mbuf);
    th = tcp_hdr(mbuf);

    if (mbuf_may_pull(mbuf, ip4hlen + (th->doff<< 2)) != 0)
        return;

    /* deal with tcp options */
    syn_proxy_parse_set_opts(mbuf, th, opt);

    /* get cookie */
    isn = syn_proxy_cookie_v4_init_sequence(mbuf, opt);

    /* set syn-ack flag */
    ((uint8_t *)th)[13] = 0x12;

    /* exchage ports */
    tmpport = th->dest;
    th->dest = th->source;
    th->source = tmpport;

    /* set seq(cookie) and ack_seq */
    th->ack_seq = htonl(ntohl(th->seq) + 1);
    th->seq = htonl(isn);

    /* exchage addresses */
    tmpaddr = iph->saddr;
    iph->saddr = iph->daddr;
    iph->daddr = tmpaddr;

    iph->ttl = dp_vs_synproxy_ctrl_synack_ttl;
    iph->tos = 0;

    /* compute checksum */
    if (likely(mbuf->ol_flags & PKT_TX_TCP_CKSUM)) {
        mbuf->l3_len = ip4hlen;
        mbuf->l4_len = ntohs(ip4_hdr(mbuf)->total_length) - ip4hlen;
        th->check = rte_ipv4_phdr_cksum(ip4_hdr(mbuf), mbuf->ol_flags);
    } else {
        if (mbuf_may_pull(mbuf, mbuf->pkt_len) != 0)
            return;
        tcp4_send_csum((struct ipv4_hdr*)iph, th);
    }

    if (likely(mbuf->ol_flags & PKT_TX_IP_CKSUM))
        iph->check = 0;
    else
        ip4_send_csum((struct ipv4_hdr*)iph);
}
  1. 調(diào)用 syn_proxy_parse_set_opts 設(shè)置 tcp option, 包括 mss, window size, timestamp
  2. 調(diào)用 syn_proxy_cookie_v4_init_sequence 計(jì)算生成 cookie,函數(shù)是 secure_tcp_syn_cookie(iph->saddr, iph->daddr, th->source, th->dest, ntohl(th->seq), rte_atomic32_read(&g_minute_count), data);
  3. 交換 dest, source 端口
  4. 設(shè)置 seq, 其中 ack seq 是客戶(hù)端的序號(hào)加一,而返回的 syn seq 就是剛剛計(jì)算出來(lái)的 cookie
  5. 交換源和目地 ip 地址
  6. 如果硬件不支持計(jì)算 csum,調(diào)用 ip4_send_csum 生成

ipv4_rcv 接收 client syn 請(qǐng)求小結(jié):此時(shí)可以看到,lb 是不分配任務(wù)內(nèi)存資源的,他將狀態(tài)生成 cookie,保存到 syn seq 中。如果此時(shí)是攻擊流量,那么 lb 基本沒(méi)什么負(fù)擔(dān)。分析到這第一階段結(jié)束,dp_vs_pre_routing 返回 INET_STOLEN,INET_HOOK 直接返回,并不會(huì)執(zhí)行 okfn 回調(diào)。

ipv4_rcv 接收 client ack 應(yīng)答

如前文所述,HOOK 會(huì)執(zhí)行 INET_HOOK_PRE_ROUTING 注冊(cè)的回調(diào)函數(shù):dp_vs_pre_routingdp_vs_in,當(dāng) client 返回 ack 應(yīng)答時(shí),dp_vs_pre_routing 返回 DPVS_ACCEPT, 繼續(xù)執(zhí)行 dp_vs_in 邏輯。
首先,此時(shí)還沒(méi)有建立連接,流表里并不存在。dp_vs_proto_lookup 查找協(xié)義,當(dāng)前只看 dp_vs_proto_tcp. 由于 conn_lookup 查找不到流表,所以繼續(xù)執(zhí)行 tcp_conn_sched

static int tcp_conn_sched(struct dp_vs_proto *proto, 
                          const struct dp_vs_iphdr *iph,
                          struct rte_mbuf *mbuf, 
                          struct dp_vs_conn **conn,
                          int *verdict)
{
    struct tcphdr *th, _tcph;
    struct dp_vs_service *svc;
    assert(proto && iph && mbuf && conn && verdict);

    th = mbuf_header_pointer(mbuf, iph->len, sizeof(_tcph), &_tcph);
    if (unlikely(!th)) {
        *verdict = INET_DROP;
        return EDPVS_INVPKT;
    }

    /* Syn-proxy step 2 logic: receive client's 3-handshacke ack packet */
    /* When synproxy disabled, only SYN packets can arrive here.
     * So don't judge SYNPROXY flag here! If SYNPROXY flag judged, and syn_proxy
     * got disbled and keepalived reloaded, SYN packets for RS may never be sent. */
    if (dp_vs_synproxy_ack_rcv(iph->af, mbuf, th, proto, conn, iph, verdict) == 0) {
        /* Attention: First ACK packet is also stored in conn->ack_mbuf */
        return EDPVS_PKTSTOLEN;
    }

    /* only TCP-SYN without other flag can be scheduled */
    if (!th->syn || th->ack || th->fin || th->rst) {
#ifdef CONFIG_DPVS_IPVS_DEBUG
        char dbuf[64], sbuf[64];
        const char *daddr, *saddr;

        daddr = inet_ntop(iph->af, &iph->daddr, dbuf, sizeof(dbuf)) ? dbuf : "::";
        saddr = inet_ntop(iph->af, &iph->saddr, sbuf, sizeof(sbuf)) ? sbuf : "::";
        RTE_LOG(DEBUG, IPVS,
                "%s: [%d] try sched non-SYN packet: [%c%c%c%c] %s:%d->%s:%d\n", 
                __func__, rte_lcore_id(),
                th->syn ? 'S' : '.', th->fin ? 'F' : '.',
                th->ack ? 'A' : '.', th->rst ? 'R' : '.',
                saddr, ntohs(th->source), daddr, ntohs(th->dest));
#endif

        /* Drop tcp packet which is send to vip and !vport */
        if (g_defence_tcp_drop &&
                (svc = dp_vs_lookup_vip(iph->af, iph->proto, &iph->daddr))) {
            dp_vs_estats_inc(DEFENCE_TCP_DROP);
            *verdict = INET_DROP;
            return EDPVS_INVPKT;
        }

        *verdict = INET_ACCEPT;
        return EDPVS_INVAL;
    }

    svc = dp_vs_service_lookup(iph->af, iph->proto, 
                               &iph->daddr, th->dest, 0, mbuf, NULL);
    if (!svc) {
        /* Drop tcp packet which is send to vip and !vport */
        if (g_defence_tcp_drop &&
                (svc = dp_vs_lookup_vip(iph->af, iph->proto, &iph->daddr))) {
            dp_vs_estats_inc(DEFENCE_TCP_DROP);
            *verdict = INET_DROP;
            return EDPVS_INVPKT;
        }
        *verdict = INET_ACCEPT;
        return EDPVS_NOSERV;
    }

    *conn = dp_vs_schedule(svc, iph, mbuf, false);
    if (!*conn) {
        dp_vs_service_put(svc);
        *verdict = INET_DROP;
        return EDPVS_RESOURCE;
    }

    dp_vs_service_put(svc);

    return EDPVS_OK;
}

看注釋?zhuān)绻?dp_vs_synproxy_ack_rcv 執(zhí)行成功,那么返回 EDPVS_PKTSTOLEN, 最終 dp_vs_in 也會(huì)返回 STOLEN. 再細(xì)看 dp_vs_synproxy_ack_rcv 源碼

/* Syn-proxy step 2 logic: receive client's Ack
 * Receive client's 3-handshakes ack packet, do cookie check and then
 * send syn to rs after creating a session */
int dp_vs_synproxy_ack_rcv(int af, struct rte_mbuf *mbuf,
        struct tcphdr *th, struct dp_vs_proto *pp,
        struct dp_vs_conn **cpp,
        const struct dp_vs_iphdr *iph, int *verdict)
{
    int res;
    struct dp_vs_synproxy_opt opt;
    struct dp_vs_service *svc;
    int res_cookie_check;

    /* Do not check svc syn-proxy flag, as it may be changed after syn-proxy step 1. */
    if (!th->syn && th->ack && !th->rst && !th->fin &&
            (svc = dp_vs_service_lookup(af, iph->proto, &iph->daddr,
                                        th->dest, 0, NULL, NULL))) {
        if (dp_vs_synproxy_ctrl_defer &&
                !syn_proxy_ack_has_data(mbuf, iph, th)) {
            /* Update statistics */
            dp_vs_estats_inc(SYNPROXY_NULL_ACK);
            /* We get a pure ack when expecting ack packet with payload, so
             * have to drop it */
            dp_vs_service_put(svc);
            *verdict = INET_DROP;
            return 0;
        }

        res_cookie_check = syn_proxy_v4_cookie_check(mbuf, ntohl(th->ack_seq) - 1, &opt);
        if (!res_cookie_check) {
            /* Update statistics */
            dp_vs_estats_inc(SYNPROXY_BAD_ACK);
            /* Cookie check failed, drop the packet */
            RTE_LOG(DEBUG, IPVS, "%s: syn_cookie check failed seq=%u\n", __func__,
                    ntohl(th->ack_seq) - 1);
            dp_vs_service_put(svc);
            *verdict = INET_DROP;
            return 0;
        }

        /* Update statistics */
        dp_vs_estats_inc(SYNPROXY_OK_ACK);

        /* Let the virtual server select a real server for the incoming connetion,
         * and create a connection entry */
        *cpp = dp_vs_schedule(svc, iph, mbuf, 1);
        if (unlikely(!*cpp)) {
            RTE_LOG(WARNING, IPVS, "%s: ip_vs_schedule failed\n", __func__);
            /* FIXME: What to do when virtual service is available but no destination
             * available for a new connetion: send an icmp UNREACHABLE ? */
            dp_vs_service_put(svc);
            *verdict = INET_DROP;
            return 0;
        }
        
        /* Release the service, we do not need it any more */
        dp_vs_service_put(svc);

        /* Do nothing but print a error msg when fail, because session will be
         * correctly freed in dp_vs_conn_expire */
        if (EDPVS_OK != (res = syn_proxy_send_rs_syn(af, th, *cpp, mbuf, pp, &opt))) {
            RTE_LOG(ERR, IPVS, "%s: syn_proxy_send_rs_syn failed -- %s\n",
                    __func__, dpvs_strerror(res));
        }
        /* Count in the ack packet (STOLEN by synproxy) */
        dp_vs_stats_in(*cpp, mbuf);

        /* Active session timer, and dec  refcnt.
         * Also steal the mbuf, and let caller return immediately */
        dp_vs_conn_put(*cpp);
        *verdict = INET_STOLEN;
        return 0;
    }
    return 1;
}

使用 syn_proxy_v4_cookie_check 反解 seq cookie, 如果不匹配,那么就是攻擊或是無(wú)效流量,將包丟棄。如果成功,執(zhí)行 syn proxy 第二階段,lb 調(diào)用 dp_vs_schedule 與后端 real server 建立連接,這里也有細(xì)節(jié)。

/* select an RS by service's scheduler and create a connection */
struct dp_vs_conn *dp_vs_schedule(struct dp_vs_service *svc, 
                                  const struct dp_vs_iphdr *iph,
                                  struct rte_mbuf *mbuf,
                                  bool is_synproxy_on)
{
    uint16_t _ports[2], *ports; /* sport, dport */
    struct dp_vs_dest *dest;
    struct dp_vs_conn *conn;
    struct dp_vs_conn_param param;
    struct sockaddr_in daddr, saddr;
    int err;

    assert(svc && iph && mbuf);

    ports = mbuf_header_pointer(mbuf, iph->len, sizeof(_ports), _ports);
    if (!ports)
        return NULL;
        
    /* persistent service 長(zhǎng)連接請(qǐng)求*/
    if (svc->flags & DP_VS_SVC_F_PERSISTENT)
        return dp_vs_sched_persist(svc, iph,  mbuf, is_synproxy_on);

    dest = svc->scheduler->schedule(svc, mbuf); // 特定的調(diào)度算法
    if (!dest) {
        RTE_LOG(WARNING, IPVS, "%s: no dest found.\n", __func__);
#ifdef CONFIG_DPVS_MBUF_DEBUG
        dp_vs_mbuf_dump("found dest failed.", iph->af, mbuf);
#endif
        return NULL;
    }
        
    if (dest->fwdmode == DPVS_FWD_MODE_SNAT) {
        if (unlikely(iph->proto == IPPROTO_ICMP)) {
            struct icmphdr *ich, _icmph;
            ich = mbuf_header_pointer(mbuf, iph->len, sizeof(_icmph), &_icmph);
            if (!ich)
                return NULL;

            ports = _ports;
            _ports[0] = icmp4_id(ich);
            _ports[1] = ich->type << 8 | ich->code;

            /* ID may confict for diff host,
             * need we use ID pool ? */
            dp_vs_conn_fill_param(iph->af, iph->proto,
                                  &iph->daddr, &dest->addr,
                                  ports[1], ports[0],
                                  0, &param);
        } else {
            /* we cannot inherit dest (host's src port),
             * that may confict for diff hosts,
             * and using dest->port is worse choice. */
            memset(&daddr, 0, sizeof(daddr));
            daddr.sin_family = AF_INET;
            daddr.sin_addr = iph->daddr.in;
            daddr.sin_port = ports[1];
            memset(&saddr, 0, sizeof(saddr));
            saddr.sin_family = AF_INET;
            saddr.sin_addr = dest->addr.in;
            saddr.sin_port = 0;

            err = sa_fetch(NULL, &daddr, &saddr);
            if (err != 0) {
#ifdef CONFIG_DPVS_MBUF_DEBUG
                dp_vs_mbuf_dump("sa_fetch failed.", iph->af, mbuf);
#endif
                return NULL;
            }

            dp_vs_conn_fill_param(iph->af, iph->proto,
                                  &iph->daddr, &dest->addr,
                                  ports[1], saddr.sin_port,
                                  0, &param);
        }
    } else {
        if (unlikely(iph->proto == IPPROTO_ICMP)) {
            struct icmphdr *ich, _icmph;
            ich = mbuf_header_pointer(mbuf, iph->len, sizeof(_icmph), &_icmph);
            if (!ich)
                return NULL;

            ports = _ports;
            _ports[0] = icmp4_id(ich);
            _ports[1] = ich->type << 8 | ich->code;

            dp_vs_conn_fill_param(iph->af, iph->proto,
                                  &iph->saddr, &iph->daddr,
                                  ports[0], ports[1], 0, &param);
        } else {
            dp_vs_conn_fill_param(iph->af, iph->proto,
                                  &iph->saddr, &iph->daddr,
                                  ports[0], ports[1], 0, &param);
        }
    }

    conn = dp_vs_conn_new(mbuf, &param, dest,
            is_synproxy_on ? DPVS_CONN_F_SYNPROXY : 0);
    if (!conn) {
        if (dest->fwdmode == DPVS_FWD_MODE_SNAT && iph->proto != IPPROTO_ICMP)
            sa_release(NULL, &daddr, &saddr);
#ifdef CONFIG_DPVS_MBUF_DEBUG
        dp_vs_mbuf_dump("create conn failed.", iph->af, mbuf);
#endif
        return NULL;
    }

    dp_vs_stats_conn(conn);
    return conn;
}

根據(jù)服務(wù)調(diào)度算法,選擇后端 real server, 就是源碼里的 dest. 調(diào)用 dp_vs_conn_new 來(lái)與 dest 建立連接,這里涉及 syn proxy.

struct dp_vs_conn * dp_vs_conn_new(struct rte_mbuf *mbuf,
                                   struct dp_vs_conn_param *param,
                                   struct dp_vs_dest *dest, uint32_t flags)
{
    struct dp_vs_conn *new;
    struct conn_tuple_hash *t;
    uint16_t rport;
    __be16 _ports[2], *ports;
    int err;

    assert(mbuf && param && dest);
    // 內(nèi)存池,這很重要,malloc 分配內(nèi)存很慢的,特別是大量的情況下
    if (unlikely(rte_mempool_get(this_conn_cache, (void **)&new) != 0)) {
        RTE_LOG(WARNING, IPVS, "%s: no memory\n", __func__);
        return NULL;
    }
    memset(new, 0, sizeof(struct dp_vs_conn));
    new->connpool = this_conn_cache;

    /* set proper RS port */
    if ((flags & DPVS_CONN_F_TEMPLATE) || param->ct_dport != 0)
        rport = param->ct_dport;
    else if (dest->fwdmode == DPVS_FWD_MODE_SNAT) {
        if (unlikely(param->proto == IPPROTO_ICMP)) {
            rport = param->vport;
        } else {
            ports = mbuf_header_pointer(mbuf, ip4_hdrlen(mbuf),
                                        sizeof(_ports), _ports);
            if (unlikely(!ports)) {
                RTE_LOG(WARNING, IPVS, "%s: no memory\n", __func__);
                goto errout;
            }
            rport = ports[0];
        }
    } else
        rport = dest->port;

    /* init inbound conn tuple hash */
    t = &tuplehash_in(new);
    t->direct   = DPVS_CONN_DIR_INBOUND; // 入口流量,肯定是外網(wǎng)進(jìn)來(lái)的
    t->af       = param->af;
    t->proto    = param->proto;
    t->saddr    = *param->caddr; // 源地址是 外網(wǎng)  client addr
    t->sport    = param->cport;
    t->daddr    = *param->vaddr; // 目地地址是 服務(wù)虛IP地址
    t->dport    = param->vport;
    INIT_LIST_HEAD(&t->list);

    /* init outbound conn tuple hash */
    t = &tuplehash_out(new);
    t->direct   = DPVS_CONN_DIR_OUTBOUND; // 出口
    t->af       = param->af;
    t->proto    = param->proto;
    if (dest->fwdmode == DPVS_FWD_MODE_SNAT)
        t->saddr.in.s_addr    = ip4_hdr(mbuf)->src_addr;
    else
        t->saddr    = dest->addr;
    t->sport    = rport;
    t->daddr    = *param->caddr;    /* non-FNAT */
    t->dport    = param->cport;     /* non-FNAT */
    INIT_LIST_HEAD(&t->list);

    /* init connection */
    new->af     = param->af;
    new->proto  = param->proto;
    new->caddr  = *param->caddr;
    new->cport  = param->cport;
    new->vaddr  = *param->vaddr;
    new->vport  = param->vport;
    new->laddr  = *param->caddr;    /* non-FNAT */
    new->lport  = param->cport;     /* non-FNAT */
    if (dest->fwdmode == DPVS_FWD_MODE_SNAT)
        new->daddr.in.s_addr  = ip4_hdr(mbuf)->src_addr;
    else
        new->daddr  = dest->addr;
    new->dport  = rport;

    /* neighbour confirm cache 鄰居子系統(tǒng)*/
    new->in_nexthop.in.s_addr = htonl(INADDR_ANY);
    new->out_nexthop.in.s_addr = htonl(INADDR_ANY);

    new->in_dev = NULL;
    new->out_dev = NULL;

    /* Controll member */
    new->control = NULL;
    rte_atomic32_clear(&new->n_control);

    /* caller will use it right after created,
     * just like dp_vs_conn_get(). */
    rte_atomic32_set(&new->refcnt, 1);
    new->flags  = flags;
    new->state  = 0;
#ifdef CONFIG_DPVS_IPVS_STATS_DEBUG
    new->ctime = rte_rdtsc();
#endif

    /* bind destination and corresponding trasmitter */
    // 在這里設(shè)置 轉(zhuǎn)發(fā)模式相關(guān)的幾個(gè)發(fā)包收包操作
    err = conn_bind_dest(new, dest);
    if (err != EDPVS_OK) {
        RTE_LOG(WARNING, IPVS, "%s: fail to bind dest: %s\n",
                __func__, dpvs_strerror(err));
        goto errout;
    }

    /* FNAT only: select and bind local address/port */
    if (dest->fwdmode == DPVS_FWD_MODE_FNAT) {
        if ((err = dp_vs_laddr_bind(new, dest->svc)) != EDPVS_OK)
            goto unbind_dest;
    }

    /* add to hash table (dual dir for each bucket) */
    if ((err = conn_hash(new)) != EDPVS_OK)
        goto unbind_laddr;

    /* timer */
    new->timeout.tv_sec = conn_init_timeout;
    new->timeout.tv_usec = 0;

    /* synproxy 用于 syn proxy 使用*/
    INIT_LIST_HEAD(&new->ack_mbuf);
    rte_atomic32_set(&new->syn_retry_max, 0);
    rte_atomic32_set(&new->dup_ack_cnt, 0);
    if ((flags & DPVS_CONN_F_SYNPROXY) && !(flags & DPVS_CONN_F_TEMPLATE)) {
        struct tcphdr _tcph, *th;
        struct dp_vs_synproxy_ack_pakcet *ack_mbuf;
        struct dp_vs_proto *pp;

        th = mbuf_header_pointer(mbuf, ip4_hdrlen(mbuf), sizeof(_tcph), &_tcph);
        if (!th) {
            RTE_LOG(ERR, IPVS, "%s: get tcphdr failed\n", __func__);
            goto unbind_laddr;
        }

        /* save ack packet */
        if (unlikely(rte_mempool_get(this_ack_mbufpool, (void **)&ack_mbuf) != 0)) {
            RTE_LOG(ERR, IPVS, "%s: no memory\n", __func__);
            goto unbind_laddr;
        }
        ack_mbuf->mbuf = mbuf;
        list_add_tail(&ack_mbuf->list, &new->ack_mbuf);
        new->ack_num++;
        sp_dbg_stats32_inc(sp_ack_saved);

        /* save ack_seq - 1 */
        new->syn_proxy_seq.isn =
            htonl((uint32_t) ((ntohl(th->ack_seq) - 1)));

        /* save ack_seq */
        new->fnat_seq.fdata_seq = htonl(th->ack_seq);

        /* FIXME: use DP_VS_TCP_S_SYN_SENT for syn */
        pp = dp_vs_proto_lookup(param->proto);
        new->timeout.tv_sec = pp->timeout_table[new->state = DPVS_TCP_S_SYN_SENT];
    }

    this_conn_count++;

    /* schedule conn timer */
    dpvs_time_rand_delay(&new->timeout, 1000000);
    if (new->flags & DPVS_CONN_F_TEMPLATE)
        dpvs_timer_sched(&new->timer, &new->timeout, conn_expire, new, true);
    else
        dpvs_timer_sched(&new->timer, &new->timeout, conn_expire, new, false);

#ifdef CONFIG_DPVS_IPVS_DEBUG
    conn_dump("new conn: ", new);
#endif
    return new;

unbind_laddr:
    dp_vs_laddr_unbind(new);
unbind_dest:
    conn_unbind_dest(new);
errout:
    rte_mempool_put(this_conn_cache, new);
    return NULL;
}
  1. 調(diào)用 tuplehash_in 生成兩個(gè)方向的 tuplehash 用于檢索
  2. conn_bind_dest 根據(jù)各個(gè)轉(zhuǎn)發(fā)模式綁定回調(diào)
   case DPVS_FWD_MODE_FNAT:
        conn->packet_xmit = dp_vs_xmit_fnat;
        conn->packet_out_xmit = dp_vs_out_xmit_fnat;
  1. dp_vs_laddr_bind 綁定 LB 本地 socket, 這個(gè)很好理解,fullnat 做了雙向 nat
  2. conn_hash 將連接加到流表 this_conn_tab 中
  3. 處理 synproxy, 將 mbuf 加到 ack_mbuf 列表。將 client 發(fā)來(lái)的 fnat_seq 保存到 fnat_seq,將 fnat_seq - 1 保存到 syn_proxy_seq
  4. 將連接加到 dpvs_timer_sched 超時(shí)控制。

再回到函數(shù) dp_vs_synproxy_ack_rcv, 當(dāng) dp_vs_schedule 創(chuàng)建連接后,調(diào)用 syn_proxy_send_rs_syn 完成 lb 與 real server 建連。

static int syn_proxy_send_rs_syn(int af, const struct tcphdr *th,
        struct dp_vs_conn *cp, struct rte_mbuf *mbuf,
        struct dp_vs_proto *pp, struct dp_vs_synproxy_opt *opt)
{
    int tcp_hdr_size;
    struct rte_mbuf *syn_mbuf, *syn_mbuf_cloned;
    struct rte_mempool *pool;
    struct iphdr *ack_iph;
    struct iphdr *syn_iph;
    struct tcphdr *syn_th; 

    if (!cp->packet_xmit) {
        RTE_LOG(WARNING, IPVS, "%s: packet_xmit is null\n", __func__);
        return EDPVS_INVAL;
    }

    /* Allocate mbuf from device mempool */
    pool = get_mbuf_pool(cp, DPVS_CONN_DIR_INBOUND);
    if (unlikely(!pool)) {
        //RTE_LOG(WARNING, IPVS, "%s: %s\n", __func__, dpvs_strerror(EDPVS_NOROUTE));
        return EDPVS_NOROUTE;
    }

    syn_mbuf = rte_pktmbuf_alloc(pool);
    if (unlikely(!syn_mbuf)) {
        //RTE_LOG(WARNING, IPVS, "%s: %s\n", __func__, dpvs_strerror(EDPVS_NOMEM));
        return EDPVS_NOMEM;
    }
    syn_mbuf->userdata = NULL; /* make sure "no route info" */

    /* Reserve space for tcp header */
    tcp_hdr_size = (sizeof(struct tcphdr) + TCPOLEN_MAXSEG
            + (opt->tstamp_ok ? TCPOLEN_TSTAMP_APPA : 0)
            + (opt->wscale_ok ? TCP_OLEN_WSCALE_ALIGNED : 0)
            /* SACK_PERM is in the palce of NOP NOP of TS */
            + ((opt->sack_ok && !opt->tstamp_ok) ? TCP_OLEN_SACKPERMITTED_ALIGNED : 0));
    syn_th = (struct tcphdr *)rte_pktmbuf_prepend(syn_mbuf, tcp_hdr_size);
    if (!syn_th) {
        rte_pktmbuf_free(syn_mbuf);
        //RTE_LOG(WARNING, IPVS, "%s:%s\n", __func__, dpvs_strerror(EDPVS_NOROOM));
        return EDPVS_NOROOM;
    }

    /* Set up tcp header */
    memset(syn_th, 0, tcp_hdr_size);
    syn_th->source = th->source;
    syn_th->dest = th->dest;
    syn_th->seq = htonl(ntohl(th->seq) - 1);
    syn_th->ack_seq = 0;
    *(((uint16_t *) syn_th) + 6) = htons(((tcp_hdr_size >> 2) << 12) | /*TH_SYN*/ 0x02);
    /* FIXME: what window should we use */
    syn_th->window = htons(5000);
    syn_th->check = 0;
    syn_th->urg_ptr = 0;
    syn_th->urg = 0;
    syn_proxy_syn_build_options((uint32_t *)(syn_th + 1), opt);

    /* Reserve space for ipv4 header */
    syn_iph = (struct iphdr *)rte_pktmbuf_prepend(syn_mbuf, sizeof(struct ipv4_hdr));
    if (!syn_iph) {
        rte_pktmbuf_free(syn_mbuf);
        //RTE_LOG(WARNING, IPVS, "%s:%s\n", __func__, dpvs_strerror(EDPVS_NOROOM));
        return EDPVS_NOROOM;
    }

    ack_iph = (struct iphdr *)ip4_hdr(mbuf);
    *((uint16_t *) syn_iph) = htons((4 << 12) | (5 << 8) | (ack_iph->tos & 0x1E));
    syn_iph->tot_len = htons(syn_mbuf->pkt_len);
    syn_iph->frag_off = htons(IPV4_HDR_DF_FLAG);
    syn_iph->ttl = 64;
    syn_iph->protocol = IPPROTO_TCP;
    syn_iph->saddr = ack_iph->saddr;
    syn_iph->daddr = ack_iph->daddr;

    /* checksum is done by fnat_in_handler */
    syn_iph->check = 0;

    /* Save syn_mbuf if syn retransmission is on */
    if (dp_vs_synproxy_ctrl_syn_retry > 0) {
        syn_mbuf_cloned = rte_pktmbuf_clone(syn_mbuf, pool);
        if (unlikely(!syn_mbuf_cloned)) {
            rte_pktmbuf_free(syn_mbuf);
            //RTE_LOG(WARNING, IPVS, "%s:%s\n", __func__, dpvs_strerror(EDPVS_NOMEM));
            return EDPVS_NOMEM;
        }

        syn_mbuf_cloned->userdata = NULL;
        cp->syn_mbuf = syn_mbuf_cloned;
        sp_dbg_stats32_inc(sp_syn_saved);
        rte_atomic32_set(&cp->syn_retry_max, dp_vs_synproxy_ctrl_syn_retry);
    }

    /* Count in the syn packet */
    dp_vs_stats_in(cp, mbuf);

    /* If xmit failed, syn_mbuf will be freed correctly */
    cp->packet_xmit(pp, cp, syn_mbuf);

    return EDPVS_OK;
}
  1. 從內(nèi)存池中分配 syn_mbuf,用于發(fā)送到后端 real server
  2. 借助 syn_th 填充 syn_mbuf 四層頭: 源,目的端口,seq,window 等等
  3. 借助 syn_iph 填充 syn_mbuf 三層頭: 源地址,目的地址等等
  4. 調(diào)用 packet_xmit 發(fā)送 syn_mbuf 給 rs, 由前文 conn_bind_dest 可知最終調(diào)用 dp_vs_xmit_fnat,其中最關(guān)鍵的就是調(diào)用 tcp_fnat_in_handler
    /* 
     * for SYN packet
     * 1. remove tcp timestamp option
     *    laddress for different client have diff timestamp.
     * 2. save original TCP sequence for seq-adjust later.
     *    since TCP option will be change.
     * 3. add TOA option
     *    so that RS with TOA module can get real client IP.
     */
    if (th->syn && !th->ack) {
        tcp_in_remove_ts(th);
        tcp_in_init_seq(conn, mbuf, th);
        tcp_in_add_toa(conn, mbuf, th);
    }

    /* add toa to first data packet */
    if (ntohl(th->ack_seq) == conn->fnat_seq.fdata_seq
            && !th->syn && !th->rst && !th->fin)
        tcp_in_add_toa(conn, mbuf, th);

    tcp_in_adjust_seq(conn, th);

    /* L4 translation */
    th->source  = conn->lport;
    th->dest    = conn->dport;
  1. 如果是 syn 包,第一次和 rs 握手,要移除 tcp header 中的 ts, 初始化 seq,并且要記錄 seq delta.
  2. 增加 toa 模塊,這樣后端 rs 就能拿到真正的 client ip port
  3. 對(duì)于正常己建立連接的數(shù)據(jù)包,tcp_in_adjust_seq 調(diào)整 seq 差值。
    最后由將 syn 包發(fā)送給 rs

ipv4_rcv 接收 rs ack 應(yīng)答

當(dāng) rs 返回 ack 應(yīng)答時(shí),數(shù)據(jù)包經(jīng)過(guò) dp_vs_in 處理,此時(shí)在流表里己經(jīng)有 conn 了,會(huì)走入 syn proxy 邏輯

            /* Syn-proxy 3 logic: receive syn-ack from rs */
            if (dp_vs_synproxy_synack_rcv(mbuf, conn, prot,
                                          ip4_hdrlen(mbuf), &verdict) == 0) {
                dp_vs_stats_out(conn, mbuf);
                dp_vs_conn_put(conn);
                return verdict;
            }

那么看下 dp_vs_synproxy_synack_rcv 具體實(shí)現(xiàn)

/* Syn-proxy step 3 logic: receive rs's Syn/Ack.
 * Update syn_proxy_seq.delta and send stored ack mbufs to rs. */
int dp_vs_synproxy_synack_rcv(struct rte_mbuf *mbuf, struct dp_vs_conn *cp,
        struct dp_vs_proto *pp, int ihl, int *verdict)
{
    struct tcphdr _tcph, *th;
    struct dp_vs_synproxy_ack_pakcet *tmbuf, *tmbuf2;
    struct list_head save_mbuf;
    struct dp_vs_dest *dest = cp->dest;
    unsigned conn_timeout = 0;

    th = mbuf_header_pointer(mbuf, ihl, sizeof(_tcph), &_tcph);
    if (unlikely(!th)) {
        *verdict = INET_DROP;
        return 0;
    }
    INIT_LIST_HEAD(&save_mbuf);

    if ((th->syn) && (th->ack) && (!th->rst) &&
            (cp->flags & DPVS_CONN_F_SYNPROXY) &&
            (cp->state == DPVS_TCP_S_SYN_SENT)) {
        cp->syn_proxy_seq.delta = htonl(cp->syn_proxy_seq.isn) - htonl(th->seq);
        cp->state = DPVS_TCP_S_ESTABLISHED;
        conn_timeout = dp_vs_get_conn_timeout(cp);
        if (unlikely((conn_timeout != 0) && (cp->proto == IPPROTO_TCP)))
            cp->timeout.tv_sec = conn_timeout;
        else
            cp->timeout.tv_sec = pp->timeout_table[cp->state];
        dpvs_time_rand_delay(&cp->timeout, 1000000);
        if (dest) {
            rte_atomic32_inc(&dest->actconns);
            rte_atomic32_dec(&dest->inactconns);
            cp->flags &= ~DPVS_CONN_F_INACTIVE;
        }

        /* Save tcp sequence for fullnat/nat, inside to outside */
        if (DPVS_FWD_MODE_NAT == cp->dest->fwdmode ||
                DPVS_FWD_MODE_FNAT == cp->dest->fwdmode) {
            cp->rs_end_seq = htonl(ntohl(th->seq) + 1);
            cp->rs_end_ack = th->ack_seq;

        // ip_vs_synproxy_save_fast_xmit_info ?

        /* Free stored syn mbuf, no need for retransmition any more */
        if (cp->syn_mbuf) {
            rte_pktmbuf_free(cp->syn_mbuf);
            cp->syn_mbuf = NULL;
            sp_dbg_stats32_dec(sp_syn_saved);
        }

        if (list_empty(&cp->ack_mbuf)) {
            /*
             * FIXME: Maybe a bug here, print err msg and go.
             * Attention: cp->state has been changed and we
             * should still DROP the syn/ack mbuf.
             */
            RTE_LOG(ERR, IPVS, "%s: got ack_mbuf NULL pointer: ack-saved = %u\n",
                    __func__, cp->ack_num);
            *verdict = INET_DROP;
            return 0;
        }

        list_for_each_entry_safe(tmbuf, tmbuf2, &cp->ack_mbuf, list) {
            list_del_init(&tmbuf->list);
            cp->ack_num--;
            list_add_tail(&tmbuf->list, &save_mbuf);
        }
        assert(cp->ack_num == 0);

        list_for_each_entry_safe(tmbuf, tmbuf2, &save_mbuf, list) {
            list_del_init(&tmbuf->list);
            /* syn_mbuf will be freed correctly if xmit failed */
            cp->packet_xmit(pp, cp, tmbuf->mbuf);
            /* free dp_vs_synproxy_ack_pakcet */
            rte_mempool_put(this_ack_mbufpool, tmbuf);
            sp_dbg_stats32_dec(sp_ack_saved);
        }

        *verdict = INET_DROP;
        return 0;
    } else if ((th->rst) &&
            (cp->flags & DPVS_CONN_F_SYNPROXY) &&
            (cp->state == DPVS_TCP_S_SYN_SENT)) {
        RTE_LOG(DEBUG, IPVS, "%s: get rst from rs, seq = %u ack_seq = %u\n",
                __func__, ntohl(th->seq), ntohl(th->ack_seq));

        /* Count the delta of seq */
        cp->syn_proxy_seq.delta =
            ntohl(cp->syn_proxy_seq.isn) - ntohl(th->seq);
        cp->state = DPVS_TCP_S_CLOSE;
        cp->timeout.tv_sec = pp->timeout_table[cp->state];
        dpvs_time_rand_delay(&cp->timeout, 1000000);
        th->seq = htonl(ntohl(th->seq) + 1);
        //syn_proxy_seq_csum_update ?

        return 1;
    }
    return 1;
}
  1. 首先判斷應(yīng)答包狀態(tài),必須是 syn 和 ack 包, 并且開(kāi)啟了 synproxy, 當(dāng)前 conn 連接處于 DPVS_TCP_S_SYN_SENT 狀態(tài)
  2. 更新 syn_proxy_seq.delta 序列號(hào)差值
  3. 設(shè)置 conn 狀態(tài)是 DPVS_TCP_S_ESTABLISHED
  4. 保存序號(hào) rs_end_seq 和 rs_end_ack
  5. 釋放 syn_mbuf,己經(jīng)不需要了
  6. 在全局 ack_mbuf 鏈表中刪除自己的 ack_mbuf 引用
  7. 調(diào)用 dp_vs_xmit_fnat 將 ack_mbuf 發(fā)送出去,從哪發(fā)送到哪呢?這個(gè) ack_mbuf 是 client 發(fā)過(guò)來(lái)的,此時(shí)要發(fā)送到后端 real server. 函數(shù)內(nèi)部修改源目的地址和端口,并修改 seq 等等。此時(shí)真正的完成了三次握手。

ipv4_rcv 正常發(fā)送數(shù)據(jù)

這塊邏輯就不貼代碼了,dp_vs_in 查找流表,根據(jù)數(shù)據(jù)方向,來(lái)選擇 xmit_inbound 發(fā)送到后端 rs,還是由 xmit_outbound 發(fā)送到 client,相應(yīng)的回調(diào)分別是 packet_xmitpacket_out_xmit

總結(jié)

fullnat 流程還是很復(fù)雜的,特別是結(jié)合 syn proxy. 下一篇再分析流表超時(shí)管理。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀(guān)點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容