基于 Couchbase 为 Serverless 函数提供 SAML 协议有状态会话支持的架构决策

架构设计

文章字数: 3.8k

阅读时长: 17 分

定义技术难题：Serverless 的无状态性与 SAML 的有状态会话之间的固有矛盾

SAML 2.0 协议，特别是其常用的 HTTP-Redirect 和 HTTP-POST 绑定，本质上是一个有状态的多步骤流程。一个典型的认证流程如下：

用户访问服务提供商 (SP)。
SP 构建一个 AuthnRequest，并将用户的浏览器通过 HTTP 302 重定向到身份提供商 (IdP)。在此过程中，SP 必须暂存一个 RelayState 参数，该参数指明用户认证成功后应返回的最终目标地址。
用户在 IdP 完成认证。
IdP 构建一个包含签名的 SAML Response，并通过浏览器将用户重定向回 SP 的断言消费服务 (Assertion Consumer Service, ACS) 端点。
SP 的 ACS 端点接收到 Response，必须验证其 InResponseTo 属性是否与步骤 2 中发出的 AuthnRequest 的 ID 匹配，并取出之前暂存的 RelayState 来完成最终跳转。

这里的核心挑战在于步骤 2 和步骤 5 之间的时间差。SP 必须在两次独立的 HTTP 请求之间维持状态（至少包括 RequestID 和 RelayState）。对于传统的单体或基于虚拟机的应用，这通常通过内存中的会话或本地缓存解决。但在 Serverless 架构中，函数的生命周期是短暂且不可预测的。两次独立的 API Gateway 请求极有可能被路由到两个完全不同的、没有任何共享内存的函数实例上。这种无状态、按需执行的计算模型与 SAML 流程的有状态要求形成了直接冲突。

我们面临的问题是：如何为一个需要与数百个外部企业 IdP 对接的多租户平台，构建一个高弹性、低成本且可靠的 SAML 断言中继服务？这个服务必须在无状态的 Serverless 计算环境中，可靠地管理数千个并发 SAML 认证流程的瞬时状态。

方案 A: 纯 Google Cloud Functions 配合外部状态存储

第一个进入考虑范围的方案是利用最纯粹的 Serverless 模型。使用 Google Cloud Functions (GCF) 来实现 SAML 流程中的所有 HTTP 端点，并通过外部的持久化存储来解决状态管理问题。

架构设计:

计算层: 两个独立的 Google Cloud Functions（第二代）。
- initiate-saml-auth: 负责接收初始请求，生成 AuthnRequest，并将 RequestID 与 RelayState 存入外部状态库，然后返回 HTTP 302 重定向。
- consume-saml-assertion: 作为 ACS 端点，负责接收来自 IdP 的 SAMLResponse，从外部状态库中检索并验证会话，处理断言，然后执行后续逻辑。
状态层: Google Cloud Memorystore (Redis) 或 Couchbase Cloud。对于需要持久化、多维查询和更丰富数据模型的场景，Couchbase 是一个更强的候选者。我们将选择 Couchbase 作为状态存储。

sequenceDiagram
    participant User
    participant GCF_Init as /sso/init (GCF)
    participant Couchbase
    participant IdP as External IdP
    participant GCF_ACS as /sso/acs (GCF)

    User->>+GCF_Init: GET /sso/init?tenant=acme&target=/dashboard
    GCF_Init->>GCF_Init: Generate SAML AuthnRequest (RequestID: id_123)
    GCF_Init->>+Couchbase: SET `sess:id_123` {relayState: "/dashboard"}, TTL: 5min
    Couchbase-->>-GCF_Init: OK
    GCF_Init-->>-User: HTTP 302 Redirect to IdP (with SAMLRequest)
    
    User->>+IdP: Authenticate
    IdP-->>-User: HTTP 302 Redirect to /sso/acs (with SAMLResponse)

    User->>+GCF_ACS: POST /sso/acs
    GCF_ACS->>GCF_ACS: Parse SAMLResponse (InResponseTo: id_123)
    GCF_ACS->>+Couchbase: GET `sess:id_123`
    Couchbase-->>-GCF_ACS: {relayState: "/dashboard"}
    GCF_ACS->>+Couchbase: DELETE `sess:id_123`
    Couchbase-->>-GCF_ACS: OK
    GCF_ACS->>GCF_ACS: Validate SAML Assertion Signature & Content
    GCF_ACS-->>-User: HTTP 302 Redirect to /dashboard

核心代码实现 (Node.js - GCF initiate-saml-auth):

// A simplified Google Cloud Function for initiating SAML authentication
const functions = require('@google-cloud/functions-framework');
const { Cluster } = require('couchbase');
const { v4: uuidv4 } = require('uuid');
const saml = require('saml2-js'); // A placeholder for a real SAML library

// --- Production-Level Considerations ---
// 1. Connection should be managed outside the handler to be reused across invocations.
// 2. Error handling must be robust, with structured logging.
// 3. Secrets (Couchbase credentials) must be managed via Secret Manager.

let couchbaseCluster;
let samlSessionBucket;

async function connectToCouchbase() {
    if (couchbaseCluster) return;
    try {
        couchbaseCluster = await Cluster.connect(process.env.CB_CONNECT_STRING, {
            username: process.env.CB_USERNAME,
            password: process.env.CB_PASSWORD,
        });
        const bucket = couchbaseCluster.bucket('saml-transient-state');
        samlSessionBucket = bucket.defaultCollection();
        console.log('Couchbase connection established.');
    } catch (err) {
        console.error('FATAL: Couchbase connection failed', err);
        // In a real app, this should trigger alerts and potentially halt the function.
        throw err;
    }
}

// Global scope promise to handle async initialization
const couchbaseConnectionPromise = connectToCouchbase();

functions.http('initiateSamlAuth', async (req, res) => {
    await couchbaseConnectionPromise;

    const tenantId = req.query.tenant;
    const relayState = req.query.target || '/';

    if (!tenantId) {
        res.status(400).send('Missing tenant identifier.');
        return;
    }

    try {
        // In a real multi-tenant system, you'd fetch this config from a persistent Couchbase bucket.
        const tenantConfig = {
            idp_slo_target_url: `https://idp.example.com/${tenantId}/slo`,
            idp_sso_target_url: `https://idp.example.com/${tenantId}/sso`,
            issuer: `urn:my-app:${tenantId}`,
            // Public key of IdP, private key of SP etc.
        };

        const sp = new saml.ServiceProvider({
            entity_id: tenantConfig.issuer,
            private_key: process.env.SP_PRIVATE_KEY,
            assertion_consumer_service_url: process.env.ACS_URL,
            // ... other SP options
        });
        
        const idp = new saml.IdentityProvider({
            sso_login_url: tenantConfig.idp_sso_target_url,
            // ... other IdP options
        });

        sp.create_login_request_url(idp, {}, async (err, loginUrl, requestId) => {
            if (err) {
                console.error(`[${tenantId}] Failed to create SAML login request`, err);
                res.status(500).send('Internal Server Error');
                return;
            }

            // The critical state-saving step
            const sessionId = `saml_sess::${requestId}`;
            const sessionData = {
                tenantId,
                relayState,
                createdAt: new Date().toISOString()
            };
            
            // Set a short TTL (e.g., 5 minutes) to prevent stale state accumulation
            await samlSessionBucket.insert(sessionId, sessionData, { expiry: 300 });

            console.log(`[${tenantId}] SAML session initiated: ${sessionId}`);
            res.redirect(loginUrl);
        });

    } catch (error) {
        console.error(`[${tenantId}] Unhandled error during SAML initiation`, error);
        res.status(500).send('An unexpected error occurred.');
    }
});

优劣分析:

优点:
- 极简运维: 无需管理服务器或容器编排。部署就是上传代码。
- 成本效益: 在登录请求量波动大或总体偏低的场景下，按调用付费的模式极具吸引力。无请求则无成本。
- 自动伸缩: Google Cloud 负责处理流量洪峰，理论上伸缩能力是无限的。
缺点:
- 冷启动延迟: 对于认证这种对用户体验极其敏感的流程，GCF 的冷启动延迟（可能从几百毫秒到数秒）是致命的。用户点击登录后需要等待函数实例启动，体验很差。
- 执行时间限制: GCF (Gen2) 的最大超时时间为 60 分钟（HTTP 触发器），虽然足以应对多数情况，但对于需要与响应缓慢的、老旧的企业 IdP 集成的场景，这仍然是一个潜在的风险点。
- 环境控制力弱: 无法进行深度定制化的网络配置、无法附加持久化磁盘、对并发模型的控制也有限。

在真实项目中，冷启动延迟是方案 A 的主要否决因素。身份认证是系统的门面，任何可感知的延迟都将严重影响用户信任度。

方案 B: Knative Serving 配合 Couchbase

第二个方案则试图在 Serverless 的弹性与传统服务的性能之间找到一个平衡点。我们使用 Knative Serving 将我们的 SAML 服务部署为一个可伸缩的容器化应用。

架构设计:

计算层: 一个单一的、容器化的 Go 或 Java 应用，部署为 Knative Service。该服务暴露 /sso/init 和 /sso/acs 两个端点。Knative 负责根据流量自动伸缩该容器的实例数量，从零到 N。
平台层: Google Kubernetes Engine (GKE) 或 Cloud Run for Anthos，作为 Knative 的运行环境。
状态层: 同样使用 Couchbase Cloud 来存储瞬时会话。

graph TD
    subgraph User Browser
        A[User]
    end
    
    subgraph Knative on GKE
        B(Knative Route)
        C{SAML Service Pod 1}
        D{SAML Service Pod N}
    end

    subgraph Couchbase Cloud
        E[Couchbase Cluster]
    end

    F[External IdP]

    A -->|1. GET /sso/init| B
    B -->|Load Balancing| C
    C -->|2. Write Session| E
    C -->|3. Redirect| A
    A -->|4. Authenticate| F
    F -->|5. Redirect to ACS| A
    A -->|6. POST /sso/acs| B
    B -->|Load Balancing, maybe new Pod| D
    D -->|7. Read Session| E
    E -->|8. Delete Session| D
    D -->|9. Validate Assertion| D
    D -->|10. Final Redirect| A

Knative Service 定义 (service.yaml):

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: saml-idp-broker
  namespace: default
spec:
  template:
    metadata:
      annotations:
        # For sustained traffic from key tenants, we can avoid cold starts.
        autoscaling.knative.dev/minScale: "1" 
        # For bursty traffic, allow up to 20 instances.
        autoscaling.knative.dev/maxScale: "20"
    spec:
      containerConcurrency: 50 # Each pod can handle up to 50 concurrent requests
      timeoutSeconds: 600 # 10 minute timeout, much more generous than GCF default
      containers:
        - image: gcr.io/my-project/saml-idp-broker:latest
          ports:
            - containerPort: 8080
          env:
            - name: CB_CONNECT_STRING
              valueFrom:
                secretKeyRef:
                  name: couchbase-secrets
                  key: connectionString
            - name: CB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: couchbase-secrets
                  key: username
            - name: CB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: couchbase-secrets
                  key: password
            - name: SP_PRIVATE_KEY
              valueFrom:
                secretKeyRef:
                  name: saml-sp-secrets
                  key: privateKey
          readinessProbe:
            httpGet:
              path: /healthz
          livenessProbe:
            httpGet:
              path: /healthz

核心代码实现 (Go - Knative Service):

package main

import (
	"context"
	"crypto/rsa"
	"crypto/x509"
	"encoding/pem"
	"fmt"
	"log"
	"net/http"
	"net/url"
	"os"
	"time"

	"github.com/crewjam/saml/samlsp"
	"github.com/couchbase/gocb/v2"
	"github.com/gin-gonic/gin"
)

// Global Couchbase connection variables
var cbCluster *gocb.Cluster
var samlSessionCollection *gocb.Collection

// connectCouchbase initializes the database connection.
// It's called once at startup.
func connectCouchbase() {
	connStr := os.Getenv("CB_CONNECT_STRING")
	username := os.Getenv("CB_USERNAME")
	password := os.Getenv("CB_PASSWORD")

	var err error
	cbCluster, err = gocb.Connect(connStr, gocb.ClusterOptions{
		Authenticator: gocb.PasswordAuthenticator{
			Username: username,
			Password: password,
		},
	})
	if err != nil {
		log.Fatalf("FATAL: Couchbase connection failed: %v", err)
	}

	bucket := cbCluster.Bucket("saml-transient-state")
	err = bucket.WaitUntilReady(5*time.Second, nil)
	if err != nil {
		log.Fatalf("FATAL: Couchbase bucket not ready: %v", err)
	}
	samlSessionCollection = bucket.DefaultCollection()
	log.Println("Couchbase connection established and bucket is ready.")
}

// getTenantSamlMiddleware is the core of our multi-tenant solution.
// It fetches tenant-specific SAML config from a persistent Couchbase bucket
// and creates a samlsp.Middleware instance on-the-fly for each request.
func getTenantSamlMiddleware(c *gin.Context) (*samlsp.Middleware, error) {
    // In a real app, tenantId would be determined from hostname, path, or token
	tenantID := c.Param("tenantId")
	if tenantID == "" {
		return nil, fmt.Errorf("tenant ID is missing")
	}

	// Fetch config from a persistent bucket (not the session one)
	// This is a simplified example. Production code needs caching here.
	// _, configDoc := tenantConfigCollection.Get(tenantID, nil)
    
    // For demonstration, we use hardcoded config
	idpMetadataURL, _ := url.Parse(fmt.Sprintf("https://idp.example.com/%s/metadata", tenantID))
	rootURL, _ := url.Parse(fmt.Sprintf("https://sp.example.com/%s", tenantID))

	// Parse SP private key from env/secret
	keyPEM := os.Getenv("SP_PRIVATE_KEY")
	block, _ := pem.Decode([]byte(keyPEM))
	if block == nil {
		return nil, fmt.Errorf("failed to decode PEM block containing private key")
	}
	spKey, err := x509.ParsePKCS1PrivateKey(block.Bytes)
	if err != nil {
		return nil, fmt.Errorf("failed to parse private key: %w", err)
	}
    
    // The key part: using Couchbase as the state store (samlsp.RequestTracker)
	cbStore := &CouchbaseRequestTracker{
		Collection: samlSessionCollection,
		// SAML requests are short-lived. 5 minutes is a reasonable expiry.
		Expiry: 5 * time.Minute,
	}

	samlSP, err := samlsp.New(samlsp.Options{
		URL:            *rootURL,
		Key:            spKey,
		// In production, the SP certificate would also be loaded.
		IDPMetadataURL: idpMetadataURL,
		RequestTracker: cbStore, // Here we inject our Couchbase store
		CookieName:     fmt.Sprintf("token_%s", tenantID), // Tenant-specific session cookie
	})
	if err != nil {
		return nil, fmt.Errorf("failed to create SAML middleware: %w", err)
	}
	return samlSP, nil
}

// CouchbaseRequestTracker implements samlsp.RequestTracker interface.
// It stores and retrieves SAML request state in Couchbase.
type CouchbaseRequestTracker struct {
	Collection *gocb.Collection
	Expiry     time.Duration
}

// TrackRequest stores the SAML request ID. `index` is the RelayState.
func (t *CouchbaseRequestTracker) TrackRequest(w http.ResponseWriter, r *http.Request, samlRequestID string) (string, error) {
	relayState := samlsp.TrackedRequest{
		Index:         fmt.Sprintf("idx_%s", samlRequestID),
		SAMLRequestID: samlRequestID,
		// We could add more context here if needed
	}
	
	docID := fmt.Sprintf("saml_req::%s", relayState.Index)
	_, err := t.Collection.Insert(docID, relayState, &gocb.InsertOptions{
		Expiry: t.Expiry,
	})
	if err != nil {
		log.Printf("ERROR: Failed to track SAML request %s in Couchbase: %v", samlRequestID, err)
		return "", err
	}
	return relayState.Index, nil
}

// StopTrackingRequest deletes the state after it's been used.
func (t *CouchbaseRequestTracker) StopTrackingRequest(w http.ResponseWriter, r *http.Request, index string) error {
	docID := fmt.Sprintf("saml_req::%s", index)
	_, err := t.Collection.Remove(docID, nil)
	// It's not a critical error if the key is already gone (e.g., expired)
	if err != nil && err != gocb.ErrDocumentNotFound {
		log.Printf("WARN: Failed to stop tracking SAML request for index %s: %v", index, err)
	}
	return nil
}

// GetTrackedRequest retrieves the state.
func (t *CouchbaseRequestTracker) GetTrackedRequest(r *http.Request, index string) (*samlsp.TrackedRequest, error) {
	docID := fmt.Sprintf("saml_req::%s", index)
	getResult, err := t.Collection.Get(docID, nil)
	if err != nil {
		log.Printf("ERROR: Failed to get tracked SAML request for index %s: %v", index, err)
		return nil, err
	}

	var trackedRequest samlsp.TrackedRequest
	if err := getResult.Content(&trackedRequest); err != nil {
		return nil, err
	}
	return &trackedRequest, nil
}


func main() {
	connectCouchbase()
	defer cbCluster.Close(nil)

	router := gin.Default()
	router.GET("/healthz", func(c *gin.Context) { c.Status(http.StatusOK) })
    
    // Dynamically handle different tenants
	tenantRouter := router.Group("/:tenantId")
	{
		tenantRouter.GET("/sso/init", func(c *gin.Context) {
			samlSP, err := getTenantSamlMiddleware(c)
			if err != nil {
				c.String(http.StatusInternalServerError, "Failed to initialize SAML SP: %v", err)
				return
			}
			// This handler will automatically generate the AuthnRequest and redirect.
			samlSP.HandleStartAuthFlow(c.Writer, c.Request)
		})

		tenantRouter.POST("/sso/acs", func(c *gin.Context) {
			samlSP, err := getTenantSamlMiddleware(c)
			if err != nil {
				c.String(http.StatusInternalServerError, "Failed to initialize SAML SP: %v", err)
				return
			}
            // This handler consumes the assertion, validates it via the Couchbase store,
            // and then calls samlSP.RequireAccount
			samlSP.ServeHTTP(c.Writer, c.Request)
		})
	}
	
	port := os.Getenv("PORT")
	if port == "" {
		port = "8080"
	}
	router.Run(":" + port)
}

优劣分析:

优点:
- 性能可控: 通过设置 minScale: "1"，可以为核心客户消除冷启动延迟，保证登录体验的流畅。
- 健壮性高: 更长的超时时间和对容器环境的完全控制，使其能更好地应对与不可靠的外部 IdP 的集成。
- 资源效率: 通过调整 containerConcurrency，可以使单个 Pod 处理更多并发请求，相比于 GCF 的“一个请求一个实例”模型，在持续负载下可能更具成本效益。
- 技术栈统一: 整个 SAML 逻辑可以内聚在一个单一、可测试、可维护的服务中，而不是分散在多个独立的函数里。
缺点:
- 运维复杂度: 需要管理 GKE 集群和 Knative 安装。虽然 Cloud Run for Anthos 简化了这一点，但其复杂度仍高于纯 GCF。
- 成本模型: 存在保底成本。即使流量为零，minScale: "1" 意味着至少有一个 Pod 在运行，加上 GKE 集群本身的管理费用，成本底线更高。

最终架构决策与理由

对于一个企业级、多租户的身份认证中枢，稳定性和可预测的性能是压倒一切的考量因素。用户登录是所有业务的入口，任何延迟或不稳定都会直接转化为业务损失和客户不信任。

因此，我们选择方案 B：Knative Serving 配合 Couchbase。

做出这个决策的关键权衡点在于：

冷启动不可接受: 在认证场景下，我们无法接受方案 A 带来的 P95 或 P99 延迟中包含秒级的冷启动时间。方案 B 通过 minScale 配置完美解决了这个问题。
为复杂性付费是值得的: 虽然方案 B 的运维成本更高，但它换来的是对性能、超时和运行环境的完全控制。这种控制力在处理与众多第三方系统集成时至关重要，这些外部系统的行为往往是不可预测的。
状态管理的内聚性: Knative 方案允许我们将整个 SAML 流程的逻辑（包括与 Couchbase 的交互）封装在一个单一的服务中。这比在两个独立的、需要协调状态的 GCF 之间传递逻辑要清晰得多，也更容易进行集成测试。

在真实项目中，我们会为大部分普通租户设置 minScale: "0" 以节省成本，同时为签订了SLA的大客户或流量稳定的租户动态调整其 Knative Service 配置，将其 minScale 设置为 1 或更高，实现性能与成本的精细化平衡。

方案的局限性与未来展望

当前这套基于 Knative 和 Couchbase 的架构虽然健壮，但并非没有局限。Couchbase 集群本身的管理和可用性成为了整个系统的关键路径。虽然 Couchbase Cloud 提供了高可用的托管服务，但在极端情况下，网络分区或服务降级仍可能影响认证流程。一个潜在的优化方向是引入多区域部署的 Couchbase 集群，并让 Knative 服务在多个 GKE 集群中部署，实现更高层次的容灾。

此外，该架构目前只解决了 SAML 协议。随着 OIDC (OpenID Connect) 协议的普及，下一步的迭代需要将该服务扩展为一个更通用的身份协议代理，能够动态地处理 SAML 和 OIDC 流程，而底层的有状态会话管理模式可以被复用。Knative Eventing 也可以被集成进来，用于将认证成功或失败的事件异步地广播到审计、风控等下游系统，从而构建一个更完整的身份事件中心。

Serverless Couchbase SAML Knative Google Cloud Functions

结合 Couchbase 与 Ant Design 构建多租户 WebAuthn 无密码认证服务的实践复盘

2023-10-27 后端架构

Couchbase Ant Design WebAuthn Webpack Memcached

构建基于 Linkerd mTLS、Redis Streams 与 ASP.NET Core 的零信任事件驱动架构

2023-10-27 云原生

mTLS ASP.NET Core Redis Streams Linkerd