Catalyzer: Sub-millisecond Startup for Serverless Computing with Initialization-less Booting

Abstract

Serverless computing promises cost-efficiency and elasticity for high-productive software development. To achieve this, the serverless sandbox system must address two challenges: strong isolation between function instances, and low startup latency to ensure user experience. While strong isolation can be provided by virtualization-based sandboxes, the initialization of sandbox and application causes non-negligible startup overhead. Conventional sandbox systems fall short in low-latency startup due to their application-agnostic nature: they can only reduce the latency of sandbox initialization through hypervisor and guest kernel customization, which is inadequate and does not mitigate the majority of startup overhead.

This paper proposes Catalyzer, a serverless sandbox system design providing both strong isolation and extremely fast function startup. Instead of booting from scratch, Catalyzer restores a virtualization-based function instance from a well-formed checkpoint image and thereby skips the initialization on the critical path (init-less). Catalyzer boosts the restore performance by on-demand recovering both user-level memory state and system state. We also propose a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by directly reusing the state of a running sandbox instance. Fundamentally, Catalyzer removes the initialization cost by reusing the state, which enables general optimizations for diverse serverless functions. The evaluation shows that Catalyzer reduces startup latency by orders of magnitude, achieves <1ms latency in the best case, and significantly reduces the end-to-end latency for real-world workloads.

無(wú)服務(wù)器計(jì)算為高生產(chǎn)力軟件開(kāi)發(fā)帶來(lái)了成本效益和彈性。為此,無(wú)服務(wù)器沙箱系統(tǒng)必須解決兩個(gè)挑戰(zhàn):功能實(shí)例之間的強(qiáng)隔離,以及確保用戶(hù)體驗(yàn)的低啟動(dòng)延遲。雖然基于虛擬化的沙箱可以提供強(qiáng)隔離,但沙箱和應(yīng)用程序的初始化會(huì)導(dǎo)致不可忽略的啟動(dòng)開(kāi)銷(xiāo)。由于其與應(yīng)用程序無(wú)關(guān)的性質(zhì),傳統(tǒng)沙箱系統(tǒng)在低延遲啟動(dòng)方面存在不足:它們只能通過(guò)虛擬機(jī)管理程序和來(lái)賓內(nèi)核定制來(lái)減少沙箱初始化的延遲,這是不夠的,并且不能減輕大部分啟動(dòng)開(kāi)銷(xiāo)。

本文提出了 Catalyzer,這是一種無(wú)服務(wù)器沙箱系統(tǒng)設(shè)計(jì),可提供強(qiáng)隔離和極快的功能啟動(dòng)。 Catalyzer 不是從頭啟動(dòng),而是從格式良好的檢查點(diǎn)映像恢復(fù)基于虛擬化的功能實(shí)例,從而跳過(guò)關(guān)鍵路徑上的初始化(無(wú)初始化)。 Catalyzer 通過(guò)按需恢復(fù)用戶(hù)級(jí)內(nèi)存狀態(tài)和系統(tǒng)狀態(tài)來(lái)提高恢復(fù)性能。我們還提出了一個(gè)新的 OS 原語(yǔ) sfork(沙箱分叉),通過(guò)直接重用正在運(yùn)行的沙箱實(shí)例的狀態(tài)來(lái)進(jìn)一步減少啟動(dòng)延遲。從根本上說(shuō),Catalyzer 通過(guò)重用狀態(tài)來(lái)消除初始化成本,從而實(shí)現(xiàn)對(duì)各種無(wú)服務(wù)器功能的一般優(yōu)化。評(píng)估表明,Catalyzer 將啟動(dòng)延遲降低了幾個(gè)數(shù)量級(jí),在最佳情況下實(shí)現(xiàn)了 <1 毫秒的延遲,并顯著降低了實(shí)際工作負(fù)載的端到端延遲。

1 Introduction

Serverless computing, the new trending paradigm in cloud computing, liberates developers from the distraction of managing servers and has already been supported by many platforms, including Amazon Lambda [2], IBM Cloud Function [1], Microsoft Azure Functions [3], and Google Cloud Functions [7]. In serverless computing, the unit of computation is a function. When a service request is received, the serverless platform allocates an ephemeral execution sandbox and instantiates a user-defined function to handle the request. This computing model shifts the responsibility of dynamically managing cloud resources to cloud providers, allowing the developers to focus purely on their application logic. Besides, cloud providers can manage their resources more efficiently.

無(wú)服務(wù)器計(jì)算是云計(jì)算的新趨勢(shì)范式,將開(kāi)發(fā)人員從管理服務(wù)器的注意力中解放出來(lái),并且已經(jīng)得到許多平臺(tái)的支持,包括 Amazon Lambda [2]、IBM Cloud Function [1]、Microsoft Azure Functions [3] 和谷歌云函數(shù) [7]。在無(wú)服務(wù)器計(jì)算中,計(jì)算單位是一個(gè)函數(shù)。當(dāng)收到服務(wù)請(qǐng)求時(shí),無(wú)服務(wù)器平臺(tái)會(huì)分配一個(gè)臨時(shí)執(zhí)行沙箱并實(shí)例化用戶(hù)定義的函數(shù)來(lái)處理請(qǐng)求。這種計(jì)算模型將動(dòng)態(tài)管理云資源的責(zé)任轉(zhuǎn)移給了云提供商,讓開(kāi)發(fā)人員可以完全專(zhuān)注于他們的應(yīng)用程序邏輯。此外,云提供商可以更有效地管理他們的資源。

The ephemeral execution sandboxes are typically containers [1], virtual machines [20, 44] or recently proposed lightweight virtualization designs [6, 8, 19, 35, 37, 41, 45]. However, container instances suffer from isolation issues since they share one kernel, which is error-prone. Virtual machines can achieve better isolation but are too heavy to run serverless functions. Lightweight virtualization designs like Google gVisor [8] and Amazon FireCracker [6] achieve high performance, easy resource management, and strong isolation by customizing the host-guest interfaces, e.g., gVisor uses a process abstraction interface.

臨時(shí)執(zhí)行沙箱通常是容器 [1]、虛擬機(jī) [20、44] 或最近提出的輕量級(jí)虛擬化設(shè)計(jì) [6、8、19、35、37、41、45]。然而,容器實(shí)例存在隔離問(wèn)題,因?yàn)樗鼈児蚕硪粋€(gè)內(nèi)核,這很容易出錯(cuò)。虛擬機(jī)可以實(shí)現(xiàn)更好的隔離,但太重而無(wú)法運(yùn)行無(wú)服務(wù)器功能。 Google gVisor [8] 和 Amazon FireCracker [6] 等輕量級(jí)虛擬化設(shè)計(jì)通過(guò)自定義主機(jī)-訪客接口(例如,gVisor 使用進(jìn)程抽象接口)來(lái)實(shí)現(xiàn)高性能、易于資源管理和強(qiáng)隔離。

Executing serverless functions with low latency is critical for user experience [21, 24, 28, 32, 38], and is still a significant challenge for virtualization-based sandbox design. To explain the severity, we conduct an end-to-end evaluation on three benchmarks, DeathStar [22], E-business microservices, and image processing functions, and divide the latency into the “execution” part and “boot” part (§6.4). We calculate the “Execution/Overall” ratio of the tested 14 serverless functions, and present the CDF in Figure 1. The ratio of 12 functions (out of 14) in gVisor can not even achieve 30%, indicating that the startup dominates the overall latency. Long startup latency, especially for virtualization-based sandbox, has become a significant challenge for serverless platforms.

以低延遲執(zhí)行無(wú)服務(wù)器功能對(duì)于用戶(hù)體驗(yàn)至關(guān)重要 [21, 24, 28, 32, 38],并且仍然是基于虛擬化的沙箱設(shè)計(jì)的重大挑戰(zhàn)。為了解釋嚴(yán)重性,我們對(duì)DeathStar [22]、電子商務(wù)微服務(wù)和圖像處理功能三個(gè)基準(zhǔn)進(jìn)行端到端評(píng)估,并將延遲分為“執(zhí)行”部分和“啟動(dòng)”部分(§ 6.4)。我們計(jì)算了測(cè)試的 14 個(gè) serverless 函數(shù)的“執(zhí)行/總體”比率,并在圖 1 中呈現(xiàn)了 CDF。gVisor 中的 12 個(gè)函數(shù)(共 14 個(gè))的比率甚至不能達(dá)到 30%,表明啟動(dòng)在總體上占主導(dǎo)地位潛伏。長(zhǎng)啟動(dòng)延遲,尤其是基于虛擬化的沙箱,已成為無(wú)服務(wù)器平臺(tái)的重大挑戰(zhàn)。

Existing VM-based sandboxes [6, 8, 37] reduce the startup latency through hypervisor customization, e.g., FireCracker can boot a virtual machine (micro VM) and a minimized Linux kernel in 100ms. However, none of them can reduce the application initialization latency like JVM or Python interpreter setup time. Our studies on serverless functions (written by five programming languages) show that most of the startup latency comes from application initialization (Insight I).

現(xiàn)有的基于虛擬機(jī)的沙箱 [6, 8, 37] 通過(guò)虛擬機(jī)管理程序定制減少了啟動(dòng)延遲,例如 FireCracker 可以在 100 毫秒內(nèi)啟動(dòng)一個(gè)虛擬機(jī)(微型虛擬機(jī))和最小化的 Linux 內(nèi)核。但是,它們都不能像 JVM 或 Python 解釋器設(shè)置時(shí)間那樣減少應(yīng)用程序初始化延遲。我們對(duì)無(wú)服務(wù)器函數(shù)(由五種編程語(yǔ)言編寫(xiě))的研究表明,大部分啟動(dòng)延遲來(lái)自應(yīng)用程序初始化(Insight I)。

This paper proposes Catalyzer, a general design to boost startup for serverless computing. The key idea of Catalyzer is to restore an instance from a well-formed checkpoint image and thereby skip the initialization on the critical path. The design is based on two additional insights: First, a serverless function in the execution stage typically accesses only a small fraction of memory and files used in the initialization stage (Insight II), thus we can on-demand recover both application states (e.g., data in memory) and system state (e.g., file handles/descriptors). Second, sandbox instances of the same function possess almost the same initialized state (Insight III), thus it is possible to reuse most of the state of running sandboxes to spawn new ones. Specifically, Catalyzer adopts on-demand recovery of both user-level and system states. And it proposes a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by directly reusing the state of a running sandbox instance. Fundamentally, Catalyzer eliminates the initialization cost by reusing the state, which enables general optimizations on diverse serverless functions.

本文提出了 Catalyzer,這是一種促進(jìn)無(wú)服務(wù)器計(jì)算啟動(dòng)的通用設(shè)計(jì)。 Catalyzer 的關(guān)鍵思想是從格式良好的檢查點(diǎn)圖像中恢復(fù)實(shí)例,從而跳過(guò)關(guān)鍵路徑上的初始化。該設(shè)計(jì)基于兩個(gè)額外的見(jiàn)解:首先,執(zhí)行階段的無(wú)服務(wù)器函數(shù)通常只訪問(wèn)初始化階段(Insight II)中使用的一小部分內(nèi)存和文件,因此我們可以按需恢復(fù)兩個(gè)應(yīng)用程序狀態(tài)(例如,內(nèi)存中的數(shù)據(jù))和系統(tǒng)狀態(tài)(例如,文件句柄/描述符)。其次,相同函數(shù)的沙箱實(shí)例具有幾乎相同的初始化狀態(tài)(洞察 III),因此可以重用運(yùn)行沙箱的大部分狀態(tài)來(lái)產(chǎn)生新的沙箱。具體來(lái)說(shuō),Catalyzer 采用用戶(hù)級(jí)和系統(tǒng)狀態(tài)的按需恢復(fù)。并且它提出了一個(gè)新的 OS 原語(yǔ) sfork(sandbox fork),通過(guò)直接重用正在運(yùn)行的沙盒實(shí)例的狀態(tài)來(lái)進(jìn)一步減少啟動(dòng)延遲。從根本上說(shuō),Catalyzer 通過(guò)重用狀態(tài)來(lái)消除初始化成本,從而可以對(duì)各種無(wú)服務(wù)器功能進(jìn)行一般優(yōu)化。

We have implemented Catalyzer based on gVisor. We measure the performance with both micro-benchmarks and real-world applications developed in five programming languages.

The result shows the Catalyzer can achieve <1ms startup latency on C-hello (best case), and <2ms to boot Java SPECjbb, 1000x speedup over baseline gVisor. We also present evaluations on server machines and share our lessons learned from industrial development at Ant Financial. The main contributions of this paper are as follows:

  • A detailed analysis of latency overhead on serverless computing (§2).
  • A general design of Init-less booting that boosts startup of diverse serverless applications (§3 and §4).
  • An implementation of Catalyzer on a state-of-the-art serverless sandbox system, Google gVisor (§5).
  • An evaluation with micro-benchmarks and real-world serverless applications proving the efficiency and practicability of Catalyzer (§6).
  • The experience of deploying Catalyzer on real platforms (§6.9).

我們已經(jīng)實(shí)現(xiàn)了基于 gVisor 的 Catalyzer。我們使用微基準(zhǔn)測(cè)試和用五種編程語(yǔ)言開(kāi)發(fā)的實(shí)際應(yīng)用程序來(lái)衡量性能。

結(jié)果表明,Catalyzer 可以在 C-hello(最佳情況)上實(shí)現(xiàn) <1ms 的啟動(dòng)延遲,以及 <2ms 來(lái)啟動(dòng) Java SPECjbb,比基準(zhǔn) gVisor 加速 1000 倍。我們還展示了對(duì)服務(wù)器機(jī)器的評(píng)估,并分享了我們從螞蟻金服的工業(yè)發(fā)展中吸取的經(jīng)驗(yàn)教訓(xùn)。本文的主要貢獻(xiàn)如下:

  • 無(wú)服務(wù)器計(jì)算延遲開(kāi)銷(xiāo)的詳細(xì)分析(第 2 節(jié))。
  • 無(wú) Init 啟動(dòng)的通用設(shè)計(jì),可促進(jìn)各種無(wú)服務(wù)器應(yīng)用程序的啟動(dòng)(第 3 節(jié)和第 4 節(jié))。
  • Catalyzer 在最先進(jìn)的無(wú)服務(wù)器沙箱系統(tǒng) Google gVisor(第 5 節(jié))上的實(shí)現(xiàn)。
  • 使用微基準(zhǔn)測(cè)試和真實(shí)世界的無(wú)服務(wù)器應(yīng)用程序進(jìn)行的評(píng)估證明了 Catalyzer 的效率和實(shí)用性(第 6 節(jié))。
  • 在真實(shí)平臺(tái)上部署 Catalyzer 的經(jīng)驗(yàn)(第 6.9 節(jié))。

2 Serverless Function Startup Breakdown

In this section, we evaluate and analyze the startup latency of serverless platforms with different system sandboxes (i.e., gVisor, FireCracker, Hyper Container, and Docker) and different language runtimes. Based on evaluation and analysis, we present our motivation that serverless functions should be executed with an initialization-less approach.

在本節(jié)中,我們?cè)u(píng)估和分析具有不同系統(tǒng)沙箱(即 gVisor、FireCracker、Hyper Container 和 Docker)和不同語(yǔ)言運(yùn)行時(shí)的無(wú)服務(wù)器平臺(tái)的啟動(dòng)延遲。 基于評(píng)估和分析,我們提出了我們的動(dòng)機(jī),即應(yīng)該使用無(wú)初始化方法執(zhí)行無(wú)服務(wù)器功能。

2.1 Background

Serverless Platform. In serverless computing, the developer sends a function to the serverless platform to execute. We use the term handler function to represent the target function, which could be written in different languages. The handler function is compiled offline together with a wrapper, which does initialization and invokes the handler function. Wrapped programs (consist of the wrapper and handler function) execute safely within sandboxes, which can be containers [5, 40] or virtual machines (VM) [6, 8, 10]. There is a gateway program running on each server as a daemon, which accepts “invoke function” requests, and starts a sandbox with two arguments: a configuration file and a rootfs containing both the wrapped program and runtime libraries. The arguments are based on OCI specification [12] and compatible with most of the existing serverless platforms.

無(wú)服務(wù)器平臺(tái)。在無(wú)服務(wù)器計(jì)算中,開(kāi)發(fā)者向無(wú)服務(wù)器平臺(tái)發(fā)送一個(gè)函數(shù)來(lái)執(zhí)行。我們使用術(shù)語(yǔ)處理程序函數(shù)來(lái)表示目標(biāo)函數(shù),它可以用不同的語(yǔ)言編寫(xiě)。處理程序函數(shù)與包裝器一起離線編譯,包裝器進(jìn)行初始化并調(diào)用處理程序函數(shù)。包裝的程序(由包裝器和處理程序函數(shù)組成)在沙箱中安全執(zhí)行,沙箱可以是容器 [5, 40] 或虛擬機(jī) (VM) [6, 8, 10]。每個(gè)服務(wù)器上都有一個(gè)網(wǎng)關(guān)程序作為守護(hù)進(jìn)程運(yùn)行,它接受“調(diào)用函數(shù)”請(qǐng)求,并使用兩個(gè)參數(shù)啟動(dòng)一個(gè)沙箱:一個(gè)配置文件和一個(gè)包含包裝程序和運(yùn)行時(shí)庫(kù)的 rootfs。這些參數(shù)基于 OCI 規(guī)范 [12],并與大多數(shù)現(xiàn)有的無(wú)服務(wù)器平臺(tái)兼容。

gVisor Case Study. In this paper, we propose a general optimization to achieve sub-millisecond startup even for VM-based sandboxes like gVisor. In the following text, we will take gVisor as an example for analysis, implementation, and evaluation. For evaluation, we use server machines (§6.1) to reveal performance improvement in the industrial environment.

gVisor 案例研究。在本文中,我們提出了一種通用優(yōu)化方法,即使對(duì)于 gVisor 等基于 VM 的沙箱,也能實(shí)現(xiàn)亞毫秒級(jí)啟動(dòng)。在下文中,我們將以 gVisor 為例進(jìn)行分析、實(shí)現(xiàn)和評(píng)估。為了進(jìn)行評(píng)估,我們使用服務(wù)器機(jī)器(第 6.1 節(jié))來(lái)揭示工業(yè)環(huán)境中的性能改進(jìn)。

On a serverless platform, the first step of invoking a function is to prepare a sandbox. In the case of gVisor, the sandbox preparation includes four operations: configuration parsing, virtualization resource allocation (e.g., VCPUs and guest memory regions), root file system mounting, and guest kernel initialization (Figure 2). The guest kernel consists of two user processes: a sandbox process and an I/O process. The sandbox process sets up the virtualized resource, e.g., the extended page table (EPT)1, and prepares the guest kernel. The I/O process mounts the root file system according to the configuration file. Figure 2 shows that sandbox initialization takes non-negligible time (22.3ms) in gVisor. Since sandbox initialization depends on function-specific configurations, it is hard to use techniques like caching [31, 40] to reduce sandbox initialization overhead. The critical path of startup refers to the period from when the “Gateway process” got a request until the handler executed. We use the term offline to represent the non-critical path operations (e.g., caching).

在無(wú)服務(wù)器平臺(tái)上,調(diào)用函數(shù)的第一步是準(zhǔn)備沙箱。在 gVisor 的情況下,沙箱準(zhǔn)備包括四個(gè)操作:配置解析、虛擬化資源分配(例如,VCPU 和來(lái)賓內(nèi)存區(qū)域)、根文件系統(tǒng)掛載和來(lái)賓內(nèi)核初始化(圖 2)。來(lái)賓內(nèi)核由兩個(gè)用戶(hù)進(jìn)程組成:沙箱進(jìn)程和 I/O 進(jìn)程。沙盒進(jìn)程設(shè)置虛擬化資源,例如擴(kuò)展頁(yè)表 (EPT)1,并準(zhǔn)備客戶(hù)內(nèi)核。 I/O進(jìn)程根據(jù)配置文件掛載根文件系統(tǒng)。圖 2 顯示沙箱初始化在 gVisor 中花費(fèi)了不可忽略的時(shí)間(22.3 毫秒)。由于沙箱初始化取決于特定于函數(shù)的配置,因此很難使用緩存 [31, 40] 等技術(shù)來(lái)減少沙箱初始化開(kāi)銷(xiāo)。啟動(dòng)的關(guān)鍵路徑是指從“網(wǎng)關(guān)進(jìn)程”收到請(qǐng)求到處理程序執(zhí)行的這段時(shí)間。我們使用術(shù)語(yǔ)離線來(lái)表示非關(guān)鍵路徑操作(例如,緩存)。

After sandbox initialization, the sandbox runs the wrapped program specified in the configuration file. Taking Java as an example, the wrapped program first starts a JVM to initialize Java runtime (e.g., loading class files), then executes the user-provided handler function. We define the application initialization latency as the period from when the wrapped program starts until the handler function is ready to run. As the following evaluation shows, the application initialization latency dominates the total startup latency.

沙箱初始化后,沙箱運(yùn)行配置文件中指定的封裝程序。以Java為例,被包裝的程序首先啟動(dòng)一個(gè)JVM初始化Java運(yùn)行時(shí)(如加載類(lèi)文件),然后執(zhí)行用戶(hù)提供的處理函數(shù)。我們將應(yīng)用程序初始化延遲定義為從包裝程序啟動(dòng)到處理程序函數(shù)準(zhǔn)備好運(yùn)行的時(shí)間段。如以下評(píng)估所示,應(yīng)用程序初始化延遲在總啟動(dòng)延遲中占主導(dǎo)地位。

2.2 A Quantitative Analysis on Startup Optimizations

The design space of serverless sandboxes is shown in Figure 3.

Cache-based Optimizations. Many systems adopt the idea of caching for serverless function startup [17, 39, 40]. For example, Zygote is a cache-based design for optimizing latency, which has been used in Android [14] to instantiate new Java applications. SOCK [40] leverages the Zygote idea for serverless computing. By creating a cache of pre-warmed Python interpreters, functions can be launched with an interpreter that has already loaded the necessary libraries, thus achieving high startup performance. SAND [17] allows instances of the same application function to share the sandbox which contains the function codes and its libraries. However, there are two reasons that caching is far from ideal. First, a single machine is capable of running thousands of serverless functions, so caching all the functions in memory will introduce a high resource overhead. Caching policies are also hard to be determined in the real world. Second, caching does not help with the tail latency, which is dominated by the “cold boot” in most cases.

基于緩存的優(yōu)化。許多系統(tǒng)采用無(wú)服務(wù)器功能啟動(dòng)緩存的想法 [17, 39, 40]。例如,Zygote 是一種基于緩存的優(yōu)化延遲設(shè)計(jì),已在 Android [14] 中用于實(shí)例化新的 Java 應(yīng)用程序。 SOCK [40] 利用 Zygote 思想進(jìn)行無(wú)服務(wù)器計(jì)算。通過(guò)創(chuàng)建預(yù)熱 Python 解釋器的緩存,可以使用已經(jīng)加載必要庫(kù)的解釋器啟動(dòng)函數(shù),從而實(shí)現(xiàn)高啟動(dòng)性能。 SAND [17] 允許相同應(yīng)用程序函數(shù)的實(shí)例共享包含函數(shù)代碼及其庫(kù)的沙箱。但是,緩存遠(yuǎn)非理想的原因有兩個(gè)。首先,單臺(tái)機(jī)器能夠運(yùn)行數(shù)千個(gè)無(wú)服務(wù)器功能,因此將所有功能緩存在內(nèi)存中會(huì)帶來(lái)很高的資源開(kāi)銷(xiāo)。在現(xiàn)實(shí)世界中也很難確定緩存策略。其次,緩存對(duì)尾部延遲沒(méi)有幫助,在大多數(shù)情況下,這是由“冷啟動(dòng)”主導(dǎo)的。

Optimizations on Sandbox Initialization. Besides caching, sandbox systems also optimize their initialization through customization. For example, SOCK [40] proposes a lean container, which is a customized container design for serverless computing, to mitigate the overhead of sandbox initialization. Compared with container-based approaches, VM-based sandboxes [6, 8, 10] provide stronger isolation and also introduce more costs to sandbox initialization. Researchers have proposed numerous lightweight virtualization techniques [6, 19, 26, 36, 37] to solve performance and resource utilization issues [18, 23, 25, 29] in traditional heavy-weight virtualization systems. These proposals have already stimulated significant interest in the serverless computing industry (e.g., Google’s gVisor [8] and Amazon’s FireCracker [6]).

Further, the lightweight virtualization techniques adopt various ways to optimize startup latency: by customizing guest kernels [26, 36], customizing hypervisors [19, 37], or a combination of the two [6, 8]. For instance, FireCracker [6] can boot a virtual machine (micro VM) and a minimized Linux kernel in 100ms. Although different in design and implementation, today’s virtualization-based sandboxes have one common limitation: they can not mitigate the application initialization latency like JVM or Python interpreter.

To understand the latency overhead (including sandbox and application initialization), we evaluate the startup latency of four widely used sandboxes (i.e., gVisor, FireCracker, Hyper Container, and Docker) with different workloads, and present the latency distribution in Figure 4. The evaluation uses the sandbox runtime directly and does not count the cost of container management. The settings are the same as described in §6.1.

We highlight several interesting findings from the evaluation. First, much of the latency overhead comes from application initialization. Second, compared with C language (142ms startup latency in gVisor), the startup latency is much higher for high-level languages like Java and Python. The main reason is that high-level languages usually need to initialize a language runtime (e.g., JVM) before loading application codes. Third, sandbox initialization is stable for different workloads and dominates the latency overhead for simple functions like Python Hello.

The evaluation shows that much of the startup latency comes from application initialization instead of the sandbox. However, none of the existing virtualization-based sandboxes can reduce the application initialization latency caused by JVM or Python interpreters.

沙盒初始化優(yōu)化。除了緩存,沙箱系統(tǒng)還通過(guò)定制優(yōu)化它們的初始化。例如,SOCK [40] 提出了一種精益容器,這是一種用于無(wú)服務(wù)器計(jì)算的定制容器設(shè)計(jì),以減輕沙箱初始化的開(kāi)銷(xiāo)。與基于容器的方法相比,基于 VM 的沙箱 [6, 8, 10] 提供了更強(qiáng)的隔離性,并為沙箱初始化引入了更多成本。研究人員提出了許多輕量級(jí)虛擬化技術(shù) [6, 19, 26, 36, 37] 來(lái)解決傳統(tǒng)重量級(jí)虛擬化系統(tǒng)中的性能和資源利用率問(wèn)題 [18, 23, 25, 29]。這些提議已經(jīng)激發(fā)了對(duì)無(wú)服務(wù)器計(jì)算行業(yè)的極大興趣(例如,谷歌的 gVisor [8] 和亞馬遜的 FireCracker [6])。

此外,輕量級(jí)虛擬化技術(shù)采用各種方法來(lái)優(yōu)化啟動(dòng)延遲:通過(guò)自定義來(lái)賓內(nèi)核 [26, 36]、自定義管理程序 [19, 37] 或兩者的組合 [6, 8]。例如,F(xiàn)ireCracker [6] 可以在 100 毫秒內(nèi)啟動(dòng)一個(gè)虛擬機(jī)(微型 VM)和一個(gè)最小化的 Linux 內(nèi)核。雖然在設(shè)計(jì)和實(shí)現(xiàn)上有所不同,但今天基于虛擬化的沙箱有一個(gè)共同的限制:它們不能像 JVM 或 Python 解釋器那樣減輕應(yīng)用程序初始化延遲。

為了了解延遲開(kāi)銷(xiāo)(包括沙箱和應(yīng)用程序初始化),我們?cè)u(píng)估了四種廣泛使用的沙箱(即 gVisor、FireCracker、Hyper Container 和 Docker)在不同工作負(fù)載下的啟動(dòng)延遲,并在圖 4 中展示了延遲分布。評(píng)估直接使用沙箱運(yùn)行時(shí),不計(jì)算容器管理的成本。設(shè)置與§6.1 中描述的相同。

我們強(qiáng)調(diào)了評(píng)估中的幾個(gè)有趣的發(fā)現(xiàn)。首先,大部分延遲開(kāi)銷(xiāo)來(lái)自應(yīng)用程序初始化。其次,與C語(yǔ)言(gVisor中142ms啟動(dòng)延遲)相比,Java和Python等高級(jí)語(yǔ)言的啟動(dòng)延遲要高得多。主要原因是高級(jí)語(yǔ)言通常需要在加載應(yīng)用程序代碼之前初始化語(yǔ)言運(yùn)行時(shí)(例如 JVM)。第三,沙箱初始化對(duì)于不同的工作負(fù)載是穩(wěn)定的,并且在 Python Hello 等簡(jiǎn)單函數(shù)的延遲開(kāi)銷(xiāo)中占主導(dǎo)地位。

評(píng)估表明,大部分啟動(dòng)延遲來(lái)自應(yīng)用程序初始化而不是沙箱。然而,現(xiàn)有的基于虛擬化的沙箱都不能減少由 JVM 或 Python 解釋器引起的應(yīng)用程序初始化延遲。

Checkpoint/Restore-based Optimizations. Checkpoint/restore (C/R) is a technique to save the state of a running sandbox into a checkpoint image. The saved state includes both the application state (in the sandbox) and the sandbox state (e.g., the hypervisor). Then, the sandbox can be restored from the image and run seamlessly. Replayable Execution [43] leverages C/R techniques to mitigate the application initialization cost, but only applies to container-based systems. Compared with other C/R systems, Replayable optimizes memory loading using an on-demand approach for boosting startup latency. However, our evaluation shows virtualization-based sandboxes incur high overhead to recover system state during the restore, which is omitted by the prior art.

The major benefit of C/R is that it can transform the application initialization costs into the sandbox restore costs (init-less). We generalize the idea as Init-less booting, shown in Figure 5. First, a func-image (short for function image) is generated offline, which saves initialized state of a serverless function (Offline initialization). The func-image could be saved to both local or remote storage, and a serverless platform needs to fetch a func-image first. After that, the platform can re-use the state saved in the func-image to boost the function startup (func-load).

基于檢查點(diǎn)/恢復(fù)的優(yōu)化。檢查點(diǎn)/恢復(fù) (C/R) 是一種將正在運(yùn)行的沙箱的狀態(tài)保存到檢查點(diǎn)圖像中的技術(shù)。保存的狀態(tài)包括應(yīng)用程序狀態(tài)(在沙箱中)和沙箱狀態(tài)(例如,管理程序)。然后,沙箱可以從映像中恢復(fù)并無(wú)縫運(yùn)行。 Replayable Execution [43] 利用 C/R 技術(shù)來(lái)降低應(yīng)用程序初始化成本,但僅適用于基于容器的系統(tǒng)。與其他 C/R 系統(tǒng)相比,Replayable 使用按需方法來(lái)優(yōu)化內(nèi)存加載,以提高啟動(dòng)延遲。然而,我們的評(píng)估表明,基于虛擬化的沙箱在恢復(fù)期間恢復(fù)系統(tǒng)狀態(tài)會(huì)產(chǎn)生高開(kāi)銷(xiāo),而現(xiàn)有技術(shù)忽略了這一點(diǎn)。

C/R 的主要好處是它可以將應(yīng)用程序初始化成本轉(zhuǎn)換為沙箱恢復(fù)成本(無(wú)初始化)。我們將這個(gè)想法概括為 Init-less 啟動(dòng),如圖 5 所示。首先,離線生成一個(gè) func-image(函數(shù)圖像的縮寫(xiě)),它保存了無(wú)服務(wù)器函數(shù)的初始化狀態(tài)(離線初始化)。 func-image 可以保存到本地或遠(yuǎn)程存儲(chǔ),無(wú)服務(wù)器平臺(tái)需要先獲取 func-image。之后,平臺(tái)可以重新使用保存在 func-image 中的狀態(tài)來(lái)提升功能啟動(dòng)(func-load)。

Challenges. C/R (checkpoint/restore) techniques re-use serialized state (mostly application state) of a process to diminish application initialization cost but rely on re-do operations to recover system state (i.e., in-kernel state like the opened files). A re-do operation recovers the state of a checkpointed instance and is necessary for correctness and compatibility. For example, a C/R system will re-do “open()” operations to re-open files that are opened in a checkpointed process. However, re-do operations introduce performance overhead, especially for virtualization-based sandboxes.

To analyze the performance effect, we implement a C/R-based init-less booting system on gVisor, called gVisor-restore, using gVisor-provided checkpoint and restore [4] mechanism. We add a new syscall in gVisor to trap at the entry point of serverless functions. We use the term, func-entry point, to indicate the entry point of a serverless function, which is either specified by developers or at the default location: the point right before the wrapped program invoking the handler function. The syscall is invoked by the func-entry point annotation and will block until checkpoint operation begins.

We evaluate the startup latency of gVisor-restore using different applications and compare it with unmodified gVisor. We use the sandbox runtime directly (i.e., runsc for gVisor) to exclude container management costs. As the result (Figure 6) shows, gVisor-restore successfully eliminates the application initialization overhead and achieves 2x–5x speedup over gVisor. However, the startup latency is still high (400ms for a Java SPECjbb application and >100ms in other cases). Figure 2 suggests that gVisor-restore spends 135.9ms on guest kernel recovery, which can be classified into “Recover Kernel” and “Reconnect I/O” in the figure. The “Recover Kernel” means recovering non-I/O system state, e.g., thread information, while I/O reconnection is for recovering I/O system state, e.g., re-open a “suppose opened” file. For reusable state (“App memory” in the figure), the gVisor C/R mechanism compresses the saved data to reduce the storage overhead and needs to decompress, deserialize, and load the data into memory on the restore critical path, costing 128.8ms for a SPECjbb application. During the restore process in the SPECjbb case, gVisor recovers more than 37,838 objects (e.g., threads/tasks, mounts, sessionLists, timers, and etc.) in the guest kernel and loads 200MB memory data.

Prior container-based C/R systems [43] have exploited on-demand paging to boost application state recovery, but still, recover all the system state in the critical path.

挑戰(zhàn)。 C/R(檢查點(diǎn)/恢復(fù))技術(shù)重用進(jìn)程的序列化狀態(tài)(主要是應(yīng)用程序狀態(tài))來(lái)減少應(yīng)用程序初始化成本,但依靠重做操作來(lái)恢復(fù)系統(tǒng)狀態(tài)(即內(nèi)核狀態(tài),如打開(kāi)的文件) .重做操作會(huì)恢復(fù)檢查點(diǎn)實(shí)例的狀態(tài),這對(duì)于正確性和兼容性來(lái)說(shuō)是必要的。例如,C/R 系統(tǒng)將重新執(zhí)行“open()”操作以重新打開(kāi)在檢查點(diǎn)進(jìn)程中打開(kāi)的文件。但是,重做操作會(huì)引入性能開(kāi)銷(xiāo),尤其是對(duì)于基于虛擬化的沙箱。

為了分析性能影響,我們?cè)?gVisor 上實(shí)現(xiàn)了一個(gè)基于 C/R 的 init-less 啟動(dòng)系統(tǒng),稱(chēng)為 gVisor-restore,使用 gVisor 提供的檢查點(diǎn)和恢復(fù) [4] 機(jī)制。我們?cè)?gVisor 中添加了一個(gè)新的系統(tǒng)調(diào)用以在無(wú)服務(wù)器函數(shù)的入口點(diǎn)捕獲。我們使用術(shù)語(yǔ) func-entry point 來(lái)表示無(wú)服務(wù)器函數(shù)的入口點(diǎn),它要么由開(kāi)發(fā)人員指定,要么位于默認(rèn)位置:就在調(diào)用處理程序函數(shù)的包裝程序之前的點(diǎn)。系統(tǒng)調(diào)用由 func-entry point 注釋調(diào)用,并將阻塞,直到檢查點(diǎn)操作開(kāi)始。

我們使用不同的應(yīng)用程序評(píng)估 gVisor-restore 的啟動(dòng)延遲,并將其與未修改的 gVisor 進(jìn)行比較。我們直接使用沙箱運(yùn)行時(shí)(即 gVisor 的 runc)來(lái)排除容器管理成本。結(jié)果(圖 6)顯示,gVisor-restore 成功地消除了應(yīng)用程序初始化開(kāi)銷(xiāo),并且比 gVisor 實(shí)現(xiàn)了 2 到 5 倍的加速。但是,啟動(dòng)延遲仍然很高(Java SPECjbb 應(yīng)用程序?yàn)?400 毫秒,其他情況下為 >100 毫秒)。圖2表明gVisor-restore在guest kernel恢復(fù)上花費(fèi)了135.9ms,在圖中可以分為“Recover Kernel”和“Reconnect I/O”。 “Recover Kernel”是指恢復(fù)非I/O系統(tǒng)狀態(tài),例如線程信息,而I/O重連是為了恢復(fù)I/O系統(tǒng)狀態(tài),例如重新打開(kāi)一個(gè)“假設(shè)打開(kāi)”的文件。對(duì)于可重用狀態(tài)(圖中“App memory”),gVisor C/R機(jī)制對(duì)保存的數(shù)據(jù)進(jìn)行壓縮以減少存儲(chǔ)開(kāi)銷(xiāo),需要在恢復(fù)關(guān)鍵路徑上對(duì)數(shù)據(jù)進(jìn)行解壓、反序列化、加載到內(nèi)存中,耗時(shí)128.8ms對(duì)于 SPECjbb 應(yīng)用程序。在 SPECjbb 案例的恢復(fù)過(guò)程中,gVisor 在訪客內(nèi)核中恢復(fù)了超過(guò) 37,838 個(gè)對(duì)象(例如,線程/任務(wù)、掛載、會(huì)話列表、計(jì)時(shí)器等)并加載 200MB 內(nèi)存數(shù)據(jù)。

之前的基于容器的 C/R 系統(tǒng) [43] 已經(jīng)利用按需分頁(yè)來(lái)促進(jìn)應(yīng)用程序狀態(tài)恢復(fù),但仍然可以恢復(fù)關(guān)鍵路徑中的所有系統(tǒng)狀態(tài)。

2.3 Overview

Our evaluation and analysis motivate us to propose Catalyzer, an init-less booting design for virtualization-based sandboxes, which is equipped with novel techniques to overcome the high latency on the restore process.

我們的評(píng)估和分析促使我們提出 Catalyzer,這是一種基于虛擬化沙箱的無(wú)初始化啟動(dòng)設(shè)計(jì),它配備了新技術(shù)來(lái)克服恢復(fù)過(guò)程中的高延遲。

As shown in Figure 7, Catalyzer defines three kinds of booting: cold boot, warm boot, and fork boot. Precisely, cold boot means that the platform must create a sandbox instance from func-image through the restore. Warm boot means there are running instances for the requested function; thus, Catalyzer can boost the restore by sharing the in-memory state of running instances. Fork boot in Catalyzer needs a dedicated sandbox template, a sandbox contains the initialized state, to skip the initialization. Fork boot is a hot-boot mechanism [11, 40]—a platform that knows a function may be invoked soon and prepares the running environment for the function. The significant contribution is that fork boot is scalable to boot any number of instances from a single template, while prior hot boot can only serve limited instances (depending on the cache size).

Catalyzer adopts a hybrid approach combining C/R-based init-less booting and a new OS primitives to implement the cold, warm, and fork boot. Since a serverless function in the execution stage typically accesses only a small fraction of both memory and files used in the initialization stage, Catalyzer introduces on-demand restore for the cold and warm boot to optimize the recovery of both applications and system state (§3). In addition, Catalyzer proposes a new OS primitive, sfork (sandbox fork), to reduce the startup latency in fork boot by directly reusing the state of a template sandbox (§4). Fork boot can achieve faster startup than the warm boot, but also introduces more memory overhead; thus, fork boot is more suitable for frequently invoked (hot) functions.

如圖 7 所示,Catalyzer 定義了三種啟動(dòng)方式:冷啟動(dòng)、熱啟動(dòng)和分叉啟動(dòng)。確切地說(shuō),冷啟動(dòng)意味著平臺(tái)必須從 func-image 到恢復(fù)創(chuàng)建一個(gè)沙箱實(shí)例。熱啟動(dòng)意味著所請(qǐng)求的函數(shù)有正在運(yùn)行的實(shí)例;因此,Catalyzer 可以通過(guò)共享正在運(yùn)行的實(shí)例的內(nèi)存狀態(tài)來(lái)促進(jìn)恢復(fù)。 Catalyzer 中的 fork boot 需要一個(gè)專(zhuān)用的沙箱模板,沙箱包含初始化狀態(tài),跳過(guò)初始化。 Fork boot 是一種熱啟動(dòng)機(jī)制 [11, 40]——一個(gè)知道一個(gè)函數(shù)可能很快被調(diào)用并為該函數(shù)準(zhǔn)備運(yùn)行環(huán)境的平臺(tái)。重要的貢獻(xiàn)是 fork boot 可擴(kuò)展以從單個(gè)模板啟動(dòng)任意數(shù)量的實(shí)例,而之前的熱啟動(dòng)只能服務(wù)有限的實(shí)例(取決于緩存大小)。

Catalyzer 采用混合方式結(jié)合基于 C/R 的 init-less 啟動(dòng)和新的 OS 原語(yǔ)來(lái)實(shí)現(xiàn)冷啟動(dòng)、暖啟動(dòng)和分叉啟動(dòng)。由于執(zhí)行階段的無(wú)服務(wù)器功能通常只訪問(wèn)初始化階段使用的內(nèi)存和文件的一小部分,因此 Catalyzer 為冷啟動(dòng)和熱啟動(dòng)引入了按需恢復(fù),以?xún)?yōu)化應(yīng)用程序和系統(tǒng)狀態(tài)的恢復(fù)(§3 )。此外,Catalyzer 提出了一個(gè)新的 OS 原語(yǔ) sfork(沙箱叉),通過(guò)直接重用模板沙箱的狀態(tài)(§4)來(lái)減少叉啟動(dòng)時(shí)的啟動(dòng)延遲。 Fork boot 可以實(shí)現(xiàn)比熱啟動(dòng)更快的啟動(dòng),但也引入了更多的內(nèi)存開(kāi)銷(xiāo);因此,fork boot 更適合頻繁調(diào)用(熱)的功能。

3 On-demand Restore

The performance overhead of restore comes from two parts. First, the application and system state need to be uncompressed, deserialized (only metadata), and loaded into memory. Second, re-do operations are necessary to recover system state, including multi-threaded contexts, virtualization sandbox, and I/O connections.

As shown in Figure 8-a, Catalyzer accelerates restore by splitting the process into three parts: offline preparation, critical path restore, and on-demand recovery. The preparation work, like uncompression and deserialization, is mostly performed offline in the checkpoint stage. The loading of application state and recovering of I/O-related system state are delayed with on-demand paging and I/O re-connection. Thus, Catalyzer only performs minimized work on the critical path, i.e., recovering non-I/O system state.

Specifically, Catalyzer proposes four techniques. First, overlay memory is a new memory abstraction that allows Catalyzer to directly map a func-image into memory, boosting application state loading (for cold boot). Sandboxes running the same function can share a “base memory mapping”, further omitting file mapping cost (for warm boot). Second, separated state recovery decouples deserialization from system state recovery on the critical path. Third, on-demand I/O reconnection delays I/O state recovery. Last, virtualization sandbox Zygote provides generalized virtualization sandboxes that are function-independent and can be used to reduce sandbox construction overhead.

恢復(fù)的性能開(kāi)銷(xiāo)來(lái)自?xún)刹糠帧?首先,應(yīng)用程序和系統(tǒng)狀態(tài)需要解壓縮、反序列化(僅元數(shù)據(jù))并加載到內(nèi)存中。 其次,重做操作是恢復(fù)系統(tǒng)狀態(tài)所必需的,包括多線程上下文、虛擬化沙箱和 I/O 連接。

如圖 8-a 所示,Catalyzer 通過(guò)將過(guò)程拆分為三個(gè)部分來(lái)加速恢復(fù):離線準(zhǔn)備、關(guān)鍵路徑恢復(fù)和按需恢復(fù)。準(zhǔn)備工作,如解壓縮和反序列化,大多在檢查點(diǎn)階段離線進(jìn)行。應(yīng)用程序狀態(tài)的加載和 I/O 相關(guān)系統(tǒng)狀態(tài)的恢復(fù)因按需分頁(yè)和 I/O 重新連接而延遲。因此,Catalyzer 僅在關(guān)鍵路徑上執(zhí)行最小化的工作,即恢復(fù)非 I/O 系統(tǒng)狀態(tài)。

具體來(lái)說(shuō),Catalyzer 提出了四種技術(shù)。首先,覆蓋內(nèi)存是一種新的內(nèi)存抽象,它允許 Catalyzer 直接將 func-image 映射到內(nèi)存中,從而提升應(yīng)用程序狀態(tài)加載(用于冷啟動(dòng))。運(yùn)行相同功能的沙箱可以共享“基本內(nèi)存映射”,進(jìn)一步省略文件映射成本(用于熱啟動(dòng))。其次,分離狀態(tài)恢復(fù)將反序列化與關(guān)鍵路徑上的系統(tǒng)狀態(tài)恢復(fù)分離。第三,按需 I/O 重新連接延遲了 I/O 狀態(tài)恢復(fù)。最后,虛擬化沙箱 Zygote 提供了通用的虛擬化沙箱,這些沙箱與功能無(wú)關(guān),可用于減少沙箱構(gòu)建開(kāi)銷(xiāo)。


圖 8. 按需恢復(fù)。 (a) 與之前的方法相比,Catalyzer 利用離線準(zhǔn)備和按需恢復(fù)來(lái)消除關(guān)鍵路徑上的大部分工作。 (b) Overlay memory允許一個(gè)func-image直接映射到內(nèi)存中來(lái)構(gòu)建Base-EPT,Base-EPT也可以通過(guò)copy-on-write的方式在不同的實(shí)例之間共享。 (c) 操作流程顯示了如何使用按需恢復(fù)實(shí)例化 gVisor 沙箱。

3.1 Overlay Memory

The overlay memory is a design for on-demand application state loading through copy-on-write of file-based mmap. As shown in Figure 8-b, the design allows a “base memory mapping” to be shared among sandboxes running the same function, and relies on memory copy-on-write to ensure privacy.

Overlay memory uses a well-formed func-image for direct mapping, which contains uncompressed and page-aligned application state. During a cold boot, Catalyzer loads application state by directly mapping the func-image into memory (map-file operation). Catalyzer maintains two layered EPTs for each sandbox. The upper one is called Private-EPT, and the lower one is Base-EPT. Private-EPT is private to each sandbox, while Base-EPT is shared and read-only. During a warm boot, Catalyzer directly maps the Base-EPT for the new sandbox with the share-mapping operation. The main benefit comes from the avoidance of costly file loading.

The platform constructs the hardware EPT by merging entries from the Private-EPT with the Base-EPT, i.e., using the entries of Private-EPT if the entries are valid, otherwise using the entries of Base-EPT. The construction is efficient and triggerd by hardware. Base-EPT is read-only thus can be inherited by new sandboxes through mmap, while the Private-EPT is established using copy-on-write when an EPT violation happens on the Base-EPT.

覆蓋內(nèi)存是一種通過(guò)基于文件的 mmap 的寫(xiě)時(shí)復(fù)制按需加載應(yīng)用程序狀態(tài)的設(shè)計(jì)。如圖 8-b 所示,該設(shè)計(jì)允許在運(yùn)行相同功能的沙箱之間共享“基本內(nèi)存映射”,并依靠?jī)?nèi)存復(fù)制來(lái)確保隱私。

覆蓋內(nèi)存使用格式良好的 func-image 進(jìn)行直接映射,其中包含未壓縮和頁(yè)面對(duì)齊的應(yīng)用程序狀態(tài)。在冷啟動(dòng)期間,Catalyzer 通過(guò)直接將 func-image 映射到內(nèi)存(映射文件操作)來(lái)加載應(yīng)用程序狀態(tài)。 Catalyzer 為每個(gè)沙箱維護(hù)兩個(gè)分層的 EPT。上層稱(chēng)為Private-EPT,下層稱(chēng)為Base-EPT。 Private-EPT 對(duì)每個(gè)沙箱都是私有的,而 Base-EPT 是共享和只讀的。在熱啟動(dòng)期間,Catalyzer 使用共享映射操作直接映射新沙箱的 Base-EPT。主要好處來(lái)自避免了昂貴的文件加載。

平臺(tái)通過(guò)合并來(lái)自 Private-EPT 的條目與 Base-EPT 來(lái)構(gòu)建硬件 EPT,即,如果條目有效,則使用 Private-EPT 的條目,否則使用 Base-EPT 的條目。構(gòu)建高效且由硬件觸發(fā)。 Base-EPT 是只讀的,因此可以通過(guò) mmap 被新的沙箱繼承,而 Private-EPT 是在 Base-EPT 上發(fā)生 EPT 違規(guī)時(shí)使用寫(xiě)時(shí)復(fù)制建立的。

3.2 Separated State Recovery

C/R relies on metadata of system state (represented by objects in the sandbox) for re-do operation, which is serialized before saving into checkpoint images and deserialized during the restore. The system state includes all guest OS internal state, e.g., the thread list and timers. However, such process is non-trivial for sandboxes implemented by high-level languages (e.g., Golang for gVisor), as the language abstraction hides the arrangement of state data. Even with the help of serialization tools such as Protobuf [16], metadata objects have to be processed one-by-one to recover, which can cause huge overhead when the number of objects is large (e.g., 37,838 objects are recovered for SPECjbb application in gVisor-restore, consuming >50ms).

Catalyzer proposes separated state recovery to overcome the challenge, by decoupling deserialization from state recovery. During offline preparation, Catalyzer saves partially deserialized metadata objects into func-images. Specifically, Catalyzer first re-organizes the discrete in-memory objects into continuous memory; thus they can be mapped back to memory through mmap operation instead of one-by-one deserialization. Then, Catalyzer zeros pointers in objects with placeholders, and records all (pointer) reference relationships in a relation table, which stores a map from offsets of pointers to offsets of pointer values. The metadata objects and the relation table together constitute the partially deserialized objects. The partially means that Catalyzer needs to deserialize pointers during runtime using the relation table.

With the func-image, Catalyzer accomplishes state recovery in two stages: loading the partially deserialized objects from a func-image (stage-1), reconstructing the object relationships (e.g., pointer relation) and recovering system state in parallel (stage-2). First, objects as well as the saved relation table will be mapped to the sandbox’s memory with overlay memory. Second, the object reference relationships are re-established by replacing all placeholders with real pointers through the relation table, and non-I/O system state are established on the critical path. Since each update is independent, this stage can be carried out in parallel. The design does not depend on a specific memory layout, which is better for portability so that a func-image can run on different machines.

C/R 依賴(lài)系統(tǒng)狀態(tài)的元數(shù)據(jù)(由沙箱中的對(duì)象表示)進(jìn)行重做操作,在保存到檢查點(diǎn)圖像之前進(jìn)行序列化,并在恢復(fù)期間反序列化。系統(tǒng)狀態(tài)包括所有客戶(hù)操作系統(tǒng)內(nèi)部狀態(tài),例如線程列表和計(jì)時(shí)器。然而,對(duì)于由高級(jí)語(yǔ)言(例如,gVisor 的 Golang)實(shí)現(xiàn)的沙箱來(lái)說(shuō),這樣的過(guò)程并不簡(jiǎn)單,因?yàn)檎Z(yǔ)言抽象隱藏了狀態(tài)數(shù)據(jù)的排列。即使借助 Protobuf [16] 等序列化工具,元數(shù)據(jù)對(duì)象也必須逐一處理才能恢復(fù),當(dāng)對(duì)象數(shù)量較多時(shí)(例如,SPECjbb 應(yīng)用程序恢復(fù) 37,838 個(gè)對(duì)象),這會(huì)造成巨大的開(kāi)銷(xiāo)在 gVisor-restore 中,消耗 >50 毫秒)。

Catalyzer 通過(guò)將反序列化與狀態(tài)恢復(fù)分離,提出了分離狀態(tài)恢復(fù)來(lái)克服挑戰(zhàn)。在離線準(zhǔn)備期間,Catalyzer 將部分反序列化的元數(shù)據(jù)對(duì)象保存到 func-images 中。具體來(lái)說(shuō),Catalyzer 首先將離散的內(nèi)存中對(duì)象重新組織成連續(xù)的內(nèi)存;因此它們可以通過(guò) mmap 操作而不是一一反序列化映射回內(nèi)存。然后,Catalyzer 將帶有占位符的對(duì)象中的指針歸零,并將所有(指針)引用關(guān)系記錄在關(guān)系表中,該表存儲(chǔ)從指針偏移量到指針值偏移量的映射。元數(shù)據(jù)對(duì)象和關(guān)系表共同構(gòu)成了部分反序列化的對(duì)象。這部分意味著 Catalyzer 需要在運(yùn)行時(shí)使用關(guān)系表反序列化指針。

通過(guò) func-image,Catalyzer 分兩個(gè)階段完成狀態(tài)恢復(fù):從 func-image 加載部分反序列化的對(duì)象(stage-1),重建對(duì)象關(guān)系(例如,指針關(guān)系)和并行恢復(fù)系統(tǒng)狀態(tài)(stage-2) )。首先,對(duì)象以及保存的關(guān)系表將通過(guò)覆蓋內(nèi)存映射到沙箱的內(nèi)存中。其次,通過(guò)關(guān)系表將所有占位符替換為真實(shí)指針,重新建立對(duì)象引用關(guān)系,在關(guān)鍵路徑上建立非I/O系統(tǒng)狀態(tài)。由于每次更新都是獨(dú)立的,這個(gè)階段可以并行進(jìn)行。該設(shè)計(jì)不依賴(lài)于特定的內(nèi)存布局,這有利于移植性,以便一個(gè) func-image 可以在不同的機(jī)器上運(yùn)行。

3.3 On-demand I/O Reconnection

The numerous I/O operations performed in restore (e.g., opening files) add high latency on the critical path. Inspired by our insight that many of the I/O-related state (e.g., files) will not be used after restore, Catalyzer adopts an on-demand I/O reconnection design. For example, a previously opened file “/home/user/hello.txt” may only be accessed for specific requests. Those unused I/O connections can not be eliminated even with a proper point in the checkpoint, because existing serverless functions are usually running with language runtime (e.g., JVM) and third-party libraries, in which the developers have no idea whether they will access some rarely used connections.

Thus, we can re-establish the connections lazily—only reestablish when the connections are used. To achieve this, I/O reconnection is performed asynchronously on the restore critical path, and the sandbox guest kernel maintains the I/O connection status, i.e., a file descriptor will be passed to functions but tagged as not re-opened yet in the guest kernel.

We observe that for a specific function, the I/O connections that are immediately used after booting are mostly deterministic. Thus, we introduce an I/O cache mechanism to further mitigate the latency of I/O reconnection. The I/O connection operations performed during cold boot are saved in cache, which are used by Catalyzer to guide a sandbox (in warm boot) to establish these connections on the critical path. Specifically, the cache stores the file paths and the operations on the path, so Catalyzer can use the information as a hint to re-connect these I/O first. For I/O connections missed in the cache (i.e., the non-deterministic connections), Catalyzer will use the on-demand strategy to establish the needed I/O connections.

還原中執(zhí)行的大量 I/O 操作(例如,打開(kāi)文件)在關(guān)鍵路徑上增加了高延遲。受我們洞察很多 I/O 相關(guān)狀態(tài)(例如文件)在恢復(fù)后將不再使用的啟發(fā),Catalyzer 采用了按需 I/O 重新連接設(shè)計(jì)。例如,之前打開(kāi)的文件“/home/user/hello.txt”只能針對(duì)特定請(qǐng)求進(jìn)行訪問(wèn)。那些未使用的 I/O 連接即使在檢查點(diǎn)中設(shè)置適當(dāng)?shù)狞c(diǎn)也無(wú)法消除,因?yàn)楝F(xiàn)有的無(wú)服務(wù)器功能通常與語(yǔ)言運(yùn)行時(shí)(例如 JVM)和第三方庫(kù)一起運(yùn)行,開(kāi)發(fā)人員不知道他們是否會(huì)訪問(wèn)一些很少使用的連接。

因此,我們可以懶惰地重新建立連接——只有在使用連接時(shí)才重新建立。為此,在恢復(fù)關(guān)鍵路徑上異步執(zhí)行 I/O 重新連接,沙箱客戶(hù)機(jī)內(nèi)核維護(hù) I/O 連接狀態(tài),即文件描述符將傳遞給函數(shù),但在來(lái)賓內(nèi)核。

我們觀察到,對(duì)于特定功能,啟動(dòng)后立即使用的 I/O 連接大多是確定性的。因此,我們引入了 I/O 緩存機(jī)制來(lái)進(jìn)一步減輕 I/O 重新連接的延遲。冷啟動(dòng)期間執(zhí)行的 I/O 連接操作保存在緩存中,Catalyzer 使用緩存來(lái)引導(dǎo)沙箱(在熱啟動(dòng)中)在關(guān)鍵路徑上建立這些連接。具體來(lái)說(shuō),緩存存儲(chǔ)了文件路徑和路徑上的操作,因此 Catalyzer 可以使用這些信息作為提示,先重新連接這些 I/O。對(duì)于緩存中丟失的 I/O 連接(即非確定性連接),Catalyzer 將使用按需策略來(lái)建立所需的 I/O 連接。

3.4 Virtualization Sandbox Zygote

On the restore critical path, a sandbox is constructed before application state loading and system state recovery. Challenges of reducing sandbox construction latency lie in two factors: first, sandbox construction depends on functionspecific information (e.g., the path of rootfs), thus techniques like caching do not help; second, a sandbox is tightly coupled with system resources that are not directly re-usable (e.g., namespace and hardware virtualization resources).

Catalyzer proposes a Virtualization Sandbox Zygote design that separates the function-dependent configuration from a general sandbox (Sandbox Zygote) and leverages a cache of Zygotes to mitigate sandbox construction overhead. A Zygote is a generalized virtualization sandbox used to generate a function-specific sandbox during the restore. As described in Figure 2, a sandbox is constructed with a configuration file and a rootfs. Catalyzer proposes a base configuration and a base rootfs, which separate out function-specific details. Catalyzer caches a Zygote by parsing the base configuration file, allocating virtualization resources (e.g., VCPU) and mounting the base rootfs. Upon function invocation, Catalyzer specializes a sandbox from a Zygote by importing function-specific binaries/libraries, and appending the function-specific configuration in the Zygote. Virtualization Zygotes can be used in both cold boot and warm boot in Catalyzer

在恢復(fù)關(guān)鍵路徑上,在應(yīng)用狀態(tài)加載和系統(tǒng)狀態(tài)恢復(fù)之前構(gòu)建沙箱。減少沙盒構(gòu)建延遲的挑戰(zhàn)在于兩個(gè)因素:第一,沙盒構(gòu)建依賴(lài)于功能特定的信息(例如,rootfs 的路徑),因此緩存等技術(shù)無(wú)濟(jì)于事;其次,沙箱與不可直接重用的系統(tǒng)資源(例如,命名空間和硬件虛擬化資源)緊密耦合。

Catalyzer 提出了一種虛擬化沙箱 Zygote 設(shè)計(jì),它將依賴(lài)于功能的配置與通用沙箱 (Sandbox Zygote) 分開(kāi),并利用 Zygotes 的緩存來(lái)減輕沙箱構(gòu)建開(kāi)銷(xiāo)。 Zygote 是一種通用的虛擬化沙箱,用于在恢復(fù)期間生成特定于功能的沙箱。如圖 2 所示,沙箱由一個(gè)配置文件和一個(gè) rootfs 構(gòu)建。 Catalyzer 提出了一個(gè)基本配置和一個(gè)基本 rootfs,它們分離出特定于功能的細(xì)節(jié)。 Catalyzer 通過(guò)解析基本配置文件、分配虛擬化資源(例如 VCPU)和掛載基本 rootfs 來(lái)緩存 Zygote。在函數(shù)調(diào)用時(shí),Catalyzer 通過(guò)導(dǎo)入特定于函數(shù)的二進(jìn)制文件/庫(kù),并在 Zygote 中附加特定于函數(shù)的配置,從 Zygote 中專(zhuān)門(mén)化沙箱。虛擬化 Zygotes 可用于 Catalyzer 中的冷啟動(dòng)和熱啟動(dòng)

3.5 Putting All Together

The three elements, overlay memory, separated state, and I/O connections, are all included in the func-image. The workflow of cold boot and warm boot is shown in Figure 8-c. First, function-specific configuration and its func-image (in “App rootFS”) is passed to a Zygote to specialize a sandbox. Second, the function-specific rootfs indicated by the configuration is mounted for the sandbox. Then, the sandbox recovers system state using separated state recovery. After that, Catalyzer maps the Base-EPT’s memory to the gVisor process as read-only for warm boot, in which copy-on-write is used to preserve the privacy of the sandbox’s memory. For cold boot, Catalyzer needs to establish the Base-EPT first by mapping the func-image into memory. At last, the guest kernel asynchronously recovers I/O connections, and I/O cache assists the process for warm boot.

覆蓋內(nèi)存、分離狀態(tài)和 I/O 連接這三個(gè)元素都包含在 func-image 中。冷啟動(dòng)和熱啟動(dòng)的工作流程如圖 8-c 所示。首先,將特定于功能的配置及其功能映像(在“App rootFS”中)傳遞給 Zygote 以專(zhuān)門(mén)化沙箱。其次,為沙箱掛載了配置所指示的特定于功能的 rootfs。然后,沙箱使用分離狀態(tài)恢復(fù)來(lái)恢復(fù)系統(tǒng)狀態(tài)。之后,Catalyzer 將 Base-EPT 的內(nèi)存映射到 gVisor 進(jìn)程作為熱啟動(dòng)的只讀進(jìn)程,其中使用寫(xiě)時(shí)復(fù)制來(lái)保護(hù)沙箱內(nèi)存的隱私。對(duì)于冷啟動(dòng),Catalyzer 需要首先通過(guò)將 func-image 映射到內(nèi)存來(lái)建立 Base-EPT。最后,來(lái)賓內(nèi)核異步恢復(fù) I/O 連接,I/O 緩存輔助熱啟動(dòng)過(guò)程。

4 sfork: Sandbox fork

Based on our Insight III, Catalyzer proposes a new OS primitive, sfork (sandbox fork), to further reduce the startup latency by reusing the state of a running “template sandbox” directly. The term, “template sandbox”, means a special sandbox for a specific function that has no information about user requests; thus, it can be leveraged to instantiate sandboxes to serve requests. The basic workflow is shown in Figure 9-a. First, a template sandbox is generated through template initialization, containing clean system state at the func-entry point; then, when a request of the function arrives, the template sandbox will sfork itself to reuse the initialized state directly. The state here includes both user state (application and runtime) and guest kernel state.

**Challenges. **An intuitive choice is to use the traditional fork to implement sfork. However, it is challenging to keep system state consistent by using fork only. First, most OS kernels (e.g., Linux) can only support single-thread fork, which means the information of multi-threading will be lost after forking. Second, fork is not suitable for sandbox creation, during which a child process will inherit its parent’s shared memory mappings, file descriptors and other system state that are not supposed to be shared between sandboxes. Third, fork will clone all the user state in memory, some of which may depend on system state that have been changed. For example, given a common case where the template sandbox issues getpid syscall and uses the return value to name a variable during initialization, the PID is changed in the forked sandbox, but the variable is not, leading to undefined behavior.

The clone syscall provides more flexibility with many options, but is still not sufficient. One major limitation is the handling of shared memory (mapped with MAP_SHARED flag). If a child sandbox inherits the shared memory, it will violate the isolation between parent and child sandboxes; if not, it may change the semantics of MAP_SHARED.

Template Initialization. To overcome the challenges, Catalyzer relies on user-space handling of most inconsistent state and only introduces minimal kernel modifications. We classify syscalls into three groups, denied, handled and allowed. The allowed and handled syscalls are listed in Table 1. The handled syscalls require user-space logic to fix related system state after sfork for consistency explicitly. For example, clone creates a new thread context for a sandbox, and the multi-threaded contexts should be recovered after sfork (Challenge-1). The denied syscalls are removed from the sandbox since they may lead to non-deterministic system state modification. We illustrate how Catalyzer keeps the multi-threaded contexts and reuses inherited file descriptors (Challenge-2) after sfork with two novel techniques, transient single-thread and stateless overlay rootFS. The only modification to the kernel is adding a new flag, CoW flag, for shared memory mapping. We take advantage of Linux container technologies (USER and PID namespaces) to maintain system state like user id and process id consistent after sfork (Challenge-3).

基于我們的 Insight III,Catalyzer 提出了一個(gè)新的 OS 原語(yǔ) sfork(沙箱叉),通過(guò)直接重用正在運(yùn)行的“模板沙箱”的狀態(tài)來(lái)進(jìn)一步減少啟動(dòng)延遲。術(shù)語(yǔ)“模板沙箱”,是指用于特定功能的特殊沙箱,它沒(méi)有用戶(hù)請(qǐng)求的信息;因此,它可以用來(lái)實(shí)例化沙箱來(lái)服務(wù)請(qǐng)求?;竟ぷ髁鞒倘鐖D 9-a 所示。首先,通過(guò)模板初始化生成一個(gè)模板沙箱,在函數(shù)入口點(diǎn)包含干凈的系統(tǒng)狀態(tài);然后,當(dāng)函數(shù)的請(qǐng)求到達(dá)時(shí),模板沙箱將自己分叉以直接重用初始化狀態(tài)。這里的狀態(tài)包括用戶(hù)狀態(tài)(應(yīng)用程序和運(yùn)行時(shí))和來(lái)賓內(nèi)核狀態(tài)。

挑戰(zhàn)。一個(gè)直觀的選擇是使用傳統(tǒng)的fork來(lái)實(shí)現(xiàn)sfork。然而,僅使用 fork 來(lái)保持系統(tǒng)狀態(tài)一致是具有挑戰(zhàn)性的。首先,大多數(shù)OS內(nèi)核(例如Linux)只能支持單線程fork,這意味著fork后多線程的信息會(huì)丟失。其次,fork 不適合沙箱創(chuàng)建,在此期間子進(jìn)程將繼承其父進(jìn)程的共享內(nèi)存映射、文件描述符和其他不應(yīng)在沙箱之間共享的系統(tǒng)狀態(tài)。第三,fork 將克隆內(nèi)存中的所有用戶(hù)狀態(tài),其中一些可能取決于已更改的系統(tǒng)狀態(tài)。例如,假設(shè)模板沙箱在初始化期間發(fā)出 getpid 系統(tǒng)調(diào)用并使用返回值命名變量的常見(jiàn)情況,在分叉的沙箱中更改了 PID,但變量沒(méi)有更改,從而導(dǎo)致未定義的行為。

克隆系統(tǒng)調(diào)用通過(guò)許多選項(xiàng)提供了更大的靈活性,但仍然不夠。一個(gè)主要的限制是共享內(nèi)存的處理(用 MAP_SHARED 標(biāo)志映射)。如果子沙箱繼承了共享內(nèi)存,就會(huì)違反父子沙箱之間的隔離;如果不是,它可能會(huì)改變 MAP_SHARED 的語(yǔ)義。

模板初始化。為了克服這些挑戰(zhàn),Catalyzer 依靠用戶(hù)空間處理大多數(shù)不一致的狀態(tài),并且只引入最少的內(nèi)核修改。我們將系統(tǒng)調(diào)用分為三組,拒絕、處理和允許。允許和處理的系統(tǒng)調(diào)用列在表 1 中。處理的系統(tǒng)調(diào)用需要用戶(hù)空間邏輯來(lái)明確在 sfork 之后修復(fù)相關(guān)的系統(tǒng)狀態(tài)以保持一致性。比如clone為沙箱創(chuàng)建了一個(gè)新的線程上下文,sfork之后應(yīng)該恢復(fù)多線程上下文(Challenge-1)。從沙箱中刪除被拒絕的系統(tǒng)調(diào)用,因?yàn)樗鼈兛赡軐?dǎo)致不確定的系統(tǒng)狀態(tài)修改。我們說(shuō)明了 Catalyzer 如何在 sfork 之后保留多線程上下文并重用繼承的文件描述符(Challenge-2),并使用兩種新技術(shù),即瞬態(tài)單線程和無(wú)狀態(tài)覆蓋 rootFS。對(duì)內(nèi)核的唯一修改是為共享內(nèi)存映射添加一個(gè)新標(biāo)志 CoW 標(biāo)志。我們利用 Linux 容器技術(shù)(USER 和 PID 命名空間)在 sfork(挑戰(zhàn) 3)后保持系統(tǒng)狀態(tài),如用戶(hù) id 和進(jìn)程 id 一致。

4.1 Multi-threading Fork

Sandboxes implemented using Golang (e.g., gVisor) are naturally multi-threaded, because the Golang runtime uses multiple threads for garbage collection and other background works. Specifically, threads in Golang can be classified into three categories: runtime threads, scheduling threads, and blocking threads (Figure 9-b). The runtime threads are responsible for providing runtime functionalities like garbage collection and preemption. They are long-running and transparent to the developers. The scheduling threads (M-threads in Figure 9-b) implement the co-routine mechanism in Golang (i.e., Go routine). When a Go routine switches to the blocked state (e.g., executing blocking system calls like accept), Golang runtime will dedicate an OS thread to the Go routine.

Catalyzer proposes a transient single-thread mechanism to support multi-threaded sandbox fork. With the mechanism, a multi-threaded program can temporarily merge all the threads into a single thread (i.e., the transient singlethread), which can be expanded to a multi-threaded one after sfork. The process is shown in Figure 9-b. First, we modify the Golang runtime in Catalyzer to support stoppable runtime threads. When the runtime threads are notified to enter the transient single-thread state, they will save the thread contexts in the memory and terminate themselves. Then, the number of scheduling threads can be configured to one through Golang runtime. In addition, we add a time-out in all blocking threads, and the threads will check whether they should terminate for entering the transient single-thread state when the time-out is triggered. Finally, the Golang program will only keep the m0 thread in the transient singlethread state, and expand to multiple threads again after sfork. Our modification is only used for template sandbox generation, and will not affect program behaviors after sfork.

使用 Golang 實(shí)現(xiàn)的沙箱(例如 gVisor)自然是多線程的,因?yàn)?Golang 運(yùn)行時(shí)使用多個(gè)線程進(jìn)行垃圾收集和其他后臺(tái)工作。具體來(lái)說(shuō),Golang 中的線程可以分為三類(lèi):運(yùn)行時(shí)線程、調(diào)度線程和阻塞線程(圖 9-b)。運(yùn)行時(shí)線程負(fù)責(zé)提供運(yùn)行時(shí)功能,如垃圾收集和搶占。它們長(zhǎng)期運(yùn)行并且對(duì)開(kāi)發(fā)人員透明。調(diào)度線程(圖 9-b 中的 M 線程)在 Golang 中實(shí)現(xiàn)了協(xié)程機(jī)制(即 Go 例程)。當(dāng) Go 例程切換到阻塞狀態(tài)時(shí)(例如,執(zhí)行像 accept 這樣的阻塞系統(tǒng)調(diào)用),Golang 運(yùn)行時(shí)會(huì)將一個(gè) OS 線程專(zhuān)用于 Go 例程。

Catalyzer 提出了一種瞬態(tài)單線程機(jī)制來(lái)支持多線程沙箱fork。通過(guò)該機(jī)制,多線程程序可以暫時(shí)將所有線程合并為一個(gè)單線程(即transient singlethread),sfork后可以擴(kuò)展為多線程。該過(guò)程如圖9-b所示。首先,我們修改 Catalyzer 中的 Golang 運(yùn)行時(shí)以支持可停止的運(yùn)行時(shí)線程。當(dāng)運(yùn)行時(shí)線程被通知進(jìn)入瞬態(tài)單線程狀態(tài)時(shí),它們會(huì)將線程上下文保存在內(nèi)存中并終止自己。然后,可以通過(guò) Golang 運(yùn)行時(shí)將調(diào)度線程的數(shù)量配置為 1。此外,我們?cè)谒凶枞€程中添加了超時(shí),當(dāng)超時(shí)被觸發(fā)時(shí),線程會(huì)檢查是否應(yīng)該終止以進(jìn)入瞬態(tài)單線程狀態(tài)。最后,Golang 程序只會(huì)將 m0 線程保持在瞬態(tài)單線程狀態(tài),sfork 后再次擴(kuò)展為多線程。我們的修改僅用于模板沙箱生成,不會(huì)影響sfork后的程序行為。

4.2 Stateless Overlay RootFS

A sforked sandbox will inherit file descriptors and file systems of the template sandbox, which should be handled after sfork. Inspired by existing overlayFS design [13] and the ephemeral nature of serverless functions [27, 28], Catalyzer employs stateless overlay rootFS technique to achieve zerocost handling for file descriptors and the rootFS. The idea is to put all the modification on the rootFS into the memory, which can be automatically cloned during sfork using copy-on-write (Figure 9-c).

Specifically, each sandbox uses two layers of file systems. The upper layer is the in-memory overlayFS, which is private to a sandbox and allows both read and write operations. The overlayFS is backed by an FS server (per-function) which manages the real rootFS. A sandbox cannot directly access the persistent storage for security reasons; thus, it relies on the (read-only) file descriptors received from the FS server to access the rootFS. During sfork, besides the cloned overlayFS, the file descriptors owned by the template sandbox are still valid in the child sandbox since they are read-only and will not violate the isolation guarantee.

Our industrial development lessons show that persistent storage is still required for serverless functions in some cases, e.g., writing logs. Catalyzer allows the FS server to grant some file descriptors of the log files (with the read/write permission) to sandboxes. Overall, the majority of files are sforked with low latency, and only a small number of persistent files are copied for functionalities.

一個(gè)sfork的沙箱會(huì)繼承模板沙箱的文件描述符和文件系統(tǒng),這些應(yīng)該在sfork之后處理。受現(xiàn)有覆蓋文件系統(tǒng)設(shè)計(jì) [13] 和無(wú)服務(wù)器功能 [27、28] 的短暫性質(zhì)的啟發(fā),Catalyzer 采用無(wú)狀態(tài)覆蓋 rootFS 技術(shù)來(lái)實(shí)現(xiàn)文件描述符和 rootFS 的零成本處理。思路是將rootFS上的所有修改都放到內(nèi)存中,在sfork的時(shí)候可以使用copy-on-write自動(dòng)克隆(圖9-c)。

具體來(lái)說(shuō),每個(gè)沙箱使用兩層文件系統(tǒng)。上層是內(nèi)存中的overlayFS,它是沙箱私有的,允許讀寫(xiě)操作。 overlayFS 由管理真正的 rootFS 的 FS 服務(wù)器(按功能)支持。出于安全原因,沙箱不能直接訪問(wèn)持久存儲(chǔ);因此,它依賴(lài)于從 FS 服務(wù)器接收的(只讀)文件描述符來(lái)訪問(wèn) rootFS。在sfork期間,除了克隆的overlayFS之外,模板沙箱擁有的文件描述符在子沙箱中仍然有效,因?yàn)樗鼈兪侵蛔x的,不會(huì)違反隔離保證。

我們的工業(yè)發(fā)展經(jīng)驗(yàn)表明,在某些情況下,例如寫(xiě)入日志,無(wú)服務(wù)器功能仍然需要持久存儲(chǔ)。 Catalyzer 允許 FS 服務(wù)器將日志文件的一些文件描述符(具有讀/寫(xiě)權(quán)限)授予沙箱??傮w而言,大多數(shù)文件都以低延遲進(jìn)行分叉,只有少數(shù)持久文件被復(fù)制以用于功能。

4.3 Language Runtime Template for Cold Boot

Although the on-demand restore can provide promising cold boot performance, it relies on a well-formed func-image containing uncompressed data (larger image size). Thus, we propose another choice for the cold boot, using sfork with language runtime template, which is a template sandbox for functions written by the same language. A language runtime template initializes the environment of the wrapped program (e.g., JVM in Java), and will load a real function to serve requests on demand. Such a sandbox should be instantiated differently in different languages, e.g., loading libraries in C or loading Class files in Java. For instance, a single Java runtime template is sufficient to boost our internal functions as most of the functions are written in Java.

盡管按需恢復(fù)可以提供有希望的冷啟動(dòng)性能,但它依賴(lài)于包含未壓縮數(shù)據(jù)(更大的圖像大?。┑母袷搅己玫?func-image。 因此,我們?yōu)槔鋯?dòng)提出了另一種選擇,使用帶有語(yǔ)言運(yùn)行時(shí)模板的 sfork,這是一個(gè)用于相同語(yǔ)言編寫(xiě)的函數(shù)的模板沙箱。 語(yǔ)言運(yùn)行時(shí)模板初始化被包裝程序的環(huán)境(例如,Java 中的 JVM),并將加載一個(gè)真正的函數(shù)來(lái)按需服務(wù)請(qǐng)求。 這樣的沙箱應(yīng)該在不同的語(yǔ)言中以不同的方式實(shí)例化,例如,在 C 中加載庫(kù)或在 Java 中加載類(lèi)文件。 例如,單個(gè) Java 運(yùn)行時(shí)模板足以增強(qiáng)我們的內(nèi)部功能,因?yàn)榇蠖鄶?shù)功能都是用 Java 編寫(xiě)的。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容