原文地址:http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection/
You probably know intuitively that applications have limited powers in Intel x86 computers and that only operating system code can perform certain tasks, but do you know how this really works? This post takes a look at x86 privilege levels, the mechanism whereby the OS and CPU conspire to restrict what user-mode programs can do. There are four privilege levels, numbered 0 (most privileged) to 3 (least privileged), and three main resources being protected: memory, I/O ports, and the ability to execute certain machine instructions. At any given time, an x86 CPU is running in a specific privilege level, which determines what code can and cannot do. These privilege levels are often described as protection rings, with the innermost ring corresponding to highest privilege. Most modern x86 kernels use only two privilege levels, 0 and 3:
直覺上,你也許知道 intel x86 機(jī)器限制了應(yīng)用程序的權(quán)限,并且只有操作系統(tǒng)代碼能執(zhí)行某些任務(wù)。但是你知道他們實(shí)際是怎么工作的嗎?本文帶你學(xué)習(xí) x86 的“特權(quán)級”,這種機(jī)制依靠操作系統(tǒng)和 cpu 共同限制了用戶態(tài)程序能做什么。一共有四種特權(quán)級別,從 0(最高) 到 3(最低),三種資源被保護(hù)起來:內(nèi)存、I/O 端口、和某些機(jī)器指令執(zhí)行權(quán)限。cpu 在某個(gè)時(shí)刻運(yùn)行在特定的特權(quán)級,并且決定了可以做什么,不可以做什么。通常用保護(hù)環(huán)(ring)來描述特權(quán)級,最內(nèi)層的環(huán)對應(yīng)最高權(quán)限。大多數(shù) x86 內(nèi)核只用了兩個(gè)特權(quán)級:0 和 3:

About 15 machine instructions, out of dozens, are restricted by the CPU to ring zero. Many others have limitations on their operands. These instructions can subvert the protection mechanism or otherwise foment chaos if allowed in user mode, so they are reserved to the kernel. An attempt to run them outside of ring zero causes a general-protection exception, like when a program uses invalid memory addresses. Likewise, access to memory and I/O ports is restricted based on privilege level. But before we look at protection mechanisms, let’s see exactly
how the CPU keeps track of the current privilege level, which involves the segment selectors from the previous post. Here they are:
大約有15條機(jī)器指令限制在cpu ring 0(不多)。更多的限制是在這些指令的操作數(shù)上。這些指令如果允許在用戶模式執(zhí)行,會破壞保護(hù)機(jī)制,搞亂操作系統(tǒng)(比如藍(lán)屏和系統(tǒng) oops,譯者注 ),操作系統(tǒng)保留指令的執(zhí)行權(quán)限。如果嘗試在 ring 0 外執(zhí)行特權(quán)指令,會導(dǎo)致 general-protection 異常,就像程序使用了一個(gè)無效的內(nèi)存地址。類似的,特權(quán)級也限制了訪問內(nèi)存和 IO 端口。在我們看保護(hù)模式之前,讓我們先仔細(xì)看看 cpu 如何處理當(dāng)前應(yīng)用程序的特權(quán)級,這些和我們之前的文章講過的段選擇子有關(guān)系:

The full contents of data segment selectors are loaded directly by code into various segment registers such as ss (stack segment register) and ds (data segment register). This includes the contents of the Requested Privilege Level (RPL) field, whose meaning we tackle in a bit. The code segment register (cs) is, however, magical. First, its contents cannot be set directly by load instructions such as mov, but rather only by instructions that alter the flow of program execution, like call. Second, and importantly for us, instead of an RPL field that can be set by code, cs has a Current Privilege Level (CPL) field maintained by the CPU itself. This 2-bit CPL field in the code segment register is always equal to the CPU’s current privilege level. The Intel docs wobble a little on this fact, and sometimes online documents confuse the issue, but that’s the hard and fast rule. At any time, no matter what’s going on in the CPU, a look at the CPL in cs will tell you the privilege level code is running with.
段選擇子的全部數(shù)據(jù)都是由代碼直接從各種各樣的段寄存器加載的,比如 ss (棧寄存器),ds(數(shù)據(jù)段寄存器)。這也包括請求特權(quán)級(RPL)域,其中每個(gè) bit 含義都不同(whose meaning we tackle in a bit,按位處理每位的含義,譯者注)。然而,代碼段寄存器(cs)比較神奇,首先,cs 的值不能直接使用加載指令設(shè)置,比如 mov,而是通過可以修改程序執(zhí)行順序(flow of program execution)的指令來設(shè)置,比如 call。第二,也是非常重要的一點(diǎn),cpu 自己維護(hù) cs 的當(dāng)前特權(quán)級(CPL)域,這和 RPL 不同,RPL 自己是可以通過代碼設(shè)置。代碼段寄存器 2 bit 的 CPL 域總是等于 cpu 當(dāng)前特權(quán)級。intel 文檔對這個(gè)事實(shí)說的含混不清,一些在線文檔也搞混了這點(diǎn),不過這是嚴(yán)謹(jǐn)和快速易用的規(guī)則(that's the hard and fast rule)。不管cpu 在做什么,只要看 CPL 就知道當(dāng)前代碼在什么特權(quán)級別運(yùn)行。
Keep in mind that the CPU privilege level has nothing to do with operating system users. Whether you’re root, Administrator, guest, or a regular user, it does not matter. All user code runs in ring 3 and all kernel code runs in ring 0, regardless of the OS user on whose behalf the code operates. Sometimes certain kernel tasks can be pushed to user mode, for example user-mode device drivers in Windows Vista, but these are just special processes doing a job for the kernel and can usually be killed without major consequences.
但是記住,cpu 特權(quán)級對系統(tǒng)用戶沒做任何事情,不管是不是 root、Administrator、guest 或者一般用戶,都和特權(quán)級沒關(guān)系。不管代碼操作代表了哪個(gè) OS 用戶,所有用戶代碼都運(yùn)行在 ring 3,所有內(nèi)核代碼都運(yùn)行在 ring 0(意思 cpu 特權(quán)級和以哪個(gè)用戶運(yùn)行程序沒什么關(guān)系,新人容易搞混 cpu 特權(quán)級和用戶權(quán)限)。有時(shí)候某些內(nèi)核任務(wù)可以提升到(push)用戶模式,比如 windows vista 的用戶態(tài)設(shè)備驅(qū)動,只是特殊的進(jìn)程為了內(nèi)核做工作,而且殺死這種程序也沒什么嚴(yán)重后果。
Due to restricted access to memory and I/O ports, user mode can do almost nothing to the outside world without calling on the kernel. It can’t open files, send network packets, print to the screen, or allocate memory. User processes run in a severely limited sandbox set up by the gods of ring zero. That’s why it’s impossible, by design, for a process to leak memory beyond its existence or leave open files after it exits. All of the data structures that control such things – memory, open files, etc – cannot be touched directly by user code; once a process finishes, the sandbox is torn down by the kernel. That’s why our servers can have 600 days of uptime – as long as the hardware and the kernel don’t crap out, stuff can run for ever. This is also why Windows 95 / 98 crashed so much: it’s not because “M$ sucks” but because important data structures were left accessible to user mode for compatibility reasons. It was probably a good trade-off at the time, albeit at high cost.
因?yàn)橄拗屏藘?nèi)存和 IO 端口的訪問,用戶模式在不調(diào)用內(nèi)核的情況下,對外部世界幾乎沒什么影響。不能打開文件、不能發(fā)送網(wǎng)絡(luò)包、給屏幕打印東西、或者分配內(nèi)存。用戶進(jìn)程運(yùn)行在一個(gè)嚴(yán)格限制的沙盒內(nèi),沙盒由運(yùn)行在 ring 0 的“上帝”創(chuàng)建。設(shè)計(jì)上,避免了進(jìn)程退出后泄露內(nèi)存或者遺漏打開的文件的可能性。控制內(nèi)存分配、打開文件等等的數(shù)據(jù)結(jié)構(gòu)都不能由用戶代碼直接創(chuàng)建,一旦進(jìn)程退出,沙盒就被內(nèi)核銷毀。這就是為什么只要硬件和內(nèi)核沒問題,我們的服務(wù)器可以啟動 600 多天,甚至可以永遠(yuǎn)運(yùn)行。這也是為什么 windows 95/98 這么愛崩潰。不是因?yàn)?“M$ sucks”,而是因?yàn)榧嫒菪缘脑?,給用戶態(tài)留了訪問重要數(shù)據(jù)結(jié)構(gòu)的權(quán)限。這種保護(hù)機(jī)制代價(jià)很高,但這也許在那時(shí)是個(gè)很好的權(quán)衡。
The CPU protects memory at two crucial points: when a segment selector is loaded and when a page of memory is accessed with a linear address. Protection thus mirrors memory address translation where both segmentation and paging are involved. When a data segment selector is being loaded, the check below takes place:
cpu 在兩個(gè)關(guān)鍵點(diǎn)保護(hù)內(nèi)存:段選擇子加載時(shí),和用線性地址訪問內(nèi)存頁時(shí)。因此,保護(hù)機(jī)制反映在分段和分頁的內(nèi)存地址轉(zhuǎn)換上。在數(shù)據(jù)段選擇子加載的時(shí)候,進(jìn)行下面的檢查:

Since a higher number means less privilege, MAX() above picks the least privileged of CPL and RPL, and compares it to the descriptor privilege level (DPL). If the DPL is higher or equal, then access is allowed. The idea behind RPL is to allow kernel code to load a segment using lowered privilege. For example, you could use an RPL of 3 to ensure that a given operation uses segments accessible to user-mode. The exception is for the stack segment register ss, for which the three of CPL, RPL, and DPL must match exactly.
所以更高的特權(quán)級代表更小的權(quán)限。上圖中 MAX() 得到 CPL 和 RPL 權(quán)限較小的一個(gè),然后和段描述符權(quán)限級別
( DPL )比較。如果 DPL 更大或相等,就允許訪問(CPL <= DPL && RPL <= DPL允許訪問)。RPL 背后的思想是允許內(nèi)核代碼用低權(quán)限加載段。例如:內(nèi)核可以用 RPL 3 訪問用戶態(tài)的段。因此 CPL、RPL、DPL 必須精確匹配,只有棧寄存器(ss)例外。(我解釋下這段話:CPL 代表當(dāng)前程序權(quán)限,是內(nèi)核還是應(yīng)用程序,DPL 代表訪問數(shù)據(jù)段的權(quán)限,CPL 必須 <= DPL 很好理解。RPL 是在為了檢查用戶態(tài)傳入內(nèi)核的內(nèi)存地址。應(yīng)用瞎傳一個(gè)不屬于自己內(nèi)存的指針給內(nèi)核,借著 OS 的權(quán)限破壞其他程序,做不可告人的秘密,這樣是不行的,這時(shí)應(yīng)該以傳入的邏輯地址的 RPL 去判斷權(quán)限。不過這都些是 386 的歷史遺留問題了)。
In truth, segment protection scarcely matters because modern kernels use a flat address space where the user-mode segments can reach the entire linear address space. Useful memory protection is done in the paging unit when a linear address is converted into a physical address. Each memory page is a block of bytes described by a page table entry containing two fields related to protection: a supervisor flag and a read/write flag. The supervisor flag is the primary x86 memory protection mechanism used by kernels. When it is on, the page cannot be accessed from ring 3. While the read/write flag isn’t as important for enforcing privilege, it’s still useful. When a process is loaded, pages storing binary images (code) are marked as read only, thereby catching some pointer errors if a program attempts to write to these pages. This flag is also used to implement copy on write when a process is forked in Unix. Upon forking, the parent’s pages are marked read only and shared with the forked child. If either process attempts to write to the page, the processor triggers a fault and the kernel knows to duplicate the page and mark it read/write for the writing process.
事實(shí)上,段保護(hù)幾乎不是問題。因?yàn)楝F(xiàn)代內(nèi)核使用了扁平地址空間( flat address space ),用戶態(tài)可以訪問整個(gè)線性地址空間。比較實(shí)用的內(nèi)存保護(hù)方法是頁面單元在線性地址轉(zhuǎn)換成物理地址時(shí)進(jìn)行。每個(gè)內(nèi)個(gè)頁都是由頁表項(xiàng)的一堆數(shù)據(jù)維護(hù),其中包含兩個(gè)和內(nèi)存保護(hù)相關(guān)的域:supervisor flag 和 讀/寫 flag。 supervisor flag 是內(nèi)核使用的主要的 x86 內(nèi)存保護(hù)機(jī)制,如果打開,就不能從 ring 3訪問頁面。讀/寫 flag 也很有用,不過對執(zhí)行內(nèi)存保護(hù)不重要。當(dāng)一個(gè)進(jìn)程加載的時(shí)候,存儲二進(jìn)制鏡像(代碼)的頁面標(biāo)記為只讀,如果程序試圖寫入這些頁面會 catch 一些指針錯誤。讀/寫 flag 也用于實(shí)現(xiàn) unix 上 fork 進(jìn)程的寫時(shí)拷貝。在 fork 時(shí),父進(jìn)程的頁標(biāo)記為只讀,并且和子進(jìn)程共享。如果其中一個(gè)進(jìn)程試圖寫入這些頁面,處理器就會觸發(fā)錯誤,內(nèi)核知道后會給寫入的進(jìn)程復(fù)制頁面,并把頁面標(biāo)記為可讀寫。
Finally, we need a way for the CPU to switch between privilege levels. If ring 3 code could transfer control to arbitrary spots in the kernel, it would be easy to subvert the operating system by jumping into the wrong (right?) places. A controlled transfer is necessary. This is accomplished via gate descriptors and via the sysenter instruction. A gate descriptor is a segment descriptor of type system, and comes in four sub-types: call-gate descriptor, interrupt-gate descriptor, trap-gate descriptor, and task-gate descriptor. Call gates provide a kernel entry point that can be used with ordinary call and jmp instructions, but they aren’t used much so I’ll ignore them. Task gates aren’t so hot either (in Linux, they are only used in double faults, which are caused by either kernel or hardware problems).
最后,我們需要一種 CPU 切換特權(quán)級的方法。如果 ring 3 的代碼可以隨意把控制權(quán)轉(zhuǎn)換到內(nèi)核里任意的代碼,那跳到一個(gè)錯誤的地址,就能很容易的搞壞操作系統(tǒng)。所以受控的權(quán)限轉(zhuǎn)換很必要。這點(diǎn)通過門描述符( gate descriptors)和 sysenter 指令實(shí)現(xiàn)。門描述符是類型系統(tǒng)的段描述符(不太清楚這句話什么意思,譯者注),有四種子類型:調(diào)用門描述符,中斷門描述符,陷阱門描述符和任務(wù)門描述符。調(diào)用門描述符提供普通的 call 和 jmp 指令進(jìn)入內(nèi)核的入口,但是現(xiàn)在不常用,我會忽略它。任務(wù)門描述符也不常用(linux 上只用于內(nèi)核或者硬件問題導(dǎo)致的 double faults)。
That leaves two juicier ones: interrupt and trap gates, which are used to handle hardware interrupts (e.g., keyboard, timer, disks) and exceptions (e.g., page faults, divide by zero). I’ll refer to both as an “interrupt”. These gate descriptors are stored in the Interrupt Descriptor Table (IDT). Each interrupt is assigned a number between 0 and 255 called a vector, which the processor uses as an index into the IDT when figuring out which gate descriptor to use when handling the interrupt. Interrupt and trap gates are nearly identical. Their format is shown below along with the privilege checks enforced when an interrupt happens. I filled in some values for the Linux kernel to make things concrete.
還剩下兩類神秘的門描述符:中斷和陷阱門,它們用來處理硬件中斷(比如,鍵盤、時(shí)鐘、硬盤)和異常(比如,頁面錯誤,除零)。我把它們都叫“中斷”,中斷描述符表(IDT)保存這些門描述符。每個(gè)中斷都設(shè)置一個(gè) 0-255 的值,整個(gè)集合叫稱為中斷向量(vector,很討厭翻譯成中斷向量,我更喜歡翻譯成中斷數(shù)組),在處理中斷時(shí),處理器用 IDT 的數(shù)組下標(biāo)(index,數(shù)組索引)來查找用哪個(gè)門描述符。中斷和陷阱門幾乎一樣。下圖展示了中斷發(fā)生時(shí)如何進(jìn)行權(quán)限檢查。其中某些值使用 linux 內(nèi)核的情況來具體說明。

Both the DPL and the segment selector in the gate regulate access, while segment selector plus offset together nail down an entry point for the interrupt handler code. Kernels normally use the segment selector for the kernel code segment in these gate descriptors. An interrupt can never transfer control from a more-privileged to a less-privileged ring. Privilege must either stay the same (when the kernel itself is interrupted) or be elevated (when user-mode code is interrupted). In either case, the resulting CPL will be equal to to the DPL of the destination code segment; if the CPL changes, a stack switch also occurs. If an interrupt is triggered by code via an instruction like int n, one more check takes place: the gate DPL must be at the same or lower privilege as the CPL. This prevents user code from triggering random interrupts. If these checks fail – you guessed it – a general-protection exception happens. All Linux interrupt handlers end up running in ring zero.
段選擇子加上偏移量一起確定中斷處理代碼的入口,門的段選擇子和 DPL 共同控制其訪問。這些門描述符通常使用內(nèi)核代碼段選擇子。中斷不能把 CPU 控制權(quán)從高權(quán)限轉(zhuǎn)到低權(quán)限 ring,必須待在相同權(quán)限的 ring(當(dāng)內(nèi)核處理自身中斷的時(shí)候),或者提升權(quán)限(用戶模式的代碼產(chǎn)生中斷)。任何一種情況,都會讓 CPL 與目標(biāo)代碼段的 DPL 相同。如果 CPL 改變,也叫進(jìn)行棧切換(比如從用戶態(tài)切換到內(nèi)核態(tài))。如果中斷是由代碼觸發(fā)(比如 int n 指令),會進(jìn)行多個(gè)檢查:門描述符的 DPL 必須小于等于 CPL。這阻止了用戶代碼隨機(jī)觸發(fā)中斷。你能猜到,如果檢查失敗就會觸發(fā) general-protection。所有 linux 中斷處理代碼最終都運(yùn)行在 ring 0。
During initialization, the Linux kernel first sets up an IDT in setup_idt() that ignores all interrupts. It then uses functions in [include/asm-x86/desc.h (http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322) to flesh out common IDT entries in arch/x86/kernel/traps_32.c. In Linux, a gate descriptor with “system” in its name is accessible from user mode and its set function uses a DPL of 3. A “system gate” is an Intel trap gate accessible to user mode. Otherwise, the terminology matches up. Hardware interrupt gates are not set here however, but instead in the appropriate drivers.
在初始化的時(shí)候,linux 內(nèi)核首先在 setup_idt() 建立屏蔽所有中斷的 IDT。然后在arch/x86/kernel/traps_32.c
使用 include/asm-x86/desc.h 里的函數(shù)(這里指中斷處理函數(shù))填充一般 IDT 條目。在 linux ,一個(gè)叫 “system” 的門描述符可以從用戶態(tài)訪問,并設(shè)置中斷處理函數(shù)使用 DPL 3。“system gate”是用戶態(tài)可訪問的 intel 陷阱門(系統(tǒng)調(diào)用,譯者注)。硬件中斷門不在這里設(shè)置,而是由合適的驅(qū)動設(shè)置。
Three gates are accessible to user mode: vectors 3 and 4 are used for debugging and checking for numeric overflows, respectively. Then a system gate is set up for the SYSCALL_VECTOR , which is 0x80 for the x86 architecture. This was the mechanism for a process to transfer control to the kernel, to make a system call, and back in the day I applied for an “int 0x80” vanity license plate :). Starting with the Pentium Pro, the sysenter instruction was introduced as a faster way to make system calls. It relies on special-purpose CPU registers that store the code segment, entry point, and other tidbits for the kernel system call handler. When sysenter is executed the CPU does no privilege checking, going immediately into CPL 0 and loading new values into the registers for code and stack (cs, eip, ss, and esp). Only ring zero can load the sysenter setup registers, which is done in enable_sep_cpu()
用戶態(tài)可以訪問三個(gè)門:3號 和 4號中斷向量,3 號中斷用來調(diào)試,4 號用來檢查數(shù)值溢出。然后,在 x86 體系上設(shè)置 0x80 號中斷作為 SYSCALL_VECTOR(系統(tǒng)調(diào)用中斷)。系統(tǒng)調(diào)用是進(jìn)行 CPU 控制權(quán)轉(zhuǎn)換的機(jī)制,早些時(shí)候我還申請了個(gè)賊牛逼的車牌號 “int 0x80” :)。(作者開了個(gè)程序員玩笑,“int 0x80” 在中國可不能作為車牌號,不過看樣子在作者的國家可以) 。從奔騰 Pro 開始,引入 sysenter 指令,可以讓系統(tǒng)調(diào)用更快。sysenter 依靠特殊的 CPU 寄存器保存代碼段、調(diào)用入口和其他(硬件)周邊(tidbits)實(shí)現(xiàn)內(nèi)核系統(tǒng)調(diào)用處理。執(zhí)行 sysenter 指令時(shí) CPU 不做權(quán)限檢查、直接進(jìn)入 CPL 0 加載新的代碼和棧寄存器值(cs,eip,ss 和 esp)。只有在 ring 0 可以設(shè)置 sysenter 使用的寄存器,這個(gè)在 enable_sep_cpu() 處理。
Finally, when it’s time to return to ring 3, the kernel issues an iret or sysexit instruction to return from interrupts and system calls, respectively, thus leaving ring 0 and resuming execution of user code with a CPL of 3. Vim tells me I’m approaching 1,900 words, so I/O port protection is for another day. This concludes our tour of x86 rings and protection. Thanks for reading!
最后,我們是時(shí)候回到 ring 3了,內(nèi)核執(zhí)行 iret 和 sysexit 從中斷和系統(tǒng)調(diào)用返回,離開 ring 0 繼續(xù)用 CPL 3 執(zhí)行用戶代碼。Vim 告訴我現(xiàn)在接近 1900 個(gè)單詞了,所以 I/O 端口保護(hù)放在后面講吧。x86 ring 和保護(hù)機(jī)制的總結(jié)就到這里,謝謝閱讀!
參考資料:
1.http://www.cis.syr.edu/~wedu/Teaching/CompSec/LectureNotes_New/Protection_80386.pdf
2.https://stackoverflow.com/questions/36617718/difference-between-dpl-and-rpl-in-x86