一、問題現(xiàn)象
- 壓測時,多個模塊出現(xiàn)報錯”unable to create thread: Resource temporarily unavailable“。
- 無論服務(wù)進程或supervisor,都出現(xiàn)此報錯,多次重試?yán)鹗『螅?wù)退出,然后容器退出。
二、原因總結(jié)
總的來說,是進程resource limit限制的配置作用范圍與內(nèi)核調(diào)度時對用戶的限制不統(tǒng)一引起的。即是ulimit限制配置是在容器中讀取,對進程生效,而內(nèi)核調(diào)度時,對部分資源(這里是線程數(shù)量)的判斷依據(jù),不區(qū)分進程,是整機單個用戶的全部進程資源數(shù)量的總和。
用戶線程數(shù)量是由內(nèi)核判定的,各容器雖然運行環(huán)境隔離,但對于內(nèi)核來說,只是多個進程。同一個用戶id運行的進程,即使是不同容器,內(nèi)核可見的也是累計的線程數(shù)量。
另一方面,用戶limit實際的生效配置,卻是用戶態(tài)生效。即limit相關(guān)配置是各個容器各自讀取。
centos的limit默認配置中,對非root用戶進程數(shù)量軟限制為4096:
root@cvm-172_16_30_8:~ # cat /etc/security/limits.d/20-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.
* soft nproc 4096
root soft nproc unlimited
在進行系統(tǒng)調(diào)用增加線程時,內(nèi)核是以這兩個值進行判斷。所以就出現(xiàn)了同一個用戶id,在某個容器內(nèi)線程數(shù)量并不多,卻無法開線程的現(xiàn)象。即是此時改進程本身線程限制為4096,而該id的用戶,對于內(nèi)核來說,機器上總的線程數(shù)已經(jīng)超過4096。
具體機制見下文詳解。
三、詳解
本部分主要說明了三個方面:
- ulimit配置何時生效的。
- 內(nèi)核如何對limit合法性進行判定。
- docker拉起容器時的ulimit配置繼承關(guān)系,即如何解決該問題。
ulimit配置生效方式
1. 原容器內(nèi)進程啟動方式:
容器啟動時執(zhí)行entrypoint.sh,該腳本創(chuàng)建指定id的用戶,修改目錄權(quán)限后,通過su切換用戶并運行supervisor,進一步拉起服務(wù)進程:
? data-proxy git:(master) ? cat entrypoint.sh
#!/bin/sh
username="yibot"
#create user if not exists
egrep "^${YIBOT_UID}" /etc/passwd >& /dev/null
if [ $? -ne 0 ]
then
useradd -u "${YIBOT_UID}" "${username}"
fi
mkdir -p /data/yibot/"${MODULE}"/log/ && \
mkdir -p /data/supervisor/ && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /entrypoint && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /data && \
su yibot -c "supervisord -n"
2. pam簡介
pam(Pluggable Authentication Modules)中文翻譯是"可插拔的身份認證模塊組"。這些模塊本身不屬于內(nèi)核,內(nèi)核自身沒有身份驗證的行為。是為了讓需要身份驗證的應(yīng)用與身份驗證機制本身進行解耦,衍生出來的一套庫?,F(xiàn)在的su、login等應(yīng)用都會采用該庫。
pam介紹可參考:https://www.linuxjournal.com/article/5940
pam man page: http://man7.org/linux/man-pages/man8/pam.8.html
pam源碼:https://github.com/linux-pam/linux-pam/tree/master/libpam
3. pam與ulimit配置讀取
查看pam源碼發(fā)現(xiàn),在limit處理中 https://github.com/linux-pam/linux-pam/blob/master/modules/pam_limits/pam_limits.c 每一次該pam會話調(diào)用,都是parse_config_file-> setup_limits
retval = parse_config_file(pamh, pwd->pw_name, pwd->pw_uid, pwd->pw_gid, ctrl, pl);
retval = setup_limits(pamh, pwd->pw_name, pwd->pw_uid, ctrl, pl);
parse_config_file是從給定配置文件中,讀取limit配置存放在pl指向的pam_limit_s結(jié)構(gòu)體中,該結(jié)構(gòu)體定義如下:
/* internal data */
struct pam_limit_s {
int login_limit; /* the max logins limit */
int login_limit_def; /* which entry set the login limit */
int flag_numsyslogins; /* whether to limit logins only for a
specific user or to count all logins */
int priority; /* the priority to run user process with */
struct user_limits_struct limits[RLIM_NLIMITS];
const char *conf_file;
int utmp_after_pam_call;
char login_group[LINE_LENGTH];
};
各項limit的值都存在limits數(shù)組中,user_limits_struct結(jié)構(gòu)體中包含軟限制和硬限制
struct user_limits_struct {
int supported;
int src_soft;
int src_hard;
struct rlimit limit;
};
其中limit結(jié)構(gòu)體中是在init_limits中通過系統(tǒng)調(diào)用getrlimit獲取的當(dāng)前進程的限制值。
解析完配置文件后,在setup_limits中,通過系統(tǒng)調(diào)用setrlimit修改當(dāng)前進程pcb中的rlim相關(guān)值
for (i=0, status=LIMITED_OK; i<RLIM_NLIMITS; i++) {
int res;
if (!pl->limits[i].supported) {
/* skip it if its not known to the system */
continue;
}
if (pl->limits[i].src_soft == LIMITS_DEF_NONE &&
pl->limits[i].src_hard == LIMITS_DEF_NONE) {
/* skip it if its not initialized */
continue;
}
if (pl->limits[i].limit.rlim_cur > pl->limits[i].limit.rlim_max)
pl->limits[i].limit.rlim_cur = pl->limits[i].limit.rlim_max;
res = setrlimit(i, &pl->limits[i].limit);
if (res != 0)
pam_syslog(pamh, LOG_ERR, "Could not set limit for '%s': %m",
rlimit2str(i));
status |= res;
}
以上就是pam庫對limit配置的讀取和修改過程。系統(tǒng)調(diào)用getrlimit和setrlimit具體行為見后文
4. su與pam:
su源碼:https://github.com/shadow-maint/shadow/blob/master/src/su.c
在最新的su實現(xiàn)中,可以看到是有pam的條件編譯:
#ifdef USE_PAM
ret = pam_start ("su", name, &conv, &pamh);
if (PAM_SUCCESS != ret) {
SYSLOG ((LOG_ERR, "pam_start: error %d", ret);
fprintf (stderr,
_("%s: pam_start: error %d\n"),
Prog, ret));
exit (1);
}
在最新的centos中,ldd查看su,可以確定是打開了該條件
root@cvm-172_16_30_8:~ # ldd /usr/bin/su | grep pam
libpam.so.0 => /lib64/libpam.so.0 (0x00007f4d429a6000)
libpam_misc.so.0 => /lib64/libpam_misc.so.0 (0x00007f4d427a2000)
在su的man里也有說明:
This version of su uses PAM for authentication, account and session management. Some configuration options found in other su implementations such as e.g. support of a wheel group have to be configured via PAM.
在pam_start ("su", name, &conv, &pamh)中pam會在/etc/pam.d/下查找名為su的文件進行配置加載,該文件中指定了pam認證中需要用到的庫。實現(xiàn)可插拔的特性
最終在pam打開會話pam_open_session會調(diào)用pam_limits中的pam_sm_open_session實現(xiàn)limits相關(guān)配置文件的解析和設(shè)置。
在su切換用戶后,默認打開shell,會繼承更新后的limits配置,具體繼承機制見下文。
/*
* Use the shell and create an argv
* with the rest of the command line included.
*/
argv[-1] = cp;
execve_shell (shellstr, &argv[-1], environ);
之后再打開的進程,都會進行limits繼承
附pam編程例子:https://www.freebsd.org/doc/en_US.ISO8859-1/articles/pam/pam-sample-appl.html
以上解釋了在原entrypoint.sh的做法中,su調(diào)用pam會讀取當(dāng)前容器中的limit配置(/etc/security/limits.d/)。在非root時,進程limit中的nproc會被設(shè)為4096的限制**
5.系統(tǒng)調(diào)用setrlimit行為
kernel源碼:https://github.com/torvalds/linux
setrlimit系統(tǒng)調(diào)用如下
SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
struct rlimit new_rlim;
if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
return -EFAULT;
return do_prlimit(current, resource, &new_rlim, NULL);
}
current返回的是當(dāng)前進程的pcb,即task_struct結(jié)構(gòu)體的指針,在do_prlimit中進一步調(diào)用security_task_setrlimit修改當(dāng)前pcb中的limit限制值
int security_task_setrlimit(struct task_struct *p, unsigned int resource,
struct rlimit *new_rlim)
{
return call_int_hook(task_setrlimit, 0, p, resource, new_rlim);
}
下面這個操作看不太懂,大概是在鏈表里進行搜索,然后應(yīng)用FUNC。還請大佬指點迷津。
#define call_int_hook(FUNC, IRC, ...) ({ \
int RC = IRC; \
do { \
struct security_hook_list *P; \
\
hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
RC = P->hook.FUNC(__VA_ARGS__); \
if (RC != 0) \
break; \
} \
} while (0); \
RC; \
})
補充一下pcb task_struct部分定義,完整定義參考:https://github.com/torvalds/linux/blob/master/include/linux/sched.h
struct task_struct {
...
/* Real parent process: */
struct task_struct __rcu *real_parent;
/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;
/*
* Children/sibling form the list of natural children:
*/
struct list_head children;
struct list_head sibling;
struct task_struct *group_leader;
...
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
...
/* Signal handlers: */
struct signal_struct *signal;
...
}
注:kernel通過list_head與list_entry宏,實現(xiàn)了通用的雙鏈表結(jié)構(gòu)
在struct signal_struct中定義了rlim:
struct signal_struct {
...
/*
* We don't bother to synchronize most readers of this at all,
* because there is no reader checking a limit that actually needs
* to get both rlim_cur and rlim_max atomically, and either one
* alone is a single word that can safely be read normally.
* getrlimit/setrlimit use task_lock(current->group_leader) to
* protect this instead of the siglock, because they really
* have no need to disable irqs.
*/
struct rlimit rlim[RLIM_NLIMITS];
...
}
rlim數(shù)組中即該進程的resource limit相關(guān)值,setrlimit最終修改的也即該數(shù)組中的值。可見是每個進程單獨持有的一組值。
內(nèi)核判定nproc limit(進程數(shù)限制)合法性機制
1. 用戶進程總數(shù)
在上文給出的task_struct定義中,有一個結(jié)構(gòu)體struct cred,定義如下
struct cred {
...
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
...
struct user_struct *user; /* real user ID subscription */
...
}
其中struct user_struct定義:
struct user_struct {
refcount_t __count; /* reference count */
atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
...
}
結(jié)合下文說明pcb中的struct user_struct *user是全局唯一,則processes就是系統(tǒng)當(dāng)前用戶在運行的所有進程數(shù)(linux中processes與threads幾乎相同,內(nèi)核中沒有thread概念)
http://www.mulix.org/lectures/kernel_workshop_mar_2004/things.pdf
In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space.
2. struct user_struct *user全局唯一
在su的實現(xiàn)中,調(diào)用change_uid,最終通過系統(tǒng)調(diào)用setuid切換uid
SYSCALL_DEFINE1(setuid, uid_t, uid)
{
return __sys_setuid(uid);
}
__sys_setuid調(diào)用set_user實現(xiàn)用戶真正切換,參數(shù)new為當(dāng)前pcb中的cred結(jié)構(gòu)體副本
long __sys_setuid(uid_t uid)
{
...
if (ns_capable_setid(old->user_ns, CAP_SETUID)) {
new->suid = new->uid = kuid;
if (!uid_eq(kuid, old->uid)) {
retval = set_user(new);
if (retval < 0)
goto error;
}
} else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
goto error;
}
}
set_user完整實現(xiàn):
/*
* change the user struct in a credentials set to match the new UID
*/
static int set_user(struct cred *new)
{
struct user_struct *new_user;
new_user = alloc_uid(new->uid);
if (!new_user)
return -EAGAIN;
/*
* We don't fail in case of NPROC limit excess here because too many
* poorly written programs don't check set*uid() return code, assuming
* it never fails if called by root. We may still enforce NPROC limit
* for programs doing set*uid()+execve() by harmlessly deferring the
* failure to the execve() stage.
*/
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
free_uid(new->user);
new->user = new_user;
return 0;
}
再來看看alloc_uid:
struct user_struct *alloc_uid(kuid_t uid)
{
struct hlist_head *hashent = uidhashentry(uid);
struct user_struct *up, *new;
spin_lock_irq(&uidhash_lock);
up = uid_hash_find(uid, hashent);
spin_unlock_irq(&uidhash_lock);
...
}
在kernel/user.c中,uidhashentry定義如下
#define uidhashentry(uid) (uidhash_table + __uidhashfn((__kuid_val(uid))))
static struct kmem_cache *uid_cachep;
struct hlist_head uidhash_table[UIDHASH_SZ];
加上uid_hash_find的實現(xiàn):
static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
{
struct user_struct *user;
hlist_for_each_entry(user, hashent, uidhash_node) {
if (uid_eq(user->uid, uid)) {
refcount_inc(&user->__count);
return user;
}
}
return NULL;
}
如此就可以看出,實際上對于一個uid,用戶信息結(jié)構(gòu)體user_struct全局唯一。通過uid的hashentry,在鏈表中查找該結(jié)構(gòu)體,再將指針返回給pcb
3. 新增進程合法性判斷
實際上在上文set_user中,已經(jīng)有如下判斷:
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
rlimit(RLIMIT_NPROC)是讀取當(dāng)前進程pcb內(nèi)的nproc限制,再與新用戶總線程數(shù)作比較。
另外,在exec的實現(xiàn)__do_execve_file中,也有類似判斷:https://github.com/torvalds/linux/blob/master/fs/exec.c
if ((current->flags & PF_NPROC_EXCEEDED) &&
atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
retval = -EAGAIN;
goto out_ret;
}
其他創(chuàng)建process時也類似
另外,fork最終通過copy_creds實現(xiàn)了atomic_inc(&p->cred->user->processes);進程數(shù)+1
exec最終通過commit_creds實現(xiàn)atomic_inc(&p->cred->user->processes);進程數(shù)+1
四、docker容器的ulimit繼承關(guān)系
1.子進程對父進程ulimt的繼承
fork進程時,在fork的實現(xiàn)kernel/fork.c中實現(xiàn)了copy pcb中的內(nèi)容,其中的copy_signal:
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
...
task_lock(current->group_leader);
memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
task_unlock(current->group_leader);
...
}
可見完整地復(fù)制了pcb中的rlim,如果不使用setrlim進行更改的話,子進程與父進程一致。
2.docker容器啟動方式
根據(jù)官方文檔的說明,啟動容器時1號進程的rlim繼承于docker daemon:
https://docs.docker.com/engine/reference/commandline/run/
Note: If you do not provide a hard limit, the soft limit will be used for both values. If no ulimits are set, they will be inherited from the default ulimits set on the daemon. as option is disabled now. In other words, the following script is not supported:...
由于docker daemon一般是以root運行,所以即使指定的非root用戶運行容器,1號進程仍然是與root一致的rlim。
此時只要不通過pam讀取容器內(nèi)的ulimit配置(如在容器內(nèi)運行su切換用戶,或通過遠程登錄等),則子進程也都會一致繼承root的rlim。
總結(jié)來說,該問題的解決方法就是在容器拉起服務(wù)進程之前,不要在容器內(nèi)運行su切換用戶??稍谌萜鲉忧爸付ㄈ我庥脩?,不影響ulimit統(tǒng)一繼承于docker daemon