一、問題現(xiàn)象
- 壓測時(shí),多個(gè)模塊出現(xiàn)報(bào)錯(cuò)”unable to create thread: Resource temporarily unavailable“。
- 無論服務(wù)進(jìn)程或supervisor,都出現(xiàn)此報(bào)錯(cuò),多次重試?yán)鹗『?,服?wù)退出,然后容器退出。
二、原因總結(jié)
總的來說,是進(jìn)程resource limit限制的配置作用范圍與內(nèi)核調(diào)度時(shí)對用戶的限制不統(tǒng)一引起的。即是ulimit限制配置是在容器中讀取,對進(jìn)程生效,而內(nèi)核調(diào)度時(shí),對部分資源(這里是線程數(shù)量)的判斷依據(jù),不區(qū)分進(jìn)程,是整機(jī)單個(gè)用戶的全部進(jìn)程資源數(shù)量的總和。
用戶線程數(shù)量是由內(nèi)核判定的,各容器雖然運(yùn)行環(huán)境隔離,但對于內(nèi)核來說,只是多個(gè)進(jìn)程。同一個(gè)用戶id運(yùn)行的進(jìn)程,即使是不同容器,內(nèi)核可見的也是累計(jì)的線程數(shù)量。
另一方面,用戶limit實(shí)際的生效配置,卻是用戶態(tài)生效。即limit相關(guān)配置是各個(gè)容器各自讀取。
centos的limit默認(rèn)配置中,對非root用戶進(jìn)程數(shù)量軟限制為4096:
root@cvm-172_16_30_8:~ # cat /etc/security/limits.d/20-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.
* soft nproc 4096
root soft nproc unlimited
在進(jìn)行系統(tǒng)調(diào)用增加線程時(shí),內(nèi)核是以這兩個(gè)值進(jìn)行判斷。所以就出現(xiàn)了同一個(gè)用戶id,在某個(gè)容器內(nèi)線程數(shù)量并不多,卻無法開線程的現(xiàn)象。即是此時(shí)改進(jìn)程本身線程限制為4096,而該id的用戶,對于內(nèi)核來說,機(jī)器上總的線程數(shù)已經(jīng)超過4096。
具體機(jī)制見下文詳解。
三、詳解
本部分主要說明了三個(gè)方面:
- ulimit配置何時(shí)生效的。
- 內(nèi)核如何對limit合法性進(jìn)行判定。
- docker拉起容器時(shí)的ulimit配置繼承關(guān)系,即如何解決該問題。
ulimit配置生效方式
1. 原容器內(nèi)進(jìn)程啟動(dòng)方式:
容器啟動(dòng)時(shí)執(zhí)行entrypoint.sh,該腳本創(chuàng)建指定id的用戶,修改目錄權(quán)限后,通過su切換用戶并運(yùn)行supervisor,進(jìn)一步拉起服務(wù)進(jìn)程:
? data-proxy git:(master) ? cat entrypoint.sh
#!/bin/sh
username="yibot"
#create user if not exists
egrep "^${YIBOT_UID}" /etc/passwd >& /dev/null
if [ $? -ne 0 ]
then
useradd -u "${YIBOT_UID}" "${username}"
fi
mkdir -p /data/yibot/"${MODULE}"/log/ && \
mkdir -p /data/supervisor/ && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /entrypoint && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /data && \
su yibot -c "supervisord -n"
2. pam簡介
pam(Pluggable Authentication Modules)中文翻譯是"可插拔的身份認(rèn)證模塊組"。這些模塊本身不屬于內(nèi)核,內(nèi)核自身沒有身份驗(yàn)證的行為。是為了讓需要身份驗(yàn)證的應(yīng)用與身份驗(yàn)證機(jī)制本身進(jìn)行解耦,衍生出來的一套庫。現(xiàn)在的su、login等應(yīng)用都會(huì)采用該庫。
pam介紹可參考:https://www.linuxjournal.com/article/5940
pam man page: http://man7.org/linux/man-pages/man8/pam.8.html
pam源碼:https://github.com/linux-pam/linux-pam/tree/master/libpam
3. pam與ulimit配置讀取
查看pam源碼發(fā)現(xiàn),在limit處理中 https://github.com/linux-pam/linux-pam/blob/master/modules/pam_limits/pam_limits.c 每一次該pam會(huì)話調(diào)用,都是parse_config_file-> setup_limits
retval = parse_config_file(pamh, pwd->pw_name, pwd->pw_uid, pwd->pw_gid, ctrl, pl);
retval = setup_limits(pamh, pwd->pw_name, pwd->pw_uid, ctrl, pl);
parse_config_file是從給定配置文件中,讀取limit配置存放在pl指向的pam_limit_s結(jié)構(gòu)體中,該結(jié)構(gòu)體定義如下:
/* internal data */
struct pam_limit_s {
int login_limit; /* the max logins limit */
int login_limit_def; /* which entry set the login limit */
int flag_numsyslogins; /* whether to limit logins only for a
specific user or to count all logins */
int priority; /* the priority to run user process with */
struct user_limits_struct limits[RLIM_NLIMITS];
const char *conf_file;
int utmp_after_pam_call;
char login_group[LINE_LENGTH];
};
各項(xiàng)limit的值都存在limits數(shù)組中,user_limits_struct結(jié)構(gòu)體中包含軟限制和硬限制
struct user_limits_struct {
int supported;
int src_soft;
int src_hard;
struct rlimit limit;
};
其中limit結(jié)構(gòu)體中是在init_limits中通過系統(tǒng)調(diào)用getrlimit獲取的當(dāng)前進(jìn)程的限制值。
解析完配置文件后,在setup_limits中,通過系統(tǒng)調(diào)用setrlimit修改當(dāng)前進(jìn)程pcb中的rlim相關(guān)值
for (i=0, status=LIMITED_OK; i<RLIM_NLIMITS; i++) {
int res;
if (!pl->limits[i].supported) {
/* skip it if its not known to the system */
continue;
}
if (pl->limits[i].src_soft == LIMITS_DEF_NONE &&
pl->limits[i].src_hard == LIMITS_DEF_NONE) {
/* skip it if its not initialized */
continue;
}
if (pl->limits[i].limit.rlim_cur > pl->limits[i].limit.rlim_max)
pl->limits[i].limit.rlim_cur = pl->limits[i].limit.rlim_max;
res = setrlimit(i, &pl->limits[i].limit);
if (res != 0)
pam_syslog(pamh, LOG_ERR, "Could not set limit for '%s': %m",
rlimit2str(i));
status |= res;
}
以上就是pam庫對limit配置的讀取和修改過程。系統(tǒng)調(diào)用getrlimit和setrlimit具體行為見后文
4. su與pam:
su源碼:https://github.com/shadow-maint/shadow/blob/master/src/su.c
在最新的su實(shí)現(xiàn)中,可以看到是有pam的條件編譯:
#ifdef USE_PAM
ret = pam_start ("su", name, &conv, &pamh);
if (PAM_SUCCESS != ret) {
SYSLOG ((LOG_ERR, "pam_start: error %d", ret);
fprintf (stderr,
_("%s: pam_start: error %d\n"),
Prog, ret));
exit (1);
}
在最新的centos中,ldd查看su,可以確定是打開了該條件
root@cvm-172_16_30_8:~ # ldd /usr/bin/su | grep pam
libpam.so.0 => /lib64/libpam.so.0 (0x00007f4d429a6000)
libpam_misc.so.0 => /lib64/libpam_misc.so.0 (0x00007f4d427a2000)
在su的man里也有說明:
This version of su uses PAM for authentication, account and session management. Some configuration options found in other su implementations such as e.g. support of a wheel group have to be configured via PAM.
在pam_start ("su", name, &conv, &pamh)中pam會(huì)在/etc/pam.d/下查找名為su的文件進(jìn)行配置加載,該文件中指定了pam認(rèn)證中需要用到的庫。實(shí)現(xiàn)可插拔的特性
最終在pam打開會(huì)話pam_open_session會(huì)調(diào)用pam_limits中的pam_sm_open_session實(shí)現(xiàn)limits相關(guān)配置文件的解析和設(shè)置。
在su切換用戶后,默認(rèn)打開shell,會(huì)繼承更新后的limits配置,具體繼承機(jī)制見下文。
/*
* Use the shell and create an argv
* with the rest of the command line included.
*/
argv[-1] = cp;
execve_shell (shellstr, &argv[-1], environ);
之后再打開的進(jìn)程,都會(huì)進(jìn)行limits繼承
附pam編程例子:https://www.freebsd.org/doc/en_US.ISO8859-1/articles/pam/pam-sample-appl.html
以上解釋了在原entrypoint.sh的做法中,su調(diào)用pam會(huì)讀取當(dāng)前容器中的limit配置(/etc/security/limits.d/)。在非root時(shí),進(jìn)程limit中的nproc會(huì)被設(shè)為4096的限制**
5.系統(tǒng)調(diào)用setrlimit行為
kernel源碼:https://github.com/torvalds/linux
setrlimit系統(tǒng)調(diào)用如下
SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
struct rlimit new_rlim;
if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
return -EFAULT;
return do_prlimit(current, resource, &new_rlim, NULL);
}
current返回的是當(dāng)前進(jìn)程的pcb,即task_struct結(jié)構(gòu)體的指針,在do_prlimit中進(jìn)一步調(diào)用security_task_setrlimit修改當(dāng)前pcb中的limit限制值
int security_task_setrlimit(struct task_struct *p, unsigned int resource,
struct rlimit *new_rlim)
{
return call_int_hook(task_setrlimit, 0, p, resource, new_rlim);
}
下面這個(gè)操作看不太懂,大概是在鏈表里進(jìn)行搜索,然后應(yīng)用FUNC。還請大佬指點(diǎn)迷津。
#define call_int_hook(FUNC, IRC, ...) ({ \
int RC = IRC; \
do { \
struct security_hook_list *P; \
\
hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
RC = P->hook.FUNC(__VA_ARGS__); \
if (RC != 0) \
break; \
} \
} while (0); \
RC; \
})
補(bǔ)充一下pcb task_struct部分定義,完整定義參考:https://github.com/torvalds/linux/blob/master/include/linux/sched.h
struct task_struct {
...
/* Real parent process: */
struct task_struct __rcu *real_parent;
/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;
/*
* Children/sibling form the list of natural children:
*/
struct list_head children;
struct list_head sibling;
struct task_struct *group_leader;
...
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
...
/* Signal handlers: */
struct signal_struct *signal;
...
}
注:kernel通過list_head與list_entry宏,實(shí)現(xiàn)了通用的雙鏈表結(jié)構(gòu)
在struct signal_struct中定義了rlim:
struct signal_struct {
...
/*
* We don't bother to synchronize most readers of this at all,
* because there is no reader checking a limit that actually needs
* to get both rlim_cur and rlim_max atomically, and either one
* alone is a single word that can safely be read normally.
* getrlimit/setrlimit use task_lock(current->group_leader) to
* protect this instead of the siglock, because they really
* have no need to disable irqs.
*/
struct rlimit rlim[RLIM_NLIMITS];
...
}
rlim數(shù)組中即該進(jìn)程的resource limit相關(guān)值,setrlimit最終修改的也即該數(shù)組中的值。可見是每個(gè)進(jìn)程單獨(dú)持有的一組值。
內(nèi)核判定nproc limit(進(jìn)程數(shù)限制)合法性機(jī)制
1. 用戶進(jìn)程總數(shù)
在上文給出的task_struct定義中,有一個(gè)結(jié)構(gòu)體struct cred,定義如下
struct cred {
...
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
...
struct user_struct *user; /* real user ID subscription */
...
}
其中struct user_struct定義:
struct user_struct {
refcount_t __count; /* reference count */
atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
...
}
結(jié)合下文說明pcb中的struct user_struct *user是全局唯一,則processes就是系統(tǒng)當(dāng)前用戶在運(yùn)行的所有進(jìn)程數(shù)(linux中processes與threads幾乎相同,內(nèi)核中沒有thread概念)
http://www.mulix.org/lectures/kernel_workshop_mar_2004/things.pdf
In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space.
2. struct user_struct *user全局唯一
在su的實(shí)現(xiàn)中,調(diào)用change_uid,最終通過系統(tǒng)調(diào)用setuid切換uid
SYSCALL_DEFINE1(setuid, uid_t, uid)
{
return __sys_setuid(uid);
}
__sys_setuid調(diào)用set_user實(shí)現(xiàn)用戶真正切換,參數(shù)new為當(dāng)前pcb中的cred結(jié)構(gòu)體副本
long __sys_setuid(uid_t uid)
{
...
if (ns_capable_setid(old->user_ns, CAP_SETUID)) {
new->suid = new->uid = kuid;
if (!uid_eq(kuid, old->uid)) {
retval = set_user(new);
if (retval < 0)
goto error;
}
} else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
goto error;
}
}
set_user完整實(shí)現(xiàn):
/*
* change the user struct in a credentials set to match the new UID
*/
static int set_user(struct cred *new)
{
struct user_struct *new_user;
new_user = alloc_uid(new->uid);
if (!new_user)
return -EAGAIN;
/*
* We don't fail in case of NPROC limit excess here because too many
* poorly written programs don't check set*uid() return code, assuming
* it never fails if called by root. We may still enforce NPROC limit
* for programs doing set*uid()+execve() by harmlessly deferring the
* failure to the execve() stage.
*/
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
free_uid(new->user);
new->user = new_user;
return 0;
}
再來看看alloc_uid:
struct user_struct *alloc_uid(kuid_t uid)
{
struct hlist_head *hashent = uidhashentry(uid);
struct user_struct *up, *new;
spin_lock_irq(&uidhash_lock);
up = uid_hash_find(uid, hashent);
spin_unlock_irq(&uidhash_lock);
...
}
在kernel/user.c中,uidhashentry定義如下
#define uidhashentry(uid) (uidhash_table + __uidhashfn((__kuid_val(uid))))
static struct kmem_cache *uid_cachep;
struct hlist_head uidhash_table[UIDHASH_SZ];
加上uid_hash_find的實(shí)現(xiàn):
static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
{
struct user_struct *user;
hlist_for_each_entry(user, hashent, uidhash_node) {
if (uid_eq(user->uid, uid)) {
refcount_inc(&user->__count);
return user;
}
}
return NULL;
}
如此就可以看出,實(shí)際上對于一個(gè)uid,用戶信息結(jié)構(gòu)體user_struct全局唯一。通過uid的hashentry,在鏈表中查找該結(jié)構(gòu)體,再將指針返回給pcb
3. 新增進(jìn)程合法性判斷
實(shí)際上在上文set_user中,已經(jīng)有如下判斷:
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
rlimit(RLIMIT_NPROC)是讀取當(dāng)前進(jìn)程pcb內(nèi)的nproc限制,再與新用戶總線程數(shù)作比較。
另外,在exec的實(shí)現(xiàn)__do_execve_file中,也有類似判斷:https://github.com/torvalds/linux/blob/master/fs/exec.c
if ((current->flags & PF_NPROC_EXCEEDED) &&
atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
retval = -EAGAIN;
goto out_ret;
}
其他創(chuàng)建process時(shí)也類似
另外,fork最終通過copy_creds實(shí)現(xiàn)了atomic_inc(&p->cred->user->processes);進(jìn)程數(shù)+1
exec最終通過commit_creds實(shí)現(xiàn)atomic_inc(&p->cred->user->processes);進(jìn)程數(shù)+1
四、docker容器的ulimit繼承關(guān)系
1.子進(jìn)程對父進(jìn)程ulimt的繼承
fork進(jìn)程時(shí),在fork的實(shí)現(xiàn)kernel/fork.c中實(shí)現(xiàn)了copy pcb中的內(nèi)容,其中的copy_signal:
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
...
task_lock(current->group_leader);
memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
task_unlock(current->group_leader);
...
}
可見完整地復(fù)制了pcb中的rlim,如果不使用setrlim進(jìn)行更改的話,子進(jìn)程與父進(jìn)程一致。
2.docker容器啟動(dòng)方式
根據(jù)官方文檔的說明,啟動(dòng)容器時(shí)1號(hào)進(jìn)程的rlim繼承于docker daemon:
https://docs.docker.com/engine/reference/commandline/run/
Note: If you do not provide a hard limit, the soft limit will be used for both values. If no ulimits are set, they will be inherited from the default ulimits set on the daemon. as option is disabled now. In other words, the following script is not supported:...
由于docker daemon一般是以root運(yùn)行,所以即使指定的非root用戶運(yùn)行容器,1號(hào)進(jìn)程仍然是與root一致的rlim。
此時(shí)只要不通過pam讀取容器內(nèi)的ulimit配置(如在容器內(nèi)運(yùn)行su切換用戶,或通過遠(yuǎn)程登錄等),則子進(jìn)程也都會(huì)一致繼承root的rlim。
總結(jié)來說,該問題的解決方法就是在容器拉起服務(wù)進(jìn)程之前,不要在容器內(nèi)運(yùn)行su切換用戶??稍谌萜鲉?dòng)前指定任意用戶,不影響ulimit統(tǒng)一繼承于docker daemon