死亡Error:OSError: [Errno 12] Cannot allocate memory
調(diào)試背景:使用的是github上https://github.com/arunmallya/packnet這里的代碼。
調(diào)試的時(shí)候,出現(xiàn)Error,如下:
? ? main()
? File "main.py", line 378, in main
? ? manager.prune()
? File "main.py", line 263, in prune
? ? savename='_final', best_accuracy=accuracy)
? File "main.py", line 217, in train
? ? self.do_epoch(epoch_idx, optimizer)
? File "main.py", line 174, in do_epoch
? ? for batch, label in tqdm(self.train_data_loader, desc='Epoch: %d ' % (epoch_idx)):
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/site-packages/tqdm/_tqdm.py", line 1032, in __iter__
? ? for obj in iterable:
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 301, in __iter__
? ? return DataLoaderIter(self)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 158, in __init__
? ? w.start()
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/process.py", line 105, in start
? ? self._popen = self._Popen(self)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
? ? return _default_context.get_context().Process._Popen(process_obj)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
? ? return Popen(process_obj)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
? ? self._launch(process_obj)
? File "/home/rvlg/anaconda3/envs/torch/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
? ? self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
??
遇到這個(gè)問題,由于代碼本身的額原因先是考慮到運(yùn)行電腦的內(nèi)存問題,于是用
watch -n 2 nvidia-smi
watch -n 2 free -m
??
全程監(jiān)視電腦CPU、GPU,以及物理內(nèi)存、交換區(qū)內(nèi)存的變化情況,發(fā)現(xiàn)并不是內(nèi)存的原因。找bug未果。
換了一個(gè)思路,從出錯(cuò)的代碼以及錯(cuò)誤提示上來看,是dataloader.py出了問題,于是Google,關(guān)鍵詞:dataloader OSError: [Errno 12] Cannot allocate memory
果然有很多人也是由于在dataload的時(shí)候出錯(cuò),找了很多原因:
1、電腦內(nèi)存原因(已排除)
2、電腦系統(tǒng)線程數(shù)量限制:https://blog.csdn.net/m0_37644085/article/details/92795488:修改最大進(jìn)程數(shù)(嘗試無效)
3、設(shè)置pin_memory=False;(嘗試無效)
4、修改多線程數(shù)量:設(shè)置num_workers,系統(tǒng)默認(rèn)的數(shù)量是4,改成1之后,沒有效果,后面改成0,問題解決?。?!程序可以跑了。
特發(fā)此帖紀(jì)念,認(rèn)真查了兩天多!?。∠M梢詭偷酱蠹?。