的mpirun:无法识别的参数MCA

问题描述:

我有一个C++解算器,我需要使用下面的命令并行运行:的mpirun:无法识别的参数MCA

nohup mpirun -np 16 ./my_exec > log.txt & 

此命令将我的节点上独立地对可用16级的处理器上运行my_exec。这用于完美地工作。

上周,HPC部门进行了操作系统升级,现在,当启动相同的命令时,我收到两条警告消息(针对每个处理器)。第一个是:

--------------------------------------------------------------------------       
2 WARNING: It appears that your OpenFabrics subsystem is configured to only        
3 allow registering part of your physical memory. This can cause MPI jobs to       
4 run with erratic performance, hang, and/or crash.              
5                          
6 This may be caused by your OpenFabrics vendor limiting the amount of         
7 physical memory that can be registered. You should investigate the         
8 relevant Linux kernel module parameters that control how much physical        
9 memory can be registered, and increase them to allow registering all         
10 physical memory on your machine.                  
11                          
12 See this Open MPI FAQ item for more information on these Linux kernel module       
13 parameters:                       
14                          
15  http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages         
16                          
17 Local host:    tamnun                  
18 Registerable memory:  32768 MiB                 
19 Total memory:   98294 MiB                 
20                          
21 Your MPI job will continue, but may be behave poorly and/or hang.          
22 --------------------------------------------------------------------------       
23 --------------------------------------------------------------------------   

然后我从我的代码,它告诉我它认为我只发射1实现代码(Nprocs = 1,而不是16)得到的输出。

177                          
178 # MPI IS ON; Nprocs = 1                    
179 Filename = ../input/odtParam.inp                  
180                          
181 # MPI IS ON; Nprocs = 1                    
182                          
183 ***** Error, process 0 failed to create ../data/data_0/, or it was already there 

最后,第二警告信息是:

185 --------------------------------------------------------------------------       
186 An MPI process has executed an operation involving a call to the          
187 "fork()" system call to create a child process. Open MPI is currently        
188 operating in a condition that could result in memory corruption or         
189 other system errors; your MPI job may hang, crash, or produce silent         
190 data corruption. The use of fork() (or system() or other calls that         
191 create child processes) is strongly discouraged.              
192                          
193 The process that invoked fork was:                 
194                          
195 Local host:   tamnun (PID 17446)                
196 MPI_COMM_WORLD rank: 0                    
197                          
198 If you are *absolutely sure* that your application will successfully         
199 and correctly survive a call to fork(), you may disable this warning         
200 by setting the mpi_warn_on_fork MCA parameter to 0.             
201 --------------------------------------------------------------------------  

四处寻找在线后,我试图通过MCA参数mpi_warn_on_fork用命令设置为0以下的警告信息建议:

nohup mpirun --mca mpi_warn_on_fork 0 -np 16 ./my_exec > log.txt & 

其中产生以下错误信息:

[[email protected]] match_arg (./utils/args/args.c:194): unrecognized argument mca 
[[email protected]] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error 
[[email protected]] parse_args (./ui/mpich/utils.c:2964): error parsing input array 
[[email protected]] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments 

我使用的是RedHat 6.7(圣地亚哥)。我联系了HPC部门,但由于我在大学,这个问题可能需要一两天才能做出回应。任何帮助或指导,将不胜感激。响应

编辑回答:

事实上,我编译我与Open MPI的mpic++代码在运行英特尔的mpirun命令的执行,因此错误(OS升级后英特尔mpirun被设置为默认)。我必须将Open MPI的mpirun的路径放在$PATH环境变量的开头。

代码现在按预期方式运行,但我仍然得到上面的第一条警告消息(它不建议我再使用MCA参数mpi_warn_on_fork)我认为(但不确定)这是我需要解决的问题HPC部门

+0

你在commad中有一个错字是:mpi_warn_on_fork(你写的作品) – Marco

+0

哈对,我用命令的权利,错字发布了问题。 – solalito

[[email protected]] match_arg (./utils/args/args.c:194): unrecognized argument mca 
[[email protected]] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error 
[[email protected]] parse_args (./ui/mpich/utils.c:2964): error parsing input array 
            ^^^^^ 
[[email protected]] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments 
                ^^^^^ 

您在最后一种情况下使用MPICH MPICH是不开放MPI及其进程启动不承认--mca参数是特定于Open MPI(MCA代表模块化组件架构。 - 基本Open MPI构建于此框架之上)一个混合多个MPI实现的典型案例

+0

感谢您的回答!但是,我确定从哪里开始修复它。有什么建议? – solalito

+0

首先了解哪些MPI实现安装在机器上,以及如何在它们之间切换。另外,请确保您使用来自用于编译该程序的相同实现中的'mpirun'。使用MPI运行时(反之亦然)进行编译并不能正常工作,因此您会得到一堆单例进程,它们在您自己的'MPI_COMM_WORLD'中的排名为0,正如您已经观察到的那样。 –

+0

看我的编辑。我会接受你的答案,因为它使我走上了正确的轨道(一旦问题被诊断出来,解决方案很容易)。 – solalito