Skip to main content Skip to navigation

Software support forum

Software support forum Support about parallelization

You need to be logged in to post in this topic.
  1. Hi all, 

    I need of support about parallelizing a serial code, about both the fundamentals and the application with Matlab.

    The initial part of the code runs on the root, then the same script should run in parallel in all the processors, calling other scripts according to the user input data, and the root processor will collect everything at the end.

    I've written it with Matlab, and always get an error of transparency. I bid I'm doing a trivial error but do not see what... 

    How can I get support?

    Thanks

    Patrizio

     
  2. Sorry about the delay in replying but we've been heavily in meetings this week. Could you please supply us with the code that you are trying to use since I'm afraid that we can't be any help from the description that you have given. If you'd rather not put your code in public please email it to rse@warwick.ac.uk.

     
  3. Hi Chris, 

    thank you very much!

    Don't worry at all. Tomorrow afternoon I'll upload the files with a brief explanation.

    Thanks

    Patrizio

     
  4. Hi Chris, 

    find attached the relevant scripts:

    - the TDF_construct_3D_v3bis runs on the root and contains a loop on a set of energy values (typically ~ 150 / 200), at the line 210 the for loops I plan to parallelize starts and call the script TDF_calc_3D_v3 at line 213

    - the script TDF_calc_3D_v3 would run on each cores of the parallel "pool", it needs the data contained in the WorkSpace, that should be shared among all the cores, and contain other for loops that run on the following quantities: bands (1 to 5 values typically), k-points (~10^4 to 10^5) and Temperatures (~5 values). Inside all these nested for loops, i.e. for each k point of each band at each temperature, calls a set of scripts like the tau_ADP_3D_v3 attached.

    - the TDF_calc_3D_v3 return a set of temporary matrixes that are used from the TDF_construct_3D_v3bis script to build the TDF variables (TransportDistributionFunction) at energy value

    I planned to parallelize for the energy values since their number is known since the beginning while the number of k points changes according to the energy and band... I understand I'm reasoning as a physicist and not as a software engineer...

    BTW if you wish to run the code, you need of the all input data and files, and the "main" script, that is a burden.... I should send you other 5 files at least with a dummy example that can run in a few minutes, let me know if you wish these as well

    Thanks

    Patrizio

    3 attachments

     
  5. Hi,

    It looks like this is the classic Matlab parallel transparency error as detailed here. The problem is that by default only variables that are defined within the body of the parallel loop are available within the parallel loop. The only solutions that I know of are to use either the parallel.pool.Constant object to give you access to outside data (needs Matlab R2015b or newer) or the WorkerObjWrapper function. You will need to arrange to pass in all of your TDF_xx, TDF_xy etc. variables as well as your TDF_xx_temp etc. variables. There is a fairly good example of using parallel.pool.Constant here.

     
  6. Hi Chris!

    Thank you so much for your quick and informative support!!!

    I'll go through all of these.

    Thanks!

    Patrizio

     
  7. Hi Chris,

    in the end I saved the workspace and attached it to the parpool

    save('WorkSpace','-v7.3','-nocompression');

    poolobj = gcp;
    addAttachedFiles(poolobj,{'WorkSpace.mat'})
    WorkersConstant = parallel.pool.Constant('WorkSpace.mat');

    parfor id_E = 1:nE
    for id_n = 1:n_bands_transp

    [tau_temp, tau_matth_temp, tau_IIS_temp] = tau_calc_funct_v3(id_E, id_n, 'WorkSpace.mat'); % the big tau_calc routine, the actual tau_calc in the serial version

    taus(id_E,id_n) = tau_temp;
    taus_matth(id_E,id_n) = tau_matth_temp;

    end
    end

    and all the other files are lumped in a function with subfunctions that load the whole Workspace

    It's probably not the best practice but it works for now.

    The problem is that when I run it on a cluster I get a lot of aborted workers...can you support me on this or shalI I rise another topic or ...?

    Thanks

    Patrizio

    1 attachment

     
  8. Hi, 

    I'm afraid we're both on holiday this week, but will look as soon as we're back. 

    Heather

     
  9. Hi Heather, 

    no problem, enjoy your holidays!

    BTW may I ask you if there is a way to ask more memory in the root processor (if it matters)? When I run the parallel for loop I need in principle no more than 1Gb in each processor, but the final collected struct file can be more than 20 Gb. I'm however thinking to re-shuffle the code approach to skip this passage...

     
  10. Hi, 

    I'm not a Matlab person, but based on the fact your workers are crashing suggests what you're trying to do there isn't valid (possibly the workers aren't actually getting independent copies). I doubt there's going to be any way around the deep interconnections in your code, as pointed out on the Matlab forums (https://www.mathworks.com/matlabcentral/answers/472239-parallel-computing-with-shared-variables-problem-with-struct) and that is probably going to make debugging this almost impossible. I think you will have to address the points from Edric Ellis there and refactor your code into functions that you can then simply use parfor on. 

    This link https://www.mathworks.com/matlabcentral/answers/100816-how-do-i-locate-the-crash-dump-files-generated-by-matlab should get you the crash dumps if you want to try debugging. 

    By the way, it gets very awkward if you ask the same question here and on bugzilla (and externally) without mentioning that, and wastes everybody's time. In future, please mention on bugzilla or here if you've already posted somewhere else. 

     
  11. To add, when you locate the crash dumps, please do share them either here or on bugzilla, just in case there is something wrong on the cluster. 

     
  12. Hi Heather, 

    thanks for your messages, sorry that my former message  "  #7 15:12, Tue 30 Jul 2019  "  was not clear...

    I did parallelize the code and it works properly on my desktop pc (4 workers) but when I run it on the cluster some workers (say 3 or 4 out of 10) sometimes crash. This is why I raised the point in Bugzilla, as I thought it's related to the cluster.

    In the link you posted, https://www.mathworks.com/matlabcentral/answers/100816-how-do-i-locate-the-crash-dump-files-generated-by-matlab , they explain that  "If MATLAB is started with the '-logfile' option, a separate crash file is not created. Instead, the crash information is written to the end of the log file specified by the '-logfile' option."   that's my case and the log file is the one attached in my former message. I'll post any relevant thing on Bugzilla only in the future.

    About the mentioned cyclomatic complexity, I'll definitely need support to sort it out

    Thanks

    Patrizio

     
  13. Hi Chris, hi all, 

    just to confirm you that the aborting workers issue was related to the overall memory available for the root worker.

    A submission file as follows

    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=20
    #SBATCH --mem=75000mb
    #SBATCH --mem-per-cpu=2500mb

    and making Matlab running all the 20 cpus, does not get any aborted workers as there is sufficient amount of additional allocated memory for the root cpu. I couldn't find another way to allow more memory to the master than to the others.

    Before, the workers crashed and the calculation continued until the left memory was sufficient.

    BTW I can confirm you that the scheme

    save('WorkSpace','-v7.3','-nocompression');

    mystruct = [];

    poolobj = gcp;
    addAttachedFiles(poolobj,{'WorkSpace.mat'})

    parfor id_p = 1:Np

              struct_temp = myfunction(id_p,, 'WorkSpace.mat');

              mystruct(id_p) = struct_temp;

    end

    where myfunction loads the WorkSpace solves the transparency issue.

    The memory issue was related to that each warkers need 1 to 2 Gb, no more, but the mystruct can be more than 20Gb so the root needs much memory than the workers.

    Thanks for your time 

    In this post Patrizio Graziosi wrote:

    Hi Chris,

    in the end I saved the workspace and attached it to the parpool

    save('WorkSpace','-v7.3','-nocompression');

    poolobj = gcp;
    addAttachedFiles(poolobj,{'WorkSpace.mat'})
    WorkersConstant = parallel.pool.Constant('WorkSpace.mat');

    parfor id_E = 1:nE
    for id_n = 1:n_bands_transp

    [tau_temp, tau_matth_temp, tau_IIS_temp] = tau_calc_funct_v3(id_E, id_n, 'WorkSpace.mat'); % the big tau_calc routine, the actual tau_calc in the serial version

    taus(id_E,id_n) = tau_temp;
    taus_matth(id_E,id_n) = tau_matth_temp;

    end
    end

    and all the other files are lumped in a function with subfunctions that load the whole Workspace

    It's probably not the best practice but it works for now.

    The problem is that when I run it on a cluster I get a lot of aborted workers...can you support me on this or shalI I rise another topic or ...?

    Thanks

    Patrizio