Community Tip - Have a PTC product question you need answered fast? Chances are someone has asked it before. Learn about the community search. X
Creo Simulate. Large deflection analysis with contact.
Can anyone help me understand why is appears that only 1 core's worth of cpu is doing all the work (11 others available)?
Thanks
Hi Charles,
I don't run many contact or LDA analyses, but from my experience with static analysis I suspect there are two reasons:
We've had great success speeding up Mechanica by installing the maximum memory (24 GB) in our machines (don't forget to set Mechanica to use 8192 MB) and putting both temporary files and results onto a 6 GB RAM disk. (For very large models, writing the results can take a long time, so if you can't create a RAM disk big enough for both temp files and results, test it with temp files on the hard drive and results on the RAM disk, as well as the other way round. Or buy an SSD...)
Running with lots of memory and a RAM disk, we now frequently see "CPU time" much greater than "Elapsed time" in the run summary file. This wasn't generally possible when writing data to a hard disk.
HTH,
Jonathan
Other useful pointers on SOLRAM setting, etc., in this thread:
Thanks,
For hardware, we are reasonably well set up with RAM and fast striped disks etc. It's not hardware.
Possibilities I pondered included :
My 48 hour study would take roughly (and I know some allowance for time to poke tasks at various cpu's is required) about 4 hours?
I must have missed a trick.
Hi Charles,
Have a close look at the timings in your .rpt file. Here's one (edited) from a small model I've just run, using the hard drive as temporary space:
------------------------------------------------------------
Mechanica Structure Version L-01-57:spg
Summary for Design Study "p78166_iss4t"
Tue Oct 09, 2012 16:20:11
------------------------------------------------------------
Run Settings
Memory allocation for block solver: 8192.0
Parallel Processing Status
Parallel task limit for current run: 4
Parallel task limit for current platform: 64
Number of processors detected automatically: 4
Elements: 3419
>> Pass 1 <<
Calculating Element Equations (16:20:20)
Total Number of Equations: 61764
Maximum Edge Order: 3
Solving Equations (16:20:23)
Post-Processing Solution (16:20:25)
Checking Convergence (16:20:25)
Resource Check (16:20:26)
Elapsed Time (sec): 16.21
CPU Time (sec): 13.28
Memory Usage (kb): 10093364
Wrk Dir Dsk Usage (kb): 67584
>> Pass 2 <<
Calculating Element Equations (16:20:26)
Total Number of Equations: 85652
Maximum Edge Order: 9
Solving Equations (16:20:31)
Post-Processing Solution (16:20:34)
Checking Convergence (16:20:35)
Calculating Disp and Stress Results (16:20:36)
Analysis "p78166_iss4t" Completed (16:20:40)
------------------------------------------------------------
Memory and Disk Usage:
Machine Type: Windows XP 64 Bit Edition
RAM Allocation for Solver (megabytes): 8192.0
Total Elapsed Time (seconds): 29.66
Total CPU Time (seconds): 24.81
Maximum Memory Usage (kilobytes): 10093364
Working Directory Disk Usage (kilobytes): 119808
Results Directory Size (kilobytes):
35622 .\p78166_iss4t
Maximum Data Base Working File Sizes (kilobytes):
101376 .\p78166_iss4t.tmp\kel1.bas
18432 .\p78166_iss4t.tmp\oel1.bas
------------------------------------------------------------
Run Completed
Tue Oct 09, 2012 16:20:40
------------------------------------------------------------
Total time is just under 30 seconds.
9 seconds is spent meshing (single-core)
3 seconds "calculating equations"
2 seconds "solving equations" on Pass 1 (I think this is the multi-core bit)
1 second "post-processing" and "checking convergence"
5 seconds "calculating equations" for Pass 2
3 seconds "solving equations"
2 seconds "post-processing" and "checking convergence"
4 seconds "calculating results".
So, of 29/30 seconds total, about 7 were spent in the multi-core sections, and only 25 seconds of CPU time were used in total.
In short, "software architecture (not yet rewritten for multi-core)" is your answer - but it may not be possible to write the solver to use multiple cores for every stage of the process.
What do the times look like for your 48-hour run?
Are you setting Mechanica to use 8192 MB, or are you on the default 128? This makes a big difference. Note that you'll need about twice as much total RAM as the setting value though - so you need 16 GB installed to use 8192.
Solids: 3602
Elements: 3602
Contacts: 5
Parallel task limit for current run: 12
Parallel task limit for current platform: 64
Number of processors detected automatically: 12
Machine Type: Windows 7 64 Service Pack 1
RAM Allocation for Solver (megabytes): 16384
Total Elapsed Time (seconds): 173129.06
Total CPU Time (seconds): 184729.94
Maximum Memory Usage (kilobytes): 3509315
Working Directory Disk Usage (kilobytes): 2248967
Results Directory Size (kilobytes): 12655210
1000 steps (small load increments)
Typical iterations ...
Time Step 54 of 1000: 1.06000e+02
Contact Area: 4.95453e+02
Calculating Disp and Stress Results (15:30:21)
Solving Equations (15:30:28)
Time Step 55 of 1000: 1.08000e+02
Contact Area: 4.95924e+02
Calculating Disp and Stress Results (15:31:13)
Solving Equations (15:31:20)
Time Step 56 of 1000: 1.10000e+02
Contact Area: 4.96089e+02
Calculating Disp and Stress Results (15:32:05)
Solving Equations (15:32:12)
Stress and displacements being calculated and written at the end of each iteration, written 1000 times for pass1 and 1000 times for pass 2 though I don't think preventing the writing of results would speed things up much.
OK... so each step is taking about 7 seconds to write results, and about 45 seconds to solve equations. 1000 × 52 = about 14 hours per pass, plus the other overheads.
Interestingly, although you've allocated 16 GB for Mechanica it's only using about 3.5 GB - and 2.25 GB of temporary files. I know you're running a striped array, but I would definitely try reducing the Mechanica allocation to perhaps 6144 MB, creating a RAM disk around 6 GB, and using the RAM disk for your temporary files. I'm using a driver called IMDisk, but that's just one I found using Google - I have no affiliation, and others may also work fine.
I imagine that 12 cores all trying to read and write data simultaneously can probably saturate even a very fast RAID array...
I think the last time I used a RAM disk was on my Amstrad1512 DD-CM with GEM operating environment to which I later added a math co-processor and 10MB HDD.
I will look at the RAM disk as the cummulative write time is about 7% total analysis time.
What advantage would we gain by reducing the RAM allocation? We are fortunate to have oodles of it and Creo simulate1.0 permits double the block solver allocation; why not make sure that it all stays in the memory?
I agree that disks (even SSD) won;t keep up with cpu. RAM will though. Which brings me back to the original question. It is the 45 seconds solve time per iteration - done by occupying only a single core's worth of cpu.
Thanks.
Yes, I hadn't used a RAM drive since 386/486 days and boot floppy disks! However, on our systems it's transformed Mechanica from "quite useful" to an everyday tool - we've been running like this for almost a year now.
I suggested reducing RAM allocation just to ensure that you have enough space for the RAM disk, with more left over. On our 24 GB systems, 8 GB of Mechanica plus 6 GB of RAM disk leaves a healthy amount for xtop.exe and general Windows - if you have 64 GB or similar, then by all means leave Mechanica at 16 GB. However, your .rpt shows that Mechanica is only actually taking 3.5 GB, regardless of the amount you've allocated, so there's no need for more. Other people's studies have shown the Mechanica runs fastest with the allocation set to roughly half the installed memory, from which in this case you should first subtract the RAM drive.
The interesting bit is the 2.25 GB of temporary files. I strongly suspect that this data, or a large amount of it, is being written on every pass - and this is preventing the speed from exceeding one core's worth, even though the job may well be divided into 12 threads.
If you have enough RAM, then putting the results onto a RAM drive may also yield a big improvement as the results seem to involve a lot of non-sequential writes. However, with your 1000-step analysis I expect the biggest gain to be in the solve.
Note that your total CPU time is greater than the total elapsed time - so the job is definitely multi-threading at some point; just not very effectively.
Somewhat nerd like, I have just been watching the disk and thread/cpu activity for a single time step LDA with contact analysis.
Each iteration, the msengine process bimbling along for about 5mins only on cpu 8, 7 threads, 8.23% of total, negligable HDD activity until the iteration is complete. Then there is brief 'flurry' of activity (2 cpus worth ish) whilst there is a bit of i/o .
I presume this 'flurry' is asociated with the rejigging of the stiffness matrix for the next iteration (Newton Raphson fashion) together with the writing of temporary files.
Thus a disappointing amount of disk activity to erode. (will still try a RAM disk).
I am assuming it's Newton Raphson convergence, but this method carries out linear solutions using out of balance loads and linear mechanica solutions use many cpu's.
Anyone from PTC want to put me out of my misery?
Ta
Hmm - interesting stats there (only 7 threads; little HDD activity).
Although you have more cores than we do, there may be something specific to LDA as my colleague is currently running a large, single-step contact analysis and all four of his cores are working hard (75-99% total CPU load).
What are you using to view the number of threads?
Yes, contact studies use lots of cpu's. Useful in a cold office on a winter's day.
Between each frenzied iteration it slows down, I guess deciding what the new spring stiffness values are to be for the next iteration, doing something interesting with the stiffness matrix (and writing to disk).
I am using the 'resource monitor' in W7x64.
I am a bit new to this. However we are in the same boat. I am running an analysis on a silicone seal which obviouslly has a very large deformation (non-linear). The seal material is also non-linear and the model also has several contacts. I simplified the model as much as I could however the element count is still arount 15,000.
The model does run properly, however it takes 6 days and I am still not at the full displacement yet.
Right now we are considering building a new computer forhis purpose. Since it is obvious that the majority of the run takes place on one processer we are evaluating the two folloing options:
Intel Core i7-5820K
Intel Core i7-4790K
I am currently using this one:
Intel Xeon E5640
Does anyone have expiriance with these two or perhaps other suggestions?
We are also considering a few SSDs in a RAID and about 16 GB of DDR3 memory.
However, what I would really like to know is what type of gains may I expect. If shave the 6 days down to 4 I havent really gained all that much. However a 5x increase in speed wbe worth it.
Any comments?
Some useful info here
http://communities.ptc.com/message/266735#266735
For single core processes such as LDA get the fastest single core cpu possible.
Regards
Hello, Jonathan,
Have you seen these two links.
My OS, Windows 7 Pro 32-bit sp1 with 2 Intel Xeon X5960 2x 6coeurs but I do not do speed tests.
I have not noticed more quickly.
Best Regards.
Denis.
We have implemented a ramdisk via IMdisk on the same computer and it increased the speed by 5 fold!
10 minute iterations are now only taking 2 minutes!
Amazing!
However the CPU load only increased from 25% to 30% suggesting that there is still a bottle neck somewhere and more room for improvements.
However, I should have enough ammution to go for a computer upgrade with faster RAM and more of it...
Hi Jonathan,
Glad to hear the RAM disk solution helped!
However, I suspect that the remaining 'bottleneck' is that you have three out of four cores sitting mostly idle; as discussed at the beginning of this thread, LDA seems only to use one core.
Faster RAM is now unlikely to help you much (I'd expect a percentage improvement only) although more of it will allow a larger RAM disk and therefore solving larger analyses. A faster (single-thread score) CPU will also help - your E5640 looks pretty slow at 1166 (we're disposing of workstations with faster processors than that here), so replacing with an i7-4790k at over 2500 should double the speed, or even more. The 5820 is a little slower, around 2000, but depending on the price you may decide that's a more sensible route.
As a wildcard, if cost matters and you're sure don't need multi-thread power, consider a fast i3 or even a Pentium - I've recently upgraded my home gaming rig with a Pentium G3258 which was just £50 (CPU and cooler) and scores well over 2000, whilst a 4790k currently goes for over £250... Also note that the "k" is strictly speaking unnecessary unless you plan on overclocking.
Cheers,
another Jonathan
A quick question: what are your elapsed time and CPU time from an analysis, using the RAM disk? Is CPU time now roughly equal to elapsed time?
I have recently moved to an i7 machine with 32 GB of RAM abd the speed doubled again.
There are times when the CPU is opperating at 100% on all cores, but most of the time (as you mentioned with LDA) it is only running on a sinlgle core.
it looks like I am now getting CPU to elapsed time ratios of ~2.5:1 but the run has only just begun. I will let you know if it changes.
Another CPU upgrade dosent seem worth it considering it would only marginally increase the speed. If I went all out and got the 4000 GHz i7 it might double the speed during the single core periods, but it may also slow down the multicore parts some. I doubt it would double the overall speed...
The best solution would be for CREO to offer multicore capabilities on LDA...