LDA, Only one core used?

346gnu · ‎Oct 08, 2012

Creo Simulate. Large deflection analysis with contact.

Can anyone help me understand why is appears that only 1 core's worth of cpu is doing all the work (11 others available)?

Thanks

This thread is inactive and closed by the PTC Community Management Team. If you would like to provide a reply and re-open this thread, please notify the moderator and reference the thread. You may also use "Start a topic" button to ask a new question. Please be sure to include what version of the PTC product you are using so another community member knowledgeable about your version may be able to assist.

JonathanHodgson · ‎Oct 08, 2012

Hi Charles,

I don't run many contact or LDA analyses, but from my experience with static analysis I suspect there are two reasons:

Certain stages of the analysis cannot be split up. Meshing is the first one; then to solve the matrix, one core has to 'supervise' both dividing the job for multiples cores, and the re-combining the results. Actually writing the results to disk is also a single-core task; so it's only really the matrix solve that runs on multiple cores.
The limiting factor may be hard drive speed, in which case even if multiple cores are working they may all be waiting for temporary file access, giving a net usage no higher than one core.

We've had great success speeding up Mechanica by installing the maximum memory (24 GB) in our machines (don't forget to set Mechanica to use 8192 MB) and putting both temporary files and results onto a 6 GB RAM disk. (For very large models, writing the results can take a long time, so if you can't create a RAM disk big enough for both temp files and results, test it with temp files on the hard drive and results on the RAM disk, as well as the other way round. Or buy an SSD...)

Running with lots of memory and a RAM disk, we now frequently see "CPU time" much greater than "Elapsed time" in the run summary file. This wasn't generally possible when writing data to a hard disk.

HTH,

Jonathan

JonathanHodgson · ‎Oct 08, 2012

Other useful pointers on SOLRAM setting, etc., in this thread:

http://communities.ptc.com/message/183371#183371

346gnu · ‎Oct 09, 2012

Thanks,

For hardware, we are reasonably well set up with RAM and fast striped disks etc. It's not hardware.

Possibilities I pondered included :

software architecture (not yet rewritten for multi-core)
bought in LDA solver technology licensing limitation.

My 48 hour study would take roughly (and I know some allowance for time to poke tasks at various cpu's is required) about 4 hours?

I must have missed a trick.

JonathanHodgson · ‎Oct 09, 2012

Hi Charles,

Have a close look at the timings in your .rpt file. Here's one (edited) from a small model I've just run, using the hard drive as temporary space:

------------------------------------------------------------

Mechanica Structure Version L-01-57:spg

Summary for Design Study "p78166_iss4t"

Tue Oct 09, 2012 16:20:11

------------------------------------------------------------

Run Settings

Memory allocation for block solver: 8192.0

Parallel Processing Status

Parallel task limit for current run: 4

Parallel task limit for current platform: 64

Number of processors detected automatically: 4

Elements: 3419

>> Pass 1 <<

Calculating Element Equations (16:20:20)

Total Number of Equations: 61764

Maximum Edge Order: 3

Solving Equations (16:20:23)

Post-Processing Solution (16:20:25)

Checking Convergence (16:20:25)

Resource Check (16:20:26)

Elapsed Time (sec): 16.21

CPU Time (sec): 13.28

Memory Usage (kb): 10093364

Wrk Dir Dsk Usage (kb): 67584

>> Pass 2 <<

Calculating Element Equations (16:20:26)

Total Number of Equations: 85652

Maximum Edge Order: 9

Solving Equations (16:20:31)

Post-Processing Solution (16:20:34)

Checking Convergence (16:20:35)

Calculating Disp and Stress Results (16:20:36)

Analysis "p78166_iss4t" Completed (16:20:40)

------------------------------------------------------------

Memory and Disk Usage:

Machine Type: Windows XP 64 Bit Edition

RAM Allocation for Solver (megabytes): 8192.0

Total Elapsed Time (seconds): 29.66

Total CPU Time (seconds): 24.81

Maximum Memory Usage (kilobytes): 10093364

Working Directory Disk Usage (kilobytes): 119808

Results Directory Size (kilobytes):

35622 .\p78166_iss4t

Maximum Data Base Working File Sizes (kilobytes):

101376 .\p78166_iss4t.tmp\kel1.bas

18432 .\p78166_iss4t.tmp\oel1.bas

------------------------------------------------------------

Run Completed

Tue Oct 09, 2012 16:20:40

------------------------------------------------------------

Total time is just under 30 seconds.

9 seconds is spent meshing (single-core)

3 seconds "calculating equations"

2 seconds "solving equations" on Pass 1 (I think this is the multi-core bit)

1 second "post-processing" and "checking convergence"

5 seconds "calculating equations" for Pass 2

3 seconds "solving equations"

2 seconds "post-processing" and "checking convergence"

4 seconds "calculating results".

So, of 29/30 seconds total, about 7 were spent in the multi-core sections, and only 25 seconds of CPU time were used in total.

In short, "software architecture (not yet rewritten for multi-core)" is your answer - but it may not be possible to write the solver to use multiple cores for every stage of the process.

What do the times look like for your 48-hour run?

Are you setting Mechanica to use 8192 MB, or are you on the default 128? This makes a big difference. Note that you'll need about twice as much total RAM as the setting value though - so you need 16 GB installed to use 8192.

346gnu · ‎Oct 09, 2012

Solids: 3602

Elements: 3602

Contacts: 5

Parallel task limit for current run: 12

Parallel task limit for current platform: 64

Number of processors detected automatically: 12

Machine Type: Windows 7 64 Service Pack 1
RAM Allocation for Solver (megabytes): 16384

Total Elapsed Time (seconds): 173129.06
Total CPU Time (seconds): 184729.94
Maximum Memory Usage (kilobytes): 3509315
Working Directory Disk Usage (kilobytes): 2248967

Results Directory Size (kilobytes): 12655210

1000 steps (small load increments)

Typical iterations ...

Time Step 54 of 1000: 1.06000e+02

Contact Area: 4.95453e+02

Calculating Disp and Stress Results (15:30:21)

Solving Equations (15:30:28)

Time Step 55 of 1000: 1.08000e+02

Contact Area: 4.95924e+02

Calculating Disp and Stress Results (15:31:13)

Solving Equations (15:31:20)

Time Step 56 of 1000: 1.10000e+02

Contact Area: 4.96089e+02

Calculating Disp and Stress Results (15:32:05)

Solving Equations (15:32:12)

Stress and displacements being calculated and written at the end of each iteration, written 1000 times for pass1 and 1000 times for pass 2 though I don't think preventing the writing of results would speed things up much.

JonathanHodgson · ‎Oct 10, 2012

OK... so each step is taking about 7 seconds to write results, and about 45 seconds to solve equations. 1000 × 52 = about 14 hours per pass, plus the other overheads.

Interestingly, although you've allocated 16 GB for Mechanica it's only using about 3.5 GB - and 2.25 GB of temporary files. I know you're running a striped array, but I would definitely try reducing the Mechanica allocation to perhaps 6144 MB, creating a RAM disk around 6 GB, and using the RAM disk for your temporary files. I'm using a driver called IMDisk, but that's just one I found using Google - I have no affiliation, and others may also work fine.

I imagine that 12 cores all trying to read and write data simultaneously can probably saturate even a very fast RAID array...

346gnu · ‎Oct 10, 2012

I think the last time I used a RAM disk was on my Amstrad1512 DD-CM with GEM operating environment to which I later added a math co-processor and 10MB HDD.

I will look at the RAM disk as the cummulative write time is about 7% total analysis time.

What advantage would we gain by reducing the RAM allocation? We are fortunate to have oodles of it and Creo simulate1.0 permits double the block solver allocation; why not make sure that it all stays in the memory?

I agree that disks (even SSD) won;t keep up with cpu. RAM will though. Which brings me back to the original question. It is the 45 seconds solve time per iteration - done by occupying only a single core's worth of cpu.

Thanks.

JonathanHodgson · ‎Oct 10, 2012

Yes, I hadn't used a RAM drive since 386/486 days and boot floppy disks! However, on our systems it's transformed Mechanica from "quite useful" to an everyday tool - we've been running like this for almost a year now.

I suggested reducing RAM allocation just to ensure that you have enough space for the RAM disk, with more left over. On our 24 GB systems, 8 GB of Mechanica plus 6 GB of RAM disk leaves a healthy amount for xtop.exe and general Windows - if you have 64 GB or similar, then by all means leave Mechanica at 16 GB. However, your .rpt shows that Mechanica is only actually taking 3.5 GB, regardless of the amount you've allocated, so there's no need for more. Other people's studies have shown the Mechanica runs fastest with the allocation set to roughly half the installed memory, from which in this case you should first subtract the RAM drive.

The interesting bit is the 2.25 GB of temporary files. I strongly suspect that this data, or a large amount of it, is being written on every pass - and this is preventing the speed from exceeding one core's worth, even though the job may well be divided into 12 threads.

If you have enough RAM, then putting the results onto a RAM drive may also yield a big improvement as the results seem to involve a lot of non-sequential writes. However, with your 1000-step analysis I expect the biggest gain to be in the solve.

Note that your total CPU time is greater than the total elapsed time - so the job is definitely multi-threading at some point; just not very effectively.

346gnu · ‎Oct 10, 2012

Somewhat nerd like, I have just been watching the disk and thread/cpu activity for a single time step LDA with contact analysis.

Each iteration, the msengine process bimbling along for about 5mins only on cpu 8, 7 threads, 8.23% of total, negligable HDD activity until the iteration is complete. Then there is brief 'flurry' of activity (2 cpus worth ish) whilst there is a bit of i/o .

I presume this 'flurry' is asociated with the rejigging of the stiffness matrix for the next iteration (Newton Raphson fashion) together with the writing of temporary files.

Thus a disappointing amount of disk activity to erode. (will still try a RAM disk).

I am assuming it's Newton Raphson convergence, but this method carries out linear solutions using out of balance loads and linear mechanica solutions use many cpu's.

Anyone from PTC want to put me out of my misery?

Ta

JonathanHodgson · ‎Oct 10, 2012

Hmm - interesting stats there (only 7 threads; little HDD activity).

Although you have more cores than we do, there may be something specific to LDA as my colleague is currently running a large, single-step contact analysis and all four of his cores are working hard (75-99% total CPU load).

What are you using to view the number of threads?

346gnu · ‎Oct 10, 2012

Yes, contact studies use lots of cpu's. Useful in a cold office on a winter's day.

Between each frenzied iteration it slows down, I guess deciding what the new spring stiffness values are to be for the next iteration, doing something interesting with the stiffness matrix (and writing to disk).

I am using the 'resource monitor' in W7x64.

jreeh · ‎Dec 30, 2014

I am a bit new to this. However we are in the same boat. I am running an analysis on a silicone seal which obviouslly has a very large deformation (non-linear). The seal material is also non-linear and the model also has several contacts. I simplified the model as much as I could however the element count is still arount 15,000.

The model does run properly, however it takes 6 days and I am still not at the full displacement yet.

Right now we are considering building a new computer forhis purpose. Since it is obvious that the majority of the run takes place on one processer we are evaluating the two folloing options:

Intel Core i7-5820K

Intel Core i7-4790K

I am currently using this one:

Intel Xeon E5640

Does anyone have expiriance with these two or perhaps other suggestions?

We are also considering a few SSDs in a RAID and about 16 GB of DDR3 memory.

However, what I would really like to know is what type of gains may I expect. If shave the 6 days down to 4 I havent really gained all that much. However a 5x increase in speed wbe worth it.

Any comments?

346gnu · ‎Dec 30, 2014

Some useful info here

http://communities.ptc.com/message/266735#266735

For single core processes such as LDA get the fastest single core cpu possible.

Regards

DenisJaunin · ‎Dec 30, 2014

Hello, Jonathan,

Have you seen these two links.

My OS, Windows 7 Pro 32-bit sp1 with 2 Intel Xeon X5960 2x 6coeurs but I do not do speed tests.

I have not noticed more quickly.

Best Regards.

Denis.

http://communities.ptc.com/servlet/JiveServlet/previewBody/3254-102-1-3881/Form_2317156_Multi-threading.pdf

http://communities.ptc.com/message/183371#183371#183371

jreeh · ‎Dec 30, 2014

We have implemented a ramdisk via IMdisk on the same computer and it increased the speed by 5 fold!

10 minute iterations are now only taking 2 minutes!

Amazing!

However the CPU load only increased from 25% to 30% suggesting that there is still a bottle neck somewhere and more room for improvements.

However, I should have enough ammution to go for a computer upgrade with faster RAM and more of it...

JonathanHodgson · ‎Jan 05, 2015

Hi Jonathan,

Glad to hear the RAM disk solution helped!

However, I suspect that the remaining 'bottleneck' is that you have three out of four cores sitting mostly idle; as discussed at the beginning of this thread, LDA seems only to use one core.

Faster RAM is now unlikely to help you much (I'd expect a percentage improvement only) although more of it will allow a larger RAM disk and therefore solving larger analyses. A faster (single-thread score) CPU will also help - your E5640 looks pretty slow at 1166 (we're disposing of workstations with faster processors than that here), so replacing with an i7-4790k at over 2500 should double the speed, or even more. The 5820 is a little slower, around 2000, but depending on the price you may decide that's a more sensible route.

As a wildcard, if cost matters and you're sure don't need multi-thread power, consider a fast i3 or even a Pentium - I've recently upgraded my home gaming rig with a Pentium G3258 which was just £50 (CPU and cooler) and scores well over 2000, whilst a 4790k currently goes for over £250... Also note that the "k" is strictly speaking unnecessary unless you plan on overclocking.

Cheers,

another Jonathan

JonathanHodgson · ‎Jan 05, 2015

A quick question: what are your elapsed time and CPU time from an analysis, using the RAM disk? Is CPU time now roughly equal to elapsed time?

jreeh · ‎Jan 05, 2015

I have recently moved to an i7 machine with 32 GB of RAM abd the speed doubled again.

There are times when the CPU is opperating at 100% on all cores, but most of the time (as you mentioned with LDA) it is only running on a sinlgle core.

it looks like I am now getting CPU to elapsed time ratios of ~2.5:1 but the run has only just begun. I will let you know if it changes.

Another CPU upgrade dosent seem worth it considering it would only marginally increase the speed. If I went all out and got the 4000 GHz i7 it might double the speed during the single core periods, but it may also slow down the multicore parts some. I doubt it would double the overall speed...

The best solution would be for CREO to offer multicore capabilities on LDA...