Skip to main content
12-Amethyst
February 25, 2023
Solved

How to speed up Mathcad Prime for large data sets?

  • February 25, 2023
  • 1 reply
  • 6933 views

Does anyone know how to speed up MCP for processing large data sets?

 

I am working with a 50832x14 size data set that is output in Excel from another program. I would prefer to filter, sort, and manipulate the data in MCP vs Excel. However, trying to execute a simple filter step in MCP on a data set this large has proven impossible. I can't get the function to complete (endless spinning wheel). Whereas I can complete this one step in Excel in about 1sec.

 

Does anyone know how to speed this up? Or any other tricks for processing large data sets in MCP? I would really, really prefer to avoid Excel but at this point, I'll have to use it.

Thanks!

Andy_C_0-1677346454287.pngAndy_C_1-1677346461019.pngAndy_C_2-1677346507400.pngAndy_C_3-1677346511013.png

 

 

 

 

 

Best answer by Werner_E

Give the attached version a try with a larger data set. I am not sure if it would perform OK as of the large vector "mask" (as many rows as the data set itself) which is used.

1 reply

25-Diamond I
February 25, 2023

A matrix of that size should normally be no problem in Mathcad.

But Prime often was reported here as being very slow with larger data sets.

I guess that the "submatrix" command slows down the calculation and you may try to replace it by using the row-selector.

I guess you could replace the whole filter function by a using the built-in "lookup()" function.

 

BTW, what the reason behind the last step in your function - getting rid of the first row if its first entry is 0? Why the if-statement? The first row will always consist of zeros because thats how you initialize it.

Your function does not cover the case if no row is found containing the given value in the appropriate column. The attempt to get rid of the first row with the last submatrix command will throw an error because there is no row number 2.

 

EDIT: I gave it a try and using the row operator instead of  "stack" and "submatrix" speeds up the calculation so it takes just half the time. But using match() and trim() (not lookup() as I had written above) really significantly cuts down calculation time to less than one hundredths.

 

The usage of your variable "flag" is quite complicate and actually you don't need this variable at all.
Your filter function returns all data rows which do NOT contain the given value in the specified column. Is this intended?

25-Diamond I
February 25, 2023

OK, her is the version using the row selector instead of stack and submatrix

Werner_E_7-1677361755706.png

 

And here is the one using the match and trim functions

Werner_E_1-1677360931623.png

I checked that both would return the same result as your original function (all rows NOT containing "val" in column "col".

I used a smaller data matrix with just 1001 entries.

Werner_E_2-1677361070271.png

Now I used a timer function to compare calculation speed of the three functions. The values seen show the time needed (on my rather slow oldtimer) for 100 calls to the functions.

Werner_E_8-1677361805912.png

 

Your full original data is now processed by FilterCol3a in about 0,2 seconds.

And because I suspected that you actually wanted the function to return just the rows which actually contain the given string I added a function to do this as well

Werner_E_4-1677361383683.png

and checked if the output of both would combine to the full data set

Werner_E_6-1677361502204.png

 

Prime 6 worksheet attached

 

 

EDIT: I was curious and tried timing the full data set

Werner_E_0-1677363110572.png

As you can see this time it took FilterCol2 not 60 times longer compared to FilterCol3a, but rather nearly 3000 times longer.

I did not wait for the original FilterCol to finish for obvious reasons 😉
But I guess that even FilterCol2 would be much faster if you program it to just return the matching rows and not all rows which don't match.

 

EDIT2: I rewrote and streamlined your initial FilterCol() function to return only the rows which contain the given value so it does the same as my FilterCol3b. Now there are much less stack() and submatrix()  functions to execute and calc time is significantly more reasonable now (but still 300 times more than FilterCol3b)

Werner_E_1-1677363647712.png

 

 

 

 

 

25-Diamond I
February 25, 2023

According your filter functions - do they, especially the function combine, also consider multiple occurrences?

I would guess that they are a but slower than mine on data matrices with more columns as lookup is called for every coloumn and not just once as the match command in my versions. Also your combine() may take a bit longer compared to the built in trim() function which I use to clean up.

But just give it a try. Give your function a unique name like FilterCol5a and call the the same way using the smaller dataset output2 as I did with the other functions in my sheet.

For the timer() to be useful you have to turn off Multithreading in the calculating options.

 

Concerning removing duplicates in a vector, I guess that functions which do so were posted here in the forum quite often. Maybe the search function can bring some of them to light. I had written that kind of utility function (should be a standard built-in) a couple of times and I guess every time differently ;-).

As we here have dealt with match and trim, I made up a tricky version using these functions which uses the try...on error to stop the iteration (iteration stops when trim throws an error because it would have to delete all rows in the vector):

Werner_E_0-1677366434802.png