Solved: How to speed up Mathcad Prime for large data sets?

Andy_C · ‎Feb 25, 2023

Does anyone know how to speed up MCP for processing large data sets?

I am working with a 50832x14 size data set that is output in Excel from another program. I would prefer to filter, sort, and manipulate the data in MCP vs Excel. However, trying to execute a simple filter step in MCP on a data set this large has proven impossible. I can't get the function to complete (endless spinning wheel). Whereas I can complete this one step in Excel in about 1sec.

Does anyone know how to speed this up? Or any other tricks for processing large data sets in MCP? I would really, really prefer to avoid Excel but at this point, I'll have to use it.

Thanks!

Werner_E · ‎Feb 26, 2023

Give the attached version a try with a larger data set. I am not sure if it would perform OK as of the large vector "mask" (as many rows as the data set itself) which is used.

View solution in original post

Werner_E · ‎Feb 25, 2023

A matrix of that size should normally be no problem in Mathcad.

But Prime often was reported here as being very slow with larger data sets.

I guess that the "submatrix" command slows down the calculation and you may try to replace it by using the row-selector.

I guess you could replace the whole filter function by a using the built-in "lookup()" function.

BTW, what the reason behind the last step in your function - getting rid of the first row if its first entry is 0? Why the if-statement? The first row will always consist of zeros because thats how you initialize it.

Your function does not cover the case if no row is found containing the given value in the appropriate column. The attempt to get rid of the first row with the last submatrix command will throw an error because there is no row number 2.

EDIT: I gave it a try and using the row operator instead of "stack" and "submatrix" speeds up the calculation so it takes just half the time. But using match() and trim() (not lookup() as I had written above) really significantly cuts down calculation time to less than one hundredths.

The usage of your variable "flag" is quite complicate and actually you don't need this variable at all.
Your filter function returns all data rows which do NOT contain the given value in the specified column. Is this intended?

Werner_E · ‎Feb 25, 2023

OK, her is the version using the row selector instead of stack and submatrix

And here is the one using the match and trim functions

I checked that both would return the same result as your original function (all rows NOT containing "val" in column "col".

I used a smaller data matrix with just 1001 entries.

Now I used a timer function to compare calculation speed of the three functions. The values seen show the time needed (on my rather slow oldtimer) for 100 calls to the functions.

Your full original data is now processed by FilterCol3a in about 0,2 seconds.

And because I suspected that you actually wanted the function to return just the rows which actually contain the given string I added a function to do this as well

and checked if the output of both would combine to the full data set

Prime 6 worksheet attached

EDIT: I was curious and tried timing the full data set

As you can see this time it took FilterCol2 not 60 times longer compared to FilterCol3a, but rather nearly 3000 times longer.

I did not wait for the original FilterCol to finish for obvious reasons 😉
But I guess that even FilterCol2 would be much faster if you program it to just return the matching rows and not all rows which don't match.

EDIT2: I rewrote and streamlined your initial FilterCol() function to return only the rows which contain the given value so it does the same as my FilterCol3b. Now there are much less stack() and submatrix() functions to execute and calc time is significantly more reasonable now (but still 300 times more than FilterCol3b)

Andy_C · ‎Feb 25, 2023

Thank you so much @Werner_E !! This is amazing. Huge help.

I was already working on a lookup-based re-write on your recommendation and just got something working. Now I will digest these new methods which look even better.

I am also trying to re-write a clunky unique values function that is also based on the submatrix function and is failing with this large data set. Do you have any recommendations for that?
Thanks again!

Andy_C · ‎Feb 25, 2023

Here are my new versions of filtering. I'd compare the compute time with yours, but I am not sure how your 'timer' function works.

Werner_E · ‎Feb 25, 2023

According your filter functions - do they, especially the function combine, also consider multiple occurrences?

I would guess that they are a but slower than mine on data matrices with more columns as lookup is called for every coloumn and not just once as the match command in my versions. Also your combine() may take a bit longer compared to the built in trim() function which I use to clean up.

But just give it a try. Give your function a unique name like FilterCol5a and call the the same way using the smaller dataset output2 as I did with the other functions in my sheet.

For the timer() to be useful you have to turn off Multithreading in the calculating options.

Concerning removing duplicates in a vector, I guess that functions which do so were posted here in the forum quite often. Maybe the search function can bring some of them to light. I had written that kind of utility function (should be a standard built-in) a couple of times and I guess every time differently ;-).

As we here have dealt with match and trim, I made up a tricky version using these functions which uses the try...on error to stop the iteration (iteration stops when trim throws an error because it would have to delete all rows in the vector):

Andy_C · ‎Feb 25, 2023

Ok, I just tested them both. Yes, both functions are able to handle multiple occurrences. They are quite quick compared to my original solution ;-D, but are a bit slower than your 3a and 3b as you expected.

Thank you for providing the 'removedups' function... I now see there are others in the forum archives. I will try this one first.

Werner_E · ‎Feb 25, 2023

As I already had criticized here

https://community.ptc.com/t5/Mathcad/How-to-remove-duplicates-in-a-matrix-column/m-p/293717/highlight/true#M114533

using match, lookup, etc. is not very helpful if you have to deal with small numbers because of their usually too large tolerance. Often we don't realize that we deal with small numbers when we use units like micrometer, but Mathcad (and Prime, too) converts all values to the base units which then results in small numbers.

This of course is also true with the function I just posted:

removedups should not remove anything in any of the three cases, but ...

Werner_E · ‎Feb 25, 2023

This small modification should do the job.

Still using "match", but this time match() only has to distinguish between 0 and 1 which it does flawlessly 😉

A similar modification could/should be applied to the FilterCol3() functions.

Andy_C · ‎Feb 25, 2023

Excellent point! I did not know that. I am going to update that function right now.

Thanks again!

Andy_C · ‎Feb 26, 2023

Hi @Werner_E wondering if you have any recommendations for the below. I tried to take your filter recommendations and expand that into a function that can filter a data set based on multiple criteria in multiple columns (i.e., like one can in Excel using data filters). Additionally, I find the 'isolate' or 'remove' options useful depending on what I need to do w/ the data.

I came up with a solution, but again, it is too slow for working with large data sets! Couldn't get a result in the end...

MC Prime 8.0, Windows 10 Pro

Werner_E · ‎Feb 26, 2023

Give the attached version a try with a larger data set. I am not sure if it would perform OK as of the large vector "mask" (as many rows as the data set itself) which is used.

Andy_C · ‎Feb 26, 2023

That's amazing, it worked like a charm!! It finished in 0.14s and my version would not compute so, haha, much appreciated!

I should have mentioned this use case in my first example, but is it possible to modify this to also allow for two filter criteria to be in the same column? I tried (1), [1], and {1,1] column variations paired with two vals but the logic isn't there I guess. The use is to be able to isolate multiple data lines in the same column at once vs taking several filter steps I suppose.

Thanks again!!

Werner_E · ‎Feb 27, 2023

Thats easy, simply initialize "mask" with zeros and use an OR instead of the AND.

But of course the function is now also returning different results with your initial examples compared to the original function you had posted:

You could add an additional argument (like the "type" argument you added) which decides if the multiple conditions should be ANDed or ORed. I would chose a similar approach as I had sketched in the worksheet - choosing one of the two option as default which is effect no matter what the argument is and choosing the other option only if a specific argument value is passed.
And I also would suggest that you also save include text description (function documentation) along with this utility function as its very likely that, if you are like me, you forget about the correct usage of this multi-argument function 😉

Werner_E · ‎Feb 27, 2023

Attached is a suggestion for a function with the aforementioned additional argument to choose between OR and AND.
Defaults are "iso" and "AND" if you provide an invalid argument.

I also got rid of the different initialisation of "mask" with 1's or zeros by initializing with the first actual mask given by "col" and "val".

I would expect performance of the function to slow down a bit because of the calls to the "maskFunc" function.

Werner_E · ‎Feb 28, 2023

Here I have added FilterMlt3(). This function automatically selects between AND and OR.
If the column number of a condition is equal to that of the predecessor, OR is chosen, otherwise AND.

EDIT: Use this function with care as it might not always give you the results you expect!

Reason: If, lets say b and c are conditions concerning the same column and should be ORed, and a and d are conditons for different columns, the functions calculates
a AND b OR c AND d

but it should calculate

a AND (b OR c) AND d.

Andy_C · ‎Feb 28, 2023

Thanks for following up @Werner_E !! I was just starting to work on a solution for this yesterday. Let me look at your new file.

I am starting to think that it may be better to a) just use two separate filter functions (1- for filtering multiple conditions in mult. rows, 2- for filtering multiple conditions), or, to add a condition input that defines the AND/OR logic for each column/criteria input. I'll have to think on this for a bit.

Thanks again for all your help!! 😎

Andy_C · ‎Mar 02, 2023

@Werner_E do you have any suggestions for creating a 'Match2' function that operates similarly to your FilterMlt creation above? I am able to filter a specific line of data out of a large set, but I need its index (row) w/in the larger set. I tried altering your slick FilterMlt func. but have flopped so far, haha.

Idea would look like this as far as input. Match2 finds the index/row(s) of a line of data that matches various criteria across various columns.

Werner_E · ‎Mar 02, 2023

One quick hack could be to add a column with the rows numbers in front of the data matrix

That way the original FilterMlt1() function would of course also return that column

and you could simply use this first column as your result

Turned into a function it would look like this

But more clean solution sure would be to modify the FilterMlt1() function to just do whats required. Up to the "try" command we have created a matrix of 0*s and 1*s, with the 1's at the positions with matched data. So when typ is "iso", we have to return the indices of the 1's, otherwise the indices of the 0's.

If you just need the "iso" functionality, you could of course get rid of the typ argument and let "match()" look for 1's only.

Andy_C · ‎Mar 05, 2023

Thank you @Werner_E ! Yes, I was already working on a similar approach to add an index column as a match return result, but that approach was getting clunky with some of my other routines involved.

Your second method is more what I had in mind but I couldn't work it out yet. So, thank you again! With each new conversation, I am learning more about implementing this vectorization approach. And, ultimately, what a simple modification to your FilterMlt function! Now, let me give it a try...