cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - You can change your system assigned username to something more personal in your community settings. X

How to speed up Mathcad Prime for large data sets?

Andy_C
11-Garnet

How to speed up Mathcad Prime for large data sets?

Does anyone know how to speed up MCP for processing large data sets?

 

I am working with a 50832x14 size data set that is output in Excel from another program. I would prefer to filter, sort, and manipulate the data in MCP vs Excel. However, trying to execute a simple filter step in MCP on a data set this large has proven impossible. I can't get the function to complete (endless spinning wheel). Whereas I can complete this one step in Excel in about 1sec.

 

Does anyone know how to speed this up? Or any other tricks for processing large data sets in MCP? I would really, really prefer to avoid Excel but at this point, I'll have to use it.

Thanks!

Andy_C_0-1677346454287.pngAndy_C_1-1677346461019.pngAndy_C_2-1677346507400.pngAndy_C_3-1677346511013.png

 

 

 

 

 

ACCEPTED SOLUTION

Accepted Solutions
Werner_E
25-Diamond I
(To:Andy_C)

Give the attached version a try with a larger data set. I am not sure if it would perform OK as of the large vector "mask" (as many rows as the data set itself) which is used.

View solution in original post

19 REPLIES 19
Werner_E
25-Diamond I
(To:Andy_C)

A matrix of that size should normally be no problem in Mathcad.

But Prime often was reported here as being very slow with larger data sets.

I guess that the "submatrix" command slows down the calculation and you may try to replace it by using the row-selector.

I guess you could replace the whole filter function by a using the built-in "lookup()" function.

 

BTW, what the reason behind the last step in your function - getting rid of the first row if its first entry is 0? Why the if-statement? The first row will always consist of zeros because thats how you initialize it.

Your function does not cover the case if no row is found containing the given value in the appropriate column. The attempt to get rid of the first row with the last submatrix command will throw an error because there is no row number 2.

 

EDIT: I gave it a try and using the row operator instead of  "stack" and "submatrix" speeds up the calculation so it takes just half the time. But using match() and trim() (not lookup() as I had written above) really significantly cuts down calculation time to less than one hundredths.

 

The usage of your variable "flag" is quite complicate and actually you don't need this variable at all.
Your filter function returns all data rows which do NOT contain the given value in the specified column. Is this intended?

Werner_E
25-Diamond I
(To:Werner_E)

OK, her is the version using the row selector instead of stack and submatrix

Werner_E_7-1677361755706.png

 

And here is the one using the match and trim functions

Werner_E_1-1677360931623.png

I checked that both would return the same result as your original function (all rows NOT containing "val" in column "col".

I used a smaller data matrix with just 1001 entries.

Werner_E_2-1677361070271.png

Now I used a timer function to compare calculation speed of the three functions. The values seen show the time needed (on my rather slow oldtimer) for 100 calls to the functions.

Werner_E_8-1677361805912.png

 

Your full original data is now processed by FilterCol3a in about 0,2 seconds.

And because I suspected that you actually wanted the function to return just the rows which actually contain the given string I added a function to do this as well

Werner_E_4-1677361383683.png

and checked if the output of both would combine to the full data set

Werner_E_6-1677361502204.png

 

Prime 6 worksheet attached

 

 

EDIT: I was curious and tried timing the full data set

Werner_E_0-1677363110572.png

As you can see this time it took FilterCol2 not 60 times longer compared to FilterCol3a, but rather nearly 3000 times longer.

I did not wait for the original FilterCol to finish for obvious reasons 😉
But I guess that even FilterCol2 would be much faster if you program it to just return the matching rows and not all rows which don't match.

 

EDIT2: I rewrote and streamlined your initial FilterCol() function to return only the rows which contain the given value so it does the same as my FilterCol3b. Now there are much less stack() and submatrix()  functions to execute and calc time is significantly more reasonable now (but still 300 times more than FilterCol3b)

Werner_E_1-1677363647712.png

 

 

 

 

 

Andy_C
11-Garnet
(To:Werner_E)

Thank you so much @Werner_E !! This is amazing. Huge help. 

I was already working on a lookup-based re-write on your recommendation and just got something working. Now I will digest these new methods which look even better.

 

I am also trying to re-write a clunky unique values function that is also based on the submatrix function and is failing with this large data set. Do you have any recommendations for that?
Thanks again!

Andy_C_0-1677364686705.png

 

Andy_C
11-Garnet
(To:Werner_E)

Here are my new versions of filtering. I'd compare the compute time with yours, but I am not sure how your 'timer' function works.

Andy_C_1-1677365167397.pngAndy_C_2-1677365171851.png

 

 

Werner_E
25-Diamond I
(To:Andy_C)

According your filter functions - do they, especially the function combine, also consider multiple occurrences?

I would guess that they are a but slower than mine on data matrices with more columns as lookup is called for every coloumn and not just once as the match command in my versions. Also your combine() may take a bit longer compared to the built in trim() function which I use to clean up.

But just give it a try. Give your function a unique name like FilterCol5a and call the the same way using the smaller dataset output2 as I did with the other functions in my sheet.

For the timer() to be useful you have to turn off Multithreading in the calculating options.

 

Concerning removing duplicates in a vector, I guess that functions which do so were posted here in the forum quite often. Maybe the search function can bring some of them to light. I had written that kind of utility function (should be a standard built-in) a couple of times and I guess every time differently ;-).

As we here have dealt with match and trim, I made up a tricky version using these functions which uses the try...on error to stop the iteration (iteration stops when trim throws an error because it would have to delete all rows in the vector):

Werner_E_0-1677366434802.png

 

Andy_C
11-Garnet
(To:Werner_E)

Ok, I just tested them both. Yes, both functions are able to handle multiple occurrences. They are quite quick compared to my original solution ;-D, but are a bit slower than your 3a and 3b as you expected.

 

Thank you for providing the 'removedups' function... I now see there are others in the forum archives. I will try this one first.

Andy_C_0-1677367497561.png

Andy_C_1-1677367541088.png

Andy_C_2-1677367545699.png

 

 

 




Werner_E
25-Diamond I
(To:Andy_C)

As I already had criticized here

https://community.ptc.com/t5/Mathcad/How-to-remove-duplicates-in-a-matrix-column/m-p/293717/highlight/true#M114533

using match, lookup, etc. is not very helpful if you have to deal with small numbers because of their usually too large tolerance. Often we don't realize that we deal with small numbers when we use units like micrometer, but Mathcad (and Prime, too) converts all values to the base units which then results in small numbers.

This of course is also true with the function I just posted:

Werner_E_0-1677377969298.png

removedups should not remove anything in any of the three cases, but ...

Werner_E
25-Diamond I
(To:Werner_E)

This small modification should do the job.

Werner_E_1-1677378345240.png

Still using "match", but this time match() only has to distinguish between 0 and 1 which it does flawlessly 😉

A similar modification could/should be applied to the FilterCol3() functions.

Werner_E_2-1677378409159.png

 

 

Andy_C
11-Garnet
(To:Werner_E)

Excellent point! I did not know that. I am going to update that function right now.

Thanks again!

Andy_C
11-Garnet
(To:Andy_C)

Hi @Werner_E  wondering if you have any recommendations for the below. I tried to take your filter recommendations and expand that into a function that can filter a data set based on multiple criteria in multiple columns (i.e., like one can in Excel using data filters). Additionally, I find the 'isolate' or 'remove' options useful depending on what I need to do w/ the data.

 

I came up with a solution, but again, it is too slow for working with large data sets! Couldn't get a result in the end...

Andy_C_1-1677458238396.png

 

MC Prime 8.0, Windows 10 Pro

 

Werner_E
25-Diamond I
(To:Andy_C)

Give the attached version a try with a larger data set. I am not sure if it would perform OK as of the large vector "mask" (as many rows as the data set itself) which is used.

Andy_C
11-Garnet
(To:Werner_E)

That's amazing, it worked like a charm!! It finished in 0.14s and my version would not compute so, haha, much appreciated!

 

I should have mentioned this use case in my first example, but is it possible to modify this to also allow for two filter criteria to be in the same column? I tried (1), [1], and {1,1] column variations paired with two vals but the logic isn't there I guess. The use is to be able to isolate multiple data lines in the same column at once vs taking several filter steps I suppose.


Thanks again!!

Andy_C_0-1677472044881.png

Andy_C_2-1677472144844.png

 

 

 

Werner_E
25-Diamond I
(To:Andy_C)

Thats easy, simply initialize "mask" with zeros and use an OR instead of the AND.

Werner_E_0-1677492703793.png

Werner_E_1-1677492809602.png

 

But of course the function is now also returning different results with your initial examples compared to the original function you had posted:

Werner_E_2-1677493018687.png

 

You could add an additional argument (like the "type" argument you added) which decides if the multiple conditions should be ANDed or ORed.  I would chose a similar approach as I had sketched in the worksheet - choosing one of the two option as default which is effect no matter what the argument is and choosing the other option only if a specific argument value is passed.
And I also would suggest that you also save include text description (function documentation) along with this utility function as its very likely that, if you are like me, you forget about the correct usage of this multi-argument function 😉

 

Werner_E
25-Diamond I
(To:Werner_E)

Attached is a suggestion for a function with the aforementioned additional argument to choose between OR and AND.
Defaults are "iso" and "AND" if you provide an invalid argument.

I also got rid of the different initialisation of "mask" with 1's or zeros by initializing with the first actual mask given by "col" and "val".

I would expect performance of the function to slow down a bit because of  the calls to the "maskFunc" function.

Werner_E
25-Diamond I
(To:Werner_E)

Here I have added FilterMlt3(). This function automatically selects between AND and OR.
If the column number of a condition is equal to that of the predecessor, OR is chosen, otherwise AND.

 

EDIT: Use this function with care as it might not always give you the results you expect!

Reason: If, lets say b and c are conditions concerning the same column and should be ORed, and a and d are conditons for different columns, the functions calculates
a AND b OR c AND d

but it should calculate

a AND (b OR c) AND d.

 

Andy_C
11-Garnet
(To:Werner_E)

Thanks for following up @Werner_E !! I was just starting to work on a solution for this yesterday. Let me look at your new file.

 

I am starting to think that it may be better to a) just use two separate filter functions (1- for filtering multiple conditions in mult. rows, 2- for filtering multiple conditions), or, to add a condition input that defines the AND/OR logic for each column/criteria input. I'll have to think on this for a bit.

 

Thanks again for all your help!! 😎

Andy_C
11-Garnet
(To:Andy_C)

@Werner_E  do you have any suggestions for creating a 'Match2' function that operates similarly to your FilterMlt creation above? I am able to filter a specific line of data out of a large set, but I need its index (row) w/in the larger set. I tried altering your slick FilterMlt func. but have flopped so far, haha.

 

Idea would look like this as far as input. Match2 finds the index/row(s) of a line of data that matches various criteria across various columns.

Andy_C_0-1677776455099.png

 

Werner_E
25-Diamond I
(To:Andy_C)

One quick hack could be to add a column with the rows numbers in front of the data matrix

Werner_E_0-1677787204808.png

That way the original FilterMlt1() function would of course also return that column

Werner_E_5-1677788045372.png

 

and you could simply use this first column as your result

Werner_E_2-1677787320427.png

Turned into a function it would look like this

Werner_E_3-1677787372287.png

 

But more clean solution sure would be to modify the FilterMlt1() function to just do whats required. Up to the "try" command we have created a matrix of 0*s and 1*s, with the 1's at the positions with matched data. So when typ is "iso", we have to return the indices of the 1's, otherwise the indices of the 0's.

Werner_E_4-1677787680637.png

If you just need the "iso" functionality, you could of course get rid of the typ argument and let "match()" look for 1's only.

 

Andy_C
11-Garnet
(To:Werner_E)

Thank you @Werner_E ! Yes, I was already working on a similar approach to add an index column as a match return result, but that approach was getting clunky with some of my other routines involved.

 

Your second method is more what I had in mind but I couldn't work it out yet. So, thank you again! With each new conversation, I am learning more about implementing this vectorization approach. And, ultimately, what a simple modification to your FilterMlt function! Now, let me give it a try...

 

Announcements

Top Tags