Predictive models: Predictive model is one of the best technique to perform predictive analytics. This is the development of models that are trained on historical data and make predictions on new data. These models are built in order to analyse the current data records in combination with some historical data.
Bias and variance are the two components of imprecision in predictive models. Bias in predictive models is a measure of model rigidity and inflexibility, and means that your model is not capturing all the signal it could from the data. Bias is also known as under-fitting. Variance on the other hand is a measure of model inconsistency, high variance models tend to perform very well on some data points and really bad on others. This is also known as over-fitting and means that your model is too flexible for the amount of training data you have and ends up picking up noise in addition to the signal.
If your model is performing really well on the training set, but much poorer on the hold-out set, then it’s suffering from high variance. On the other hand if your model is performing poorly on both training and test data sets, it is suffering from high bias.
Techniques to improve:
Add more data: Having more data is always a good idea. It allows the “data to tell for itself,” instead of relying on assumptions and weak correlations. Presence of more data results in better and accurate models. The question is when we should ask for more data? We cannot quantify more data. It depends on the problem you are working on and the algorithm you are implementing, example when we work with time series data, we should look for at least one-year data, And whenever you are dealing with neural network algorithms, you are advised to get more data for training otherwise model won’t generalize.
Feature Engineering: Adding new feature decreases bias on the expense of variance of the model. New features can help algorithms to explain variance of the model in more effective way. When we do hypothesis generation, there should be enough time spent on features required for the model. Then we should create those features from existing data sets.
Feature Selection: This is one of the most important aspects of predictive modelling. It is always advisable to choose important features in the model and build the model again only with important and significant features. e. let’s say we have 100 variables. There will be variables which drive most of the variance of a model. If we just select the number of features only on p-value basis, then we may still have more than 50 variables. In that case, you should look for other measures like contribution of individual variable to the model. If 90% variance of the model is explained by only 15 variables then only choose those 15 variables in the final model.
Multiple Algorithms: Hitting at the right machine learning algorithm is the ideal approach to achieve higher accuracy. Some algorithms are better suited to a particular type of data sets than others. Hence, we should apply all relevant models and check the performance.
Algorithm Tuning: We know that machine learning algorithms are driven by parameters. These parameters majorly influence the outcome of learning process. The objective of parameter tuning is to find the optimum value for each parameter to improve the accuracy of the model. To tune these parameters, you must have a good understanding of these meaning and their individual impact on model. You can repeat this process with a number of well performing models. For example: In random forest, we have various parameters like max_features, number_trees, random_state, oob_score and others. Intuitive optimization of these parameter values will result in better and more accurate models.
Cross Validation: Cross Validation is one of the most important concepts in data modeling. It says, try to leave a sample on which you do not train the model and test the model on this sample before finalizing the model. This method helps us to achieve more generalized relationships.
Ensemble Methods: This is the most common approach found majorly in winning solutions of Data science competitions. This technique simply combines the result of multiple weak models and produce better results. This can be achieved through many ways.
Bagging: It uses several versions of the same model trained on slightly different samples of the training data to reduce variance without any noticeable effect on bias. Bagging could be computationally intensive esp. in terms of memory.
Boosting: is a slightly more complicated concept and relies on training several models successively each trying to learn from the errors of the models preceding it. Boosting decreases bias and hardly affects variance.