Introduction to Orange Tool Part-2
Train Test Split :
The idea of “sufficiently large” is specific to each predictive modeling problem. It means that there is enough data to split the dataset into train and test datasets and each of the train and test datasets are suitable representations of the problem domain. This requires that the original dataset is also a suitable representation of the problem domain.
How to Configure the Train-Test Split
The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage 0.33 (33 percent) is assigned to the test set.
There is no optimal split percentage.
You must choose a split percentage that meets your project’s objectives with considerations that include:
- Computational cost in training the model.
- Computational cost in evaluating the model.
- Training set representativeness.
- Test set representativeness.
Nevertheless, common split percentages include:
- Train: 80%, Test: 20%
- Train: 67%, Test: 33%
- Train: 50%, Test: 50%
For the Train Test Split, I used the below workflow.
Here as usual I load iris.tab dataset in the File widget which comes with the orange tool.
After that, I pass the whole dataset into Data Sampler Widget. In Data Sampler Widget we will partition our dataset into train and test data.
I have split the data into 85:15 ratio i.e 85% Train Data and 15% Test Data. On the bottom, you can see 127 data points use for Training and 23data points used for testing from a Total of 150 data points.
Now after split the data I connect Data Sampler with Test & Score Widget. I connect two lines one for train data and another for test data.
Data Sample -> Data ( Train Data )
Data Sample -> Test Data ( Test Data )
Now for model creation, I used Random Forest, SVM (Support Vector Machine), and KNN ( K Nearest Neighbors ) Widgets. These all the widgets are machine learning algorithms. Connect all the widgets with Test & Score Widget.
Test & Score widget must need two things.
(1) Data ( Train & | Test )
(2) Machine Learning Algorithm
When we are using Train test data we always test our model on test data so we have to specify that thing into Test & Score widget.
As you can see on the left side Test on test data is selected i.e the results which you are seeing( right side )coming from testing on test data.
We get the best results on all the algorithms. ( Approx 98% CA a.k.a Classification Accuracy )
What is the effect of splitting data on the classification result/ classification model?
As you can see accuracy is a little bit higher in With Splitting But it’s not the case all the time. Here we have very clean and low data ( 150 data points ) But in some situations when we have a lot of data points at that time if you will not split your data then your model might get overfit. So it’s always good to split data into Train and Test. So that we can get information that how’s our model perform on an unseen dataset ( test data ).
Cross Validation:
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
We can do cross-validation using the Test & Score widget. Note that during splitting the data cross-validation use the whole dataset not only train data nor test data.
For Cross-Validation I used the following workflow:
Now you’re familiar with this workflow it’s a very simple workflow we will directly focus Test & Score widget.
As you can see I used Number of folds = 10 i.e Total of 10 times random data points will be tested on our models and then we will get an average result. Cross-validation is a very powerful technique for evaluating the model. We can also find whether our model is overfitted or not by using cross-validation.
After that Test & Score widget connects with the Confusion Matrix in which we can see the result and after that, from the confusion matrix, we can select the data and view it into the Data Table widget.
As you can see first I select Misclassified data from Confusion Matrix and then View it on Data Table Widget. This is how we can explore our results using the Confusion Matrix and Data Table widget.
What is the cross-validation effect on model output/accuracy?
As you can see our accuracy is decreasing while using Cross-validation but still it’s a very good performance. In the case of without cross-validation we will test our model once and with cross-validation we tested our model K times (K=number of fold) on random data points from the dataset. That’s why after getting good accuracy on test data always assured that accuracy by performing Cross-validation.
While using Cross-Validation You can see model comparison by metrics parameters like accuracy, precision, recall, f1-score, AUC Curve, etc…
Read this article for more details on Cross-Validation here.
Conclusion:
I hope now you can work by yourself in the orange tool. I tried to cover as many things as I can. Now you can explore more by yourself.
Do check out more features of the Orange tool here.