@@ -38,21 +38,41 @@ Based on our evaluation, we decided to use k-Nearest Neighbour (kNN), a widely u
*Table 1: Comparison of RMSE values between Linear Regression and k-Nearest Neighbour (Z-score, k=3) across Cyan, Magenta, and Yellow chromoprotein volumes.*
*Table 1: Comparison of RMSE values between Linear Regression and k-Nearest Neighbour (Z-score, k=3) across Cyan, Magenta, and Yellow chromoprotein volumes.*
To ensure that kNN works optimally with our dataset, several parameters needed to be set. One of the most critical parameters is k, which specifies how many neighbours the algorithm considers when making a prediction (Zhang, 2016). If k is set too low, the model risks overfitting, becoming overly sensitive to small variations in the data. On the other hand, if k is too large, the model may oversimplify and lose accuracy. The challenge, therefore, is to find the optimal k value that balances accuracy without overfitting the data (Zhang, 2016). Additionally, scaling the RGB input values was essential to improve the model’s accuracy. Since kNN is sensitive to the scale of input data, we had to standardise the RGB values to ensure that one colour channel, for example Red, does not dominate the predictions. We explored two scaling methods to determine the most suitable one. Z-score normalisation scales the data based on its mean and standard deviation, making it effective when the dataset contains outliers (Andrade, 2021). Min-max normalisation, on the other hand, scales all data points between 0 and 1, which is useful when the dataset has a limited range of values (Lambert et al., 2024).
To determine the best scaling method and k value combination, we used Root Mean Squared Error (RMSE) again as our performance indicator (Ajala et al., 2022). To ensure that this measure is reliable, RMSE must be calculated on unseen data that the model has not been trained on (Ajala et al., 2022). Therefore, we split the dataset into training and testing sets using an 80/20 ratio, where 80% of the data trains the model, and 20% is reserved for testing. This approach helps achieve a balance between overfitting, where the model is too closely tuned to the training data, and underfitting, where it fails to capture important patterns (Rácz et al., 2021). By computing RMSE on the test set, we can assess how well the model generalises to unseen data. Ultimately, the lower the RMSE value, the more accurate the model (Forrest et al., 2021). For our dataset, we compared the RMSE values for different scaling methods and k values and, though the values are quite similar, decided that Z-score normalisation and k=5 gives us the most accurate prediction.
*Table 4: Overview of different scaling methods (Z-score, Min-Max) and k-values (3,5 and 7) for the respective colours cyan, magenta and yellow Scaling Method*