Impact of Training Sample Size on the Effects of Regularization in a Convolutional Neural Network-based Dental X-ray Artifact Prediction Model

Introduction: Advances in computers have allowed for the practical application of increasingly advanced machine learning models to aid healthcare providers with diagnosis and inspection of medical images. Often, a lack of training data and computation time can be a limiting factor in the development of an accurate machine learning model in the domain of medical imaging. As a possible solution, this study investigated whether L2 regularization moderates the overfitting that occurs as a result of small training sample sizes. Methods: This study employed transfer learning experiments on a dental x-ray binary classification model to explore L2 regularization with respect to training sample size in five common convolutional neural network architectures. Model testing performance was investigated and technical implementation details including computation times and hardware considerations as well as performance factors and practical feasibility were described. Results: The experimental results showed a trend that smaller training sample sizes benefitted more from regularization than larger training sample sizes. Further, the results showed that applying L2 regularization did not apply significant computational overhead and that the extra rounds of training L2 regularization were feasible when training sample sizes are relatively small. Conclusion: Overall, this study found that there is a window of opportunity in which the benefits of employing regularization can be most cost-effective relative to training sample size. It is recommended that training sample size should be carefully considered when forming expectations of achievable generalizability improvements that result from investing computational resources into model regularization. Convolutional neural networks, machine learning, medical imaging, dental imaging, L2 Regularization, computing, overfitting, computer vision


INTRODUCTION
Advances in computers have allowed for the practical application of increasingly advanced machine learning models to aid healthcare providers with diagnosis and inspection of medical images. Despite the increasing possible complexity of machine learning models, the quality and size of training data are dominant factors affecting a machine learning model's performance in practice 1 . Of particular interest in medical diagnostic imaging, supervised learning methods require labeled data to generate efficient models. Unfortunately, readily available humanannotated data is difficult to come by with its creation being expensive in terms of money and time 2,3 . Data labeling requires a group of domain-expert reviewers to manually annotate available data, which is often only a fraction of the available un-labeled data. Thus, training data availability can be a bottleneck in the development of a machine learning model in the domain of medical imaging.
Noting that convolutional neural networks (CNNs) analyzing images extract an immense number of features (proportionate to image resolution) from training data, it is intuitive that the pairing of small training sets and a large number of features faced in the domain of medical imaging make overfitting an expected challenge in model development 4 . Previous literature suggests that regularization, a technique that decreases model complexity by imposing model weight penalties on the loss function, may effectively help alleviate some of the overfitting that may occur in a model trained on a small training sample size [5][6][7] . Acknowledging the existence of multiple regularization methods, this experiment only focused on L2 regularization which is also known as ridge regression and weight decay 6 . L2 regularization penalizes the squared sum of the model's weights in order to prevent weights from becoming too large by adding onto the loss function 8 .

Equation 1: L2 regularization of a loss function
Mathematically, the L2 regularization penalty is represented by Equation-1, where n is the number of weights in the model, ŷ is the predicted output of input, and y is the actual label of the input.
The purpose of this study was to explore how training sample size impacts the effects of L2 regularization in a convolutional neural networkbased dental x-ray artifact prediction model that employs transfer learning. While the benefits of using regularization in convolutional neural networks have already been established in literature, this study aims to take an academic approach in providing guidance on the practical benefits and costs of its use. It is hypothesized that L2 regularization could be a significant and feasible method to improve accuracy generalizability of a convolutional neural network model. The objectives of this study were to:

METHODS
Experiments in this study were based on a dental x-ray binary classification model that classifies whether a given dental x-ray scan has been taken on a plate with significant damage ("yes") or not ("no"). The dataset used consists of 2928 augmented phosphorstorage plate x-ray images where 1404 have "no" significant artifacts while the remaining 1524 do have ("yes") significant artifacts provided by the Department of Medical Imaging at the University of Toronto. These 2928 x-ray images were originally generated using image augmentation methods such as segmentation and superimposition on 339 phosphor-storage plates of varying degrees of damage that were provided by the Faculty of Dentistry at the University of Toronto.

Experiment Environment
Experiments were run using the standard Python-3.5.6 interpreter on a machine running Ubuntu 16.04 with a stock-clocked Intel i7-8700 processor, 32 GB of 2133 MHz DDR4 RAM, and a stock-clock Nvidia GeForce GTX 1080 Ti with 12 GB GDDR5 VRAM. Relevant hardware specifications have been tabulated as supplementary information in Table-S1.
The development tools used in this study reflect the hardware components used in the experimental environment in order to maximize computational performance where possible. Special consideration is given to CPU vectorization and CUDA acceleration. This experiment based all its machine learning procedures on the procedures contained in the PyTorch library 9 . PyTorch is an open-source deep learning library for Python that provides a high-level application programming interface (API) for developing deep learning models. It provides several modules that handle essential operations such as tensor computations, data loading, neural network design, model training, and evaluation. Under the hood, the PyTorch is developed using C++ with CUDA support in order to maximize the library's performance.
Pre-trained ImageNet Models for PyTorch (pip: torchvision, pretrainedmodels) ImageNet is a project that aims to provide a large image database containing over 14 million images for research use. The neural network architectures used in this study are implemented in the torchvision-0.2.2 and pretrainedmodels-0.7.4 Python libraries that provide classes that implement the layers of an architecture while downloading pre-trained weights from training on ImageNet.

Pillow-SIMD (pip: pillow-simd)
Pillow is a commonly used API that offers a set of standard image I/O and manipulation procedures, including per-pixel manipulations, enhancing, and filtering. Pillow was used in this experiment in order to decode TIF formatted x-ray images and to apply appropriate resizing and normalization transformations to the x-ray images as they are loaded to be processed by the CNN. The experiment implementations of this study use the Pillow-SIMD fork of the Pillow library which is compiled to include Intel's "single instruction, multiple data" instructions that provide data-level parallelism in a single thread by performing the same CPU instruction on multiple data points simultaneously in a single CPU cycle 10 . By using SIMD (specifically, AVX2) instructions in the Pillow implementation instead of the regular Pillow, faster image processing performance and better CPU utilization are achieved on the experiment machine's Intel i7-8700 CPU.

CNN Architectures
The five CNN architectures studied are listed as follows: Inception-V3 11 , Inception-v4 12 , Inception-ResNet-v2 12 , ResNet-18 and ResNet-50 13 . These five well-established CNN architectures were studied due to their relatively smaller sizes compared to other architectures (such as the VGG family) and successes in the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Further, for each of the CNN architectures studied, transfer-learning was employed. Transfer-learning is the usage of pre-trained models; previous imageunderstanding knowledge from training on possibly unrelated images benefits learning to differentiate dental x-ray image artifacts. Literature suggests that transferlearning is useful even when the previously trained model was trained in a different domain of images (i.e. non-medical) and that it allows for increased model accuracy with less available training data 14 . When using transfer-learning, training computation time is also significantly reduced in most cases as earlier layers of the CNN are frozen and the weights of the final layers are primarily operated on during training 15 . This plays an important role in making model development using lower training sample sizes in practice.

Experimental Methodology
In order to investigate relationships between training sample size and generalizability improvements from regularization in a CNN model, an experiment was designed to map the following variables for a given CNN architecture conceptually: • Independent variable: Training Sample Size (n). • Dependent variables: differences between model testing performance metrics when λ = 0 and λ = λ optimal (e.g. accuracy, specificity, sensitivity, computation time).
The algorithm designed to run this experiment is simplified to the following steps: 1. First, randomly set aside 10% of the x-ray images as a testing set for hold-out testing. a. This way, holdout validation is done on the same images for all the different experimental trials. 2. Then, randomly create a training set (70%) and validation set (20%) that compose the remaining 90%, which can be subsampled. These base-sets are kept constant and sub-sampled in a predictive, ordered manner for experimental trials with different training sample sizes. a. That is, if T n and V n are training and validation sets of size n, for n ∈ N:

Research Articles
Training Sample Size Impact on CNN Regularization d. and gather the experiment results (accuracy, sensitivity, specificity, computation time). 7. Once the triple-nested loops are done and all the data-points are gathered, return the data to be outputted into a CSV spreadsheet for analysis.
With the steps of the experimental methodology expressed as a pseudo-code procedure as supplementary data in Figure-S1, it can be seen that the data is iteratively generated for all combinations of the three independent variables (architecture 'arch', training set size 'n', and lambda 'λ'). There are three important design choices to notice: the set splitting schema, validation method, and the lambda search algorithm.
This experiment followed a commonly used 70:20:10 training to validation to testing set split ratio 16,17 as it would result in large enough available training set sizes to experiment on while providing usable amounts of testing and validation data.
With relation to the set splitting, hold-out validation was used as opposed to more thorough validation methods such as k-fold cross-validation to alleviate computation time in the experiment. Literature suggests that cross-validation may be a stronger validation method, but there is a significant computation time trade-off 18 . Although the dataset used in this experiment is not particularly large, the experimental procedure involves a massive amount of repeated training and testing iterations (number of architectures × number of training sets × number of lambda search queries); so, the hold-out validation is employed as its decreased computation time is essential in practice.
To implement regularization, the lambda search algorithm employed in this algorithm is a two-pass grid search. In the initial pass, a set of constant lambda values are queried for an architecture and training set pairing. Afterward, using a very basic interpolation heuristic, the three quartile values between the top two performing lambda values are also trialed. Rather than employing methods such as hyper-parameter learning and random search, a two-pass grid search is used in the procedure for experimental simplicity.
With the gathered experimental results for accuracy and computation times, regression analysis was conducted to establish any relationships between training sample size and the accuracy improvements from L2 regularization (with successful lambda search) and computation time amongst the five different CNN architectures being investigated.

RESULTS
The experiment was run for each of the CNN architectures (Inception and ResNet families), and data was collected for training sample sizes of {16, 32, 64, 128, 256, 512, 1024, 1536, and 2048} into a table. In the table containing all raw data, there are 675 rows-one for each experiment trial (which is a combination of the three independent variables architecture 'arch', training set size 'n', and regularization lambda 'λ'). Testing accuracy, sensitivity, and specificity were the dependent variables recorded in the rows for each experiment trial.
The raw data was then partitioned by architecture and training sample size pairings, and aggregate data grouped by architecture and training sample size comparing the performance of the model trained without regularization and with regularization (using the optimal lambda amongst those found) was tabulated in Table-1. In particular, a proportional difference was calculated between the accuracies, sensitivities, and specificities of the model trained with an optimal lambda and a model trained without regularization, using the following formula in Equation-2 where a performance measure P[λ] represents accuracy, sensitivity, or specificity for a particular lambda value given a fixed architecture training sample size:

Equation 2: Formula used to calculate proportional differences
In addition to proportional differences being tabulated, the lambda values were highlighted in orange in the cases where a beneficial lambda was not found for an architecture and training sample size in Table-1. Finally, there is a column that represents a Boolean value whether or not the second-pass heuristic in the lambda search algorithm resulted in the optimal lambda's discovery in the experiment denoted as "Heuristic Find?".
To better visualize the data and any trends that may exist in the data particularly for the accuracy, the proportional accuracy improvements for the different architectures were plotted using scatter plots for each architecture in Figures 1-5. In each of the plots, power regression was used to fit a power formula to capture the trend between training set size and proportional testing accuracy improvements. Cases where a beneficial lambda value was not found (and thus there was no accuracy improvement) were considered as outliers and thus not considered for any trends. Through experimentation, power regression was deemed to be the best fit in comparison to linear and logarithm regression; polynomial regression was not considered in order to fit a monotonic trend function.
The power regression trend line information from the plots for each architecture was tabulated in Table-

DISCUSSION
Overall, data gathered in the experiment showed that in the cases where an optimal lambda could be found, regularization did benefit models trained on smaller training sets more than larger ones generally. Additionally, there was an observed distinction between the behavior of regularization in the different architectures with the observed regression lines having varying R 2 fits. Finally, computation times for any given experimental training sample size and neural network architecture did not vary significantly with the employment of regularization and choice of lambda.

Generalizability Improvements with Respect to Training Sample Size
Concerning the trends observed in the experimental results, two proposed patterns describe the relationship between accuracy improvement from regularization and training sample size in the data for all the architectures studied: 1. Minimum Information Threshold: Having too small a training sample size below a threshold results in erratic regularization behavior, and 2. Diminishing Returns: Generally, larger training sample sizes exhibit less proportionate accuracy improvements from regularization if an optimal lambda value is found.
The first pattern was expected when designing the experiment in terms of training sample sizes. Regularization prevents the weights in a CNN model from growing too large; this, in theory, prevents a model from overfitting to the training data 5-7 . Although

Research Articles
Training Sample Size Impact on CNN Regularization regularization may have prevented the model from overfitting to the training data, the extremely small amount of training data resulted in a lack of information. Since transfer learning was employed in these cases, the models were being tested on their previous knowledge from ImageNet with very little learning done on the domain of dental x-ray artifacts resulting in sporadic testing performance. The information threshold pattern can be seen in all the architectures' experimental trials, but this behavior is most pronounced in the ResNet-18 experiment trials as suggested by the poor regression fit between proportional testing accuracy improvement and training sample size and the greatest number of local minima in the relation between those two variables.
Beyond the sporadic behavior of small training sample sizes, the second pattern of larger training sample sizes exhibiting smaller proportionate improvements in accuracy was also an expected result of the experiment; every architecture studied in this experiment exhibited decreasing accuracy improvements relative to training sample size. This relationship between the effectiveness of regularization in terms of generalization and training sample can be rationalized through foundational theory in machine learning. Disregarding other factors that affect a machine learning model's testing performance, it is theoretically accepted that training on a small data sample is expected to result in more overfitting (and poorer generalizability) in comparison to training on a larger data sample 19 . Noting that regularization helps to prevent overfitting (and improve generalizability), then it can be concluded that a machine learning model that's more heavily over-fitted as a result of a small training sample size can be expected to exhibit a more pronounced improvement in its generalizability (which is measured as testing accuracy in this study) through the employment of regularization.
With these two observed patterns in mind, the once form a theoretical window of opportunity with respect to training sample size in which the real-world benefit from regularization can be maximized. When observing the aggregate performance comparison with respect to training sample size, the different architectures exhibited different regularization behaviors. Regardless, all the architectures show lower general accuracy improvement from regularization with increasing training sample sizes. Thus, a recommendation is given that special consideration should be given in the usage of regularization on a model-by-model basis, with realistic expectations (derived from the training sample size) regarding potential generalizability improvements that can be achieved from invested computation time.

Sensitivity and Specificity
Despite the models in the experiment achieving a better overall testing accuracy from regularization when a beneficial lambda value was found, accuracy by itself may not be a sufficient performance measure in certain real-world machine learning model applications. Sensitivity (the true positive rate) and specificity (the true negative rate) are also significant performance measures, especially in medicine. No particular pattern can be uncovered for testing sensitivity and specificity when applying L2 regularization on different training sample sizes. The lack of an underlying pattern for sensitivity and specificity can be justified, as no measures were taken to treat these metrics in this study (although sensitivity-driven regularization methods exist in literature) 20 .

Computational Performance and Efficiency of Regularization
While a machine learning model's testing performance is an important consideration, the available computational resources and the expected computational cost of model development are also important considerations in practice. The experimental computation times gathered in this study showed very little variation with regularization as seen in the standard deviations. As expected, the results show that training a model on less data requires a less computation time.
Overall, each architecture had differing average computation times with respect to training sample size, but they had one important statistic in common: a negligible variability of training time with respect to lambda choice and regularization. From this, it can be empirically and practically concluded that the addition of the L2 regularization to the loss function in the PyTorch implementation can be considered a computationally cheap operation. The most significant expense of employing L2 regularization is in tuning and searching for an optimal lambda hyperparameter value as it requires repeated model training.
When pairing diminishing performance improvements from applying regularization on models trained on larger amounts of data and the increased computation time of training on larger datasets, it is realized that regularization may not be an efficient means of improving model performance when lack of training data is not a significantly limited resource. More importantly, this highlights the trade-off between two important resources: computation and data. When training data is limited, computational resources could be used in an attempt to accommodate a model's shortcomings through regularization and hyperparameter tuning. Contrarily, if training data is in excess, it may be inefficient to employ regularization and hyperparameter tuning to the same extent.

Limitations
Although this study managed to show a trend between training sample sizes and the effects of L2 regularization on different CNN architectures, some limitations introduce uncertainty in the data. Primarily,

Research Articles
Training Sample Size Impact on CNN Regularization they involve the (lack of an) optimal lambda search, dataset sampling, and the granularity of gathered data points. Despite the discovery of a few good performing lambdas, there are a few trials in the experiment where a beneficial, let alone optimal, lambda value was not found. More thorough lambda search algorithms (such as hyperparameter training and randomized search) were not explored in the scope of this study. In addition to the innumerable possible lambda values, there is also an innumerable number of dataset-split combinations; if the dataset has a high variance there could be inconsistent and non-robust experimental results. Beyond lambda and dataset sampling options, the number of actual data points gathered in an experiment are also limited due to practical constraints on computation time (it would rarely be feasible to compute data for ~2000 training sample sizes from {16…2048}. All described limitations exist mainly due to much larger running time costs in their solutions was greater than the available computational time and resources for this experiment.

Future Work
While this study managed to achieve its objectives, with the primary being that it showed an association between generalizability of improvements from L2 regularization and training sample size for different convolutional neural network architectures, future work should consider different experimental conditions (such as using more standardized datasets, cross-validation, hyper-parameter tuning, repeated trials, etc.) to validate the observed trend between testing accuracy improvement and training sample size. Additionally, it should consider exploring other regularization methods (such as L1 lasso regularization) to show if they follow the same trend that is theorized in this study. Finally, a distinction between sensitivity and specificity should be made when exploring regularization methods like sensitivitybased regularization is suggested.

CONCLUSION
When limited training data is available, our study demonstrated that regularization was a feasible measure to employ to enhance a convolution neural network's generalizability through testing accuracy. Regularization was not the most efficient use of computational resources when training data was not a significant limiting factor for a model's testing performance. Experiments revealed a minimum information threshold paired with the existence of diminishing returns forming a window of opportunity in which employing regularization can be beneficial with respect to training sample size. Thus, it is recommended that training sample size should be carefully considered when forming expectations of achievable generalizability improvements that result from investing computational resources into model regularization.