The Pacific Northwest National Laboratory (PNNL) is conducting confirmatory research for the U.S. Nuclear Regulatory Commission for the validation of computational models for ultrasonic testing (UT). The purpose of this research is to determine if UT computational models adequately represent reality, within their intended domain of application. This work attempts to assess the reliability of such models by directly comparing them to empirical measurements. We propose a framework that allows assessing the model by using both qualitative and quantitative validation indicators and metrics. This framework acknowledges the presence of errors and uncertainties in all models and experiments, and the quantitative validation stage uses a statistical approach that incorporates uncertainties to infer the adequacy of the model. This framework is used to assess the adequacy of CIVA-UT Version 11 to model conventional ultrasonic nondestructive examination inspections. The validation study considers simple geometrical reflectors in isotropic, fine-grained, homogeneous materials with conventional ultrasonic transducers. In particular, wrought stainless steel plates with electro-discharge machined notch reflectors varying in length, depth, and orientation (tilt from the normal) are evaluated. Different types of conventional ultrasonic transducers are investigated, including different wave modes, beam angles, and frequencies. In the qualitative analysis, the model prediction results were compared with empirical data in terms of the similarity of the relevant [visual] features of C-scan, B-scan, and A-scan images. Overall, the most prominently observed differences could be attributed to uncertainties in the validation process itself including uncertainties in model parameters, and measurement noise. On the other hand, the quantitative analysis considered a metric that computes the effective amplitude of the flaw response under uncertainties in model parameters and measurement noise. The validation results show that the models in CIVA were reasonably accurate in representing their empirical counterparts in a qualitative fashion. However, when comparing the results based on the quantitative metric employed in this study, there were large variances in accuracy and the associated empirical and simulation uncertainties were significantly large. Thus, we conclude that using models for applications that require quantitative reasoning (such as predicting the detectability of small flaws or model-assisted probability of detection studies) is rather unreliable and errors over 20 dB could occur. Ultimately models need to be assessed on whether they can be used to predict if an inspector can find a flaw in a component. The quantitative analysis results show that using models to give definitive decisions about the detectability of flaws would be challenging due to the significant number of variables in the modeling process. This problem will be further addressed in the future by providing bounds on uncertainties and specifying guidelines for the modeling process. Those questions will be addressed as PNNL expands the scope to include realistic flaws in complex materials such as austenitic welds and inhomogeneous base material. This study focused on validating the CIVA model by comparing simulation results to empirical data and assessing the differences and limitations. Once the limitations are understood then the detection question can be more reliably addressed.