Table 2: Comparison of the judges’ original scores for the evaluation trials and the predicted scores from the different model variants (means ± standard deviations), as well as the results of the Wilcoxon rank-sum test.

Model Variant
Appa-ratus Best/ Worst Z p Nearest Neighbor Z p Three out of Five Z p Recurrent Neural Network Z p
Floor 3.73 ± 1.22 1.47 .140 3.46 ± 1.75 1.85 .063 3.84 ± 1.69 0.34 .735 3.70 ± 0.79 2.18 .029*
Beam 3.87 ± 2.38 0.85 .394 3.65 ± 2.13 0.31 .756 3.91 ± 2.11 0.89 .372 4.30 ± 1.00 0.86 .390
Vault 3.30 ± 2.10 0.06 .947 3.34 ± 1.89 0.07 .946 3.53 ± 1.75 0.39 .695 3.10 ± 1.11 1.88 .060

Note: * denotes a statistically significant difference between the original and predicted scores.