Using ChimerSeq fusion annotations ( https://www.kobic.re.kr/chimerdb_mirror/),
we train the model with 1218 samples for each exon and predict fusions within two genes with only 50,100,200,500 samples.
The results are as follows:
predict with all(1218) samples, accuracy:66.0667% precision: 67.3774%
predict with 500 samples, accuracy: 64.7900% precision: 65.2067%
predict with 200 samples, accuracy: 64.8750% precision: 65.1616%
predict with 100 samples, accuracy: 63.2100% precision: 62.8252%
predict with 50 samples, accuracy: 62.0650% precision: 60.6647%
And there is annother thing worth mentioning, considering p value into learning model is important for robustness.
Here is the results using model whose only difference with our final model is not taking p value into consideration:
predict with all(1218) samples, accuracy:65.4250% precision: 68.4211%
predict with 500 samples, accuracy: :66.5700% precision: 67.5902%
predict with 200 samples, accuracy: 64.9400% precision: 63.7115%
predict with 100 samples, accuracy: 62.9550% precision: 60.3448%
predict with 50 samples, accuracy: 60.5450% precision: 57.2951%
We can see that after considering p value, there is smaller accuracy and precision drop when using less samples.