How to generate the training dataset?
We take the annotation for fusions and generate a "positive sample" for each pair of fusion genes.
Then we randomly choose genes which has already been used in "positive samples" to form pairs.
Those random pairs are ensured not to be in fusion pairs,thus are "negative samples".
In detail,for each sample of gene pair,we calculate the correlation matrix, partial correlation matrix and pearson p-value matrix.
With those three matrixs as "X" and the fusion lable 1 or 0 as "Y", we generate a data_loader.
At last,we split the data_loader into 80% train_loader for training and 20% test_loader for testing.
As for the training process,we feed the matrixs into three different LSTM network,synthesize the output of the last LSTM cells of
three LSTM networks using a sequential layer including dropout layer, linear layer and Sigmoid activation layer.
The model's final output is a float number between [0,1],with which we can easily get the MSELoss between it and label 0/1.
After that,we back propagate loss to optimize the model.
Using fusion annotations from FusionGDB ( https://ccsm.uth.edu/FusionGDB/ ),
We trained a model and its performance on Test set is:
"
Accuracy on Test Set: 68.6777 %
total: 3199 real=1,predict=1: 655 real=1,predict=0: 879 real=0,predict=1: 123 real=0,predict=0: 1542
Precision on Test Set: 84.1902 %
"
Similarly,we trained a model using fusion annotations from ChimerSeq (https://www.kobic.re.kr/chimerdb_mirror/) and have results as:
"
Accuracy on Test Set: 66.0667 %
total: 3000 real=1,predict=1: 948 real=1,predict=0: 559 real=0,predict=1: 459 real=0,predict=0: 1034
Precision on Test Set: 67.3774 %
"
As you can see,the results are not worthless as many sequence based methods are worse.