How to generate the training dataset?<br>
We take the annotation for fusions and generate a "positive sample" for each pair of fusion genes.<br>
Then we randomly choose genes which has already been used in "positive samples" to form pairs.<br>
Those random pairs are ensured not to be in fusion pairs,thus are "negative samples".<br>
In detail,for each sample of gene pair,we calculate the correlation matrix, partial correlation matrix and pearson p-value matrix.<br>
With those three matrixs as "X" and the fusion lable 1 or 0 as "Y",  we generate a data_loader.<br>
At last,we split the data_loader into 80% train_loader for training and 20% test_loader for testing.<br>
<br>
As for the training process,we feed the matrixs into three different LSTM network,synthesize the output of the last LSTM cells of<br>
 three LSTM networks using a sequential layer including dropout layer, linear layer and Sigmoid activation layer.<br>
The model's final output is a float number between [0,1],with which we can easily get the MSELoss between it and label 0/1.<br>
After that,we back propagate loss to optimize the model.<br>
<br>
Using fusion annotations from FusionGDB ( https://ccsm.uth.edu/FusionGDB/ ),<br>
We trained a model and its performance on Test set is:<br>
"<br>
Accuracy on Test Set: 68.6777 %<br>
total: 3199  real=1,predict=1: 655  real=1,predict=0: 879  real=0,predict=1: 123  real=0,predict=0: 1542<br>
Precision on Test Set: 84.1902 %<br>
"<br>
Similarly,we trained a model using fusion annotations from ChimerSeq (https://www.kobic.re.kr/chimerdb_mirror/) and have results as:<br>
"<br>
Accuracy on Test Set: 66.0667 %<br>
total: 3000  real=1,predict=1: 948  real=1,predict=0: 559  real=0,predict=1: 459  real=0,predict=0: 1034<br>
Precision on Test Set: 67.3774 %<br>
"<br>
As you can see,the results are not worthless as many sequence based methods are worse.