A Study of Epitope Prediction Using Deep Learning – Phase One

By Kamil Legault  
   January 12, 2021

Background:

The human body’s immune reaction following infection by a pathogen, such as a virus, is characterized by the production of antibodies. These antibodies will react to antigens, molecules on the outside of viruses. Antibodies typically function by recognizing discrete regions known as antigenic determinants or B-cell epitopes. B-cell epitopes can be defined as clusters of amino acids, which are recognized by secreted antibodies or B-cell receptors and are able to elicit an immune response. Successful prediction and detection of such epitopes plays a vital role in vaccine development.

In the first step, we tried to reproduce the results published in Liu et al (2020) by following the same reported methodology using Python code and libraries.  The data they used was made available on their website, but we chose to re-write the pre-processing functions in Python, using the Biopython package. The evaluation of the results was done on similar random samples of the test sets, using an ensemble of 11 trained models.

Set-up:

We will be working with B cell epitopes obtained from the IEDB database.  The Immune Epitope Database (IEDB) catalogs experimental data on antibody and epitopes studied in humans, non-human primates, and other animal species in the context of infectious disease, allergy, autoimmunity and transplantation. This database contains the largest number of confirmed epitopes and non-epitopes.

For this paper, 240, 563 peptides were downloaded from IEDB with lengths varying between (10, 50]. These corresponded to either linear epitopes or non-epitopes. Due to the imbalanced nature of this dataset, there were 25 884 positive samples and 214 679 negative samples.

The authors of the paper also sought to determine peptide length provided the best results. They did so by employing truncation and extension techniques, mapping peptides from IEDB to longer antigen sequences from the NCBI database. This provided 39 datasets of peptides with equal lengths ranging from 11 to 50. A sample is shown in the following table:

Peptide chainLength
AAVDADTAALA11
VIRGKKGSGGITIKKTGQALVFGIY25
AKKAAAPSGKKSAKAATAPAKAAAAPAKAAAAPAKAAA38
GLLGWSPQAQGILETLPANPPPASTNRQSGRQPTPLSPPLRNTHPQAMQ49

Feature engineering:

In order to represent a peptide sequence as a vector that can be used by a machine learning algorithm, the peptide sequences needed to be converted to a vector of dipeptide frequencies. 

Recall that there are 20 essential amino acids, these amino acids form the basis of any protein or peptide found in animal cells. In order to retain some sequential information, the authors considered unique sequences of 2 peptides as distinct elements. These are referred to as dipeptides and there are 400 possible combinations. For a peptide of length n, it can be divided into n-1 dipeptide. The relative frequency of all 400 dipeptides in a peptide form a vector of 400 elements named dipeptide composition, whose elements should add up to 1. An example is shown below.

Dipeptidefrequency
AA0
AD.25
AR0
VW0.5
VY0
VV0.25
aa = “ARNDCQEGHILKMFPSTWYV”

bipeptides = []

for i in aa:

    for j in aa:

        bipeptides.append(i+j)

def fun(a):

    words = a[‘X’]

    sequence = Seq(words)

    feat_dict = {}

    for bipeptide in bipeptides:

        feat_dict[bipeptide] = [sequence.count_overlap(bipeptide)]

    feat_vect = pd.DataFrame.from_dict(feat_dict)

    num_pept = feat_vect.sum(axis=1)[0]

    feat_vect_1 = feat_vect/num_pept

    return feat_vect_1.iloc[0]

features_aa = epitope_train.apply(lambda row: fun(row), axis=1)

Modelling:

Since the data is quite imbalanced in nature, the authors used an Ensembling approach to create 11 models trained independently over 11 balanced version of the data. The process goes as follows: 20 000 positive datapoints are randomly sampled from the original data set, then 11 sets of 20 000 random negative samples are created. We then train 11 independent classifiers on one of the sets of negative samples and the positive set. To do scoring, we take the vote of each classifier on the test set and the predicted label is determined using the majority vote. This is illustrated in the next figure.

The architecture of each network is quite simple, it consists of 4 layers with Relu activations: 

model = Sequential()

model.add(Dense(units=200, activation=”relu”, input_dim=X_trans.shape[1], kernel_initializer=’random_normal’))

model.add(Dropout(0.4))

model.add(Dense(units=100, activation=”relu”, kernel_initializer=’random_normal’))

model.add(Dropout(0.4))

model.add(Dense(units=40, activation=”relu”, kernel_initializer=’random_normal’))

model.add(Dropout(0.4))

model.add(Dense(units=20, activation=”relu”, kernel_initializer=’random_normal’))

model.add(Dropout(0.4))

model.add(Dense(units=1, activation=”sigmoid”, kernel_initializer=’random_normal’))

model.compile(loss=’binary_crossentropy’, optimizer=’rmsprop’, metrics=[‘accuracy’])

Results:

We trained each model for 400 epochs, then averaged the 11 predictions given by each model to obtain one final prediction. This is a little bit different than in the paper where they only counted a vote as positive if all classifiers agreed. We did not observe any change in results by omitting this. ROC Curve for the model trained based on length=11 is shown below with the accuracy of 83%.

 

References

Liu, T., Shi, K., & Li, W. (2020). Deep learning methods improve linear B-cell epitope prediction. BioData Mining13, 1-13.

Kamil Legault
Data Scientist

See how our AI can transform your business.

Book Consultation