Inspired by the humanization of mouse-derived antibodies, our project starts with the other sources of antibody sequences against target species antigen (such as mouse-derived antibodies), automatically generates antibody sequences that may have better efficacy and lower immunogenicity, and performs comprehensive scoring. That is, when given specific antibody sequence from a certain species and the target species for obtaining the corresponding antibody, we hope to generate a potential antibody database for the target species. The generated antibody data should preserve the CDR region of the original antibody data as much as possible because this region is used for specific recognition and binding to antigens, thereby achieving the function of antibody drugs.
<h2class="title"id="1Design"style="color:#573674">Initial Model Design</h2>
<p>
Our project is mainly divided into three parts: potential antibody sequence generation, antibody sequence species specification scoring, and antibody structure scoring. The species specification scoring of antibody sequences extends the current mainstream tools to various species, and this part is the focus of our project. In the design part, we divide this project into three components: data encoding and processing, model construction, and training with debugging. The entire system framework for the model is a critical issue that we need to continuously iterate and explore. The iterative process presented below mainly revolves around the core content of antibody sequence species specification scoring.
<h2class="title"id="2Design"style="color:#573674">Version II - Design</h2>
<p>
In the first version of the model, a core issue arose from the overly naive encoding method used. Therefore, we conducted research into mainstream amino acid encoding matrices, including BLOSUM matrices, PAM matrices, and Position-Specific Scoring Matrices (PSSM). Ultimately, we chose to use the BLOSUM62 substitution scoring matrix, which is a widely applied encoding method in protein analysis. The BLOSUM62 matrix is based on statistical information from protein sequence alignments, and we have good reason to believe that it incorporates both the physicochemical similarities between amino acids and statistical patterns from natural genetic evolution.
<h2class="title"id="3Design"style="color:#573674">Version III - Design</h2>
<p>
Once we had determined the encoding method, we employed various traditional machine learning algorithms to compare their performance with the SVM method, with the aim of improving our model.
<h2class="title"id="4Design"style="color:#573674">Version IV - Design</h2>
<p>
After accumulating experience from previous model iterations, we decided to leverage the PyTorch deep learning framework and employ state-of-the-art deep learning models for designing our classifier.</p>
<h2class="title"id="5Design"style="color:#573674">Version V - Design</h2>
<p>
The core issue at the moment is that there are too many biological species, and the training data is not balanced. The use of neural network classification heads with a predefined number of classes not only results in insufficient learning from small-sample data but also makes it challenging to extend the model to newly added species.
<h2class="title"id="6Design"style="color:#573674">Version VI - Design</h2>
<p>
Because we observed significant outlier data, in order to avoid these outliers causing excessive interference in the One-Class SVM model, we aim to use the One-Class SVM model to identify clear outlier data points and remove them. This will allow us to use the more compact data for each class to complete the classification and scoring tasks using SVM models.
<h2class="title"id="7Design"style="color:#573674">Version VII - Design</h2>
<h3class="title"style="color:#573674">A completely new attempt: deep one class</h3>
<p>
...
...
@@ -468,7 +468,7 @@
<p>
We drew inspiration from Deep One-Class models used for image anomaly detection and weighted the descriptive loss of the multi-class task along with the compactness loss of the one-class model. We used this weighted loss to update the parameters of the deep learning model, ensuring that the model has strong feature extraction capabilities while also training the single-class data to be more compact.
</p>
<h2class="title"id="7Build"style="color:#573674">Version VII - Build</h2>
<h2class="title qqqqqq"id="7Build"style="color:#573674">Version VII - Build</h2>
<h2class="title"id="8Design"style="color:#573674">Final Model - Design</h2>
<p>
After completing the model training, we further updated and adjusted the testing process multiple times, and the final model was determined as follows.