Skip to content
Snippets Groups Projects
Commit 00c2b6ec authored by Chayarat Wangweera's avatar Chayarat Wangweera
Browse files

fix-projectdesc

parent ea1cb393
No related branches found
No related tags found
No related merge requests found
Pipeline #443885 passed
......@@ -21,8 +21,6 @@ export function Description() {
classification to improve prediction accuracy.
</p>
<h3>Description</h3>
<h3>Intro</h3>
<p>
Breast cancer is the most common type of cancer that affects women around the world.
......@@ -73,103 +71,46 @@ export function Description() {
{"3.) Evaluation of immunogenicity of candidate neoantigens [8]"}
</p>
<p>
Our team has previously worked on a personalized neoantigen prediction
algorithm for DC vaccine development. The software predicts the
neoantigen suitable for the patient to create vaccines that are able to
eliminate the patient’s tumor. Once the algorithm identifies the
appropriate neoantigen for the vaccine, the next step is to validate the
target neoantigen gene expression level, which can be done using RNA
sequencing. This validation process can enhance the number of patients
who benefit from this personalized DC vaccine [9].
Our team has previously worked on a personalized neoantigen prediction algorithm for DC vaccine development. The software aims to predict a neoantigen candidate suitable for the patient to create vaccines capable of eliminating the patient’s tumor. Once the algorithm identifies the appropriate neoantigen for the vaccine, the target neoantigen is validated by measuring the gene expression level, which can achieved using RNA sequencing (if this is RNA levels of a specific gene, you can propose qPCR, which is significantly cheaper than RNA-seq). Once validated, this process can significantly increase the number of patients who benefit from this personalized DC vaccine [9].
</p>
<p>
Obtaining mRNA sequences of neoantigens can be very expensive and
require expert lab personnel, leaving medical professionals with access
to only patient-derived genomic DNA sequences. The absence of
neoantigen’s mRNA expression leads to inaccuracies in predicting whether
the immune system will effectively target the neoantigen. We aim to
develop an alternative to mRNA sequencing to determine neoantigen
expression. ASPINE aims to use the DNA sequence to predict neoantigen
gene expression. We hypothesize that the promoter could be a strong
protein expression level prediction factor and a substitute for RNA in
the prediction process. Our Bangkok-NMH team sought to solve this
matter, which is how ASPINE came forward.
Obtaining mRNA sequences of neoantigens can be very expensive and require expert lab personnel, leaving medical professionals with access to only patient-derived genomic DNA sequences. The absence of neoantigen’s mRNA expression leads to inaccuracies in predicting whether the immune system will effectively target the neoantigen. We aim to develop an alternative to mRNA sequencing to determine neoantigen expression. ASPINE (Ai for Sequence-based Predictions and Identification of Neoantigen Expression) aims to use DNA sequences to predict neoantigen gene expression. We hypothesize that transcription promoters could be a strong predictive factor for protein expression and a substitute for mRNA sequencing in the prediction process. Our Bangkok-NMH team sought to solve this matter, which is how ASPINE was developed.
</p>
<p>
Considering the lack of mRNA sequencing data, ASPINE aims to predict
neoantigen gene expression levels using DNA sequence as an alternative
to mRNA. Our project focuses on building and evaluating a
machine-learning model that chronologically ranks the selected proteins
based on their expression level using promotor regions of the selected
gene. This information is crucial for choosing the right neoantigen to
develop a personalized DC vaccine for patients.
Considering the lack of patient-derived mRNA sequencing data, ASPINE aims to predict neoantigen gene expression levels using promoter DNA sequences as an alternative to mRNA. Our project focuses on building and evaluating a machine-learning model that chronologically ranks selected proteins based on their expression levels using known strengths of promotor regions found on the selected genes. This information is crucial for choosing the right neoantigen to develop a personalized DC vaccine for patients.
</p>
<p>{"{maybe something about what part of DNA we use}}"}</p>
<p>
Traditional RNA sequencing methods have been augmented by bioinformatic
approaches that predict neoantigens from tumor genomic data. However, as
Borden et al. highlighted in [10], these predictions can be limited by
the quality and availability of RNA samples, necessitating alternative
methods. This depicts a similar issue we faced during our DC cell
development process. The study demonstrates the importance of
high-quality genomic data and robust computational tools in accurately
predicting and prioritizing neoantigens, which is crucial for the
development of effective personalized cancer vaccines.
Traditional RNA sequencing methods have been augmented by bioinformatic approaches that predict neoantigens from tumor genomic data. However, as Borden et al. highlighted in [10], these predictions can be limited by the quality and availability of RNA samples, necessitating alternative methods. This depicts a similar issue we faced during the DC cell vaccine development process. Borden et al. demonstrates the importance of high-quality genomic data and robust computational tools in accurately predicting and prioritizing neoantigens, which is crucial for the development of effective personalized cancer vaccines.
</p>
<p>
Several studies have explored alternatives to RNA sequencing for gene
expression prediction. J. Li and Y. Zhang [11] highlighted the
correlation between promoter sequences and gene expression levels and
suggested that promoter strength could predict protein expression. Their
research demonstrated that specific nucleotide sequences within
promoters are strong indicators of their transcriptional activity.
Similarly, a study by Hjörleifur Einarsson et al. [12] showed that
promoters' architecture and sequence significantly influence gene
expression variability and confer robustness to genetic variants. The
study found that certain promoter configurations can maintain consistent
gene expression levels even in the presence of genetic variants, making
them a robust tool for predicting gene expression.
Several studies have explored alternatives to RNA sequencing for gene expression prediction. One study highlighted the correlation between promoter sequences and gene expression levels and suggested that promoter strength could predict protein expression [11]. Their results demonstrated that specific nucleotide sequences within promoters are strong indicators of their transcription levels. Similarly, a study by Hjörleifur Einarsson et al. [12] showed that promoters' transcription start site architecture and sequence significantly influence gene expression variability and confer robustness to genetic variants. The study found that certain promoter configurations can maintain consistent gene expression levels even in the presence of genetic variants, making them a robust tool for predicting gene expression.
</p>
<p>
A study similar to our project attempted to predict gene expression
levels using genome sequence. Using a Convolutional Neural Network, the
study shows that promotor-proximal nucleotides strongly predict
transcriptional activity. A study by Vikram Agarwal and Jay Shendure
[13] demonstrated that deep convolutional neural networks could strongly
predict mRNA abundance using DNA sequences. The model they developed,
Xpresso, achieved high predictive power and was competitive with models
using thousands of biochemical datasets, underscoring the potential of
deep learning models in understanding gene regulatory mechanisms and
highlighting that the CpG dinucleotide content at core promoters is
strongly predictive of mRNA abundance.
A study similar to our project attempted to predict gene expression levels using genome sequences. They used a Convolutional Neural Network (CNN) to show that promotor-proximal nucleotides strongly predict transcriptional activity. A different study demonstrated that deep convolutional neural networks could strongly predict mRNA abundance using DNA sequences [13]. The model they developed, Xpresso, achieved high predictive power and was competitive with models using thousands of biochemical datasets, underscoring the potential of deep learning models in understanding gene regulatory mechanisms and highlighting that the CpG dinucleotide content at core promoters is strongly predictive of mRNA abundance.
</p>
<h3>METHODOLOGY</h3>
<p>
Our methodology comprises four different experiments. These experiments
represent a chronological step toward producing an efficient final model
for predicting neoantigen expression. Although these four experiments
use the same Xpresso model, they differ mainly in the input, training,
and validation datasets.
Our methodology consists of four different aims. These aims are designed to generate an efficient model for predicting neoantigen expression. Although these four aims use the same Xpresso model, they differ in the input, training, and validation datasets.
</p>
<p>The four experiments consist of</p>
<p>The four experiments consist of;</p>
<p>
{"A replication of the original Xpresso model and data"}
<br />
{"1. A replication of the original Xpresso model and data"}
{"2. An Xpresso model with a modified input genome length"}
<br />
{"2.Development of an Xpresso model with a modified input genome length"}
<br />
{
"3. An Xpresso model trained with our custom input data (existing breast cancer patient data) "
"3.Development of an Xpresso model trained with our custom input data (existing breast cancer patient data)"
}
<br />
{
"4. An Xpresso model trained with our custom input data (existing breast cancer patient data) and modified input genome length "
"4. Development of an Xpresso model trained with our custom input data (existing breast cancer patient data) and modified input genome length"
}
</p>
<h3>3.1 Data Preprocessing</h3>
<p>
Our model consists of two main categories of data as a model input: the
Our model consists of two main categories of data to be used model inputs: the
input genome and the half-life data. Half-life features are the
components that determine mRNA stability. They consist of multiple data
types: the GC content (or the percentage of G and C nucleotides) and
......@@ -275,31 +216,13 @@ export function Description() {
</p>
<h3>3.2 Machine Learning Architecture</h3>
<p>
The first experiment replicates the Xpresso experiment from article
[13]. We used the sequence of 16,000 genes as the input, each consisting
of a 10,500 bp sequence. We also incorporated half-life data (Exon
junction density, GC content, and length of certain regions in ORF) into
the model input. Following the genome sequence input, the model will
predict and output median gene expression, making it a cell
type-specific model.
Our first experiment validates the Xpresso experiment from the original study [13]. We used the sequences of 16,000 genes as the input, each consisting of a 10,500 bp sequence. We also incorporated half-life data (Exon junction density, GC content, and length of certain regions in ORF) into the model input. Following the genome sequence input, the model will predict and output median gene expression, making it a cell type-specific model.
</p>
<p>
The second experiment utilizes the same model structure, input, output,
and half-life data as the first experiment. We reduced the input
sequence length from 10,500 bp to 1,000 ± 500 bp. We found theat the
range of ± 500 bp can effectively represent the whole sequence. This
method of reduction simulates the WES range that will be later used in
our fourth experiment as it is compatible with our real patient data. We
then used that modified input data to train the Xpresso model and
compared the results with the first experiment.
The second experiment utilizes the same model structure, input, output, and half-life data as the first experiment. We reduced the input sequence length from 10,500 bp to 1,000 ± 500 bp. We found that the range of ± 500 bp can effectively represent the whole sequence. This method of reduction simulates the WES range that will used later in our fourth experiment as it is compatible with our real patient data. We then used that modified input data to train the Xpresso model and compared the results with the first experiment.
</p>
<p>
For the third experiment, we used our own WES input data obtained from
breast cancer patients mentioned earlier. We also incorporated
half-life-related data similar to the first two experiments. However,
the input and outputs are the cancer cell lines, making it a cell
type-agnostic iteration. We then compared the results with those of the
first two experiments.
The third experiment, we used our own WES input data obtained from breast cancer patients mentioned previously. We also incorporated half-life data similar to the first two experiments. However, the input and outputs are the cancer cell lines, making it a cell type-agnostic iteration. Next, we compared the results with those of the first two experiments.
</p>
<p>
Finally, we used our own WES data from breast cancer patients, similar
......@@ -387,23 +310,10 @@ export function Description() {
of the model.
</p>
<p>
Overall, our dataset is considered very compatible with Xpresso. Looking
at the RMSE of our custom data compared to the original data (Experiment
1 vs. 3 and Experiments 2 vs. 4), the RMSE is lower for both
comparisons, indicating compatibility. Additionally, we also figured
that the inclusion of Half-life data is crucial in model efficiency. The
results were incomprehensive when half-life sata were not included.
Regarding sequence length, WES captures regulatory elements including
those that are further from TSS, resulting in data abundancy and
therefore lower accuracy.
​​Overall, our dataset and pipeline are highly compatible with Xpresso. Looking at the RMSE of our custom data compared to the original data (Experiment 1 vs. 3 and Experiments 2 vs. 4), the RMSE is lower for both comparisons, indicating compatibility. Additionally, we figured that the inclusion of half-life data is crucial to model efficiency. The results were not viable when half-life data was excluded. As per sequence length, WES captures regulatory elements, including those that are further from TSS, resulting in data abundance and, therefore, lower accuracy.
</p>
<p>
Potential improvements can be using hyper parameterization to increase
the accuracy of the prediction model. Another idea is a separation of
models; we will separate the models into dispersed and focused TSS. We
think that these models will perform better than a non-classified model
because they have different underlying mechanisms of transcription
initiation that are worth classifying.
Improvements in the model can be adding hyper parameterization to increase the accuracy of the prediction model. Another approach would be a separation of the models; we will separate the models into dispersed and focused TSS. We think that these models will perform better than a non-classified model because they have different underlying mechanisms of transcription initiation that are worth classifying .
</p>
<p style={{ marginTop: 30, maxWidth: 600, fontSize: 14 }}>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment