Skip to content
Snippets Groups Projects
Commit e4e6ed5a authored by wuttigaisorn's avatar wuttigaisorn
Browse files

fix(fixed-content): fixed-content

parent d7646003
No related branches found
No related tags found
No related merge requests found
Pipeline #483645 passed
......@@ -298,128 +298,6 @@ export function Engineering() {
these two types of transcription initiation have distinct regulatory
mechanisms that likely benefit from individualized modeling approaches.
</p>
<h3>Cycle 3 (Added 24th September) </h3>
<p>
During our research about the effect of promoter sequence on gene
expression and transcription mechanisms (the process most influenced by
DNA sequence), we found that there are 2 different modes of
transcription initiation: focused and dispersed. In focused
transcription initiation, almost all of the transcripts are transcribed
starting from only one TSS. For dispersed transcription initiation, the
transcripts are transcribed starting from multiple TSS spanning about
100bp. Both of these modes have very different regulatory sequences and
different initiation mechanisms. Thus, we hypothesized that training our
model separately based on focused and dispersed modes would improve the
results due to the model’s ability to consider another additional
feature (transcription mechanism).
</p>
<h3>Build</h3>
<p>
In order to implement the additional features, the original patient
dataset was separated into focused and dispersed groups. One gene ID
consists of numerous transcription start sites (TSS). We separated the
data by measuring the distance of the TSS at Q1 (25th percentile) from
the TSS at Q3 (75th percentile); if the distance is more than 5 base
pairs, it is classified as dispersed data. Otherwise, it is classified
as focused data. After these data are separated, two models are then
trained with either focused or dispersed datasets for experiments 5 and
6.
</p>
<p>
To test the hypothesis, we constructed two more experiments (four
different models): Experiments 5 and 6. In each experiment, there are
two variations of models: a dispersed and focused TSS model. The models
in experiment 5 will mimic the structure of experiments 1 and 3 by
having a long input sequence. The models in experiment 6 will mimic the
structure of experiments 2 and 4 by having a shortened input sequence.
</p>
<h3>Test</h3>
<p>
Evaluating the model with the same metrics of R-squared and RMSE, we
compared our experiments 5 and 6, both dispersed and focused versions,
with 3 and 4 to see how the model improves as we classify the model data
into focus and dispersed transcription. These four models are comparable
since they all use our custom patient data.
</p>
<p>
For experiment 5 (long input sequence), the model with the dispersed
sequence acquired an R-squared of 0.358 and RMSE of 0.42, showing +0.029
R-squared and -0.017 RMSE improvement compared to experiment 3 (long
input sequence). The model with the focused sequence acquired an
R-squared of 0.479 and an RMSE of 0.539, showing +0.15 R-squared
improvement but more error.
</p>
<p>
For experiment 6, the model with dispersed acquired an R-squared of
0.328 and an RMSE of 0.419, showing a 0.001 difference in R-squared but
an RMSE improvement of -0.024 when compared with experiment 4 (short
input sequence). The model with a focused sequence acquired an R-squared
of 0.479 and an RMSE of 0.539, showing a +0.185 improvement in R-squared
but a 0.96 increase in RMSE as well.
</p>
<img
className="image-center"
width={"80%"}
src="https://static.igem.wiki/teams/5251/gra-1.jpg"
/>
<p className="image-caption">
Figure 5: Graphs showing the relationship of actual value (x-axis) and
predicted value (y-axis). Experiment 5 (dispersed): top-left, 6
(dispersed): top-right, 5 (focused): bottom-left, 6 (focused) :
bottom-right.
</p>
<p>Table 2: R-squared and RMSE of experiments 3,4,5 and 6.</p>
<table>
<tr>
<td>Experiment / Results</td>
<td>R-squared</td>
<td>Root mean squared error (RMSE)</td>
</tr>
<tr>
<td>3 (long input)</td>
<td>R-squared</td>
<td>0.329</td>
</tr>
<tr>
<td>4 (short input)</td>
<td>0.294</td>
<td> 0.443</td>
</tr>
<tr>
<td>5 (dispersed and long input)</td>
<td>0.358</td>
<td>0.423</td>
</tr>
<tr>
<td>6 (dispersed and short input)</td>
<td>0.328</td>
<td>0.419</td>
</tr>
<tr>
<td>5 (focused and long input)</td>
<td>0.479</td>
<td>0.529</td>
</tr>
<tr>
<td>6 (focused and short input)</td>
<td>0.479</td>
<td>0.539</td>
</tr>
</table>
<h3>Learn</h3>
<p>
To conclude, the focused sequence and the dispersed sequence model both
improved the model (R-squared) significantly. However, the RMSE
increased for the focused sequence, but it decreased for the dispersed
sequence. The dispersed model’s R-squared decreased when we shortened
the input, while the focused model barely changed.
</p>
<p>
The results showed that our hypothesis of the effect of dispersed and
focused transcription was correct and effective in predicting the
expression of neoantigen proteins. These results can be useful for
future teams that want to work on predicting neoantigen expression.
</p>
<p>
[1] J. Ferlay et al., “Cancer incidence and mortality patterns in
Europe: Estimates for 40 countries in 2012,” Eur. J. Cancer, vol. 49,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment