unifinished ML arch

24e725d2 · Chayarat Wangweera · defc0399 · 24e725d2
Commit 24e725d2 authored 8 months ago by Chayarat Wangweera
--- a/src/contents/description.tsx
+++ b/src/contents/description.tsx
@@ -81,12 +81,7 @@ export function Description() {
                
              [Project name] aims to use the promotor sequence to determine neoantigen gene expression.
               We hypothesize that the promoter could be a strong protein expression level prediction factor and a substitute for RNA in the prediction process. <br /><br />
-          </p>
-        </div>
-        <div className="col-lg-8">
-          <h2>related works</h2>
-          <p>
-              Traditional RNA sequencing methods have been augmented by bioinformatic approaches that predict neoantigens from tumor genomic data. However, as Borden et al. highlighted in [9], these predictions can be limited by the quality and availability of RNA samples, necessitating alternative methods.
+               Traditional RNA sequencing methods have been augmented by bioinformatic approaches that predict neoantigens from tumor genomic data. However, as Borden et al. highlighted in [9], these predictions can be limited by the quality and availability of RNA samples, necessitating alternative methods.
              This depicts a similar issue we faced during our DC cell development process. 
              The study demonstrates the importance of high-quality genomic data and robust computational tools in accurately predicting and prioritizing neoantigens, which is crucial for the development of effective personalized cancer vaccines. <br /><br />
              Several studies have explored alternatives to RNA sequencing for gene expression prediction. J. Li and Y. Zhang [10] highlighted the correlation between promoter sequences and gene expression levels and suggested that promoter strength could predict protein expression. 
@@ -96,35 +91,67 @@ export function Description() {
              A study by Vikram Agarwal and Jay Shendure [12] demonstrated that deep convolutional neural networks could strongly predict mRNA abundance using promoter sequences. 
              The model they developed, Xpresso, achieved high predictive power and was competitive with models using thousands of biochemical datasets, underscoring the potential of deep learning models in understanding gene regulatory mechanisms and highlighting that the CpG dinucleotide content at core promoters is strongly predictive of mRNA abundance. <br /><br />

-              METHODOLOGY
-
-                Our methodologies are currently in the trial-and-error phase. The suggested steps in this writing are constantly evaluated and redesigned to create the most efficient ML model possible. The writing below explains the most recent methodologies/ model and data preprocessing architecture as of June 15th. 
-
-              3.1 Data Preprocessing
-
-                Our ML model utilizes genomic data from the GTF file (hg38.fa) obtained from a publicly available dataset from the National Institute of Health (NIH) and extracts required gene information, including gene IDs, chromosome starting-ending points, Median Protein Expression (MPE), Transcription Start Site (TSS) position, etc. This information is extracted using the Software Asset Management (SAM) tool. When extracting genetic information from the obtained GTF file, we make sure to filter out irrelevant aspects of the gene, such as histone genes and those genes on chromosome Y.
-
-                Our software uses the gene ID and the MPE value that we obtained, combining information from separate datasets. We identify the gene expression levels by matching the gene ID to the ones in the
+             
+        </div>
+        <div className="col-lg-8">
+          <h2>Methodology</h2>
+          <p>
+          Our methodologies are currently in the trial-and-error phase. The suggested steps in this writing are constantly evaluated and redesigned to create the most efficient ML model possible.
+              The writing below explains the most recent methodologies/ model and data preprocessing architecture as of June 19th. 
+          </p>
+          <h3> Data proprocessing </h3>
+          <p>
+              Our ML model utilizes genomic data from the GTF file (hg38.fa) obtained from a publicly available dataset from the National Institute of Health (NIH) and extracts required gene information,
+              including gene IDs, chromosome starting-ending points, Median Protein Expression (MPE), Transcription Start Site (TSS) position, etc. This information is extracted using the Software Asset Management (SAM) tool. 
+              When extracting genetic information from the obtained GTF file, we make sure to filter out irrelevant aspects of the gene, such as histone genes and those genes on chromosome Y.
+              <br /><br />
+              Our software uses the gene ID and the MPE value that we obtained, combining information from separate datasets. We identify the gene expression levels by matching the gene ID to the ones in the
              Cap Analysis of Gene Expression (CAGE) file and locate the existing Transcription Start Sites (TSS).
              Our work focuses on the highest-scoring TSS out of five to identify the most active promotional areas. Five hundred bases are extracted upstream and downstream for each selected TSS to capture the complete promoter region necessary for the ML model. 
-
-              After preprocessing, gene ID, promoter sequence areas, and MPE values are formatted as a “DataFrame.” These data can be used for further analysis or as training data for machine learning models, such as our proposed ML model for neoantigen gene expression. The results are saved to a CSV file for ML training. 
-
-              From the CSV, the nucleotide sequences undergo tokenization with different length parameters currently being tested (for example, len = 4 would be ATGC | ATGC | ACGT | CCGA). The efficacy of different data structures is then evaluated by comparing the obtained MPE expression with the original after the dataset is used in the model. Regarding ML model training, the dataset is split into 80% training and 20% testing. After extracting the data, we ensure it is all transformed from lower to upper case and then apply a Log Transformation for more dispersed and balanced data.
-
+              <br /><br />
+              After preprocessing, gene ID, promoter sequence areas, and MPE values are formatted as a “DataFrame.” These data can be used for further analysis or as training data for machine learning models, such as our proposed ML model for neoantigen gene expression. 
+              The results are saved to a CSV file for ML training. 
+              <br /><br />
+              From the CSV, the nucleotide sequences undergo tokenization with different length parameters currently being tested (for example, len = 4 would be ATGC | ATGC | ACGT | CCGA). 
+              The efficacy of different data structures is then evaluated by comparing the obtained MPE expression with the original after the dataset is used in the model. Regarding ML model training, the dataset is split into 80% training and 20% testing. After extracting the data, we ensure it is all transformed from lower to upper case and then apply a Log Transformation for more dispersed and balanced data.
+              <br /><br />


              Figure 2: Gene sequence and given Mean Protein Expression from the dataset 
+
+          </p>
+          <h3> Machine Learning Architecture </h3>
+          <p> 
+            We are currently testing multiple model design choices. The performance of each design is evaluated using deviations from the original MPE. Our experimental stage consists of 3 different designs of the ML workflow. 
+            <br /><br />
+            Figure 3: Different current ML workflows 
+            <br /><br />
+            a. The processed DNA sequences are inputted into the ENFORMER model to predict MPE.
+            b. The processed DNA sequences are tokenized and input into a Recurring Neural Network (RNN) model to predict MPE. 
+            c. The processed DNA sequences are put through DNABert and Traditional ML to predict MPE. 
+            <br /><br />
+            Our team chose the DNABert and Enformer models because both are suitable for training using DNA sequences as input. RNN is also used because it is suitable for sequence-based data and self-supervised learning. 
+            The selection process combines predictions from the HLA compatibility classifier (not included in this study) and the predicted expression level (this study) to ensure that selected proteins are immunologically relevant and biologically practical for DC vaccine development.  
+            This dual filtering strategy ensures that only proteins that are potentially good matches for the patient's HLA type and are expressed at sufficient levels in tumor cells are chosen as candidates. 
          </p>
        </div>
        <div className="col-lg-4">
          <h2>References</h2>
          <hr />
          <p>
-            iGEM teams are encouraged to record references you use during the
-            course of your research. They should be posted somewhere on your
-            wiki so that judges and other visitors can see how you thought about
-            your project and what works inspired you.
+            [1]	J. Ferlay et al., “Cancer incidence and mortality patterns in Europe: Estimates for 40 countries in 2012,” Eur. J. Cancer, vol. 49, no. 6, pp. 1374–1403, Apr. 2013, doi: 10.1016/j.ejca.2012.12.027.
+            [2]	N. Azamjah, Y. Soltan-Zadeh, and F. Zayeri, “Global Trend of Breast Cancer Mortality Rate: A 25-Year Study,” Asian Pac. J. Cancer Prev. APJCP, vol. 20, no. 7, pp. 2015–2020, 2019, doi: 10.31557/APJCP.2019.20.7.2015.
+            [3]	S. Mukem, H. Sriplung, E. McNeil, and V. Tangcharoensathien, “Breast cancer screening among women in Thailand: Analyses of population-based household surveys,” J. Med. Assoc. Thai., vol. 97, pp. 1106–18, Nov. 2014.
+            [4]	W. Insamran and S. Sangrajrang, “National Cancer Control Program of Thailand,” Asian Pac. J. Cancer Prev. APJCP, vol. 21, no. 3, pp. 577–582, Mar. 2020, doi: 10.31557/APJCP.2020.21.3.577.
+            [5]	L. Gelao et al., “Dendritic Cell-Based Vaccines: Clinical Applications in Breast Cancer,” Immunotherapy, vol. 6, no. 3, pp. 349–360, Mar. 2014, doi: 10.2217/imt.13.169.
+            [6]	L. Tang, R. Zhang, X. Zhang, and L. Yang, “Personalized Neoantigen-Pulsed DC Vaccines: Advances in Clinical Applications,” Front. Oncol., vol. 11, Jul. 2021, doi: 10.3389/fonc.2021.701777.
+            [7]	D. Qian, J. Li, M. Huang, Q. Cui, X. Liu, and K. Sun, “Dendritic cell vaccines in breast cancer: Immune modulation and immunotherapy,” Biomed. Pharmacother., vol. 162, p. 114685, Jun. 2023, doi: 10.1016/j.biopha.2023.114685.
+            [8]	T. Jiang et al., “Tumor neoantigens: From basic research to clinical applications,” J. Hematol. Oncol.J Hematol Oncol, vol. 12, Sep. 2019, doi: 10.1186/s13045-019-0787-5.
+            [9]	A. Nelde, H.-G. Rammensee, and J. S. Walz, “The Peptide Vaccine of the Future,” Mol. Cell. Proteomics, vol. 20, p. 100022, 2021, doi: 10.1074/mcp.R120.002309.
+            [10]	“Frontiers | Cancer Neoantigens: Challenges and Future Directions for Prediction, Prioritization, and Validation.” Accessed: May 31, 2024. [Online]. Available: https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2022.836821/full
+            [11]	J. Li and Y. Zhang, “Relationship between promoter sequence and its strength in gene expression,” Eur. Phys. J. E, vol. 37, no. 9, p. 86, Sep. 2014, doi: 10.1140/epje/i2014-14086-1.
+            [12]	“Promoter sequence and architecture determine expression variability and confer robustness to genetic variants - PMC.” Accessed: May 31, 2024. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9844987/
+            [13]	V. Agarwal and J. Shendure, “Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks,” Cell Rep., vol. 31, no. 7, p. 107663, May 2020, doi: 10.1016/j.celrep.2020.107663.
          </p>
        </div>
      </div>