Title: Large Protein Language Models and Their Prompt-based Learning
Abstract Protein language models (PLMs) provide a powerful representation of protein sequences and their evolutions through pre-training on vast protein sequence datasets. We introduced a structure-aware PLM, S-PLM, by integrating sequence and structure information to improve protein prediction. The model utilizes a multi-view contrastive learning strategy to align protein sequences with their structures within a shared embedding space. S-PLM leverages Swin-Transformer on the contact map images of AlphaFold-predicted structures fused with sequence-based embeddings from ESM2. It has a comprehensive set of fine-tuning tools that increase its prediction capacity, surpassing other PLMs. Additionally, we developed Prot2Seq to expand PLMs’ capabilities for multitasking protein predictions, utilizing an autoregressive language modeling method. By adding task-specific tokens into the decoder, Prot2Seq showed enhanced performance when conducting simultaneous multiple-task training within a single model run. Furthermore, we implemented a Parameter-Efficient Fine-Tuning framework, PEFT-SP, with various prompting methods, like Prompt Tuning and Adapter Tuning, on the ESM-2 model for predicting signal peptides. PEFT-SP gained significant prediction accuracy for signal peptides over other methods, especially when having limited training data. Our studies show great promise for PLMs and prompt-based learning in protein prediction tasks.
Bio Dong Xu is Curators’ Distinguished Professor in the Department of Electrical Engineering and Computer Science, with appointments in the Christopher S. Bond Life Sciences Center and the Informatics Institute at the University of Missouri-Columbia. He obtained his Ph.D. from the University of Illinois, Urbana-Champaign in 1995 and did two years of postdoctoral work at the US National Cancer Institute. He was a Staff Scientist at Oak Ridge National Laboratory until 2003 before joining the University of Missouri, where he served as Department Chair of Computer Science during 2007-2016 and Director of Information Technology Program during 2017-2020. Over the past 30 years, he has conducted research in many areas of computational biology and bioinformatics, including single-cell data analysis, protein structure prediction and modeling, protein post-translational modifications, protein localization prediction, computational systems biology, biological information systems, and bioinformatics applications in human, microbes, and plants. His research since 2012 has focused on the interface between bioinformatics and deep learning. He has published more than 400 papers with more than 21,000 citations and an H-index of 73 according to Google Scholar. He was elected to the rank of American Association for the Advancement of Science (AAAS) Fellow in 2015 and American Institute for Medical and Biological Engineering (AIMBE) Fellow in 2020.