Open Protein Modeling Consortium

About

The Open Protein Modeling Consortium (OPMC) is a collaborative initiative aimed at bringing together the efforts of the protein research community. Its mission is to foster the sharing and co-development of resources, with an emphasis on individually trained decentralized models, helping to advance protein modeling through collective contributions. OPMC provides platforms/tools that support diverse protein prediction tasks, striving to make advanced protein modeling more accessible to researchers, regardless of their level of expertise in machine learning.

Features

We have a SaprotHub as a collaborative community, aiming to empowers biologists by enabling them to create and train their own models without the need for advanced ML and coding expertise. We provide various preprocessed datasets and fine-tuned model checkpoints for users to directly utilize.

You can also try out SaprotHub in a Google Colab notebook without any setup required.

SaProt

Protein Language Modeling with Structure-aware Vocabulary

ColabSaprot SaprotHub

ProTrek

Navigating the Protein Universe through Tri-Modal Contrastive Learning

ColabProTrek ProTrekHub

ProtT5

Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

ColabProtT5 T5Hub

METL

Biophysics-based protein language models for protein engineering

METL Colab and Hub

SeProt

A sequence-only ColabPLM family, including ColabESM1b, ColabESM2, ColabProtBert etc

ColabSeprot SeprotHub

ESM3

Simulating 500 million years of evolution with a language model

ESM-Play-V3

ESMC

ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning

ESM-Play-VC

OPMC Authors

OPMC Senior Authors (Random Order)

Sergey Ovchinnikov
MIT

Martin Steinegger
Seoul National University

Kevin K. Yang
Microsoft Research

Michael Heinzinger
Institute of Computational Biology

Pascal Notin
Harvard Medical School

Pranam Chatterjee
University of Pennsylvania

Jia Zheng
Westlake University

Stan Z. Li
Westlake University

Xing Chang
Westlake University

Huaizong Shen
Westlake University

Fajie Yuan
Westlake University

Noelia Ferruz
The Centre for Genomic Regulation (CRG)

Rohit Singh
Duke University

Debora S. Marks
Harvard Medical School

Anping Zeng
Westlake University

Jijie Chai
Westlake University

Feng Ju
Westlake University

Anthony Gitter
University of Wisconsin-Madison

Anum Glasgow
Columbia University

Milot Mirdita
Seoul National University

Philip M. Kim
University of Toronto

Christopher Snow
Colorado State University

Vasilis Ntranos
University of California

Philip A. Romero
Duke University

Jianyi Yang
Shandong University

Caixia Gao
Chinese Academy of Sciences

Liang Hong
Shanghai Jiao Tong University

Michael Bronstein
University of Oxford

Tong Si
Chinese Academy of Sciences

Jianming Liu
Westlake University

OPMC Regular Authors (Random Order)

Jin Su
Westlake University

Tianli Tao
Westlake University

Chenchen Han
Westlake University

Jiawei Zhang
Westlake University

Yuliang Fan
Westlake University

Yuyang Tao
ShanghaiTech University

Fengyuan Dai
Westlake University

Xuting Zhang
Westlake University

Yuyang Zhou
Westlake University

Junjie Shan
Westlake University

Xibin Zhou
Westlake University

Yan He
Westlake University

Shiyu Jiang
Westlake University

Dacheng Ma
Westlake University

Yuan Gao
University of Chinese Academy of Science

Linqi Cheng
Department of Chemistry, Rice University

Xinzhe Zheng
Department of Chemistry, Rice University

Lei Chen
Shenzhen Lions King Hi-Tech Co., Ltd

Rui Long
Shenzhen Lions King Hi-Tech Co., Ltd

Lingjie Kong
South China Agricultural University

Zhongji Pu
Xianghu laboratory

Jiaming Guan
Hefei MiQro Era Digital Technology Co. Ltd

Tianyuan Zhang
Suzhou Polynovo Biotech Co., Ltd., Suzhou 215129, China

Cheng Li
Suzhou Polynovo Biotech Co., Ltd., Suzhou 215129, China

Qingyan Yuan
Suzhou Polynovo Biotech Co., Ltd., Suzhou 215129, China

Join Us

OPMC Membership and contributions

Steering committee Responsibilities:

The steering committee will assist in fulfilling several long-term academic responsibilities, including:

Providing Guidance: Offering constructive suggestions for OPMC and SaprotHub. Supporting academic activities such as organizing workshops, tutorials, and participating in peer review processes.
Advising: Providing valuable advice for the SaprotHub community or SaprotHub paper.
Facilitating involvement: Encouraging direct involvement of team members in the development of SaprotHub or OPMC. Contributing Resources.
Contributing high-quality open-source datasets and models to enhance OPMC's resources.
Reviewing Membership: Participating in the review process for new members joining OPMC.
Additional contributions: Any other contributions largely advance the OPMC community.

Regular Members (Developers) contributions:

The main contributions expected from regular members include:

Model contribution: Offering more fine-tuned protein language models (mainly adapters) for various protein function predictions.
Dataset contribution: Contributing more high-quality datasets to enrich the available resources.
Wet Experiment Validation: Conducted wet experiment validation utilizing Saprot, leading to the discovery of intriguing biological insights, which were subsequently documented in a research paper.
Development Participation: Participating in the development of SaprotHub or ColabSaprot.
Additional contributions: Making other contributions or providing services that advance the goals of OPMC or SaprotHub.

Contact

Dr. Fajie Yuan (yuanfajie [AT] westlake.edu.cn)

Jin Su (sujin [AT] westlake.edu.cn)

FAQs

Q1: It seems like OPMC and SaprotHub are intertwined but not exactly the same.

Yes, OPMC is a grand goal, and in this paper, it is primarily presented as a concept and vision. The paper introduces OPMC and implements SaprotHub as a pioneering example to drive the initial realization of OPMC. Achieving a broader implementation of OPMC requires continuous efforts from the entire community.

Q2: I'm very interested in the OPMC side of this project? Would I be able to support OPMC independently?

Yes, you can. OPMC is not tied exclusively to SaprotHub. SaprotHub serves as an initial implementation case within the broader OPMC concept. We also welcome the inclusion of new protein models in OPMC. There are generally two ways to contribute: either independently of SaprotHub, such as building ESMHub or ProtTransHub, or by joining SaprotHub. While SaprotHub is named after its first model, Saprot, it is not limited to Saprot alone and welcomes the inclusion of other language models. The concept of OPMC originated from the SaprotHub paper, so if you would like your protein model to be part of OPMC or if you adopt the similar construction approach of SaprotHub, we encourage you to cite the source paper. Also see Q9.

Q3: What's the relation between OPMC and the OpenFold Consortium?

The goal of the OpenFold Consortium is to develop free and open-source software tools. This differs from the goals of OPMC. OPMC aims to make it easy for all biologists (especially those without machine learning backgrounds and coding skills) to train their own protein models, and to share these models with the community members, allowing for integration and collaborative development on top of the existing community models.

Additionally, so far, the OpenFold Consortium seems to be focusing more on protein structure prediction, while OPMC is more focused on protein function prediction. Furthermore, the number of protein function task categories is far greater than the number of structure tasks. As a result, biologists often have to fine-tune large pre-trained protein models based on their own training data, which is a key feature of OPMC.

Q4: Is the idea to create a company that provides the resources for biologists to do model training? I'm unsure the vision here, since a lot of model training is resource and data constrained. It would be hard to create something where "every biologist to train their own AI models with just a few clicks." Who would provide the resources in this case?

No, the primary motivation behind OPMC is to enable biologists to participate in protein model training and collaborative development, without direct involvement of creating a company or commercial operation.

Currently, we do not provide free training resources. Users have the option to purchase GPUs, such as the A100, on platforms like Colab. OPMC primarily supports fine-tuning tasks or direct prediction tasks for protein language models, rather than pre-training. These tasks typically do not require excessively expensive computational power. With a budget of around $10, one can easily complete training and prediction tasks on several thousand samples. This cost is manageable for most individuals and academic institutions. There are also free GPU resources in Colab but they may not be sufficient for some of your tasks.

In the future, we may explore options such as applying for funding or accepting donations to provide some free computational resources to users. Please note that the purpose of SaprotHub and OPMC is not to provide free computational resources.

Q5: We are open sourcing models as well, so it would be interesting to collaborate once we release these models. However, it seems like currently the hub is geared towards SaProt as the main model of choice.

Saprot is the first model to join the hub, so we named it SaprotHub. However, SaprotHub can also accept other protein models, such as ESM. Of course, you can also independently develop your own model hub. In the future, we will create a webpage for OPMC that will include all the participating models.

The reason we adopted the Saprot model is that it is a near-universal model, capable of supporting any protein and residue-level prediction task, including regression, classification, ranking, as well as zero-shot mutational effect prediction and sequence design tasks. Saprot is also the state-of-the-art protein language model in the community.

Additionally, we also hope to include as many other protein language models as possible, but due to limited human resources, we are unable to integrate all the existing protein language models. This is precisely the purpose of building a community (the development of ColabSaprot took us approximately 4 months. Of course, with the open-sourcing of ColabSaprot, it will be much easier to implement similar functionality for other protein language models).

We believe that with the joint efforts of the entire community, the OPMC community store can become more diverse, and biologists can choose the models that best suit their needs. Therefore, we sincerely invite researchers who are interested in OPMC to join us, and if you have better suggestions, we welcome you to join us in co-building OPMC.

Q6: ESM, AlphaFold, and Openfold models are not mentioned in the hub.

OPMC mainly focuses on protein function prediction. So ESM is a good fit, AlphaFold and OpenFold target at protein structure prediction, and could be independent of OPMC or SaprotHub. But please note that developing another ColabESM will also take some time, and we welcome researchers to integrate the ESM or other models onto the hub.

The SaprotHub paper primarily focuses on collaboration and sharing within the framework of one backbone model, as this can greatly reduce storage and communication costs by leveraging Adapter technique – users just need to operate on the adapters rather than the large backbone model. Since these models are based on the same backbone network, the input format, network and parameter interfaces are more consistent, which serves as the basic for community sharing, collaboration and co-construction.

As for sharing between different backbone models, this remains a challenge at the moment, although it is an interesting research direction without an ideal solution yet. For example, if users want to collaborate between ESM15B and Saprot650M, they would need to upload and download the two complete models, significantly increasing communication and maintenance costs. As the number of models integrated increases, the demand on GPU performance would also increase rapidly.

However, by using the same backbone model and the Adapter mechanism, these difficulties can be elegantly solved. The Adapters in SaprotHub is just like grafting techniques in biology. Just as a single tree can bear different kinds of fruit, Saprot acts like the trunk, and the adapters for various downstream tasks resemble the different fruits on the Saprot tree.

Additionally, due to the differences in model architecture, input, and output across different models, it is difficult to design a unified interface. Therefore, this paper serves as an initial exploration of OPMC and does not cover collaboration between different backbone models. This may require more effort from the community, but holds promise for the future.

Q8: How does this differentiate SaprotHub from Hugging Face?

SaprotHub primarily focuses on storing lightweight Adapters, whereas Hugging Face stores complete pre-trained models. SaprotHub adopts the Adapter mechanism, which enables biologists to easily share, co-develop, and collaborate within ColabSaprot.

The goal of SaprotHub is to allow all biologists to train their own protein models even without machine learning and coding background, while Hugging Face's objective is to open-source the model weights without considering the easy training aspect.

SaprotHub is dedicated to establishing the AI model community for proteins, while Hugging Face has a broader scope. Therefore, SaprotHub's Adapter store can be built on top of Hugging Face or developed independently.

Q9: If I develop other protein language models (PLMs) and online platforms following SaprotHub and ColabSaprot, can I be an author on the SaprotHub paper?

All OPMC members will be listed as authors in the SaprotHub paper before its final revision, which is expected to take 4-12 months. Author's name will be included in the paper. Please note that eligibility to become an OPMC regular member is determined by the steering committee.

If an OPMC author is primarily granted authorship recognition for developing other PLMHubs, they are required to acknowledge that this model automatically becomes part of the OPMC framework. In case they publish a paper, they should menton this somewhere in the paper. Researchers using this new PLMHub should also cite the original OPMC literature.

Q10: How to be a member of OPMC or an author of SaprotHub.

Before the final revision of SaprotHub, all OPMC members will be automatically included as authors. However, after the publication of SaprotHub, individuals can still join OPMC but will not be able to be listed as authors in the paper, as it is subject to the requirements of the journal publication - the final version needs to determine the author list.

Regarding how to join OPMC, please refer to: here.

All in all, if you can come up with some cool ideas or novel ways to enhance the impact and influence of Saprot, ColabSaprot, SaprotHub, or OPMC, you may have the opportunity to be listed as an author.

Q11: If I have made a lot of valuable contributions to OPMC, can I become an OPMC member together with my supervisor?

Yes. If you have made significant valuable contributions to OPMC, it is possible to become an OPMC member alongside your PhD/postgraduate or internship supervisor. However, the acceptance of your contributions and membership status will be determined by the steering committee. Generally, in such cases, you would need to demonstrate more substantial contributions compared to regular OPMC members. You need to provide official documentation of your relationship with your supervisor, such as an official letter or document confirming their role and support.