Open Protein Modeling Consortium

About

The Open Protein Modeling Consortium (OPMC) is a collaborative initiative designed to unify the efforts of the protein research community. Its mission is to facilitate the sharing and co-construction of resources, with a particular focus on individually trained decentralized models, thereby advancing protein modeling through collective contributions. OPMC offers a platform that supports a wide range of protein function predictions, aiming to make advanced protein modeling accessible to researchers irrespective of their machine learning expertise.

Features

We have a SaprotHub as a collaborative community, aiming to empowers biologists by enabling them to create and train their own models without the need for advanced ML and coding expertise. We provide various preprocessed datasets and fine-tuned model checkpoints for users to directly utilize.

You can also try out SaprotHub in a Google Colab notebook without any setup required.

SaProt

Protein Language Modeling with Structure-aware Vocabulary

ColabSaprot SaprotHub

ProTrek

Navigating the Protein Universe through Tri-Modal Contrastive Learning

ColabProTrek ProTrekHub

ProtT5

Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

ColabProtT5 T5Hub

METL

Biophysics-based protein language models for protein engineering

METL Colab and Hub

OPMC Authors

OPMC Senior Authors (Random Order)

Sergey Ovchinnikov
MIT
Martin Steinegger
Seoul National University
Kevin Yang
Microsoft
Michael Heinzinger
Technische Universität München
Pascal Notin
Harvard University
Pranam Chatterjee
Duke University
Jia Zheng
Westlake University
Stan Z. Li
Westlake University
Xing Chang
Westlake University
Huaizong Shen
Westlake University
Noelia Ferruz
The Centre for Genomic Regulation (CRG)
Rohit Singh
Duke University
Debora Marks
Harvard University
Anping Zeng
Westlake University
Jijie Chai
Westlake University
Anthony Gitter
University of Wisconsin-Madison
Anum Glasgow
Columbia University
Milot Mirdita
Seoul National University
Philip M. Kim
University of Toronto
Christopher Snow
Colorado State University
Vasilis Ntranos
University of California
Philip A. Romero
Duke University
Jianyi Yang
Shandong University
Caixia Gao
Chinese Academy of Sciences

OPMC Regular Authors (Random Order)

Jiawei Zhang
Westlake University
Yuyang Tao
ShanghaiTech University
Fengyuan Dai
Westlake University
Xuting Zhang
Westlake University

Join Us

OPMC Membership and contributions

Steering committee Membership:

We plan to invite 20+ top researchers in the Al4protein field to join the steering committee and attract 50-100 regular members to build OPMC.

Steering committee Responsibilities:

The steering committee will assist in fulfilling several long-term academic responsibilities, including:

  1. Providing Guidance: Offering constructive suggestions for OPMC and SaprotHub. Supporting academic activities such as organizing workshops, tutorials, and participating in peer review processes.
  2. Advising: Providing valuable advice for the SaprotHub community or SaprotHub paper.
  3. Facilitating involvement: Encouraging direct involvement of team members in the development of SaprotHub or OPMC. Contributing Resources.
  4. Contributing high-quality open-source datasets and models to enhance OPMC's resources.
  5. Reviewing Membership: Participating in the review process for new members joining OPMC.
  6. Additional contributions: Any other contributions largely advance the OPMC community.

Regular Members (Developers) contributions:

The main contributions expected from regular members include:

  1. Model contribution: Offering more fine-tuned protein language models (mainly adapters) for various protein function predictions.
  2. Dataset contribution: Contributing more high-quality datasets to enrich the available resources.
  3. Wet Experiment Validation: Conducted wet experiment validation utilizing Saprot, leading to the discovery of intriguing biological insights, which were subsequently documented in a research paper.
  4. Development Participation: Participating in the development of SaprotHub or ColabSaprot.
  5. Additional contributions: Making other contributions or providing services that advance the goals of OPMC or SaprotHub.

Contact

Dr. Fajie Yuan (yuanfajie [AT] westlake.edu.cn)

Jin Su (sujin [AT] westlake.edu.cn)

FAQs

Q1: It seems like OPMC and SaprotHub are intertwined but not exactly the same.

Yes, OPMC is a grand goal, and in this paper, it is primarily presented as a concept and vision. The paper introduces OPMC and implements SaprotHub as a pioneering example to drive the initial realization of OPMC. Achieving a broader implementation of OPMC requires continuous efforts from the entire community.

Q2: I'm very interested in the OPMC side of this project? Would I be able to support OPMC independently?

Yes, you can. OPMC is not tied exclusively to SaprotHub. SaprotHub serves as an initial implementation case within the broader OPMC concept. We also welcome the inclusion of new protein models in OPMC. There are generally two ways to contribute: either independently of SaprotHub, such as building ESMHub or ProtTransHub, or by joining SaprotHub. While SaprotHub is named after its first model, Saprot, it is not limited to Saprot alone and welcomes the inclusion of other language models. The concept of OPMC originated from the SaprotHub paper, so if you would like your protein model to be part of OPMC or if you adopt the similar construction approach of SaprotHub, we encourage you to cite the source paper. Also see Q9.

Q3: What's the relation between OPMC and the OpenFold Consortium?

The goal of the OpenFold Consortium is to develop free and open-source software tools. This differs from the goals of OPMC. OPMC aims to make it easy for all biologists (especially those without machine learning backgrounds and coding skills) to train their own protein models, and to share these models with the community members, allowing for integration and collaborative development on top of the existing community models.

Additionally, so far, the OpenFold Consortium seems to be focusing more on protein structure prediction, while OPMC is more focused on protein function prediction. Furthermore, the number of protein function task categories is far greater than the number of structure tasks. As a result, biologists often have to fine-tune large pre-trained protein models based on their own training data, which is a key feature of OPMC.

Q4: Is the idea to create a company that provides the resources for biologists to do model training? I'm unsure the vision here, since a lot of model training is resource and data constrained. It would be hard to create something where "every biologist to train their own AI models with just a few clicks." Who would provide the resources in this case?

No, the primary motivation behind OPMC is to enable biologists to participate in protein model training and collaborative development, without direct involvement of creating a company or commercial operation.

Currently, we do not provide free training resources. Users have the option to purchase GPUs, such as the A100, on platforms like Colab. OPMC primarily supports fine-tuning tasks or direct prediction tasks for protein language models, rather than pre-training. These tasks typically do not require excessively expensive computational power. With a budget of around $10, one can easily complete training and prediction tasks on several thousand samples. This cost is manageable for most individuals and academic institutions. There are also free GPU resources in Colab but they may not be sufficient for some of your tasks.

In the future, we may explore options such as applying for funding or accepting donations to provide some free computational resources to users. Please note that the purpose of SaprotHub and OPMC is not to provide free computational resources.

Q5: We are open sourcing models as well, so it would be interesting to collaborate once we release these models. However, it seems like currently the hub is geared towards SaProt as the main model of choice.

Saprot is the first model to join the hub, so we named it SaprotHub. However, SaprotHub can also accept other protein models, such as ESM. Of course, you can also independently develop your own model hub. In the future, we will create a webpage for OPMC that will include all the participating models.

The reason we adopted the Saprot model is that it is a near-universal model, capable of supporting any protein and residue-level prediction task, including regression, classification, ranking, as well as zero-shot mutational effect prediction and sequence design tasks. Saprot is also the state-of-the-art protein language model in the community.

Additionally, we also hope to include as many other protein language models as possible, but due to limited human resources, we are unable to integrate all the existing protein language models. This is precisely the purpose of building a community (the development of ColabSaprot took us approximately 4 months. Of course, with the open-sourcing of ColabSaprot, it will be much easier to implement similar functionality for other protein language models).

We believe that with the joint efforts of the entire community, the OPMC community store can become more diverse, and biologists can choose the models that best suit their needs. Therefore, we sincerely invite researchers who are interested in OPMC to join us, and if you have better suggestions, we welcome you to join us in co-building OPMC.

Q6: ESM, AlphaFold, and Openfold models are not mentioned in the hub.

OPMC mainly focuses on protein function prediction. So ESM is a good fit, AlphaFold and OpenFold target at protein structure prediction, and could be independent of OPMC or SaprotHub. But please note that developing another ColabESM will also take some time, and we welcome researchers to integrate the ESM or other models onto the hub.

Q7: How do you foresee collaborating and sharing other open models?

The SaprotHub paper primarily focuses on collaboration and sharing within the framework of one backbone model, as this can greatly reduce storage and communication costs by leveraging Adapter technique – users just need to operate on the adapters rather than the large backbone model. Since these models are based on the same backbone network, the input format, network and parameter interfaces are more consistent, which serves as the basic for community sharing, collaboration and co-construction.

As for sharing between different backbone models, this remains a challenge at the moment, although it is an interesting research direction without an ideal solution yet. For example, if users want to collaborate between ESM15B and Saprot650M, they would need to upload and download the two complete models, significantly increasing communication and maintenance costs. As the number of models integrated increases, the demand on GPU performance would also increase rapidly.

However, by using the same backbone model and the Adapter mechanism, these difficulties can be elegantly solved. The Adapters in SaprotHub is just like grafting techniques in biology. Just as a single tree can bear different kinds of fruit, Saprot acts like the trunk, and the adapters for various downstream tasks resemble the different fruits on the Saprot tree.

Additionally, due to the differences in model architecture, input, and output across different models, it is difficult to design a unified interface. Therefore, this paper serves as an initial exploration of OPMC and does not cover collaboration between different backbone models. This may require more effort from the community, but holds promise for the future.

Q8: How does this differentiate SaprotHub from Hugging Face?

SaprotHub primarily focuses on storing lightweight Adapters, whereas Hugging Face stores complete pre-trained models. SaprotHub adopts the Adapter mechanism, which enables biologists to easily share, co-develop, and collaborate within ColabSaprot.

The goal of SaprotHub is to allow all biologists to train their own protein models even without machine learning and coding background, while Hugging Face's objective is to open-source the model weights without considering the easy training aspect.

SaprotHub is dedicated to establishing the AI model community for proteins, while Hugging Face has a broader scope. Therefore, SaprotHub's Adapter store can be built on top of Hugging Face or developed independently.

Q9: If I develop other protein language models (PLMs) and online platforms following SaprotHub and ColabSaprot, can I be an author on the SaprotHub paper?

All OPMC members will be listed as authors in the SaprotHub paper before its final revision, which is expected to take 4-12 months. Author's name will be included in the paper. Please note that eligibility to become an OPMC regular member is determined by the steering committee.

If an OPMC author is primarily granted authorship recognition for developing other PLMHubs, they are required to acknowledge that this model automatically becomes part of the OPMC framework. In case they publish a paper, they should menton this somewhere in the paper. Researchers using this new PLMHub should also cite the original OPMC literature.

Q10: How to be a member of OPMC or an author of SaprotHub.

Before the final revision of SaprotHub, all OPMC members will be automatically included as authors. However, after the publication of SaprotHub, individuals can still join OPMC but will not be able to be listed as authors in the paper, as it is subject to the requirements of the journal publication - the final version needs to determine the author list.

Regarding how to join OPMC, please refer to: here.

All in all, if you can come up with some cool ideas or novel ways to enhance the impact and influence of Saprot, ColabSaprot, SaprotHub, or OPMC, you may have the opportunity to be listed as an author.

Q11: If I have made a lot of valuable contributions to OPMC, can I become an OPMC member together with my supervisor?

Yes. If you have made significant valuable contributions to OPMC, it is possible to become an OPMC member alongside your PhD/postgraduate or internship supervisor. However, the acceptance of your contributions and membership status will be determined by the steering committee. Generally, in such cases, you would need to demonstrate more substantial contributions compared to regular OPMC members. You need to provide official documentation of your relationship with your supervisor, such as an official letter or document confirming their role and support.

Q12: Can I use copy the code of ColabSaprot for my own protein language model to build another ColabPLM?

Yes, you can. But you need to agree to join OPMC, and all papers that use your ColabPLM for research are encouraged to cite the SaprotHub paper, which is the source literature of OPMC.

Q13: Would the paper be submitted to a journal.

Yes, we plan to submit it to Nature Methods in the form of brief communication. The editor-in-chief of NBT said that the author list of OPMC could be confirmed in the final version. So you have several months time to contribute and join us.