The Open Protein Modeling Consortium (OPMC) is a collaborative initiative designed to unify the efforts of the protein research community. Its mission is to facilitate the sharing and co-construction of resources, with a particular focus on individually trained decentralized models, thereby advancing protein modeling through collective contributions. OPMC offers a platform that supports a wide range of protein function predictions, aiming to make advanced protein modeling accessible to researchers irrespective of their machine learning expertise.
We have a SaprotHub as a collaborative community, aiming to empowers biologists by enabling them to create and train their own models without the need for advanced ML and coding expertise. We provide various preprocessed datasets and fine-tuned model checkpoints for users to directly utilize.
You can also try out SaprotHub in a Google Colab notebook without any setup required.
Navigating the Protein Universe through Tri-Modal Contrastive Learning
ColabProTrek ProTrekHubTowards Cracking the Language of Life’s Code Through Self-Supervised Learning
ColabProtT5 T5HubWe plan to invite 20+ top researchers in the Al4protein field to join the steering committee and attract 50-100 regular members to build OPMC.
The steering committee will assist in fulfilling several long-term academic responsibilities, including:
The main contributions expected from regular members include:
Yes, OPMC is a grand goal, and in this paper, it is primarily presented as a concept and vision. The paper introduces OPMC and implements SaprotHub as a pioneering example to drive the initial realization of OPMC. Achieving a broader implementation of OPMC requires continuous efforts from the entire community.
Yes, you can. OPMC is not tied exclusively to SaprotHub. SaprotHub serves as an initial implementation case within the broader OPMC concept. We also welcome the inclusion of new protein models in OPMC. There are generally two ways to contribute: either independently of SaprotHub, such as building ESMHub or ProtTransHub, or by joining SaprotHub. While SaprotHub is named after its first model, Saprot, it is not limited to Saprot alone and welcomes the inclusion of other language models. The concept of OPMC originated from the SaprotHub paper, so if you would like your protein model to be part of OPMC or if you adopt the similar construction approach of SaprotHub, we encourage you to cite the source paper. Also see Q9.
The goal of the OpenFold Consortium is to develop free and open-source software tools. This differs from the goals of OPMC. OPMC aims to make it easy for all biologists (especially those without machine learning backgrounds and coding skills) to train their own protein models, and to share these models with the community members, allowing for integration and collaborative development on top of the existing community models.
Additionally, so far, the OpenFold Consortium seems to be focusing more on protein structure prediction, while OPMC is more focused on protein function prediction. Furthermore, the number of protein function task categories is far greater than the number of structure tasks. As a result, biologists often have to fine-tune large pre-trained protein models based on their own training data, which is a key feature of OPMC.
No, the primary motivation behind OPMC is to enable biologists to participate in protein model training and collaborative development, without direct involvement of creating a company or commercial operation.
Currently, we do not provide free training resources. Users have the option to purchase GPUs, such as the A100, on platforms like Colab. OPMC primarily supports fine-tuning tasks or direct prediction tasks for protein language models, rather than pre-training. These tasks typically do not require excessively expensive computational power. With a budget of around $10, one can easily complete training and prediction tasks on several thousand samples. This cost is manageable for most individuals and academic institutions. There are also free GPU resources in Colab but they may not be sufficient for some of your tasks.
In the future, we may explore options such as applying for funding or accepting donations to provide some free computational resources to users. Please note that the purpose of SaprotHub and OPMC is not to provide free computational resources.
Saprot is the first model to join the hub, so we named it SaprotHub. However, SaprotHub can also accept other protein models, such as ESM. Of course, you can also independently develop your own model hub. In the future, we will create a webpage for OPMC that will include all the participating models.
The reason we adopted the Saprot model is that it is a near-universal model, capable of supporting any protein and residue-level prediction task, including regression, classification, ranking, as well as zero-shot mutational effect prediction and sequence design tasks. Saprot is also the state-of-the-art protein language model in the community.
Additionally, we also hope to include as many other protein language models as possible, but due to limited human resources, we are unable to integrate all the existing protein language models. This is precisely the purpose of building a community (the development of ColabSaprot took us approximately 4 months. Of course, with the open-sourcing of ColabSaprot, it will be much easier to implement similar functionality for other protein language models).
We believe that with the joint efforts of the entire community, the OPMC community store can become more diverse, and biologists can choose the models that best suit their needs. Therefore, we sincerely invite researchers who are interested in OPMC to join us, and if you have better suggestions, we welcome you to join us in co-building OPMC.
OPMC mainly focuses on protein function prediction. So ESM is a good fit, AlphaFold and OpenFold target at protein structure prediction, and could be independent of OPMC or SaprotHub. But please note that developing another ColabESM will also take some time, and we welcome researchers to integrate the ESM or other models onto the hub.
The SaprotHub paper primarily focuses on collaboration and sharing within the framework of one backbone model, as this can greatly reduce storage and communication costs by leveraging Adapter technique – users just need to operate on the adapters rather than the large backbone model. Since these models are based on the same backbone network, the input format, network and parameter interfaces are more consistent, which serves as the basic for community sharing, collaboration and co-construction.
As for sharing between different backbone models, this remains a challenge at the moment, although it is an interesting research direction without an ideal solution yet. For example, if users want to collaborate between ESM15B and Saprot650M, they would need to upload and download the two complete models, significantly increasing communication and maintenance costs. As the number of models integrated increases, the demand on GPU performance would also increase rapidly.
However, by using the same backbone model and the Adapter mechanism, these difficulties can be elegantly solved. The Adapters in SaprotHub is just like grafting techniques in biology. Just as a single tree can bear different kinds of fruit, Saprot acts like the trunk, and the adapters for various downstream tasks resemble the different fruits on the Saprot tree.
Additionally, due to the differences in model architecture, input, and output across different models, it is difficult to design a unified interface. Therefore, this paper serves as an initial exploration of OPMC and does not cover collaboration between different backbone models. This may require more effort from the community, but holds promise for the future.
SaprotHub primarily focuses on storing lightweight Adapters, whereas Hugging Face stores complete pre-trained models. SaprotHub adopts the Adapter mechanism, which enables biologists to easily share, co-develop, and collaborate within ColabSaprot.
The goal of SaprotHub is to allow all biologists to train their own protein models even without machine learning and coding background, while Hugging Face's objective is to open-source the model weights without considering the easy training aspect.
SaprotHub is dedicated to establishing the AI model community for proteins, while Hugging Face has a broader scope. Therefore, SaprotHub's Adapter store can be built on top of Hugging Face or developed independently.
All OPMC members will be listed as authors in the SaprotHub paper before its final revision, which is expected to take 4-12 months. Author's name will be included in the paper. Please note that eligibility to become an OPMC regular member is determined by the steering committee.
If an OPMC author is primarily granted authorship recognition for developing other PLMHubs, they are required to acknowledge that this model automatically becomes part of the OPMC framework. In case they publish a paper, they should menton this somewhere in the paper. Researchers using this new PLMHub should also cite the original OPMC literature.
Before the final revision of SaprotHub, all OPMC members will be automatically included as authors. However, after the publication of SaprotHub, individuals can still join OPMC but will not be able to be listed as authors in the paper, as it is subject to the requirements of the journal publication - the final version needs to determine the author list.
Regarding how to join OPMC, please refer to: here.
All in all, if you can come up with some cool ideas or novel ways to enhance the impact and influence of Saprot, ColabSaprot, SaprotHub, or OPMC, you may have the opportunity to be listed as an author.
Yes. If you have made significant valuable contributions to OPMC, it is possible to become an OPMC member alongside your PhD/postgraduate or internship supervisor. However, the acceptance of your contributions and membership status will be determined by the steering committee. Generally, in such cases, you would need to demonstrate more substantial contributions compared to regular OPMC members. You need to provide official documentation of your relationship with your supervisor, such as an official letter or document confirming their role and support.
Yes, you can. But you need to agree to join OPMC, and all papers that use your ColabPLM for research are encouraged to cite the SaprotHub paper, which is the source literature of OPMC.
Yes, we plan to submit it to Nature Methods in the form of brief communication. The editor-in-chief of NBT said that the author list of OPMC could be confirmed in the final version. So you have several months time to contribute and join us.