With StableLM, Stability AI has released two large language models as open source: The two Large Language Models (LLM) each include 3 and 7 billion parameters. The release is an alpha version. Developers are free to use, study, and adapt them for research and commercial purposes while respecting the license.
StableLM-3B and StableLM-7B are licensed under CC BY-SA-4.0: This is a copyleft license that states that the software under it may be reproduced and redistributed in any format. Anyone who works with StableLM can change and edit the models for any purpose, including commercial ones. However, offshoots and products always inherit the copyleft license.
StableLM is under copyleft license
This means that new models derived from StableLM must name and pass on the original author (Stability AI) and the license in unmodified form. It is forbidden to turn software created in this way into closed source, for example to declare it your own intellectual property or to make changes to the inherited license. These requirements cannot legally be overridden by any additional clauses or technical procedures, the CC BY-SA-4.0 is considered a particularly strong copyleft license.
As announced by StabilityAI CEO Emad Mostaque, StableLM aims to provide an open, transparent, and scalable alternative to proprietary AI models like those of OpenAI. Models with 15 to 65 billion parameters are to follow in the foreseeable future, according to the release blog post. The models of the StableLM series should be able to generate text and source code and, according to Mostaque, can be used for numerous applications based on them. According to the blog entry, with the relatively small LLMs, Stability wants to show that even smaller models are capable of high performance, provided they have undergone appropriate training and have an efficient architecture.
Are models based on The Pile open source?
Stability AI had previously supported the work of AI grassroots EleutherAI, who released a series of smaller AI base models for research with Pythia in early April 2023. Pythia-12B is, among other things, the starting model for Dolly 2.0 from Databricks, and OpenAssistant from LAION is based on the open-source Pythia models. Experiences with earlier open source models such as GPT-J and GPT-NeoX from EleutherAI also contributed to the current release of StableLM.
StableLM was trained on a new experimental dataset that builds on the well-known 800 gigabyte dataset “The Pile” for modeling large language models, albeit about three times larger than this with a total of 1.5 trillion tokens of content. The record is considered problematic because in it probably also contain copyrighted works (what consequences this will have for open source licenses is currently still open and also affects other projects that use this dataset or common crawl from the Internet). With the size of the training data set, StableLM is on par with the proprietary AI model LLaMA from Meta AI, which was made available to selected research projects and, due to a leak, is also circulating on the Internet with semi-official to illegal offshoots.
RedPajama Openly Recreates LLaMA: Base Dataset Available
Another open source project works with a self-created data set to a similar extent: in mid-April 2023, a few days before the release of StableLM, a top-class research collaboration from the USA and Canada with partners published the training data set for RedPajama, the 1.2 trillion tokens and also took the LLaMA paper as a model. RedPajama plans to release a state-of-the-art, open-source model series with strong performance values and thus rebuild the unreleased LLaMA under free license. Unlike StableLM, at least the RedPajama data set is under the Apache 2.0 license, so models and applications trained with it can also be used commercially without restrictions.
LAION and the Open Letter
The Large-Scale Artificial Intelligence Open Network (LAION eV) involved in RedPajama had previously announced that members of the network want to create large, state-of-the-art AI language models with comparable capabilities to the most powerful commercial offerings. A petition to set up an international high-performance computing cluster for AI is currently running, also in response to the open letter from the Future of Life Institute, signed by Elon Musk and other celebrities who had called for a pause in the development of large AI models.
At the same time, Musk had announced his own AI company, which is to compete with OpenAI under the domain x.ai and with the working name TruthGPT. It is unlikely that this will be open source AI.
Research models co-published
Alongside the alpha versions of StableLM, Stability AI releases a number of research models fine-tuned with instructions. These research models access combinations of different open-source conversational agentic AI datasets, namely Alpaca, GPT4All, Dolly, ShareGPT, and HH datasets. These models are expressly not suitable for commercial purposes and may only be used for research purposes. Your license is a non-commercial CC BY-NC-SA 4.0, analogous to Stanford’s Alpaca license (one of the many LLaMA forks allowed for research by Meta AI).
In the release blog post, there are some conversation examples from the StableLM-7B, the larger of the two models that have now been released. StableLM fits seamlessly into the movement of open-source AI models that are now emerging. Stability AI provides three keywords for goal setting: transparency, accessibility and support. The open source models are there to support users, not to replace them. It is about efficient, specialized and practical AI applications that can also be implemented with smaller models. They expressly do not want to participate in the race for “God-like AI”. The focus is on everyday applications and a use that increases productivity and allows people to be more creative, according to the statement in the blog entry.
It is relevant for researchers that they can “look under the hood” of the models published in this way in order to jointly improve the traceability and explainability of AI models, identify risks and develop security measures. Private and public domains can fine-tune open-source models for their purposes without sharing sensitive data or exposing control of AI capabilities.
Technical report follows
StableLM is available in the Stability AI GitHub repository. A technical report and benchmarks for the performance comparison are not yet available, but should be submitted “in the near future”. Concurrent with the release, a crowdsourcing program for Human Feedback Reinforcement Learning (RLHF), a common practice for fine-tuning large language models, will begin.
Community work such as OpenAssistant, whose project has published a high-quality, quality-assured and freely accessible basic data set for AI assistants in a collaborative effort, serves as a model. More details can be found in the blog entry.