Dolly 2.0: Large AI language model freely available for commercial use

The software company Databricks has released Dolly 2.0 as open source. Dolly not only evokes associations by name with clone sheep and the similarly named image generator from OpenAI (DALL E), it is a clone of an open source AI model (Pythia-12B from EleutherAI, more on this below). Similar to ChatGPT, the large language model is designed to interact with humans who give it natural language instructions. According to the provider, what is special about the release is that Dolly 2.0 is freely available – specifically for commercial purposes and applications. Unlike OpenAI, there are no fees for API access and the user data is not shared with third parties, according to the release blog entry.

The 12 billion parameter model is licensable for research and commercial use. Any organization can use it to create its own Large Language Models (LLM) and adapt them for its own purposes. Dolly 2.0 is an open-source, large-scale, instruction-following, text-based AI language model fine-tuned on a dataset built in-house by Databricks contributors. The company releases the code, weights and dataset used for fine-tuning “databricks-dolly-15k” under a Creative Commons Attribution ShareAlike 3.0 License: Thus, anyone may use, modify and extend the dataset “for any purpose, including commercial applications”.

What is special about it, in addition to being freely available under an open source license, is that the data set was not derived secondarily from other AI models: more than 5,000 Databricks employees worked on the man-made training data set between March and April 2023, according to the blog entry. According to the company, it is the first human-made instruction data set for training large language models.

Unlike the currently popular LLaMA offshoots, the Databricks model is an offshoot of EleutherAI’s Pythia model family. The non-profit AI research group EleutherAI (in English: “Free AI”) was formed in 2020 after OpenAI transformed into a for-profit company through a partnership with Microsoft and turned away from the original open-source goals. With Pythia, EleutherAI released a package of smaller language models for scientific exploration of large models in early April 2023, which are pre-trained on publicly available datasets.

EleutherAI’s model family offers 16 models ranging in size from 70 million to 12 billion parameters and access to 154 checkpoints for each of these models. On top of that, the research network provides tools that researchers can use to reconstruct the training data for further investigations. In addition to the research offer of this “highly controlled setup”, the Pythia models are available for free download on GitHub.

For Dolly 1.0, Databricks had used a data set from the Stanford project Alpaca, which the Alpaca team had created via the OpenAI API using ChatGPT. The OpenAI Terms of Use prohibit the use of the API to make OpenAI commercially competitive from the data obtained in this way.

The Databricks team got suggestions for training Dolly 2.0 from an OpenAI paper from March 2022 “Training language models to follow instructions with human feedback”. It says, for example, that OpenAI used a data set of 13,000 instruction examples to teach human behavior to the model. The challenge here is that each of the 13,000 question-answer pairs must be an original and must not be copied from ChatGPT or the Internet, otherwise it would “contaminate the data set”. The 5,000+ employees at Databricks crowdsourced seven areas to add capabilities to the model:

  • Open and closed Q&A – whereby there is not necessarily one correct answer for open questions. With closed questions, the answer is limited to a given body of knowledge or text excerpt.
  • Get information from Wikipedia and answer factual questions
  • Summarize information from Wikipedia
  • Brainstorming: open collection of ideas and associations
  • Classifying Text
  • Creative writing


Summarizing a customer request: Dolly 2.0

Summarizing a customer request: Dolly 2.0

Dolly 2.0 summarizes a customer request

(Image: Databricks)

Using a “gamification” challenge, the 15,000 question-answer pairs that make up the databricks-dolly-15k training dataset came together. Databricks promotes the dataset as fact-based and of high quality because everyone involved is a professional and actively engages with LLM. According to Databricks, the model is less prone to hallucinations than the secondary models based on LLaMA, which were fed with synthetically generated training data sets. Instructions and sample answers can be reviewed in the blog post.

Due to its performance and relatively small size, the model is not considered “SOTA” (state-of-the-art). However, its editors expect that it will serve as a starting point for further work based on it and that more powerful large language models could emerge from it. In the AI ​​scene, the model is partly enthusiastically received – the approach of making it accessible as open source is met with acceptance and could also be of interest in connection with an ongoing petition by LAION for an international AI computing cluster to create open models be.


Automatically generated tweet by Dolly 2.0

Automatically generated tweet by Dolly 2.0

Dolly 2.0 creates a tweet text for its own release

(Image: Databricks)

Since the leak of metas model LLaMA, which can only be licensed for scientific purposes on request, illegally derived models have been shooting up out of the ground. Universities like Stanford and Berkeley (who were officially allowed to work with LLaMA) show with a small budget and crowdsourcing that large models can be retrained with small resources. Most of the time it stays with the “proof of concept”: Stanford took the alpaca demo offline again after a short time because the costs of operation were not affordable. The question of legal use is always unresolved, since Meta’s model is just as little released as open source as that of OpenAI-Microsoft. It is therefore used in a legal gray area, and such models are useless for commercial purposes. The practical suitability is also inconsistent, benchmarks are usually not available or are not very convincing.

An exception is the LLaMA offshoot Vicuna, which is suitable for local use due to its small size and whose publishers claim that Vicuna achieves “90 percent of the performance of ChatGPT” (scientific data is not available). Although it is called open source – since it is based on LLaMA models and GPT-4 related training data sets, it is legally in the gray area like the other LLaMA offshoots. Vicuna is an inter-university project by students from the Universities of Berkeley, CMU (Carnegie Mellon Pittsburgh), Stanford and San Diego. Presumably the project had a scientific license and did not rely on the unofficial bit torrent with the leaked model data.

Recently, numerous offshoots of LLaMA had appeared in quick succession. The providers had consistently created their training data sets for fine tuning using the OpenAI API using GPT-4 and ChatGPT, i.e. they used synthetic data and did not carry out their own model training without creating their own data sets. Legally and also in terms of content, such tuned models are on shaky ground, since OpenAI does not make its products available to the general public and training data sets derived from them for creating new models are prohibited by OpenAI in the Terms of Service if someone tries to capitalize on them. Alpaca, Koala, Vicuna and GPT4All are therefore all unsuitable for commercial use.

However, the use of private user data by OpenAI to create and fine-tune its own commercial GPT series is also considered problematic and is currently the subject of legal investigations in Canada, the USA and Italy.

Databricks is a provider of a multi-cloud platform for data engineering, data science and data analysis. The company was founded in 2016 by the developers of Apache Spark, who built a platform for automated cluster management and IPython-style notebooks around Spark – Databricks had not previously appeared as a provider of AI models. The first version of the large language model (Dolly) was only released about two weeks ago.

For more information, examples, and download tips, see the blog post on the Databricks website.


(sih)

To home page

Related Posts

Hot News

Trending

usefull links

robis robis robis