Navigating Commercial Use of Falcon Models: Key Considerations
Written on
Chapter 1: Understanding Falcon Models
The Falcon models have emerged as some of the top large language models based on public benchmarks. I previously discussed their features in an article titled:
Introduction to the Open LLM Falcon-40B: Performance, Training Data, and Architecture
These models excel in answering questions across various domains and performing common-sense reasoning. Such capabilities make them appealing for commercial applications. Initially, Falcon models were not available for commercial usage. However, their licensing changed at the end of May to an Apache 2.0 license, which appears to permit commercial use.
Nonetheless, the question arises: Is an Apache 2.0 license sufficient for commercial applications? The answer is complex. The creators of a language model have the liberty to impose any license they choose. This does not eliminate the need to verify if the license applies appropriately to the model.
Section 1.1: Legal Risks of Commercial Use
Utilizing Falcon models in commercial settings could expose your business to potential legal issues with OpenAI.
Falcon-instruct models are specialized versions of the Falcon models, designed to operate as chatbots. They have been trained on curated datasets that include "instructions," similar to the transformation of GPT-3 into ChatGPT. Understanding the training data of any model is crucial.
These models utilize:
- 150 million tokens from Bai ze, along with 5% of RefinedWeb data.
RefinedWeb is a dataset derived from Common Crawl, and its specific content remains unclear. However, the Bai ze dataset is accessible for review. According to its GitHub repository, it comprises 100,000 dialogues generated by ChatGPT engaging in self-conversation. Essentially, Falcon-instruct has been shaped into a chatbot using data generated by ChatGPT.
Subsection 1.1.1: Implications of Using ChatGPT-Generated Data
The Falcon-instruct models operate under an Apache 2.0 License, effectively making them free chatbots trained with ChatGPT-generated data. This scenario resembles a type of machine learning "distillation" where ChatGPT serves as the educator, and Falcon-instruct acts as the learner. We can anticipate that Falcon-instruct will perform comparably to ChatGPT due to this training method.
However, does OpenAI permit this arrangement? The answer is no. OpenAI explicitly restricts the development of competing models using outputs from its services. In their terms of service, one of the prohibitions states:
- using outputs from the Services to create models that compete with OpenAI.
I am not a legal expert, so this should not be construed as legal guidance, but it seems that Falcon-instruct models might violate OpenAI's terms. To my knowledge, OpenAI has not yet responded to this situation, and it remains uncertain how they might interpret "compete." As currently hosted on Hugging Face's hub, the Falcon-instruct models do not overtly rival OpenAI services.
Nonetheless, if you integrate a Falcon-instruct model into your product and it starts attracting users away from OpenAI's offerings, it is likely that OpenAI would take action.
Section 1.2: Fine-Tuning Falcon Models
If your goal is to fine-tune the original pre-trained Falcon model with your own datasets, further guidance can be found here:
Fine-tune Falcon-7B on Your GPU with TRL and QLoRa: A State-of-the-Art LLM Better than LLaMa for Free
By the way, if you’re looking for a personal AI translation assistant, consider slAItor. I have developed a translation engine based on ChatGPT, which not only translates but also offers alternative translations, modifies the style, explains translations, and even reviews your own translations!
Chapter 2: Conclusion
In summary, while the Falcon models present exciting opportunities for commercial applications, it's essential to navigate the legal landscape carefully to avoid potential conflicts with existing services like those offered by OpenAI. If you are already a member and would like to support my work, please follow me on Medium.