Code generation systems like DeepMind’s AlphaCode, Amazon’s CodeWhisperer, and OpenAI’s Codex, which power GitHub’s Copilot service, provide a tantalizing glimpse of what’s possible with AI today in computer programming. But so far only a a handful of such AI systems are freely available to the public and open source—reflecting the commercial incentives of the companies that build them.
In an effort to change that, the AI Hugging Face startup and ServiceNow Research, ServiceNow’s research and development arm, launched today BigCode, a new project that aims to develop “state-of-the-art” AI systems to code in an “open and accountable” way. The goal is to eventually release a data set large enough to train a code generation system that will then be used to create a prototype – a 15 billion parameter model larger in size than the Codex (12 billion parameters) but smaller than AlphaCode (~41.4 billion parameters) — using ServiceNow’s internal GPU cluster. In machine learning, parameters are the parts of an AI system learned from historical training data and essentially define the system’s skills for a problem, such as code generation.
Inspired by Hugging Face BigScience effort to open up highly sophisticated text generation systems, BigCode will be open to anyone who has professional AI research experience and can commit time to the project, organizers say. The application form went live this afternoon.
“In general, we expect applicants to be affiliated with a research organization (either in academia or industry) and to work on the technical/ethical/legal aspects of [large language models] for coding applications,” ServiceNow wrote in a blog post. “Once [code-generating system] is trained, we will evaluate his capabilities… We will strive to make the evaluation easier and broader so that we can learn more about [system’s] abilities.”
In co-developing a code generation system that will be open source under a license that will allow developers to reuse it under certain terms and conditions, BigCode seeks to address some of the controversy surrounding the practice of AI- powered code generation — especially regarding fair use. The non-profit organization Software Freedom Conservancy, among others, has criticized GitHub and OpenAI for using public source code, not all of which is under a permissive license, to train and monetize Codex. Codex is available through OpenAI’s paid API, while GitHub recently started charging for access to Copilot. For their part, GitHub and OpenAI continue to maintain that Codex and Copilot do not conflict with any license terms.
The organizers of BigCode say they will make efforts to ensure that only files from repositories with permissive licenses enter the aforementioned training dataset. Along the way, they say, they will work to establish “responsible” AI practices for learning and sharing code generation systems of all types, asking for feedback from relevant stakeholders before making policy statements.
ServiceNow and Hugging Face did not provide a timeline for when the project might reach completion. But they expect it to explore several forms of code generation over the next few months, including systems that automatically complete and synthesize code from code snippets and natural language descriptions and work across a wide range of domains, tasks and programming languages.
Assuming the ethical, technical, and legal issues are ironed out someday, AI-based coding tools could significantly reduce development costs while allowing programmers to focus on more creative tasks. According to a study from the University of Cambridge, at least half of developers’ efforts are spent on debugging rather than active programming, costing the software industry an estimated $312 billion a year.