Skip to main content Skip to navigation

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Project Overview

The document explores the transformative role of generative AI in education, particularly through the 'CodeNet' dataset, which serves as a comprehensive resource for enhancing software development productivity via machine learning. This dataset comprises over 14 million code samples across 55 programming languages, facilitating a wide array of coding tasks such as code similarity assessment, translation, and performance optimization. With its extensive annotations, 'CodeNet' not only benchmarks AI applications in coding but also encourages research and innovation in the field, catering to both novice and experienced developers. Furthermore, the document highlights an upcoming contest aimed at promoting diversity and inclusivity in data science, actively engaging aspiring data scientists to leverage AI tools effectively. Collectively, these initiatives underscore the potential of generative AI to improve educational methodologies, foster creativity in software development, and support the next generation of tech professionals.

Key Applications

CodeNet dataset for AI in Coding

Context: Educational context includes aspiring data scientists and software developers. The dataset serves academic and practical applications in coding.

Implementation: The CodeNet dataset was curated from submissions on online judge platforms like AIZU and AtCoder, providing a structured resource with annotations.

Outcomes: Enhanced opportunities for research and development in AI for coding, improved coding practices, and educational tools for learning coding.

Challenges: All code samples may not be extensively commented; the dataset is not suited for users looking for enterprise-level API code or advanced design patterns.

Implementation Barriers

Quality of Data

The quality of code samples varies, and many may lack extensive comments or documentation.

Proposed Solutions: Encourage thorough commenting practices among users and provide guidelines for high-quality submissions.

Project Team

Ruchir Puri

Researcher

David S. Kung

Researcher

Geert Janssen

Researcher

Wei Zhang

Researcher

Giacomo Domeniconi

Researcher

Vladimir Zolotov

Researcher

Julian Dolby

Researcher

Jie Chen

Researcher

Mihir Choudhury

Researcher

Lindsey Decker

Researcher

Veronika Thost

Researcher

Luca Buratti

Researcher

Saurabh Pujar

Researcher

Shyam Ramji

Researcher

Ulrich Finkler

Researcher

Susan Malaika

Researcher

Frederick Reiss

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies