Combining static analysis with deep learning for type inference and code editing
Access full-text files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
For many programming tasks, state-of-the-art machine learning techniques treat programs as sequences of tokens and encode only local syntactic information. While this approach has achieved impressive results on tasks such as code autocompletion and program synthesis, many other tasks require analyzing programs at the project level. In this thesis, we propose techniques that combine lightweight static analysis and code transformations with machine learning to tackle two challenging problems from this category. We first focus on probabilistic type inference, where the goal is to predict missing type annotations for programs written in a gradually typed language like JavaScript and Python. Global information is essential for this task as the model needs to consider how a function is used throughout the project and be aware of the new types defined elsewhere. Our first approach, LambdaNet, uses lightweight static analysis to generate a program abstraction called a type dependency graph, which is then processed by a graph neural network to make type predictions. Our more recent work, TypeT5, models type inference as a code-infilling task and fine-tunes a pre-trained code-infilling model on type annotation labels. To best utilize the transformer model's limited receptive field, TypeT5 uses static analysis to construct a dynamic context for each code element. During inference time, we also propose a sequential decoding scheme to incorporate previously predicted types into the dynamic context, allowing information exchange between distant but related code elements. We then focus on contextual code change prediction, where the goal is to predict how to edit a piece of code based on other relevant changes made elsewhere in the same project. We introduce Coeditor, a fine-tuned CodeT5 model specifically designed for code editing tasks. We again model this task as code infilling using a line-diff-based code change encoding scheme and employ static analysis to form large customized model contexts, ensuring appropriate information for prediction. Coeditor significantly outperforms the best code completion approach in a simplified single-round, single-edit task. In the proposed multi-round, multi-edit setting, Coeditor demonstrates substantial gains by iteratively conditioning on additional user edits. To encourage future research, we open-source our code, data, and model weights, and release a VSCode extension powered by our model for interactive usage.