ANPL: Towards Natural Programming with Interactive Decomposition

1SKL of Processors, Institute of Computing Technology, CAS
2University of Chinese Academy of Sciences
3Autodesk Research

Abstract

The advents of Large Language Models (LLMs) have shown promise in augmenting programming using natural interactions. However, while LLMs are proficient in compiling common usage patterns into a programming language, e.g., Python, it remains a challenge how to edit and debug an LLM-generated program. We introduce ANPL, a programming system that allows users to decompose user- specific tasks. In an ANPL program, a user can directly manipulate “sketch”, which specifies the data flow of the generated program. The user annotates the modules, or “hole” with natural language descriptions offloading the expensive task of generating functionalities to the LLM. Given an ANPL program, the ANPL compiler generates a cohesive Python program that implements the functionalities in hole, while respecting the dataflows specified in sketch. We deploy ANPL on the Abstraction and Reasoning Corpus (ARC), a set of unique tasks that are challenging for state-of-the-art AI systems, showing it outperforms baseline programming systems that (a) without the ability to decompose tasks interactively and (b) without the guarantee that the modules can be correctly composed together. We obtain a dataset consisting of 300/400 ARC tasks that were successfully decomposed and grounded in Python, providing valuable insights into how humans decompose programmatic tasks.

Overview

User Study

We conducted a user study on 400 ARC training tasks to evaluate the effectiveness of ANPL compared to the original ChatGPT (GPT-3.5-turbo). Specifically, we compare system A (ANPL), system B (GPT + interaction, i.e. ChatGPT), system C (ANPL without interaction), and system D (vanilla GPT), where systems C and D represent the solving rate of one-shot generation without further interaction using systems A and B, respectively.

Results of problem-solving rates. A: Solving rate of four systems. B: The relationship between solving rate and time consumption. C: The relationship between solving rate and number of interactions. Trace Calculated(TC) means the trace mode is considered into interactions.

Results of problem-solving rates when users can not see generated Python code and only interact with user interfaces and IO feedback. A: Solving rates of four systems. B: The relationship between solving rate and time consumption. C: The relationship between solving rate and number of interactions. Trace Calculated(TC) means the trace mode is considered into interactions.

In conclusion:

  1. With large confidence, ANPL performs the best and allows users to solve more problems, with a solving rate achieving 75.0% on average, while B achieves 58.4% (↑ 28.25%).
  2. Programming with interaction (systems A and B) is always better than programming without interaction (systems C and D).
  3. Even for one-shot code generation, ANPL without interaction (system C) performs better than vanilla GPT (system D).

ANPL program of ANPL compiler

Preserving the sketch & Generating the holes

          
def compiling(ANPL):
  while "ANPL program has a hole"(ANPL):
  hole = "return the first hole in ANPL program"(ANPL):
  enerated_code = LLM(ANPL, hole.descriptions, hole.IOs)
  dependency_graph = "analyze generated code, return dependency graph with topology sort"(generated_code)
  entry_nodes = "return all entry nodes of the dependency graph with topology sort"(dependency_graph)
  if "the hole has been named by the user"(hole):
    new_dependency_graph = "remove unrelated nodes in the dependency graph"(dependency_graph, hole)
    ANPL = "fill the hole with generated code according to the dependency graph"(ANPL, generated_code, new_dependency_graph)
  elif len(entry_nodes) == 1:
    ANPL = "fill the hole with generated code according to the dependency graph"(ANPL, generated_code, dependency_graph)
  else:
    # The dependency graph has multiple entry nodes, re-generate the hole
    continue
  return ANPL
        
      

User Interface

The user interface consists of 3 operations: trace, edit, resynthesis. And we also provide a grid editor for users.