Automatic Prompt Engineer (APE)

APE Framework Overview

Automatic Prompt Engineer (APE) is a technique that leverages language models themselves to automatically generate, test, and refine prompts for specific tasks. Instead of manually crafting prompts through trial and error, APE uses systematic approaches to discover high-performing prompt formulations, often surpassing human-designed prompts.

Introduced by Zhou et al. (2022), APE treats prompt generation as a natural language synthesis problem, where the model is asked to generate instructions that would lead to desired outputs given specific inputs.

How APE Works

The APE process typically involves several key steps:

Prompt Generation: The model generates multiple candidate prompts based on input-output examples
Evaluation: Each candidate prompt is tested on a validation set
Selection: The best-performing prompts are identified based on accuracy or other metrics
Iteration: The process can be repeated to further refine prompts

Use Cases

APE is particularly valuable when:

Manual prompt engineering is time-consuming or requires domain expertise
You need to optimize prompts for specific metrics (accuracy, consistency, style)
Working with new tasks where effective prompt patterns aren't well-established
Scaling prompt optimization across multiple similar tasks

Implementation Patterns

Basic APE Pattern

Task: Generate a prompt that will help classify customer reviews as positive or negative.

Given these examples:
Input: "This product exceeded my expectations!"
Output: Positive

Input: "Waste of money, poor quality"
Output: Negative

Generate 5 different prompts that would work well for this classification task.

Here are 3 prompts for sentiment analysis:
1. "Classify this review as positive or negative:"
2. "Determine the sentiment (positive/negative) of this customer feedback:"
3. "Is this review expressing a positive or negative opinion?"

Based on testing, prompt #2 performed best. Generate 3 improved variations of prompt #2.

Meta-Prompt for APE

You are an expert prompt engineer. Your task is to create effective prompts for [SPECIFIC TASK].

Requirements:
- The prompt should be clear and unambiguous
- It should work well across diverse inputs
- Include any necessary context or formatting instructions

Generate 5 candidate prompts, then explain why each might be effective.

Advanced Techniques

Chain-of-Thought APE

APE can be combined with chain-of-thought prompting to generate prompts that encourage step-by-step reasoning:

Generate a prompt that asks the model to solve math word problems by thinking step by step. The prompt should encourage the model to:
Identify what information is given
Determine what needs to be found
Set up the calculation
Solve and verify the answer

Multi-Objective APE

When optimizing for multiple criteria simultaneously:

Create prompts for summarizing research papers that optimize for:
- Accuracy (capturing key findings)
- Brevity (under 100 words)
- Accessibility (understandable to non-experts)

Generate 3 prompts that balance these objectives.

Evaluation and Testing

Effective APE requires systematic evaluation:

Quantitative Metrics

Accuracy: Percentage of correct outputs on test cases
Consistency: Variance in outputs across multiple runs
Efficiency: Token usage and response time

Qualitative Assessment

Clarity: How well the prompt communicates the task
Robustness: Performance across edge cases and diverse inputs
Generalization: Effectiveness on unseen examples

Implementation Example

Here's a complete APE workflow for creating a code documentation prompt:

Step 1: Define the task
Task: Generate clear, concise documentation for Python functions

Step 2: Provide examples
Input: def calculate_area(radius): return 3.14159 * radius ** 2
Output: Calculates the area of a circle given its radius. Returns the area as a float.

Step 3: Generate candidate prompts
1. "Write documentation for this Python function:"
2. "Create a brief description of what this function does:"
3. "Generate a docstring explaining this function's purpose and return value:"
4. "Describe this function in one clear sentence:"
5. "Document this function including its purpose and output:"

Step 4: Test and evaluate
[Test each prompt on validation set]

Step 5: Select and refine best performer
Best: Prompt #3
Refined: "Generate a concise docstring for this Python function, explaining its purpose, parameters, and return value:"

Best Practices

For Prompt Generation

Provide diverse, high-quality examples
Be specific about desired output format and style
Include edge cases in your evaluation set
Consider the target model's capabilities and limitations

For Evaluation

Use held-out test sets to avoid overfitting
Evaluate on multiple metrics relevant to your use case
Test prompts with different model temperatures and sampling methods
Consider human evaluation for subjective tasks

For Iteration

Start with simple prompts and gradually increase complexity
Document what works and what doesn't for future reference
Consider prompt ensembles for critical applications
Regular re-evaluation as models and tasks evolve

Limitations and Considerations

Computational Cost: APE requires multiple model calls for generation and evaluation, which can be expensive.

Evaluation Challenges: Defining good evaluation metrics can be difficult for subjective or creative tasks.

Overfitting Risk: Prompts optimized on small datasets may not generalize well.

Context Dependence: Optimal prompts may vary significantly across different models or domains.

Human Oversight: Automated prompt generation should be combined with human review, especially for sensitive applications.

Integration with Other Techniques

APE can be combined with other prompting techniques:

Few-shot learning: Generate prompts that include optimal examples
Chain-of-thought: Create prompts that encourage reasoning
Self-consistency: Develop prompts optimized for multiple sampling

References

Zhou, Y., et al. (2022). Large Language Models Are Human-Level Prompt Engineers. arXiv preprint arXiv:2211.01910
Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. Extended Abstracts of CHI 2021
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022

How APE Works​

Use Cases​

Implementation Patterns​

Basic APE Pattern​

Iterative Refinement​

Meta-Prompt for APE​

Advanced Techniques​

Chain-of-Thought APE​

Multi-Objective APE​

Evaluation and Testing​

Quantitative Metrics​

Qualitative Assessment​

Implementation Example​

Best Practices​

For Prompt Generation​

For Evaluation​

For Iteration​

Limitations and Considerations​

Integration with Other Techniques​

References​