Classification Prompt That Runs 260k A Day With 100% Accuracy
Developing a Comprehensive Framework for Evaluating Responses on a Domain Specific Question and Answer
In today's data-driven world, assessing the quality of information is crucial. This blog post will explore an innovative approach to evaluating responses to questions, utilizing a multi-faceted scoring system. We'll break down the methodology and logic behind this framework, which can be applied in various fields, from education to market research.
The Evaluation Framework
The prompt introduces a system where an expert accountant specializes in assessing the accuracy of responses. This framework evaluates responses based on three key criteria:
Accuracy
Relevance
Positivity
You are an expert accountant specializing in assessing the accuracy of responses to questions.
Question: '''{question}'''
Answer: '''{userMessage}'''
Each criterion is rated on a scale of 1 to 5, allowing for a nuanced assessment of the response. Let's dive into each aspect:
Accuracy (1-5 scale): This measures how correct and up-to-date the information in the response is.
1: Completely incorrect
2: Mostly incorrect, with some correct elements
3: Partially correct
4: Mostly correct, with minor inaccuracies
5: Fully accurate and up-to-date
The accuracy scale allows for a granular evaluation of the response's correctness. It acknowledges that responses can have varying degrees of accuracy, from being entirely wrong to being perfectly correct and current.
Accuracy:
1: Completely incorrect
2: Mostly incorrect, but with some correct elements
3: Partially correct
4: Mostly correct, with minor inaccuracies
5: Fully accurate and up-to-date
Relevance (1-5 scale): This assesses how well the response addresses the given question.
1: Completely unrelated to the question
2: Some relation to the question, but mostly off-topic
3: Relevant, but lacking focus or key details
4: Highly relevant, addressing the main aspects of the question
5: Directly relevant and precisely targeted to the question
The relevance scale helps determine whether the response actually answers the question at hand. It recognizes that responses can range from being entirely off-topic to being precisely focused on the question.
Relevance:
1: Completely unrelated to the question
2: Some relation to the question, but mostly off-topic
3: Relevant, but lacking focus or key details
4: Highly relevant, addressing the main aspects of the question
5: Directly relevant and precisely targeted to the question
Positivity (1-5 scale): This evaluates the overall tone and stance of the response towards the question topic.
1: Completely negative or disagreeing
2: Mostly negative, with some positive elements
3: Neutral or ambiguous
4: Mostly positive, with some negative elements
5: Entirely positive or agreeing
The positivity scale is unique in that it assesses the response's overall attitude towards the question topic. This can be particularly useful in sentiment analysis or when gauging public opinion on certain issues.
Positivity:
1: denies or disagrees without any elements suggesting a positive inclination towards the question topic.
2: denies or disagrees but includes some aspects that hint towards a positive or favorable view related to the question topic.
3: The response neither affirms nor denies, or it does both, leaving the overall message ambiguous regarding a positive or negative stance.
4: The response primarily affirms or agrees but includes some elements that suggest a negative or unfavorable view of some aspects related to the question topic.
5: The response directly and unambiguously affirms or agrees without any elements suggesting a negative view towards the question topic.
Methodology and Logic
Input: The system takes two inputs - the original question and the user's response.
Expert Evaluation: The prompt assumes the role of an expert accountant, implying a high level of analytical skill and attention to detail in the assessment process.
Criteria-based Scoring: Each response is evaluated independently on the three criteria mentioned above.
Numeric Scale: The use of a 1-5 scale for each criterion allows for nuanced scoring, capturing subtle differences in quality.
Clear Definitions: Each point on the scale is clearly defined, reducing subjectivity in the assessment process.
JSON Output: The evaluation is presented in a structured JSON format, making it easy to parse and analyze programmatically.
Applications and Benefits
This evaluation framework can be applied in various scenarios:
Educational Assessment: Teachers can use this to grade student responses more objectively.
Customer Feedback Analysis: Companies can evaluate customer responses to surveys or product reviews.
Content Quality Control: Content creators can assess the quality of their work before publication.
AI Training: This framework can be used to evaluate and improve AI-generated responses.
The multi-dimensional approach (accuracy, relevance, positivity) provides a comprehensive view of response quality. By breaking down the evaluation into these distinct components, it becomes easier to identify specific areas for improvement.
Top 3 advantages of assessing accuracy, relevance, and positivity in a single prompt:
Comprehensive evaluation: By combining these three criteria, you get a holistic view of the response quality. This allows you to assess not just factual correctness, but also how well the answer addresses the question and its overall tone. This multifaceted approach provides a more complete picture of the response's effectiveness.
Efficiency and consistency: Evaluating all three criteria simultaneously ensures a consistent assessment methodology across responses. It's more time-efficient than separate evaluations and reduces the risk of inconsistencies that might arise from multiple separate assessments.
Balanced scoring: This approach allows for a more balanced evaluation of responses. A response might be factually accurate but irrelevant, or highly relevant but inaccurate. By considering all three aspects together, you can better understand the strengths and weaknesses of each response, leading to a fairer and more nuanced assessment.
Prompt Breakdown:
Role Assignment: The prompt begins with "You are an expert accountant specializing in assessing the accuracy of responses to questions." This role assignment is crucial because it primes the LLM to adopt a specific persona and expertise. LLMs are trained on vast amounts of text data, including content written by experts in various fields. By assigning this role, the prompt activates the model's "knowledge" associated with accounting and assessment, encouraging more precise and relevant outputs.
Clear Structure: The prompt uses a clear, consistent structure for each criterion (Accuracy, Relevance, Positivity). This structure helps the LLM organize its response in a predictable manner. LLMs are particularly good at pattern recognition and completion, so providing a clear structure guides the model to generate responses in the desired format.
Detailed Scales: For each criterion, the prompt provides a detailed 1-5 scale with specific descriptions for each level. This level of detail serves two purposes: a) It gives the LLM clear guidelines for assessment, reducing ambiguity. b) It provides rich context that the LLM can draw upon when making its evaluation.
Use of Quotation Marks: The prompt uses triple quotes (''') to denote placeholders for the question and user message. This is a common convention in programming and helps the LLM understand that these are variables that will be filled with actual content.
Explicit Output Format: The prompt specifies that the output should be in JSON format and provides an example. This is crucial because LLMs can be instructed to produce structured outputs, making the results easily parseable by other systems.
Focused Task: The prompt ends with a clear instruction to provide only the JSON evaluation, reinforcing the desired output format.
Prompt objectives:
Provides clear context and instructions
Leverages the model's ability to adopt roles and access relevant "knowledge"
Uses a structured format that aligns with the model's training on patterns and completions
Gives detailed criteria that the model can use as reference points
Specifies a clear, structured output format
By combining these elements, the prompt effectively guides the LLM to perform a complex evaluation task in a consistent and structured manner. This approach takes advantage of the LLM's strengths in language understanding, context interpretation, and structured output generation.
Full Prompt:
You are an expert accountant specializing in assessing the accuracy of responses to questions.
Question: '''{question}'''
Answer: '''{userMessage}'''
Accuracy:
1: Completely incorrect
2: Mostly incorrect, but with some correct elements
3: Partially correct
4: Mostly correct, with minor inaccuracies
5: Fully accurate and up-to-date
Relevance:
1: Completely unrelated to the question
2: Some relation to the question, but mostly off-topic
3: Relevant, but lacking focus or key details
4: Highly relevant, addressing the main aspects of the question
5: Directly relevant and precisely targeted to the question
Positivity:
1: denies or disagrees without any elements suggesting a positive inclination towards the question topic.
2: denies or disagrees but includes some aspects that hint towards a positive or favorable view related to the question topic.
3: The response neither affirms nor denies, or it does both, leaving the overall message ambiguous regarding a positive or negative stance.
4: The response primarily affirms or agrees but includes some elements that suggest a negative or unfavorable view of some aspects related to the question topic.
5: The response directly and unambiguously affirms or agrees without any elements suggesting a negative view towards the question topic.
Make sure to provide your evaluation in JSON format and ONLY the JSON, with separate ratings for each of the mentioned criteria as in the following example: {{"accuracy":3, "relevance": 1, "positivity": 2}} """
Conclusion
This innovative framework offers a structured and comprehensive approach to evaluating responses. By considering accuracy, relevance, and positivity, it provides a nuanced assessment that goes beyond simple right-or-wrong judgments. As we continue to navigate an information-rich world, tools like this will become increasingly valuable in distinguishing high-quality responses from less reliable ones.