LLM Testing with ACCELQ – ACCELQ

Large Language Model (LLM) chatbot systems are rapidly becoming integral components of enterprise customer service, internal knowledge management, and business process automation. Unlike traditional rule-based chatbots, modern LLM-based systems can handle complex customer inquiries, access company knowledge bases, maintain context throughout multi-turn conversations, and deliver natural human-like interactions.

Enterprise adoption of LLM technologies introduces unique quality assurance challenges:

Non-deterministic outputs: LLMs may produce variable responses to identical inputs
Context sensitivity: Responses change based on conversation history and input variations
Alignment evaluation: Ensuring outputs conform to company policies and requirements
Hallucination detection: Identifying when an LLM provides incorrect information
Bias and toxicity measurement: Mitigating potential risks from problematic outputs

ACCELQ No-Code Automation for Enterprise LLM Testing

ACCELQ's enterprise-grade no-code automation platform offers specialized commands for LLM testing that enable QA professionals to systematically evaluate chatbot performance without requiring specialized AI expertise.

Required Setup

To use the commands in the LLM Verification library, you must create a file in your agent’s user_data folder with the following structure:

Path:
<agent folder>/user_data/llm/

File name:
api_key.properties

File content:
llm_provider=<provider name>     # Supported providers: openai, claude
model=<model name>               # Specify the model to use from the selected provider
api_key=<your valid API key>     # API key associated with the provider

Note: If you need to update any of these values, simply edit the file and restart the Agent for the changes to take effect.

Note: The LLM validation module requires LLMLink.exe file to function. Please contact ACCELQ support.

Supported LLM Testing Commands

1. Verify LLM Response Metric

This command evaluates individual LLM responses against specific quality metrics.

Available Metrics:

Metric	Description	When to Use
Answer Relevancy	Measures how well the response addresses the prompt	For evaluating if responses stay on topic
Faithfulness	Assesses if the response contains factual inaccuracies	For checking factual reliability
Contextual Precision	Measures precision when context is provided	For evaluating retrieval-augmented responses
Contextual Recall	Evaluates recall when context is provided	For ensuring all relevant information is included
Contextual Relevancy	Measures relevance of response to provided context	For testing context incorporation
Bias	Detects potential biases in responses	For fairness testing
Toxicity	Measures harmful, offensive, or inappropriate content	For safety testing
Summarization	Evaluates quality of summarized content	For testing summary capabilities
Prompt Alignment	Checks if response follows specific instructions	For evaluating instruction following
Hallucination	Detects fabricated information	For ensuring factual accuracy

For verifying a metric, users should input an acceptable pass threshold in the LLM verification command. This value varies between 0 and 1. While there is no precise formula to arrive at these thresholds, refer to this article for guidelines in establishing threshold values appropriate for your specific application requirements.

2. Verify LLM Conversational Metric

This command evaluates entire conversations rather than individual exchanges.

Conversation JSON Parameter

The command requires a JSON object that defines the conversation to be evaluated. This JSON must include a 'chatbot_role' string describing the expected role of the assistant (applicable only for the role adherence parameter), a 'turns' array containing conversation exchanges with each turn having 'input' (user message) and 'output' (chatbot response) fields, and a 'threshold' decimal value between 0 and 1 representing the minimum acceptable score. Here's one example:

{  
"chatbot_role": "Description of the assistant's expected behavior",  
"turns": [ 
     {  
        "input": "User message", 
        "output": "Assistant response"  
     }  
],  
    "threshold": 0.75 
}

Available Conversational Metrics:

Metric	Description	When to Use
Role Adherence	Measures how well the LLM maintains its assigned role	For evaluating persona consistency
Knowledge Retention	Assesses the LLM's ability to retain information	For testing memory across conversation
Conversation Completeness	Evaluates if all parts of a query are addressed	For checking thoroughness
Conversation Relevancy	Measures overall topical consistency	For evaluating conversation coherence

3. Generate LLM Metrics Report

This command produces comprehensive analytical summaries at either the test case or test job level. It summarizes all the LLM verifications so far in your test logic and outputs an aggregated summary in the report.

Implementation Details

Command Framework for LLM Verifications

This command framework utilizes LLM for verification operations. Users must provide their own LLM API key in the file api_key.properties located in the agent\software\accelq\add_ons\llm directory. The file should contain the property open_api_key=<key_value>

Report Generation Capabilities

The "Generate LLM Metrics Report" command aggregates results from previous verification calls, presenting statistical data such as pass/fail counts, average metric values, and median scores. This command can compile metrics either from the current test case or across the entire test job execution.

For optimal implementation, this command should be placed within a Teardown Action to execute after test completion, ensuring a comprehensive overview of all verification points. Learn more about Teardown Actions.

Implementing LLM Testing with ACCELQ

Basic Testing Workflow

Configure Your Test Environment:
- Set up access to your LLM system
- Define test data and expected outcomes
- Configure threshold values for each metric (For more info, visit:
  https://support.accelq.com/hc/en-us/articles/35065967101197-Setting-Effective-Threshold-Values-for-LLM-Testing-in-Enterprise-QA)
Create Test Cases:
- Design prompts that test various aspects of your LLM
- Include edge cases and potential problematic inputs
- Structure test cases to evaluate specific functionality
Execute Tests:
- Run tests individually or as part of a test suite
- Monitor real-time results as tests execute
- Review detailed metric scores for each test
Analyze Results:
- Use the Generate LLM Metrics Report command for comprehensive analysis
- Identify areas for improvement
- Compare performance across different versions of your LLM

Example Test Scenario

Goal: Evaluate a customer service chatbot's ability to provide accurate product information.

// Test 1: Basic Information Retrieval
Verify LLM standard metric Answer Relevancy with input (prompt): "What are the dimensions of the Model X smartphone?",
actual output: "The Model X smartphone dimensions are 150.9 x 75.7 x 8.3 mm (5.94 x 2.98 x 0.33 in).",
threshold value: 0.8

// Test 2: Factual Accuracy
Verify LLM standard metric Faithfulness with input (prompt): "What is the battery capacity of the Model X?",
actual output: "The Model X features a 4500mAh battery with fast charging capabilities.",
retrieval context: "Model X specifications: 4500mAh battery, 67W fast charging, 15W wireless charging",
threshold value: 0.9

// Test 3: Instruction Following
Verify LLM standard metric Prompt Alignment with input (prompt): "List the three main camera features of Model X in bullet points.",
actual output: "• 108MP main camera with OIS\n• 13MP ultrawide lens\n• 5MP macro camera with 4cm focal distance",
prompt instructions: "Respond with bullet points; Keep it concise; Focus only on camera specifications",
threshold value: 0.85

// Generate Summary Report
Generate LLM Metrics Report at the test-suite level

Best Practices for LLM Testing

Test Diverse Scenarios: Include a wide range of inputs to evaluate different aspects of your LLM.
Set Appropriate Thresholds: Calibrate threshold values based on your specific requirements and use case.
Leverage Contextual Testing: Use retrieval context parameters to evaluate how well your LLM integrates external information.
Regular Regression Testing: Re-run tests when updating your model to ensure consistent quality.
Combine Metrics: Use multiple metrics together to get a comprehensive view of performance.