Confidence Scores
Understanding what confidence scores mean and how to use them.
What is a Confidence Score?
A confidence score (0.0 to 1.0) represents how strongly the knowledge graph supports a claim:
{
"claim": "The Great Wall of China is visible from space",
"verified": false,
"confidence": 0.23
}- 1.0 = Perfect match with verified knowledge
- 0.0 = No supporting evidence found
Score Interpretation
High Confidence (0.85 - 1.0)
The claim strongly matches verified knowledge:
result = client.verify("Water boils at 100 degrees Celsius at sea level")
# confidence: 0.97 - well-established scientific factInterpretation: Very likely accurate. Safe to present without qualification.
Medium Confidence (0.60 - 0.84)
The claim is supported but with some uncertainty:
result = client.verify("Coffee was discovered in Ethiopia")
# confidence: 0.75 - historically accepted but not definitively provenInterpretation: Probably accurate. Consider adding context or hedging language.
Low Confidence (0.30 - 0.59)
Limited support or conflicting evidence:
result = client.verify("Humans only use 10% of their brain")
# confidence: 0.35 - common myth, contradicts scientific consensusInterpretation: Questionable accuracy. Verify through additional sources.
Very Low Confidence (0.0 - 0.29)
No support or contradicts known facts:
result = client.verify("The moon is made of cheese")
# confidence: 0.02 - contradicts verified scientific knowledgeInterpretation: Likely inaccurate. Do not present as fact.
Factors Affecting Confidence
1. Direct Knowledge Match
Claims that directly match graph entries score highest:
"Paris is the capital of France"
→ Direct edge: Paris --capital_of--> France
→ Confidence boost: +0.402. Path Length
Shorter inference paths yield higher confidence:
"Dogs are animals"
→ 1-hop: dog --is_a--> animal
→ Confidence: 0.95
"Poodles are animals"
→ 2-hop: poodle --is_a--> dog --is_a--> animal
→ Confidence: 0.883. Edge Strength
Stronger relationships increase confidence:
"Chairs are furniture"
→ Edge strength: primary (1.0)
→ Confidence boost: +0.30
"Chairs are objects"
→ Edge strength: secondary (0.6)
→ Confidence boost: +0.184. Source Consensus
Multiple supporting paths increase confidence:
"Einstein was a physicist"
→ Path 1: Einstein --profession--> physicist
→ Path 2: Einstein --known_for--> relativity --field--> physics
→ Path 3: Einstein --worked_at--> Princeton --department--> physics
→ Combined confidence: 0.96Using Confidence Thresholds
Setting Thresholds
Choose thresholds based on your risk tolerance:
# High-stakes application (medical, legal, financial)
THRESHOLD = 0.90
# General content verification
THRESHOLD = 0.70
# Exploratory/research use
THRESHOLD = 0.50Threshold-Based Actions
def handle_verification(result):
confidence = result["summary"]["avg_confidence"]
if confidence >= 0.90:
return "verified"
elif confidence >= 0.70:
return "likely_accurate"
elif confidence >= 0.50:
return "needs_review"
else:
return "flagged"Confidence in API Responses
Per-Claim Confidence
{
"claims": [
{
"text": "The Earth is round",
"verified": true,
"confidence": 0.99
},
{
"text": "The Earth is 4.5 billion years old",
"verified": true,
"confidence": 0.92
}
]
}Aggregate Confidence
{
"summary": {
"total_claims": 5,
"verified": 4,
"avg_confidence": 0.87,
"min_confidence": 0.72,
"max_confidence": 0.98
}
}Confidence vs. Verified
These are related but distinct:
| Scenario | Verified | Confidence | Meaning |
|---|---|---|---|
| Strong match | true | 0.95 | Definitely true |
| Weak match | true | 0.65 | Probably true |
| No evidence | false | 0.40 | Unknown |
| Contradicts | false | 0.10 | Probably false |
# A claim can be "verified" with low confidence
{
"text": "This species was discovered in 1952",
"verified": true,
"confidence": 0.55 # Limited evidence, but what exists supports it
}
# A claim can be "unverified" with medium confidence
{
"text": "This product cures cancer",
"verified": false,
"confidence": 0.45 # Some marketing claims exist, but no scientific support
}Best Practices
1. Don't Over-Rely on Single Scores
Consider the full picture:
def assess_reliability(result):
# Check multiple factors
avg = result["summary"]["avg_confidence"]
min_conf = result["summary"]["min_confidence"]
flagged_count = sum(1 for c in result["claims"] if not c["verified"])
if avg > 0.85 and min_conf > 0.70 and flagged_count == 0:
return "high_reliability"
elif avg > 0.70:
return "moderate_reliability"
else:
return "low_reliability"2. Track Confidence Over Time
Monitor trends in your verification results:
# Log for analysis
logger.info("verification_result", extra={
"avg_confidence": result["summary"]["avg_confidence"],
"claim_count": result["summary"]["total_claims"],
"source": "gpt-4-output",
"timestamp": datetime.now()
})3. Adjust Thresholds by Domain
Different domains may need different thresholds:
DOMAIN_THRESHOLDS = {
"medical": 0.95,
"legal": 0.90,
"news": 0.80,
"entertainment": 0.60,
"casual": 0.50
}4. Communicate Uncertainty
When showing results to users:
def confidence_label(score):
if score >= 0.90:
return "Verified"
elif score >= 0.70:
return "Likely accurate"
elif score >= 0.50:
return "Unconfirmed"
else:
return "Disputed"