Concepts
Confidence Scores

Confidence Scores

Understanding what confidence scores mean and how to use them.

What is a Confidence Score?

A confidence score (0.0 to 1.0) represents how strongly the knowledge graph supports a claim:

{
  "claim": "The Great Wall of China is visible from space",
  "verified": false,
  "confidence": 0.23
}
  • 1.0 = Perfect match with verified knowledge
  • 0.0 = No supporting evidence found

Score Interpretation

High Confidence (0.85 - 1.0)

The claim strongly matches verified knowledge:

result = client.verify("Water boils at 100 degrees Celsius at sea level")
# confidence: 0.97 - well-established scientific fact

Interpretation: Very likely accurate. Safe to present without qualification.

Medium Confidence (0.60 - 0.84)

The claim is supported but with some uncertainty:

result = client.verify("Coffee was discovered in Ethiopia")
# confidence: 0.75 - historically accepted but not definitively proven

Interpretation: Probably accurate. Consider adding context or hedging language.

Low Confidence (0.30 - 0.59)

Limited support or conflicting evidence:

result = client.verify("Humans only use 10% of their brain")
# confidence: 0.35 - common myth, contradicts scientific consensus

Interpretation: Questionable accuracy. Verify through additional sources.

Very Low Confidence (0.0 - 0.29)

No support or contradicts known facts:

result = client.verify("The moon is made of cheese")
# confidence: 0.02 - contradicts verified scientific knowledge

Interpretation: Likely inaccurate. Do not present as fact.

Factors Affecting Confidence

1. Direct Knowledge Match

Claims that directly match graph entries score highest:

"Paris is the capital of France"
  → Direct edge: Paris --capital_of--> France
  → Confidence boost: +0.40

2. Path Length

Shorter inference paths yield higher confidence:

"Dogs are animals"
  → 1-hop: dog --is_a--> animal
  → Confidence: 0.95

"Poodles are animals"
  → 2-hop: poodle --is_a--> dog --is_a--> animal
  → Confidence: 0.88

3. Edge Strength

Stronger relationships increase confidence:

"Chairs are furniture"
  → Edge strength: primary (1.0)
  → Confidence boost: +0.30

"Chairs are objects"
  → Edge strength: secondary (0.6)
  → Confidence boost: +0.18

4. Source Consensus

Multiple supporting paths increase confidence:

"Einstein was a physicist"
  → Path 1: Einstein --profession--> physicist
  → Path 2: Einstein --known_for--> relativity --field--> physics
  → Path 3: Einstein --worked_at--> Princeton --department--> physics
  → Combined confidence: 0.96

Using Confidence Thresholds

Setting Thresholds

Choose thresholds based on your risk tolerance:

# High-stakes application (medical, legal, financial)
THRESHOLD = 0.90
 
# General content verification
THRESHOLD = 0.70
 
# Exploratory/research use
THRESHOLD = 0.50

Threshold-Based Actions

def handle_verification(result):
    confidence = result["summary"]["avg_confidence"]
 
    if confidence >= 0.90:
        return "verified"
    elif confidence >= 0.70:
        return "likely_accurate"
    elif confidence >= 0.50:
        return "needs_review"
    else:
        return "flagged"

Confidence in API Responses

Per-Claim Confidence

{
  "claims": [
    {
      "text": "The Earth is round",
      "verified": true,
      "confidence": 0.99
    },
    {
      "text": "The Earth is 4.5 billion years old",
      "verified": true,
      "confidence": 0.92
    }
  ]
}

Aggregate Confidence

{
  "summary": {
    "total_claims": 5,
    "verified": 4,
    "avg_confidence": 0.87,
    "min_confidence": 0.72,
    "max_confidence": 0.98
  }
}

Confidence vs. Verified

These are related but distinct:

ScenarioVerifiedConfidenceMeaning
Strong matchtrue0.95Definitely true
Weak matchtrue0.65Probably true
No evidencefalse0.40Unknown
Contradictsfalse0.10Probably false
# A claim can be "verified" with low confidence
{
  "text": "This species was discovered in 1952",
  "verified": true,
  "confidence": 0.55  # Limited evidence, but what exists supports it
}
 
# A claim can be "unverified" with medium confidence
{
  "text": "This product cures cancer",
  "verified": false,
  "confidence": 0.45  # Some marketing claims exist, but no scientific support
}

Best Practices

1. Don't Over-Rely on Single Scores

Consider the full picture:

def assess_reliability(result):
    # Check multiple factors
    avg = result["summary"]["avg_confidence"]
    min_conf = result["summary"]["min_confidence"]
    flagged_count = sum(1 for c in result["claims"] if not c["verified"])
 
    if avg > 0.85 and min_conf > 0.70 and flagged_count == 0:
        return "high_reliability"
    elif avg > 0.70:
        return "moderate_reliability"
    else:
        return "low_reliability"

2. Track Confidence Over Time

Monitor trends in your verification results:

# Log for analysis
logger.info("verification_result", extra={
    "avg_confidence": result["summary"]["avg_confidence"],
    "claim_count": result["summary"]["total_claims"],
    "source": "gpt-4-output",
    "timestamp": datetime.now()
})

3. Adjust Thresholds by Domain

Different domains may need different thresholds:

DOMAIN_THRESHOLDS = {
    "medical": 0.95,
    "legal": 0.90,
    "news": 0.80,
    "entertainment": 0.60,
    "casual": 0.50
}

4. Communicate Uncertainty

When showing results to users:

def confidence_label(score):
    if score >= 0.90:
        return "Verified"
    elif score >= 0.70:
        return "Likely accurate"
    elif score >= 0.50:
        return "Unconfirmed"
    else:
        return "Disputed"