LLM Evals Are Based on Vibes — I Built the Missing Layer That Decides What Ships | Towards Data Science

TL;DR

a full working implementation in pure Python, with real benchmark numbers.

Most teams evaluate LLM responses by reading them and guessing. That breaks the moment you scale.

The real problem is not that models hallucinate. It is that nothing catches the confident ones, the responses that score 0.525, pass your threshold, and are quietly wrong.

I built a scoring layer that splits faithfulness into two signals: attribution and specificity. High specificity plus low attribution is the signature of a hallucination. A single score misses it every time.

This is not an evaluation script. It is a decision engine that sits between your model and your user.

I Changed One Line in My Prompt. Everything Broke.

Three words broke my eval system: “be specific and detailed.”

I added them to my system prompt on a Tuesday afternoon. Routine change. The kind you make a dozen times when you’re tuning a RAG pipeline. I ran my next test batch an hour later and question three came back like this:

“Context engineering was invented at MIT in 1987 and is primarily used for hardware cache optimization in CPUs. It has nothing to do with language models.”

My scorer gave it 0.525. Above my passing threshold of 0.5. Green light.

I almost missed it. I was skimming outputs the way you do when you’ve been staring at test results for two hours, checking scores, not reading sentences. The only reason I caught it was that “1987” looked wrong to me. I read it twice and pulled up the context doc. The model had invented every specific detail in that sentence.

The score had gone up because the response got more specific. The quality had collapsed because the model got more confident about things it was fabricating. My eval layer had one number to cover both directions, and it couldn’t tell them apart.

I caught it manually that time. That is not a process. That is luck. And the whole point of an eval system is that it should not depend on whether you happen to be reading carefully on a given afternoon.

But the moment you try to actually fix it, things get complicated. Like, how do you even define “good”? If you just ask another LLM to judge the first one, you’re just moving the problem up a level. The real danger isn’t a broken response; it’s the one that sounds like an expert but is quietly lying to you.

Most tutorials tell you to just call the model and see if the output “looks right.” But look at the numbers. What happens when your response scores 0.525 overall, technically acceptable, but its grounding score is 0.428 and its specificity is 0.701? That combination means confident but ungrounded. That is not a borderline response. That is a hallucination wearing a business suit.

These are not rare edge cases. This is what happens by default in production LLM systems, and you will not catch it with a vibe check.

The answer is a missing layer most teams skip entirely. Between LLM output and user delivery, there is a deliberate step: deciding whether the response should be served, retried, or regenerated. I built that layer. This is the system, with real numbers and code you can run.

Complete code: https://github.com/Emmimal/llm-eval-layer

Who This Is For

This kind of architecture is useful when you are building RAG systems [1], where wrong answers can easily slip in, or chatbots that handle multiple turns and need their responses checked over time. It is also helpful in any LLM pipeline where you need to automatically decide what to do next, like whether to show a response to the user, try again, or generate a new one.

Skip it for single-turn demos with no production traffic. If every response gets human review anyway, the overhead is not worth it. Same if your domain has one correct answer and exact matching works fine.

Why LLM Evaluation Is Broken

There are three ways most eval systems fail, and they usually happen before anyone notices.

“Looks correct” is not always correct. A response can sound fluent, be well structured, and look confident, yet still be completely wrong. Fluency does not guarantee truth. When you’re reviewing outputs quickly, your brain usually evaluates the writing quality, not accuracy. You have to actively fight that instinct, and most people don’t.

The hallucinations that matter aren’t the ones you can easily spot. Nobody ships a model that says the Eiffel Tower is in Berlin. That gets caught on day one. The dangerous ones are the confident, domain-specific claims that sound right to anyone who isn’t an expert in that exact area [10]. They pass review unnoticed, make it to production, and ultimately end up in front of users.

The deeper problem is that a score is not a decision. You set a threshold at 0.5. One response scores 0.51 and passes. Another scores 0.95 and also passes. You treat them the same. But one of them probably needed a human review. They give you a number when what you need is: ship this, flag this, or reject this.

The score had gone up. The quality had collapsed. One number cannot hold both directions at once

Traditional metrics like BLEU and ROUGE don’t work well here [2, 3]. They check how many words match a reference answer, which makes sense in machine translation where there is usually one correct output. But LLM responses don’t have a single correct version. There are many ways to say the same thing. So using BLEU for a conversation is misleading. It’s like grading an essay only by checking how many words match a model answer, instead of judging whether the idea is actually correct and well explained.

LLM-as-judge is what everyone is turning to now [4]. You use a model like GPT-4 to score the outputs of another GPT-4 model. It does improve over BLEU, but it comes with problems. It is expensive, it can give slightly different results each time, and it creates a dependency on another model you do not fully control. And this also does not scale when you are scoring every response in a production system.

Frameworks like RAGAS [6] have pushed this forward, but they still depend on an LLM judge for scoring and are not deterministic across runs. What you actually need is a scoring layer that runs locally, has no per-call cost, and produces consistent results every time.

What a Real Eval System Needs

Before writing any code I set five hard constraints. It had to run in milliseconds because an eval layer that slows down user responses is not deployable. No API calls on the standard path either. The LLM judge is a fallback, not the default, because paying per evaluation call does not scale. And same input, same score every time, otherwise regression testing is completely useless.

The other two were about explainability. Every rejection had to come with a plain-English reason, not just a number, because “score: 0.43” tells you nothing about what to actually fix. And adding new scorers should never require touching the decision logic. That is how systems rot over time.

The Architecture

Three layers. Each one has a specific job.

LLM Evals Are Based on Vibes — I Built the Missing Layer That Decides What Ships | Towards Data Science — LLM Evaluation Architecture: A multi-tier pipeline demonstrating how generated AI responses are scored for quality and routed through automated decision and action layers to ensure grounded outputs. Image by Author

The scoring layer produces numbers. The decision layer converts those numbers into a verdict with a full explanation. That last part is what most systems skip, and it is also the most useful part when a response breaks in production and you have no idea why.

The Core Evaluation Dimensions

Faithfulness: Attribution and Specificity

This was the most important scorer, and the one I almost got wrong.

At first, I used a single “faithfulness” score. It mixed things like semantic similarity and word overlap between the context and the response. It worked for simple cases, but it failed in the cases that actually matter.

The problem is this: some answers sound confident and detailed, but are not actually based on the given context.

So I split faithfulness into two separate checks.

Attribution checks whether the answer is supported by the context. If the response makes claims that cannot be found or inferred from the input, attribution is low [8].

# Attribution: is it grounded?

semantic    = semantic_similarity(context, response)
overlap     = token_overlap(context, response)
attribution = 0.60 * semantic + 0.40 * overlap

Specificity checks how detailed and concrete the answer is. A response is specific if it gives clear details and avoids vague phrases like “it can be useful in many situations.”

# Specificity: is it concrete?

length_score  = min(1.0, len(tokens) / 80)
richness      = len(set(tokens)) / len(tokens)
hedge_penalty = min(0.60, hedge_count * 0.15)
specificity   = (0.40 * length_score + 0.60 * richness) - hedge_penalty

# Composite

faithfulness = 0.70 * attribution + 0.30 * specificity

The critical insight: high specificity plus low attribution equals hallucination.

A 2x2 matrix diagram evaluating AI responses based on High and Low Specificity versus High and Low Attribution. It categorizes outputs as Weak Answers, Hallucinations, Grounded but Thin, or Good Answers. — The AI Response Quality Matrix: Navigating the intersection of factual grounding (Attribution) and detail precision (Specificity) to determine whether to accept, reject, or review model outputs. Image by Author

This is dangerous because confident, detailed wrong answers are harder to catch. Vague answers at least show some uncertainty. Confident but ungrounded answers do not.

Attribution is the main signal because grounding matters most. Specificity is secondary and mainly helps catch confident but wrong answers.

Here is what this looks like in practice. A response claims that context engineering “was invented at MIT in 1987 and is primarily used for hardware cache optimization”:

Attribution: 0.428 (low, weakly grounded in the context)
Specificity: 0.701 (high, sounds detailed and authoritative)
Decision: REJECT
Reason: Confident hallucination detected

A single score with a threshold like 0.5 might still allow this through. The split between attribution and specificity catches the problem because it shows not just the score, but why the response is failing.

Answer Relevance

It measures how directly the response answers the original question.

The scorer combines three signals: semantic similarity between the full response and the query, the best matching single sentence in the response, and simple token overlap [5, 6].

semantic  = semantic_similarity(query, response)
max_sent  = max_sentence_similarity(query, response)
overlap   = token_overlap(query, response)

relevance = 0.45 * semantic + 0.35 * max_sent + 0.20 * overlap

The sentence-level component rewards focused answers. Even if a response is long or includes extra information, it can still score well as long as at least one sentence directly answers the question.

Context Quality: Precision and Recall

Context Precision answers a simple question: is the model making things up, or is it staying inside the context? [7] If precision is low, the response contains claims the retrieved context never supported. The model went off-script.

Context Recall flips it around. It checks how much of what you retrieved actually showed up in the response. Low recall means your retrieval pulled in documents the model mostly ignored. You fetched a lot of noise.

prec = precision(context, response)   # context -> response coverage
rec  = recall(response, context)      # response -> context grounding
f1   = 2 * prec * rec / (prec + rec)

context_quality = 0.50 * f1 + 0.50 * semantic_similarity(context, response)

Context quality is causal, not passive. When it drops below a threshold, the system does not just flag it. It changes what the system does next.

if context_quality < 0.40 and final_score < 0.65:
    action = "retrieve_more_documents"
    reason = "Root cause is retrieval, not the model"

A bad response caused by poor retrieval needs better documents, not a better prompt. Most eval systems do not make this distinction and you end up debugging the wrong thing for an hour.

Disagreement Signal

I started looking closely at variance after debugging a brutal edge case. The logs showed a faithfulness score of 0.68, relevance at 0.32, and context quality at 0.71.

If you just run a weighted average on those numbers, the final score looks totally acceptable. It passes the pipeline. But the raw data is telling three completely different stories about a single response. One metric says it is accurate, another says it is irrelevant, and the third says the context was decent.

Averaging those numbers completely hides the conflict. What you actually need to track is the disagreement signal.

You can catch this instantly by calculating the standard deviation across all your dimension scores:

def _disagreement(scores: list[float]) -> float:
    n = len(scores)
    if n < 2:
        return 0.0           
    mean = sum(scores) / n
    return round(math.sqrt(sum((s - mean) ** 2 for s in scores) / n), 4)

When the standard deviation crosses 0.12, the system routes the response straight to a human review queue, ignoring the final average entirely.

If your scorers are pulling in completely different directions, the system is fundamentally uncertain. That friction is your best indicator that automation has reached its limit and a human needs to step in.

This disagreement metric does not just trigger reviews, though. It also directly feeds into the confidence calculation, which brings us to the next step.

The Scoring Engine: Hybrid by Design

The full pipeline runs in three steps.

Step 1: Heuristic Scoring

All four evaluation dimensions are computed locally. The system avoids external API calls completely. By loading sentence-transformers directly onto the CPU, this stage finishes in roughly 3ms.

Step 2: Confidence Gating

When a score lands between 0.45 and 0.65, something interesting happens. The system does not trust the heuristics alone anymore and escalates to the LLM judge. Outside that window, local scoring is solid enough and no API call is made.

Step 3: The Decision Layer

A vertical flowchart of an AI response evaluation pipeline. It displays a sequence from data input to a final rejection decision based on metrics for faithfulness, relevance, context, and specificity. — AI Evaluation Pipeline: A step-by-step logic flow showing how metric thresholds identify hallucinations and trigger automated rejection and regeneration. Image by Author

No raw floating-point number gets dumped into the logs. Instead the pipeline returns a full schema: ACCEPT, REVIEW, or REJECT, with a failure type, a reason, and a concrete next action. The LLM judge never runs by default. It only fires when the heuristics genuinely cannot decide.

The Decision Layer: From Scores to Actions

Most evaluation tools try to answer a basic question: “Is this response good?”

This system changes the question entirely: “What should we do with this response?”

The decision logic under the hood is a three-dimensional policy that runs directly on your grounding, specificity, and agreement metrics. Instead of relying on a single average, it isolates failures using explicit programmatic rules:

# Confirmed hallucination: attribution is critically low and the response is vague
if attribution < 0.35 and specificity <= 0.50:
    return REVIEW, "vague response, retry with specific prompt"

# Confirmed hallucination: attribution is low but the response sounds confident
if attribution < 0.35 and specificity > 0.50:
    return REJECT, "confident hallucination"

# Confident hallucination: sounds authoritative but is poorly grounded
if attribution < 0.45 and specificity > 0.60:
    return REJECT, "confident hallucination detected"

# Poor retrieval: the context fetch itself is the root cause
if context_quality < 0.40:
    return REVIEW, "retrieve_more_documents"

# Hard guardrail: both attribution and context quality are weak
# Two weak signals together are worse than one strong failure
if attribution < 0.55 and context_quality < 0.50:
    return REJECT, "hallucination guardrail triggered"

# Weak grounding
if attribution < 0.55:
    return REVIEW, "weak grounding, retry with specific prompt"

# Off-topic: response does not address the query at all
if relevance_score < 0.30:
    return REVIEW, "off-topic, retry with clearer query"


# High disagreement
if disagreement > 0.12:
    return REVIEW, "uncertain scoring, human review recommended"

# Borderline quality
if final_score < 0.65:
    return REVIEW, "borderline, optional human review"

# All gates passed successfully
return ACCEPT, "serve_response"

You can’t treat every bad output the same way. A vague response (low attribution, low specificity) just needs a rewrite, so it goes to REVIEW with a prompt retry. A confident hallucination (low attribution, high specificity) is dangerous, so it gets slapped with an immediate REJECT and a forced regeneration. Different failures require different downstream actions.

What the Output Looks Like

Here are the actual outputs from running main.py on four cases.

Example 1: Well-grounded response

Final Score       : 0.680
Attribution       : 0.684   (grounding)
Specificity       : 0.713   (concreteness)
Relevance         : 0.657
Context Quality   : 0.688
Disagreement      : 0.016   (scorer std dev)
No hallucination
Decision          : ACCEPT  (confidence: 41%)
Reason            : All quality gates passed
Next Action       : serve_response
Latency           : 322ms

Example 2: Confident hallucination

Final Score       : 0.525
Attribution       : 0.428   (grounding)
Specificity       : 0.701   (concreteness)
Relevance         : 0.613
Context Quality   : 0.424
Disagreement      : 0.077   (scorer std dev)
Suspected weak grounding
Failure Type      : hallucination
Decision          : REJECT  (confidence: 22%)
Reason            : Confident hallucination detected, attribution=0.428
                    (low grounding) but specificity=0.701 (high confidence).
                    Response sounds authoritative but is not grounded in context.
Next Action       : regenerate_with_grounding_prompt
Why               : Confident but ungrounded response is more dangerous than a vague one
Low-confidence sentences:
  It has nothing to do with language models.

This case perfectly demonstrates why raw score-only evaluation fails. If you just look at the final score of 0.525, it sits safely above a standard 0.5 passing threshold. A basic metric pipeline lets this slide right through. But the decision layer catches it and throws a flag: an attribution score of 0.428 combined with a specificity score of 0.701 is the exact footprint of a confident hallucination.

Example 3: Vague response

Final Score       : 0.295
Attribution       : 0.248   (grounding)
Specificity       : 0.332   (concreteness)
Decision          : REVIEW  (confidence: 32%)
Reason            : Uncertain / vague response, low grounding, low specificity.
                    Not a confirmed hallucination.
Next Action       : retry_with_specific_prompt

Don’t mistake a noncommittal answer for a hallucination. Low attribution plus low specificity tells you the model is just playing it safe and dodging the question. If you force a raw regeneration here, you’ll just get more fluff. The actual fix is triggering a retry using a more restrictive prompt template.

Example 4: Off-topic response

Final Score       : 0.080
Attribution       : 0.017   (grounding)
Specificity       : 0.630   (concreteness)
Decision          : REJECT  (confidence: 42%)
Reason            : Confident hallucination, attribution=0.017,
                    specificity=0.630. Response sounds authoritative but is fabricated.
Low-confidence sentences:
  The French Revolution was a period of major political and societal change...
  Marie Antoinette was Queen of France at the time.

An attribution of 0.017 with a specificity of 0.630 means the model returned an essay about the French Revolution on a context engineering question. The system catches this instantly, but it doesn’t just issue a blind rejection. It pinpoints and exposes the exact sentence strings that triggered the low-confidence flag.

Decision Distribution

ACCEPT      1/4  (25%)
REVIEW      1/4  (25%)
REJECT      2/4  (50%)

If you track this metric distribution over time in production, you can instantly see if your model weights are degrading, your retrieval pipeline is dropping relevant docs, or your prompt templates are losing their edge. That is actual system observability, not just dumping useless strings into a log aggregator.

Real Benchmark Numbers

Running across the full 5-case RAG evaluation set:

ID	Label	Attr	Relev	Ctx	Final	Hallucination	Decision
q_001	good_response	0.686	0.680	0.725	0.694	No	ACCEPT
q_002	hallucinated_response	0.445	0.621	0.459	0.547	Suspected	REJECT
q_003	good_response	0.528	0.456	0.535	0.534	Suspected	REVIEW
q_004	off_context_response	0.043	0.682	0.091	0.337	Confirmed	REJECT
q_005	good_response	0.625	0.341	0.628	0.536	No	REVIEW

Decisions, not scores, are the source of truth. These results are illustrative — five cases is not a statistically significant sample, and you should run this against your own labeled data before trusting any threshold.

Accuracy benchmark

Let’s look at the actual accuracy benchmarks. Good outputs average out at 0.588, and bad ones tank down to 0.442. That 0.146 score separation is wide enough to let us set tight, reliable boundaries. Plus, it flagged 2 out of 2 hallucinations perfectly during the run. You get total detection coverage without sacrificing your runtime budget.

Latency benchmark (10 runs, warm model)

Operation	Latency	Notes
Attribution scorer	~1.2ms	Embedding plus overlap
Relevance scorer	~1.1ms	Sentence-level scoring
Context scorer	~0.8ms	Precision plus recall
Decision layer	~0.1ms	Policy rules plus confidence
Full pipeline.evaluate()	~291ms mean	No LLM calls
With LLM judge	~340ms	Edge cases only, 0.45 to 0.65 zone

Your first run will hit approximately 800–1000ms bottleneck while the sentence-transformers model spins up. After that initial load, things speed up drastically, averaging around 291ms per call. If you pre-load the weights inside your application container at startup, you can run this entire evaluation layer in production while adding under 300ms to your response latency.

The Regression Test System

Most teams skip this part. That is a mistake. Generating evaluation scores is pointless if you don’t do anything with them. If you tweak a prompt template and your accuracy drops, you need an instant alert. If you swap out a retrieval strategy and three edge cases that used to pass are now completely broken, you have to catch that before pushing to main. The regression suite handles this by storing historical baselines and diffing current scores against them during your CI build.

suite = RegressionSuite("data/baselines.json")

# Record baselines after validating your system
suite.record_baseline("q_001", query, context, response, result)

# After changing your prompt or model:
report = suite.run_regression(pipeline, test_cases)

# Treat failures like CI failures
if report.failed > 0:
    raise SystemExit("Quality regression detected. Deployment blocked.")

Here is the exact terminal output when a prompt modification triggers a performance regression:

Regression Report  --  CI/CD Quality Gate
3 REGRESSION(S) DETECTED -- DEPLOYMENT BLOCKED

Total cases   : 3
Passed        : 0
Failed        : 3
Mean delta    : -0.4586
Threshold     : +/- 0.05

Regressions -- score dropped beyond threshold:
  [q_001] 0.694 -> 0.137  (delta -0.556)
  [q_002] 0.547 -> 0.137  (delta -0.410)
  [q_003] 0.534 -> 0.124  (delta -0.410)

A simple prompt change drops a solid response from 0.694 to 0.137. The regression pipeline catches it, killing the deployment before users see the damage.

This brings standard CI/CD practices to generative AI. No more manual spot-checks. If quality drops past your threshold, the build fails. It treats prompt engineering exactly like code coverage or unit testing [11].

From Metrics to Decisions to Actions

Here is the full transformation this system enables.

Old thinking:
score = 0.68
# ship it? probably fine
This system:
signals -> reasoning -> decision -> action

We drop every output into a predictable schema. You get a hard decision (ACCEPT, REVIEW, or REJECT), a log reason, a failure type, a routing action, and a confidence percentage. This structured payload is the only reason the system is actually debuggable when things break.

The to_dict() method on every result makes it JSON-serialisable for logging, dashboards, and APIs:

result.to_dict()
# {
#   "decision": "REJECT",
#   "confidence_pct": 22,
#   "failure_type": "hallucination",
#   "hallucination_status": "suspected",
#   "next_action": "regenerate_with_grounding_prompt",
#   "action_why": "Confident but ungrounded response is more dangerous than a vague one",
#   "scores": {
#     "final": 0.525,
#     "attribution": 0.428,
#     "specificity": 0.701,
#     "relevance": 0.613,
#     "context_quality": 0.424,
#     "disagreement": 0.077
#   },
#   "explanations": {
#     "reason": "Confident hallucination detected...",
#     "low_confidence_sentences": ["It has nothing to do with language models."]
#   },
#   "meta": {
#     "passed": false,
#     "used_llm_judge": false,
#     "latency_ms": 301.0
#   }
# }

Plug this into any logging system and you have a complete quality audit trail for every response your system ever produced.

Honest Design Decisions

A score separation of 0.146 is completely normal for a local heuristic system. Good and bad responses will always blur together in the middle. The decision layer fixes this by looking at how attribution and specificity interact, rather than trusting a single averaged number. Trying to force a wider separation gap by tweaking weights just rigs the benchmarks without changing how the code actually runs in production.

The 0.70/0.30 and 0.60/0.40 weights aren’t based on some universal theory. I just ran tests until these numbers fit the data in my own knowledge base. If you run this exact setup on legal contracts, medical journals, or raw source code, these ratios will fail. That is why I isolated them in a configs directory. You can adjust the tuning parameters for your specific data without editing the core pipeline code.

The 0.35 hallucination threshold trips only when attribution bottoms out completely. If your application domain relies on heavy paraphrasing without exact word matches, this tight cutoff will trigger false positives. Using sentence-transformers [9] handles semantic meaning much better than basic TF-IDF matching. If you disable it and drop down to the local fallback mode, the pipeline automatically becomes much more conservative to protect your data. [5]

The 0.45 to 0.65 LLM judge zone is tied directly to the default thresholds. If you end up shifting REJECT_THRESHOLD or REVIEW_THRESHOLD, you need to remap the judge window to match. The architecture relies on a strict pattern: spin up the expensive LLM judge only when local heuristics hit a wall of uncertainty, never as your default gatekeeper.

Low confidence scores—like 22% or 42% on borderline outputs—aren’t bugs. Those responses are genuinely volatile. An overconfident evaluation pipeline running on sketchy inputs is a massive production liability; you want a system that properly quantifies its own doubt.

Also, don’t worry about that embeddings.position_ids warning when sentence-transformers boots up. It is purely cosmetic and has zero impact on runtime performance.

What This Does Not Solve

The hardest case is implicit hallucination. If a response reuses your context vocabulary but quietly shifts the meaning, the local code gets fooled because the raw words still match. Heuristics are blind to that kind of semantic drift. That is exactly why the LLM judge fallback exists.

Cross-document consistency is also out of scope. The scorer looks at each response against its own context in isolation. If two related responses contradict each other, nothing here will catch it. And calibration is genuinely domain-specific — treat configs/thresholds.yaml as a starting point, run it against your own labeled cases, and tune before trusting any number listed here. A medical QA system needs hallucination thresholds far tighter than anything I used.

What You Have Actually Built

What you end up with after building all of this is not an evaluation script.

It takes three inputs: query, context, and response. The output is a strict payload containing a decision, a log reason, a failure type, a next action, a confidence score, and the underlying data breakdown.

Every response that touches your system gets scored, classified, and routed. Good ones go straight to the user. Vague ones get retried with a tighter prompt. Hallucinations get blocked before anyone sees them. And when you change a prompt and three cases that used to score 0.69 suddenly score 0.13, the regression suite catches it before you push to main — not after a user reports it.

This is the missing layer in the sea of LlamaIndex demos, LangChain examples, and basic RAG tutorials online. Everyone shows you how to hook up the vector database, but nobody shows you how to safely validate the model’s output.

RAG gets you the right documents. Prompt engineering gets you the right instructions. This layer gets you the right decision about what to do with the output.

You can grab the full source code, benchmark data, and local implementation scripts here: https://github.com/Emmimal/llm-eval-layer .

References

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474. https://arxiv.org/abs/2005.11401

[2] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311-318. https://aclanthology.org/P02-1040/

[3] Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out, 74-81. https://aclanthology.org/W04-1013/

[4] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685. https://arxiv.org/abs/2306.05685

[5] Reimers, N., and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 3982-3992. https://arxiv.org/abs/1908.10084

[6] Es, S., James, J., Espinosa Anke, L., and Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217. https://arxiv.org/abs/2309.15217

[7] Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://nlp.stanford.edu/IR-book/

[8] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186. https://arxiv.org/abs/1810.04805

[9] Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020).
MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in NeurIPS, 33, 5776–5788. https://arxiv.org/abs/2002.10957

[10] Tonmoy, S. M., Zaman, S. M., Jain, V., Rani, A., Rawte, V., Chadha, A.,& Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv:2401.01313.
https://arxiv.org/abs/2401.01313

[11] Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017).
The ML test score: A rubric for ML production readiness and technical debt reduction. IEEE BigData 2017, 1123–1132.
https://doi.org/10.1109/BigData.2017.8258038

Disclosure

All code in this article was written by me and is original work, developed and tested on Python 3.12.6. Benchmark numbers are from actual runs on my local machine (Windows 11, CPU only) and are reproducible by cloning the repository and running main.py, experiments/rag_eval_demo.py, and experiments/benchmarks.py. The sentence-transformers library is used as an optional dependency for semantic embedding in the attribution and relevance scorers. Without it, the system falls back to TF-IDF vectors with a warning, and all functionality remains operational. The scoring formulas, decision logic, hallucination detection rules, and regression system are independent implementations not derived from any cited codebase. I have no financial relationship with any tool, library, or company mentioned in this article.

Kentucky Derby முடிவுகள்