AI மாதிரிகளுக்கான பயனுள்ள அறிவுத் தளத்தை எவ்வாறு உருவாக்குவது | தரவு அறிவியலை நோக்கி

அவர்களின் அறிவுத் தளத்தைப் போல மட்டுமே வலுவானது. ஒரு நேர்த்தியான மற்றும் துல்லியமான அறிவுத் தளமானது மாதிரி வேகம் மற்றும் துல்லியம் இரண்டையும் மேம்படுத்துகிறது—தற்போதைய மாதிரிகள் அடிக்கடி குறையும் பகுதிகள். உண்மையில், முக்கிய AI சாட்போட்கள் ஒவ்வொரு வினாடி வினவலுக்கும் தவறாகப் புரிந்துகொள்வதை சமீபத்திய ஆய்வு காட்டுகிறது.

இந்தக் கட்டுரையில், நீங்கள் எவ்வாறு நம்பகமான அறிவுத் தளத்தை உருவாக்கலாம், விரிவான படிகள் மற்றும் தவறுகளைத் தவிர்க்கலாம்.

பயனுள்ள அறிவுத் தளத்தை உருவாக்குவதற்கான 6 படிகள்

AI மாதிரிகளுக்கான பயனுள்ள அறிவுத் தளத்தை எவ்வாறு உருவாக்குவது | தரவு அறிவியலை நோக்கி — அறிவுத் தளத்தை உருவாக்குவதற்கான படிகள் | ஆசிரியரின் படம்

அறிவுத் தளத்தை உருவாக்குவதற்கான முறையான அணுகுமுறையை எடுத்துக்கொள்வது, தரப்படுத்தப்பட்ட, அளவிடக்கூடிய மற்றும் சுய விளக்கமளிக்கும் ஒன்றை உருவாக்க உதவுகிறது. எந்தவொரு புதிய டெவலப்பரும், அறிவுத் தளத்தை தற்போதைய மற்றும் நம்பகமானதாக வைத்திருக்க, காலப்போக்கில் எளிதாக சேர்க்கலாம் அல்லது புதுப்பிக்கலாம்.

நீங்கள் அங்கு செல்வதை உறுதிசெய்ய, அறிவுத் தளத்தை உருவாக்கத் தொடங்கும் ஒவ்வொரு முறையும் இந்த ஆறு படிகளைப் பின்பற்றலாம்:

1. தரவு சேகரிக்க

அறிவுத் தளத்திற்கான தரவு சேகரிப்பில் உள்ள முதன்மையான தவறான கருத்து, மேலும் சிறந்தது என்ற அனுமானமாகும். இது உங்களை உன்னதமான “குப்பை உள்ளே, குப்பை வெளியே” பிரச்சனையில் விழ வைக்கிறது.

ஒலியளவைக் காட்டிலும் மதிப்புக்கு முன்னுரிமை அளித்து, உங்கள் மாதிரிக்கான அனைத்து தொடர்புடைய தரவையும் சேகரிக்கவும். இது வடிவத்தில் இருக்கலாம்:

உண்மைகள் மற்றும் நடைமுறைகளை உள்ளடக்கிய உண்மை மற்றும் பயிற்சி உள்ளடக்கம்
அறிவுறுத்தல் உரை அல்லது வீடியோ வடிவில் உள்ள சிக்கலைத் தீர்க்கும் உள்ளடக்கம்
முந்தைய சிக்கல்கள் அல்லது செயல்படுத்தல் பதிவைக் காட்டும் வரலாற்றுத் தரவு
நேரடி சிஸ்டம் நிலை அல்லது சமீபத்திய செய்தி ஊட்டங்களை உள்ளடக்கிய நிகழ்நேர தரவு
கூடுதல் சூழலைப் பெற மாதிரிக்கான டொமைன் தரவு

உங்கள் கணினிக்கு அனைத்து தகவல்களும் தேவையில்லை என்பதை புரிந்து கொள்ள வேண்டியது அவசியம். எடுத்துக்காட்டாக, நீங்கள் வாடிக்கையாளர் ஆதரவு சாட்போட்டை உருவாக்குகிறீர்கள் என்றால், உங்கள் மொக்கப்பிற்கு உண்மையான உள்ளடக்கம் மற்றும் நிறுவனத்தின் கொள்கை மற்றும் நடைமுறைகளை விளக்கும் பயிற்சி மட்டுமே தேவைப்படலாம். இது உங்கள் மாடல் தவறான அல்லது நோக்கத்திற்கு அப்பாற்பட்ட பதிலை உருவாக்கவில்லை என்பதை உறுதிசெய்கிறது மற்றும் அது கொடுக்கப்பட்டவற்றுடன் ஒட்டிக்கொண்டிருக்கும்.

ஆலோசனை: புதிய AI மாடல்களின் அறிவுத் தளத்தை உருவாக்கும் அதே வேளையில் AI-உருவாக்கப்பட்ட தரவுகளுக்கு உணவளிக்கும் போக்கு அதிகரித்து வருகிறது. இந்த நடைமுறை சற்று இருமுனைகள் கொண்ட வாள் என்று நான் உணர்கிறேன். இது வேகத்தை அளிக்கிறது, ஆனால் நம்பகத்தன்மை மற்றும் புழுதிக்கான வெளியீட்டை நீங்கள் சரிபார்க்க வேண்டும். தெளிவான பதில்களுக்கு உள்ளடக்கத்தை எப்போதும் மேம்படுத்தி, அறிவுத் தளத்தில் சேர்ப்பதற்கு முன் வெளியீட்டைச் சரிபார்க்கவும்.

2. தரவைச் சுத்தம் செய்து துண்டுகளாகப் பிரிக்கவும்

உங்கள் மூலத் தரவை நீங்கள் தயார் செய்தவுடன், முதலில் அதை சுத்தம் செய்யலாம். துப்புரவு செயல்முறை பொதுவாக அடங்கும்:

நகல் மற்றும் காலாவதியான உள்ளடக்கத்தை அகற்றுதல்
தலைப்புகள், அடிக்குறிப்புகள் மற்றும் பக்க எண்கள் போன்ற பொருத்தமற்ற விவரங்களை நீக்குதல்
உள்ளடக்க தரப்படுத்தல், வடிவம் மற்றும் உள்ளடக்கம் (நிலையான சொற்கள்)

இந்த சுத்திகரிக்கப்பட்ட தரவு பின்னர் தருக்க துகள்களாக உடைக்கப்படுகிறது, அங்கு ஒவ்வொரு துண்டிலும் தெளிவான யோசனை அல்லது தலைப்பு உள்ளது.

ஒவ்வொரு பகுதிக்கும் மெட்டாடேட்டா ஒதுக்கப்பட்டுள்ளது, அது அதன் உள்ளடக்கத்தைப் பற்றிய விரைவான சூழலை வழங்குகிறது. இந்த மெட்டாடேட்டா, AI மாதிரிகள் அறிவுத் தளங்களை வேகமாக உலாவவும், தொடர்புடைய விவரங்களைக் கொண்ட துணுக்குகளை விரைவாகப் பெறவும் உதவுகிறது.

அந்தத் துண்டில் உள்ள தகவல்களை எந்தப் பாத்திரங்கள் அணுகுகின்றன என்பதை உறுதிப்படுத்த, பங்கு அடிப்படையிலான அணுகலைத் துண்டுகளாக அமைக்கலாம். பல பாத்திரங்கள் ஒரு மாதிரியை அணுக முடியும் என்றாலும், எல்லா தரவையும் அணுக முடியாது. மாடலுக்குள் பாதுகாப்பு மற்றும் அணுகல் கட்டுப்பாட்டை நீங்கள் அமைக்கக்கூடிய இடமே சங்கிங் ஆகும்.

ஆலோசனை: நான் எப்போதும் பின்பற்றும் ஒரு சிறந்த நடைமுறை, ஆவணக் கட்டமைப்பிற்குப் பதிலாக பயனர் வினவல்களின் அடிப்படையில் தரவைத் துண்டாக்குவது. எடுத்துக்காட்டாக, அங்கீகாரம் மற்றும் அணுகல் மேலாண்மை பற்றிய ஆவணம் உங்களிடம் உள்ளது. “எனது கடவுச்சொல்லை எவ்வாறு மாற்றுவது?”, “கடவுச்சொல் கொள்கை என்ன?” போன்ற பொதுவான பயனர் கேள்விகளால் நீங்கள் அதை உடைக்கலாம். மேலும் உண்மையான வினவல்களுடன் சோதனை செய்வதன் மூலம் இந்த துணுக்குகளை நீங்கள் சரிபார்க்கலாம். பாதுகாப்பான தொகுப்பு 10-12 கேள்விகளாக இருக்கலாம்.

3. ஒழுங்கமைத்தல் மற்றும் அட்டவணை தரவு

OpenAI v3-Large, BGE-M3 போன்ற உட்பொதிவு மாதிரியைப் பயன்படுத்தி, உரைத் துண்டுகள் வெக்டர்கள் எனப்படும் எண்களாக மாற்றப்படுகின்றன.

AI மாதிரிகள் வெக்டர்கள் வழியாக ஒரு பெரிய தொகுதி உரையை விட வேகமாக செல்ல முடியும். திசையன்மயமாக்கலுக்குப் பிறகு, துண்டுடன் இணைக்கப்பட்ட மெட்டாடேட்டா பின்னர் திசையனுடன் இணைக்கப்படும். இறுதி பகுதி இப்படி இருக்கும்:

[ Vector (numbers) ] + [ Original text ] + [ Metadata ]

4. தரவு சேமிப்பிற்கான தளத்தைத் தேர்வு செய்யவும்

இந்த திசையன் வெளியீட்டை மீட்டெடுப்பதற்காக Pinecone, Milvus அல்லது Weaviate போன்ற திசையன் தரவுத்தளத்தில் சேமிக்கலாம். எளிய பைதான் குறியீட்டை எழுதுவதன் மூலம் வெக்டர் தரவை ஏற்றலாம்.

  import math
  import time
  import json
  from dataclasses import dataclass, field
  from typing import Any

  import numpy as np


  # Vector Normalization + Metadata

  def normalize_l2(vector: list[float]) -> list[float]:
    """
    Return an L2-normalized copy of `vec`.
    Many vector stores use dot-product similarity. If you normalize vectors to
    unit length, dot-product becomes equivalent to cosine similarity.
    """
      arr = np.array(vector, dtype=np.float32)
      norm = np.linalg.norm(arr)
      if norm == 0:
          return vector
      return (arr / norm).tolist()


  def prepare_record(
      doc_id: str,
      embedding: list[float],
      text: str,
      source: str,
      extra_metadata: dict[str, Any] | None = None,
  ) -> dict:
      """
      Prepare a single record for vector DB upsert.
      Metadata serves two purposes:
      - Filtering: narrow down search to a subset
      """
      metadata = {
          "source": source,
          "text_preview": text[:500],
          "char_count": len(text),
      }
      if extra_metadata:
          metadata.update(extra_metadata)

      return {
          "id": doc_id,
          "values": normalize_l2(embedding),
          "metadata": metadata,
      }


# Vector Quantization

  # Scalar Quantization / SQ

  def scalar_quantization(input_vec) -> dict:
      """
      This funtion demonstrates 
        how to compress float32 input_vec to uint8
      """
      input_arr = np.array(input_vec, dtype=np.float32)
      min, max = input_arr.min(), input_arr.max()
      range = (max - min)
      if range == 0:
          quantized = np.zeros_like(arr, dtype=np.uint8)
      else:
          quantized = ((input_arr - min) / range * 255).astype(np.uint8)

      return {
          "quantized": quantized.tolist(),
          "min": float(min),
          "max": float(max),
      }


  def scalar_dequantization(record: dict) -> list[float]:
      """
      You can Reconstruct the original vector 
        by approximate float32 vector from uint8.
      """
      arr = np.array(record["quantized"], dtype=np.float32)
      return (arr / 255 * (record["max"] - record["min"]) + record["min"]).tolist()


  # Product Quantization / PQ

  def train_product_quantizer( vectors, num_subvectors: int = 8, num_centroids: int = 256, max_iterations: int = 20) -> list:
      """
      This function demonstrates 
        split vector into subvectors, cluster each independently
      """
      from sklearn.cluster import KMeans

      dim = vectors.shape[1]
      assert dim % num_subvectors == 0, "dim must be divisible by num_subvectors"
      sub_dim = dim // num_subvectors

      codebooks = []
      for i in range(num_subvectors):
          sub_vectors = vectors[:, i * sub_dim : (i + 1) * sub_dim]
          kmeans = KMeans(n_clusters=num_centroids, max_iter=max_iterations, n_init=1)
          kmeans.fit(sub_vectors)
          codebooks.append(kmeans.cluster_centers_)

      return codebooks


  def pq_encode(vector: np.ndarray, codebooks: list[np.ndarray]) -> list[int]:
      """
      Encode a single vector into PQ codes (one uint8 per subvector)
      """
      num_subvectors = len(codebooks)
      sub_dim = len(vector) // num_subvectors
      codes = []

      for i, codebook in enumerate(codebooks):
          sub_vec = vector[i * sub_dim : (i + 1) * sub_dim]
          distances = np.linalg.norm(codebook - sub_vec, axis=1)
          codes.append(int(np.argmin(distances)))

      return codes


  def pq_decode(codes: list[int], codebooks: list[np.ndarray]) -> np.ndarray:
      """
      Reconstruct approximate vector from PQ codes
      """
      return np.concatenate(
        [codebook[code] for code, codebook in zip(codes, codebooks)]
      )

ஆலோசனை: ஏற்றுதல் வேகத்தை அதிகரிக்க, தொகுதி செருகும் விருப்பத்தைப் பயன்படுத்த பரிந்துரைக்கிறேன். ஏற்றுதல் கட்டத்தில் நீங்கள் திசையன்களை இயல்பாக்கலாம் (அனைத்தும் ஒரே அளவில் செய்யலாம்). இயல்பாக்கத்திற்குப் பிறகு, சேமிப்பகத்தை மேம்படுத்த, அளவை (சுருக்க). இந்த கூடுதல் இயல்பாக்கம் மற்றும் அளவுப்படுத்தல் படி அடுத்தடுத்த மீட்சியை உறுதி செய்கிறது.

5. மீட்டெடுப்பை மேம்படுத்தவும்

திசையன் தரவுத்தள மீட்டெடுப்பை இயக்க, நீங்கள் LlamaIndex மற்றும் LangChain போன்ற ஆர்கெஸ்ட்ரேஷன் கட்டமைப்பைப் பயன்படுத்தலாம்.

LlamaIndex ஆனது வெக்டார் தரவுத்தளத்தின் மூலம் வேகமாக செல்லவும் மற்றும் பயனரின் வினவல் தொடர்பான உள்ளடக்கம் உள்ள சரியான பகுதியை அடையவும் முடியும்.

LangChain பின்னர் துண்டின் தரவை எடுத்து பயனரின் வினவலுக்கு ஏற்ப மாற்றுகிறது. எடுத்துக்காட்டாக, உரையை சுருக்கவும் அல்லது அதிலிருந்து மின்னஞ்சலை எழுதவும்.

"""                                                                                                                             
  Hybrid Retrieval: Take benefits from both keyword search and vector similarity                                                     
                                                                                                                                  
  Where each approach shines:                                                                                  
  - Keywords: looks for exact matches, but will miss searches with synonym
  - Embeddings: has advantage of capturing the meaning, but there is possibility of missing exact keyword
  Hybrid is a combination of both to get the best of each.
  """

  import math
  from collections import defaultdict
  from dataclasses import dataclass
  import numpy as np

  @dataclass
  class Document:
      id: str
      text: str
      embedding: list[float]


  class BestMatching25Index:
      def __init__(self, k1: float = 1.5, b: float = 0.75):
          # Here k1 is the term frequency saturation limit 
          # and b is length of normalization
          self.k1 = k1
          self.b = b
          self.doc_lengths: dict[str, int] = {}
          self.avg_doc_length: float = 0
          self.doc_freqs: dict[str, int] = {} 
          self.term_freqs: dict[str, dict[str, int]] = {} 
          self.corpus_size: int = 0

      def _tokenize(self, text: str) -> list[str]:
          return text.lower().split()

      def index(self, documents: list[Document]) -> None:
          self.corpus_size = len(documents)

          for doc in documents:
              tokens = self._tokenize(doc.text)
              self.doc_lengths[doc.id] = len(tokens)
              self.term_freqs[doc.id] = {}

              seen_terms: set[str] = set()
              for token in tokens:
                  self.term_freqs[doc.id][token] = self.term_freqs[doc.id].get(token, 0) + 1
                  if token not in seen_terms:
                      self.doc_freqs[token] = self.doc_freqs.get(token, 0) + 1
                      seen_terms.add(token)

          self.avg_doc_length = sum(self.doc_lengths.values()) / self.corpus_size

      def score(self, query: str, doc_id: str) -> float:
          query_terms = self._tokenize(query)
          doc_len = self.doc_lengths[doc_id]
          score = 0.0

          for term in query_terms:
              if term not in self.doc_freqs or term not in self.term_freqs.get(doc_id, {}):
                  continue

              tf = self.term_freqs[doc_id][term]
              df = self.doc_freqs[term]
              idf = math.log((self.corpus_size - df + 0.5) / (df + 0.5) + 1)
              tf_norm = (tf * (self.k1 + 1)) / (
                  tf + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_length)
              )
              score += idf * tf_norm

          return score

      def search(self, query: str, top_k: int = 10) -> list[tuple[str, float]]:
          scores = [
              (doc_id, self.score(query, doc_id))
              for doc_id in self.doc_lengths
          ]
          scores.sort(key=lambda x: x[1], reverse=True)
          return scores[:top_k]


  class VectorIndex:
      """This class implements the smart search using the hybrid search.
         The index function normalize and stores the document
         search implements a cosine similarity search
        hybrid_search_weighted merges BM25 index and vector index using weighted average
       Reciprocal_rank_fusion Combines the results in an efficient way
     """

      def __init__(self):
          self.documents: dict[str, np.ndarray] = {}

      def index(self, documents: list[Document]) -> None:
          for doc in documents:
              arr = np.array(doc.embedding, dtype=np.float32)
              norm = np.linalg.norm(arr)
              self.documents[doc.id] = arr / norm if norm > 0 else arr

      def search(self, query_embedding: list[float], top_k: int = 10) -> list[tuple[str, float]]:
          q = np.array(query_embedding, dtype=np.float32)
          q = q / np.linalg.norm(q)

          scores = [
              (doc_id, float(np.dot(q, emb)))
              for doc_id, emb in self.documents.items()
          ]
          scores.sort(key=lambda x: x[1], reverse=True)
          return scores[:top_k]

  def hybrid_search_weighted(
      query: str,
      query_embedding: list[float],
      bm25_index: BestMatching25Index,
      vector_index: VectorIndex,
      alpha: float = 0.5,
      top_k: int = 10,
  ) -> list[dict]:
      """Combine keyword and vector scores with a tunable weight.

      alpha = 1.0 → pure vector search
      alpha = 0.0 → pure keyword search
      alpha = 0.5 → equal weight (good starting point)
      """
      keyword_results = bm25_index.search(query, top_k=top_k * 2)
      vector_results = vector_index.search(query_embedding, top_k=top_k * 2)

      # Normalize (min-max) each score list to [0, 1]
      def normalize_scores(results: list[tuple[str, float]]) -> dict[str, float]:
          if not results:
              return {}
          scores = [s for _, s in results]
          min_s, max_s = min(scores), max(scores)
          rng = max_s - min_s
          if rng == 0:
              return {doc_id: 1.0 for doc_id, _ in results}
          return {doc_id: (s - min_s) / rng for doc_id, s in results}

      keyword_scores = normalize_scores(keyword_results)
      vector_scores = normalize_scores(vector_results)

      # Merge
      all_doc_ids = set(keyword_scores) | set(vector_scores)
      combined = []
      for doc_id in all_doc_ids:
          ks = keyword_scores.get(doc_id, 0.0)
          vs = vector_scores.get(doc_id, 0.0)
          combined.append({
              "id": doc_id,
              "score": alpha * vs + (1 - alpha) * ks,
              "keyword_score": ks,
              "vector_score": vs,
          })

      combined.sort(key=lambda x: x["score"], reverse=True)
      return combined[:top_k]

  def reciprocal_rank_fusion(
      *ranked_lists: list[tuple[str, float]],
      k: int = 60,
      top_n: int = 10,
  ) -> list[dict]:
      """
     Merge multiple ranked lists,  uses RRF (Reciprocal Rank Fusion)

      RRF score = sum over all lists of: 1 / (k + rank)

      Why RRF over weighted combination?
      - No score normalization needed (works on ranks, not raw scores)
      - No alpha tuning needed
      - Robust across different score distributions
      - Used by Elasticsearch, Pinecone, Weaviate under the hood
      """
      rrf_scores: dict[str, float] = defaultdict(float)
      doc_details: dict[str, dict] = {}

      for list_idx, ranked_list in enumerate(ranked_lists):
          for rank, (doc_id, raw_score) in enumerate(ranked_list, start=1):
              rrf_scores[doc_id] += 1.0 / (k + rank)
              if doc_id not in doc_details:
                  doc_details[doc_id] = {}
              doc_details[doc_id][f"list_{list_idx}_rank"] = rank
              doc_details[doc_id][f"list_{list_idx}_score"] = raw_score

      results = []
      for doc_id, rrf_score in rrf_scores.items():
          results.append({
              "id": doc_id,
              "rrf_score": round(rrf_score, 6),
              **doc_details[doc_id],
          })

      results.sort(key=lambda x: x["rrf_score"], reverse=True)
      return results[:top_n]


  def hybrid_search_rrf(
      query: str,
      query_embedding: list[float],
      bm25_index: BestMatching25Index,
      vector_index: VectorIndex,
      top_k: int = 10,
  ) -> list[dict]:
      keyword_results = bm25_index.search(query, top_k=top_k * 2)
      vector_results = vector_index.search(query_embedding, top_k=top_k * 2)

      return reciprocal_rank_fusion(keyword_results, vector_results, top_n=top_k)

ஆலோசனை: இரண்டு முக்கிய வார்த்தைகளின் அடிப்படையில் ஹைப்ரிட் பிரித்தெடுத்தல் மற்றும் விரைவாக மீட்டெடுப்பதற்கு உட்பொதிக்க பரிந்துரைக்கிறேன். முக்கிய வார்த்தைகளை மீட்டெடுப்பது சரியான விதிமுறைகளுக்கு (“கடவுச்சொல் கொள்கை”) சிறந்தது. உட்பொதிப்புகள் கருத்தியல் அல்லது பொருள் சார்ந்த பொருத்தங்களுக்கு சிறந்தது. கலப்பின மீட்டெடுப்பில் LlamaIndex சிறந்தது, அங்கு அது கேள்வியைச் சுற்றி சரியான சொற்கள் மற்றும் சூழலைத் தேடலாம்.

6. தானியங்கி புதுப்பிப்பு மற்றும் புதுப்பிப்பு வழக்கத்தை அமைக்கவும்

உங்கள் அறிவுத் தளத்தை எப்போதும் புதுப்பித்த நிலையில் வைத்திருப்பதை உறுதி செய்வதே இறுதிப் படியாகும். இதற்காக, தேர்ந்தெடுக்கப்பட்ட மறதியை நீங்கள் செயல்படுத்தலாம். இது மாதிரியை துல்லியமாக வைத்திருக்க, காலாவதியான மற்றும் தேவையற்ற தரவை மேலெழுதுதல் அல்லது நீக்குதல்.

எந்தத் தரவை நீக்க வேண்டும் என்பதைக் கண்டுபிடிப்பது எப்படி? உங்களுக்கு உதவ மதிப்பீடு மற்றும் கவனிப்பு தளங்கள் உள்ளன. உங்கள் AI மாடல் சரியானதா என்பதை தொடர்ந்து சரிபார்க்கும் சோதனை விதிகள்/வினவல்களை DeepEval இல் நீங்கள் நிரல் செய்யலாம். பதில்கள் தவறாக இருந்தால், இந்தப் பதில் தேர்ந்தெடுக்கப்பட்ட இடத்திலிருந்து சரியான பத்தியைப் பெற TruLens இயங்குதளம் உதவுகிறது.

 """                                                                                                                             
  Knowledge Base Quality Monitoring                                                                                               
                                                                                                                                  
  Knowledge base health with the help of automated checks:                                                                                     
  1. Retrieval quality — is it finding the right documents?
  2. Freshness detection — Are documents stale or embeddings drifting?
  3. Unified pipeline — Scheduled monitoring with alerts
  """

  import time
  import json
  import logging
  from datetime import datetime, timedelta
  from dataclasses import dataclass, field
  from typing import Any, Callable

  import numpy as np

  logging.basicConfig(level=logging.INFO)
  logger = logging.getLogger("kb_monitor")


    def setup_deepeval_metrics():
      """Define retrieval quality metrics using DeepEval.

      DeepEval provides LLM-evaluated metrics — it uses a judge LLM to score
      whether retrieved context actually helps answer the question.
      """
      from deepeval.metrics import (
          AnswerRelevancyMetric,
          FaithfulnessMetric,
          ContextualPrecisionMetric,
          ContextualRecallMetric,
      )
      from deepeval.test_case import LLMTestCase

      metrics = {
          # Does the answer address the question?
          "relevancy": AnswerRelevancyMetric(threshold=0.7),
          # Is the answer grounded in the retrieved context (no hallucination)?
          "faithfulness": FaithfulnessMetric(threshold=0.7),
          # Are the top-ranked retrieved docs actually relevant?
          "context_precision": ContextualPrecisionMetric(threshold=0.7),
          # Did we retrieve all the docs needed to answer?
          "context_recall": ContextualRecallMetric(threshold=0.7),
      }

      return metrics, LLMTestCase


  def evaluate_retrieval_quality(
      rag_pipeline: Callable,
      test_cases: list[dict],
  ) -> list[dict]:
      """Run a set of test queries through your RAG pipeline and score them.

      Each test case should have:
      - query: the user question
      - expected_answer: ground truth answer (for recall/relevancy)
      """
      from deepeval import evaluate
      from deepeval.test_case import LLMTestCase
      from deepeval.metrics import (
          AnswerRelevancyMetric,
          FaithfulnessMetric,
          ContextualPrecisionMetric,
          ContextualRecallMetric,
      )

      results = []

      for tc in test_cases:
          # Run your actual RAG pipeline
          response = rag_pipeline(tc["query"])

          test_case = LLMTestCase(
              input=tc["query"],
              actual_output=response["answer"],
              expected_output=tc["expected_answer"],
              retrieval_context=response["retrieved_contexts"],
          )

          metrics = [
              AnswerRelevancyMetric(threshold=0.7),
              FaithfulnessMetric(threshold=0.7),
              ContextualPrecisionMetric(threshold=0.7),
              ContextualRecallMetric(threshold=0.7),
          ]

          for metric in metrics:
              metric.measure(test_case)

          results.append({
              "query": tc["query"],
              "scores": {m.__class__.__name__: m.score for m in metrics},
              "passed": all(m.is_successful() for m in metrics),
          })

      return results


  def setup_trulens_monitoring(rag_pipeline: Callable, app_name: str = "my_kb"):
      """Wrap your RAG pipeline with TruLens for continuous feedback logging.

      TruLens records every query + response + retrieved context, then
      runs feedback functions asynchronously to score each interaction.
      """
      from trulens.core import TruSession, Feedback, Select
      from trulens.providers.openai import OpenAI as TruLensOpenAI
      from trulens.apps.custom import TruCustomApp, instrument

      session = TruSession()

      # Feedback provider (uses an LLM to judge quality)
      provider = TruLensOpenAI()

      feedbacks = [
          # Is the response relevant to the query?
          Feedback(provider.relevance)
          .on_input()
          .on_output(),

          # Is the response grounded in retrieved context?
          Feedback(provider.groundedness_measure_with_cot_reasons)
          .on(Select.RecordCalls.retrieve.rets)
          .on_output(),

          # Is the retrieved context relevant to the query?
          Feedback(provider.context_relevance)
          .on_input()
          .on(Select.RecordCalls.retrieve.rets),
      ]

      # Wrap your pipeline — every call is now logged and scored
      @instrument
      class InstrumentedRAG:
          def __init__(self, pipeline):
              self._pipeline = pipeline

          @instrument
          def retrieve(self, query: str) -> list[str]:
              result = self._pipeline(query)
              return result["retrieved_contexts"]

          @instrument
          def query(self, query: str) -> str:
              result = self._pipeline(query)
              return result["answer"]

      instrumented = InstrumentedRAG(rag_pipeline)

      tru_app = TruCustomApp(
          instrumented,
          app_name=app_name,
          feedbacks=feedbacks,
      )

      return tru_app, session


  def get_trulens_dashboard_url(session) -> str:
      """Launch the TruLens dashboard to visualize quality over time."""
      session.run_dashboard(port=8501)
      return "http://localhost:8501"

  @dataclass
  class DocumentFreshness:
      doc_id: str
      last_updated: datetime
      last_embedded: datetime
      source_hash: str  # hash of source content at embedding time


  class FreshnessMonitor:
      """Detect stale documents and embedding drift."""

      def __init__(self, staleness_threshold_days: int = 30):
          self.threshold = timedelta(days=staleness_threshold_days)
          self.freshness_records: dict[str, DocumentFreshness] = {}

      def register(self, doc_id: str, source_hash: str) -> None:
          now = datetime.utcnow()
          self.freshness_records[doc_id] = DocumentFreshness(
              doc_id=doc_id,
              last_updated=now,
              last_embedded=now,
              source_hash=source_hash,
          )

      def check_staleness(self) -> dict:
          """Find documents that haven't been re-embedded recently."""
          now = datetime.utcnow()
          stale, fresh = [], []

          for doc_id, record in self.freshness_records.items():
              age = now - record.last_embedded
              if age > self.threshold:
                  stale.append({"id": doc_id, "days_stale": age.days})
              else:
                  fresh.append(doc_id)

          return {
              "total": len(self.freshness_records),
              "fresh": len(fresh),
              "stale": len(stale),
              "stale_documents": stale,
          }

      def check_content_drift(
          self, doc_id: str, current_source_hash: str
      ) -> bool:
          """Check if source content changed since last embedding."""
          record = self.freshness_records.get(doc_id)
          if not record:
              return True  # unknown doc, treat as drifted
          return record.source_hash != current_source_hash


  def detect_embedding_drift(
      old_embeddings: dict[str, list[float]],
      new_embeddings: dict[str, list[float]],
      drift_threshold: float = 0.1,
  ) -> dict:
      """Compare old vs new embeddings for the same documents.

      If your embedding model gets updated (or you switch models),
      existing vectors may no longer be compatible. This detects that.
      """
      drifted = []
      common_ids = set(old_embeddings) & set(new_embeddings)

      for doc_id in common_ids:
          old = np.array(old_embeddings[doc_id])
          new = np.array(new_embeddings[doc_id])

          # cosine distance: 0 = identical, 2 = opposite
          cos_sim = np.dot(old, new) / (np.linalg.norm(old) * np.linalg.norm(new))
          cos_dist = 1 - cos_sim

          if cos_dist > drift_threshold:
              drifted.append({
                  "id": doc_id,
                  "cosine_distance": round(float(cos_dist), 4),
              })

      return {
          "documents_compared": len(common_ids),
          "drifted": len(drifted),
          "drift_threshold": drift_threshold,
          "drifted_documents": sorted(drifted, key=lambda x: x["cosine_distance"], reverse=True),
      }

TruLens உடன் இணைந்து DeepEval ஐப் பயன்படுத்துவது உங்கள் அறிவுத் தளத்தின் அவ்வப்போது சோதனைகளை தானியங்குபடுத்துகிறது.

அறிவுத் தளத்தை உருவாக்குவதில் முக்கிய சவால்கள் (+ தீர்வுகள்)

அறிவுத் தளத்துடன் நாம் பார்த்த பொதுவான சிக்கல்கள் இங்கே:

1. தரவு தர பிழைகள் அதிகரிப்பு

பல ஆண்டுகளாக கட்டமைக்கப்பட்ட AI மாதிரிகள், திடமான அணிகளைக் கொண்ட புகழ்பெற்ற நிறுவனங்களால் கூட, மனதைக் கவரும். ஏர் கனடாவின் புகழ்பெற்ற சாட்போட் விபத்து, மாடல் வாடிக்கையாளருக்கு ஒருபோதும் இல்லாத கொள்கைக்கு எதிராக பணத்தைத் திரும்பப் பெறுவதாக உறுதியளித்த ஒரு எடுத்துக்காட்டு.

அனைத்து பொறியாளர்களும் அறிவுத் தளத்தில் தொடர்புடைய உள்ளடக்கத்தை வைக்க முயற்சித்தாலும், முடிவில் இன்னும் சிக்கல்கள் உள்ளன. எனது அனுபவத்தில், டொமைன் நிபுணத்துவம் இல்லாதது எது பொருத்தமானது என்பதைக் கண்டறிவதில் தவறுகளை உருவாக்குகிறது. உங்கள் அறிவுத் தளத்தில் காலாவதியான, முரண்பட்ட மற்றும் பொருத்தமற்ற தகவல்களைக் கண்டறிய தொழில்நுட்ப தொப்பியை அகற்றி டொமைன் தொப்பியை அணியவும்.

2. மீட்சியை மெதுவாக்குதல்

சரியான பதிலை வழங்கும் AI மாதிரி மட்டும் போதாது. பயனர்கள் ஏற்றுதல் அல்லது தாமதத்தை வெறுக்கிறார்கள் மற்றும் குறைந்தபட்சம் ஒரு இயந்திரத்திலிருந்து உடனடி பதில்களை விரும்புகிறார்கள்.

டெவலப்பர்கள் பெரும்பாலும் செயல்பாட்டில் சிக்கிக் கொள்கிறார்கள் மற்றும் தேர்வுமுறை பகுதிக்கு முன்னுரிமை அளிக்க மாட்டார்கள், இது முற்றிலும் பேச்சுவார்த்தைக்குட்பட்டது அல்ல. பொதுவான மந்தநிலை சிக்கலைத் தீர்க்க பின்வரும் உதவிக்குறிப்புகளைப் பயன்படுத்தவும்:

தட்டையான குறியீடுகளுக்குப் பதிலாக படிநிலை நேவிகேபிள் ஸ்மால் வேர்ல்ட் (HNSW) அல்லது IVF குறியீடுகளைப் பின்பற்றவும், அவை விரைவாக மீட்டெடுப்பதற்குத் தொடர்புடைய தலைப்புகளைக் குழுவாக்கவும்.
அளவுப்படுத்தல் (வினவல்களில் இருந்து மாற்றப்பட்ட திசையன்களைக் குறைத்தல், அதனால் அவை குறைந்த நினைவகத்தை எடுத்துக்கொள்கின்றன) அல்லது மீண்டும் மீண்டும் பிரித்தல் (துண்டுகளாகப் பிரித்தல்) வினவல்கள் குறைந்த நினைவகத்தை எடுத்துக்கொள்கின்றன
வேகமான அணுகலுக்கு உங்கள் தரவுத்தளத்தையும் AI சேவையையும் ஒரே கிளவுட் பகுதியில் வைத்திருங்கள்.

3. மோசமான அளவிடுதல்

வரிசைப்படுத்தலை விரைவுபடுத்துவதற்காக, டெவலப்பர்கள் பெரும்பாலும் மோசமான வடிவமைப்பு முடிவுகளை எடுக்கிறார்கள், இது நீண்ட கால அளவிடுதலை பாதிக்கிறது. அத்தகைய ஒரு பிரச்சனையானது ஒரு ஒற்றைக் கட்டமைப்பைப் பின்தொடர்வது ஆகும், அங்கு அனைத்து தரவு சேமிப்பு மற்றும் வினவல் செயலாக்கம் ஒரு ஒற்றை, இறுக்கமாக இணைக்கப்பட்ட கிளஸ்டரில் நிகழ்கிறது. மாதிரி பயன்பாடு அதிகரிக்கும் போது, CPU/RAM பயன்பாடு பலகை முழுவதும் அதிகரிக்கிறது முழுவதும் ஒவ்வொரு வினவலுக்கும் கிளஸ்டர். அளவை திறம்பட நிர்வகிக்க, கிடைமட்ட ஷார்டிங் (பல சிறிய சேவையகங்களில் தரவைப் பிரித்தல்) பரிந்துரைக்கிறேன்.

மற்றொரு சிக்கல் அளவுடன் அதிகரித்து வரும் செலவு ஆகும், சேமிப்பகத்தை மேம்படுத்த நீங்கள் திசையன்களை அளவிடவில்லை அல்லது சுருக்கவில்லை என்றால் இது வழக்கமாக நிகழ்கிறது. டெவலப்பர்கள் மாடலை விரைவாகப் பெற, அளவீட்டுப் படியைத் தவிர்க்கிறார்கள். எதிர்மறையானது ஆரம்பத்தில் கவனிக்கப்படவில்லை, ஆனால் விரைவில் மந்தநிலை மற்றும் உயரும் கட்டணங்கள் இடைவெளியைக் காட்டுகின்றன.

அறிவுத் தளம் என்பது தரவுக் கிடங்கு அல்ல, சுத்தமான சொத்து

அறிவுத் தளத்தை உருவாக்குவது என்பது ஒரு முறையான திட்டம் அல்ல. இது வழக்கமான மேம்படுத்தல் தேவைப்படும் வளரும் சொத்து. இன்று நீங்கள் உருவாக்கும் கட்டமைப்பு நாளை இடைவெளிகளை வெளிப்படுத்தும். ஒவ்வொரு தோல்வியுற்ற வினவல் பின்னூட்டம் மற்றும் ஒவ்வொரு வெற்றிகரமான பெறுதலும் உங்கள் வடிவமைப்பு தேர்வுகளை சரிபார்க்கிறது.

சிறியதாகத் தொடங்கவும், உங்கள் மாதிரிக்கான பத்து பொதுவான கேள்விகளைத் தேர்ந்தெடுத்து, அவற்றுக்கான தெளிவான ஆவணங்களை உருவாக்கவும், பின்னர் உங்கள் மாதிரி சரியான நேரத்தில் சரியான பதில்களை வழங்க முடியுமா என்று சோதிக்க பரிந்துரைக்கிறேன். நீங்கள் எதிர்பார்த்த முடிவைப் பெறத் தொடங்கியதும், உங்கள் அறிவுத் தளத்தை விரிவுபடுத்துவதற்கான செயல்முறையை மீண்டும் செய்யலாம்.

யூகிக்கும் மாதிரிக்கும் தெரிந்த மாதிரிக்கும் உள்ள வித்தியாசம் இந்த வேண்டுமென்றே க்யூரேஷன் வேலையில் வரும். தொடர்ச்சியான சுத்திகரிப்பு அடுத்த தேடலை எளிதாக்குகிறது மற்றும் முடிவுகளை மிகவும் நம்பகமானதாக ஆக்குகிறது.