Featured image of post AI basics – overview

AI basics – overview

Overview of large language models: capabilities, how they work, prompting, RAG, key concepts, and limits

This overview for the «Bare minimum» series is an outline:

  • large language models;
  • architecture;
  • usage patterns;
  • trends.

Video (~9 min): Watch on YouTube

Large language models (LLMs)

What is it? — A tool first; then capabilities and use cases

What is it? — A tool first; then capabilities and use cases

Text generation

  • Articles, stories, marketing materials
  • Automated reports and documentation
  • Creative writing and content across genres

Question answering

  • Intelligent customer-support chatbots
  • Q&A systems for information retrieval
  • Virtual assistants for everyday tasks

Language analysis

  • Text classification by category and sentiment
  • Summarization of long documents
  • Translation between languages with context preserved

Programming and code

  • Code generation from natural-language descriptions
  • Debugging and fixing errors
  • Comments and explanations for complex code

Powerful — but it helps to know how they work and where they fail.

What is it technically? - A huge statistical machine

What is it technically? - A huge statistical machine

Definition and how it works

Neural models trained to predict the next token from context.

They use statistical regularities in language to generate text.

An LLM uses the full context to pick the most likely next word.

Diagram: model uses full context for the next token

LLM token prediction flow

Input context (Russian tongue-twister): «Карл у Клары украл…»

Context analysis:

  • Карл (subject)
  • у Клары (whose / from whom)
  • украл (action)

Prediction: the model analyzes all prior tokens and their relations, recognizes the familiar tongue-twister pattern

→ «кораллы» (“corals”)

Probability: 85%

Token prediction flow

Transformers and architecture

Self-attention for sequences (sounds simple — in practice, it isn’t).

Self-attention mechanism

Parallel processing of data instead of a purely recurrent pipeline.

Parallel processing vs recurrent style

Multi-layer structure with billions of parameters (weight heatmap).

Layers and parameter scale

Training on text

  • Self-supervised learning on billions of text examples
  • Pre-training on broad data, then specialization
  • Scaling data and compute

How to use them? - Overview of techniques for working with LLMs

How to use them? - Overview of techniques for working with LLMs

Prompt engineering

  • The craft of phrasing requests for better results
  • Structuring instructions with roles, examples, and context
  • Iteratively refining prompts for accurate answers

Deep dive by technique (zero-shot, few-shot, CoT, roles, step-back, …): AI basics – prompt engineering.

Fine-tuning

  • Adapting the model to specific tasks and domains
  • Using small labeled datasets
  • RLHF (reinforcement learning from human feedback)

RAG (Retrieval-Augmented Generation)

  • Extending the model with retrieval from a knowledge base
  • Combining external sources with generation
  • Reducing hallucinations by grounding in verified facts

More on stages, chunking, and pipeline flavors: AI basics – RAG systems.

Chain-of-thought

  • Step-by-step reasoning for hard problems
  • Intermediate computation and logical steps
  • Better math and logic when the model is guided through steps
Popular models

GPT (OpenAI)

  • A family from GPT-3 through GPT-5.4-class releases, industry leaders
  • Commercial APIs with a wide range of capabilities
  • ChatGPT as the mass-market product built on these models

Claude (Anthropic)

  • Emphasis on safety and long context
  • Constitutional-style alignment with human values
  • Very large context (up to ~1M tokens in flagship offerings)

LLaMA

  • Open(ish) models from Meta for the research community
  • Base for many derivatives (Alpaca, Vicuna, …)
  • Compact variants for local deployment

Regional / domestic stacks

  • YandexGPT with strong Russian support
  • Sber’s GigaChat for business use
  • Vikhr and other specialized models for niche tasks

Key concepts

Key concepts

Token

  • Smallest unit of text the model processes
  • Can be a word, subword, or symbol
  • Examples: “hello” ≈ 1 token; “unpredictability” often 2–3 tokens
  • Tokenization splits text into a sequence of tokens

Temperature

  • Low (0.1–0.3): more predictable, precise text
  • Mid (0.7–0.9): balance of creativity and coherence
  • High (1.5–2.0): more creative, less coherent

    Infographic: LLM generation temperature from predictable to highly diverse output

Model types by role

  • Foundation: pre-trained on large text corpora
  • Instruction-tuned: trained to follow user instructions
  • Chat: tuned for dialogue and multi-turn conversation
  • Specialized: tuned for code, medicine, law, etc.

    Diagram: from foundation models to specialized models (code, medicine, law)

Context window

  • Cap on how much text the model processes in one pass
  • Various ways to extend context (roughly 8K–100K tokens in many systems)
  • Information loss on very long documents

Problems and limitations

Problems and limitations

Hallucinations

  • The model may invent facts while sounding confident
  • Plausible but false content
  • Hard to verify everything it generates

Compute

  • Powerful GPUs/TPUs for training and inference
  • High energy use for training large models
  • Cost of building and running infrastructure

Ethics

  • Bias and stereotypes in training data
  • Safety risks and malicious use
  • Copyright and intellectual property issues
Future and trends

Multimodality

  • Text, images, audio, and video
  • Understanding and generating content in multiple formats
  • Integrating modalities for richer understanding

LLM agents

  • Autonomous systems for complex tasks
  • Planning actions and making decisions
  • Using external tools and APIs

More on planning, memory, tools, ReAct, and multi-agent setups: AI basics – LLM agents.

Model optimization

  • Quantization and distillation for speed
  • More efficient architectures
  • Trade-off between size and capability

On-device / local

  • Models on personal devices
  • Privacy without sending data to the cloud
  • Specialized hardware for LLM inference