AI basics – overview

This overview for the «Bare minimum» series is an outline:

large language models;
architecture;
usage patterns;
trends.

Video (~9 min): Watch on YouTube

Large language models (LLMs)

What is it? — A tool first; then capabilities and use cases

Text generation

Articles, stories, marketing materials
Automated reports and documentation
Creative writing and content across genres

Question answering

Intelligent customer-support chatbots
Q&A systems for information retrieval
Virtual assistants for everyday tasks

Language analysis

Text classification by category and sentiment
Summarization of long documents
Translation between languages with context preserved

Programming and code

Code generation from natural-language descriptions
Debugging and fixing errors
Comments and explanations for complex code

Powerful — but it helps to know how they work and where they fail.

What is it technically? - A huge statistical machine

Definition and how it works

Neural models trained to predict the next token from context.

They use statistical regularities in language to generate text.

An LLM uses the full context to pick the most likely next word.

Diagram: model uses full context for the next token

LLM token prediction flow

Input context (Russian tongue-twister): «Карл у Клары украл…»

Context analysis:

Карл (subject)
у Клары (whose / from whom)
украл (action)

Prediction: the model analyzes all prior tokens and their relations, recognizes the familiar tongue-twister pattern

→ «кораллы» (“corals”)

Probability: 85%

Token prediction flow

Transformers and architecture

Self-attention for sequences (sounds simple — in practice, it isn’t).

Self-attention mechanism

Parallel processing of data instead of a purely recurrent pipeline.

Parallel processing vs recurrent style

Multi-layer structure with billions of parameters (weight heatmap).

Layers and parameter scale

Training on text

Self-supervised learning on billions of text examples
Pre-training on broad data, then specialization
Scaling data and compute

How to use them? - Overview of techniques for working with LLMs

Prompt engineering

The craft of phrasing requests for better results
Structuring instructions with roles, examples, and context
Iteratively refining prompts for accurate answers

Deep dive by technique (zero-shot, few-shot, CoT, roles, step-back, …): AI basics – prompt engineering.

Fine-tuning

Adapting the model to specific tasks and domains
Using small labeled datasets
RLHF (reinforcement learning from human feedback)

RAG (Retrieval-Augmented Generation)

Extending the model with retrieval from a knowledge base
Combining external sources with generation
Reducing hallucinations by grounding in verified facts

More on stages, chunking, and pipeline flavors: AI basics – RAG systems.

Chain-of-thought

Step-by-step reasoning for hard problems
Intermediate computation and logical steps
Better math and logic when the model is guided through steps

Popular models

GPT (OpenAI)

A family from GPT-3 through GPT-5.4-class releases, industry leaders
Commercial APIs with a wide range of capabilities
ChatGPT as the mass-market product built on these models

Claude (Anthropic)

Emphasis on safety and long context
Constitutional-style alignment with human values
Very large context (up to ~1M tokens in flagship offerings)

LLaMA

Open(ish) models from Meta for the research community
Base for many derivatives (Alpaca, Vicuna, …)
Compact variants for local deployment

Regional / domestic stacks

YandexGPT with strong Russian support
Sber’s GigaChat for business use
Vikhr and other specialized models for niche tasks

Key concepts

Token

Smallest unit of text the model processes
Can be a word, subword, or symbol
Examples: “hello” ≈ 1 token; “unpredictability” often 2–3 tokens
Tokenization splits text into a sequence of tokens

Temperature

Low (0.1–0.3): more predictable, precise text
Mid (0.7–0.9): balance of creativity and coherence
High (1.5–2.0): more creative, less coherent

Model types by role

Foundation: pre-trained on large text corpora
Instruction-tuned: trained to follow user instructions
Chat: tuned for dialogue and multi-turn conversation
Specialized: tuned for code, medicine, law, etc.

Context window

Cap on how much text the model processes in one pass
Various ways to extend context (roughly 8K–100K tokens in many systems)
Information loss on very long documents

Problems and limitations

Hallucinations

The model may invent facts while sounding confident
Plausible but false content
Hard to verify everything it generates

Compute

Powerful GPUs/TPUs for training and inference
High energy use for training large models
Cost of building and running infrastructure

Ethics

Bias and stereotypes in training data
Safety risks and malicious use
Copyright and intellectual property issues

Future and trends

Multimodality

Text, images, audio, and video
Understanding and generating content in multiple formats
Integrating modalities for richer understanding

LLM agents

Autonomous systems for complex tasks
Planning actions and making decisions
Using external tools and APIs

More on planning, memory, tools, ReAct, and multi-agent setups: AI basics – LLM agents.

Model optimization

Quantization and distillation for speed
More efficient architectures
Trade-off between size and capability

On-device / local

Models on personal devices
Privacy without sending data to the cloud
Specialized hardware for LLM inference