This overview for the «Bare minimum» series is an outline:
- large language models;
- architecture;
- usage patterns;
- trends.
Video (~9 min): Watch on YouTube
Large language models (LLMs)
What is it? — A tool first; then capabilities and use cases
What is it? — A tool first; then capabilities and use cases
Text generation
- Articles, stories, marketing materials
- Automated reports and documentation
- Creative writing and content across genres
Question answering
- Intelligent customer-support chatbots
- Q&A systems for information retrieval
- Virtual assistants for everyday tasks
Language analysis
- Text classification by category and sentiment
- Summarization of long documents
- Translation between languages with context preserved
Programming and code
- Code generation from natural-language descriptions
- Debugging and fixing errors
- Comments and explanations for complex code
Powerful — but it helps to know how they work and where they fail.
What is it technically? - A huge statistical machine
What is it technically? - A huge statistical machine
Definition and how it works
Neural models trained to predict the next token from context.
They use statistical regularities in language to generate text.
An LLM uses the full context to pick the most likely next word.

LLM token prediction flow
Input context (Russian tongue-twister): «Карл у Клары украл…»
Context analysis:
- Карл (subject)
- у Клары (whose / from whom)
- украл (action)
Prediction: the model analyzes all prior tokens and their relations, recognizes the familiar tongue-twister pattern
→ «кораллы» (“corals”)
Probability: 85%

Transformers and architecture
Self-attention for sequences (sounds simple — in practice, it isn’t).

Parallel processing of data instead of a purely recurrent pipeline.

Multi-layer structure with billions of parameters (weight heatmap).

Training on text
- Self-supervised learning on billions of text examples
- Pre-training on broad data, then specialization
- Scaling data and compute
How to use them? - Overview of techniques for working with LLMs
How to use them? - Overview of techniques for working with LLMs
Prompt engineering
- The craft of phrasing requests for better results
- Structuring instructions with roles, examples, and context
- Iteratively refining prompts for accurate answers
Deep dive by technique (zero-shot, few-shot, CoT, roles, step-back, …): AI basics – prompt engineering.
Fine-tuning
- Adapting the model to specific tasks and domains
- Using small labeled datasets
- RLHF (reinforcement learning from human feedback)
RAG (Retrieval-Augmented Generation)
- Extending the model with retrieval from a knowledge base
- Combining external sources with generation
- Reducing hallucinations by grounding in verified facts
More on stages, chunking, and pipeline flavors: AI basics – RAG systems.
Chain-of-thought
- Step-by-step reasoning for hard problems
- Intermediate computation and logical steps
- Better math and logic when the model is guided through steps
Popular models
Popular models
GPT (OpenAI)
- A family from GPT-3 through GPT-5.4-class releases, industry leaders
- Commercial APIs with a wide range of capabilities
- ChatGPT as the mass-market product built on these models
Claude (Anthropic)
- Emphasis on safety and long context
- Constitutional-style alignment with human values
- Very large context (up to ~1M tokens in flagship offerings)
LLaMA
- Open(ish) models from Meta for the research community
- Base for many derivatives (Alpaca, Vicuna, …)
- Compact variants for local deployment
Regional / domestic stacks
- YandexGPT with strong Russian support
- Sber’s GigaChat for business use
- Vikhr and other specialized models for niche tasks
Key concepts
Key concepts
Token
- Smallest unit of text the model processes
- Can be a word, subword, or symbol
- Examples: “hello” ≈ 1 token; “unpredictability” often 2–3 tokens
- Tokenization splits text into a sequence of tokens
Temperature
- Low (0.1–0.3): more predictable, precise text
- Mid (0.7–0.9): balance of creativity and coherence
- High (1.5–2.0): more creative, less coherent

Model types by role
- Foundation: pre-trained on large text corpora
- Instruction-tuned: trained to follow user instructions
- Chat: tuned for dialogue and multi-turn conversation
- Specialized: tuned for code, medicine, law, etc.

Context window
- Cap on how much text the model processes in one pass
- Various ways to extend context (roughly 8K–100K tokens in many systems)
- Information loss on very long documents
Problems and limitations
Problems and limitations
Hallucinations
- The model may invent facts while sounding confident
- Plausible but false content
- Hard to verify everything it generates
Compute
- Powerful GPUs/TPUs for training and inference
- High energy use for training large models
- Cost of building and running infrastructure
Ethics
- Bias and stereotypes in training data
- Safety risks and malicious use
- Copyright and intellectual property issues
Future and trends
Future and trends
Multimodality
- Text, images, audio, and video
- Understanding and generating content in multiple formats
- Integrating modalities for richer understanding
LLM agents
- Autonomous systems for complex tasks
- Planning actions and making decisions
- Using external tools and APIs
More on planning, memory, tools, ReAct, and multi-agent setups: AI basics – LLM agents.
Model optimization
- Quantization and distillation for speed
- More efficient architectures
- Trade-off between size and capability
On-device / local
- Models on personal devices
- Privacy without sending data to the cloud
- Specialized hardware for LLM inference
