Mathematics on Blog

Math Analysis — Derivatives

Fri, 13 Mar 2026 00:00:00 +0000

Without this you can’t understand how neural networks learn.

A common example is the speed of a car. We’ll do the same.

What is speed?

It’s the ratio of distance to time.

A car traveling at an average speed of 60 km/h — if we increase time \( t \) by one unit, distance \( s \) increases by 60 units. If we multiply \( t \) by 10, \( s \) is multiplied by 10 as well. The growth is linear; on a graph it’s a straight line.

s = 60t

But over time the car moves unevenly — speed increases and decreases.

s(t) — distance v(t) — velocity

Impossible

Distance traveled cannot decrease. Such an s(t) graph would be wrong:

s(t) — wrong

Self-check questions

1. When is the speed at its maximum?

Hint

Look at the v(t) graph. Where does it reach its highest value?

2. What does the slope of s(t) at a given point represent?

Hint

Velocity is the ratio of distance change to time. How does that relate to slope?

3. Why does v(t) first increase and then decrease?

Hint

v is the derivative of s. How does the slope of s(t) change from start to end? Recall the S-shape.

Essential Mathematics — Overview

Thu, 12 Mar 2026 00:00:00 +0000

This is the intro to the «Essential Mathematics» series — an overview of which math areas are needed for AI and machine learning, and what they are used for.

Linear Algebra

What you need: vectors, matrices, matrix multiplication, linear transformations, dot product, vector norm, eigenvalues and eigenvectors (advanced).

Why:

Vectors — the basic way to represent data: a word, an image, or a model’s «thought» is encoded as a vector of numbers
Matrices — all neural network operations (convolutions, attention, linear layers) boil down to matrix multiplication
Linear transformations — understanding how a layer «mixes» and combines features
Without linear algebra you cannot read formulas in papers or understand what the code does (e.g. W @ x is weight matrix times input vector)¹

Derivatives

What you need: derivatives, partial derivatives, gradient, chain rule, function optimization.

Why:

Neural networks are trained via gradient descent: we move weights in the direction that reduces the error
Gradient — a vector of partial derivatives showing which way to «tweak» each parameter
Chain rule is the backbone of backpropagation — the algorithm that computes gradients for all layers
Without calculus you cannot understand where weight updates come from or why training works at all

Probability and Statistics

What you need: distributions (normal, Bernoulli, etc.), expectation, variance, conditional probability, Bayes’ theorem, maximum likelihood estimation (MLE).

Why:

Models are often formulated probabilistically: «probability of the next word», «how confident is the model»
Loss functions often derive from likelihood (cross-entropy, MSE, etc.)
Statistics for working with data: normalization, quality assessment, understanding metrics (precision, recall, AUC)
Bayes — foundation of many classical algorithms and the way to «update beliefs» on new data

Computer Vision (CV)

On top of the general minimum: discrete convolution (2D), affine and projective transformations, basic image geometry (camera calibration, epipolar geometry).

Why:

Convolution — the core of CNNs: understanding how a filter «slides» over an image, and what kernel, padding, stride mean
Affine transformations (rotation, scale, shift) — for augmentations and understanding invariance
Projective geometry and homographies — for stereo vision, 3D reconstruction, AR
For classical methods (SIFT, optical flow) — image derivatives and per-pixel gradients are useful

Optional but helpful

Information theory (entropy, cross-entropy) — deeper understanding of loss functions and data compression
Convex optimization — when the problem is «nice» and when SGD converges
Differential geometry — for advanced topics (diffusion models, representations)

Practical minimum

To comfortably read papers and code:

Be able to multiply matrices and understand what a linear layer does
Understand what a gradient is and why it matters for training
Know basic distributions and quality metrics

You can pick up the rest as needed — when it appears in a specific problem or paper.

W @ x — in Python (NumPy, PyTorch) the matrix multiplication operator: the weight matrix W is multiplied by the input vector x. The result is the output vector. A linear layer essentially computes W·x + b. ↩︎

Math Analysis — Derivatives

Thu, 12 Mar 2026 00:00:00 +0000

Second article in the «Essential Mathematics» series — on derivatives, gradient, and chain rule.

Without this you can’t understand how neural networks learn.

Prerequisites: limits — foundation for defining the derivative and continuity.

What is a derivative
Summary

What is a derivative

Formal definition

The derivative of a function at a point shows the rate of change: how fast the function grows (or decreases) for a small shift in the argument.

For a single-variable function \( f(x) \), the derivative \( f'(x) \) is the limit of the ratio of the change in \( f \) to the change in \( x \) as the change goes to zero:

\[ f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \]

Here \( \Delta x \) is the increment of the argument (a small step). In the examples below we use \( h \) for the same quantity.

Geometrically — the slope of the tangent line at \( x \):

\( f'(x) > 0 \) — function is increasing
\( f'(x) < 0 \) — function is decreasing
\( f'(x) = 0 \) — possible minimum or maximum

If it didn’t click — open the simple explanation.

Simple explanation

The speedometer — that’s essentially a derivative: how fast the distance traveled changes. Accelerating — speed goes up. Braking — it drops. So the derivative answers: “how much does one quantity change when you change another one a little?”

A hill. The steepness of a slope — “how far down you go when you take a step forward.” Steep slope — large derivative. Gentle — small. Flat road — zero.

A function graph. If you have a graph \( y = f(x) \), the derivative at a point is the slope of the graph at that point. For a straight line \( y = kx + b \) the slope is the familiar coefficient \( k \). For a curve, each point has its own slope — that’s the derivative.

Graph steeply going up → positive derivative
Going down → negative derivative
Flat (top or bottom) → derivative equals zero

If it’s still not crystal clear — open the simplest explanation.

The simplest explanation

A car drives at constant speed. Distance \( s \) (km) depends on time \( t \) (h): \( s(t) = 60t \). In one hour we cover 60 km.

Speed is “how many km per hour.” The ratio of change in distance to change in time:

\[ \text{speed} = \frac{s(1) - s(0)}{1 - 0} = \frac{60 - 0}{1} = 60 \text{ km/h} \]

That’s the derivative: the rate of change of one quantity (distance) with respect to another (time).

Numeric example:

\( t \) (h)	\( s(t) \) (km)	\( s(t+0.5) - s(t) \)	\( \frac{\Delta s}{\Delta t} \)
0	0	30	60
1	60	30	60
2	120	30	60

Everywhere we get 60. Yes: 60 (the speed in km/h) is the derivative — \( s'(t) = 60 \). It’s constant because the speed doesn’t change.

The graph — a straight line. The slope of this line is also 60 — the same number:

What if the speed varied? First 30 min at 40 km/h, next 30 min at 80 km/h. Total: 20 + 40 = 60 km in 1 h. Average speed is 60 km/h — but at each moment it was either 40 or 80.

The graph shows a smooth variant: speed increases from 40 to 80 km/h. Green curve — distance s(t), dashed line — derivative s’(t) (instantaneous speed).

We compute the instantaneous speed (derivative) at several points — using \( \frac{s(t+h) - s(t)}{h} \) with small \( h = 0.1 \) h:

\( t \) (h)	\( s(t) \) (km)	\( s(t+0.1) - s(t) \)	\( \frac{\Delta s}{\Delta t} \)
0.1	4	4	40
0.25	10	4	40
0.5	20	—	40 (from left) / 80 (from right)
0.75	40	8	80
1	60	8	80

In the first half-hour the derivative is 40 everywhere; in the second, 80. At \( t = 0.5 \) the graph has a “kink”: slope 40 on the left, 80 on the right. The average 60 is just an average over the whole trip; the derivative tells you the speed at that exact moment.

Rules and basic derivative facts

Main rules and facts

The derivative is the rate of change. How fast the function grows (or decreases) for a small shift in the argument. Geometrically — the slope of the tangent to the graph.

Sign and behavior:

\( f'(x) > 0 \) — function is increasing
\( f'(x) < 0 \) — function is decreasing
\( f'(x) = 0 \) — possible minimum or maximum (peak, valley)

Main rules (brief):

Function	Derivative	Description
\( C \) (constant)	\( 0 \)	constant does not change; rate of change is zero
\( x \)	\( 1 \)	linear growth at rate 1
\( x^n \)	\( n \cdot x^{n-1} \)	exponent “comes down” as factor; power decreases by 1
\( kx + b \)	\( k \)	linear function: slope equals coefficient \( k \)
\( f + g \)	\( f' + g' \)	derivative of sum equals sum of derivatives
\( C \cdot f \)	\( C \cdot f' \)	constant can be factored out of the derivative

Examples: \( (5)' = 0 \), \( (x^2)' = 2x \), \( (x^3)' = 3x^2 \), \( (7x - 2)' = 7 \).

Why it matters in ML: training minimizes the loss. The derivative points in the direction of increase; gradient descent moves the opposite way (\( -\nabla L \))¹ so the loss decreases.

Why \( (x^n)' = n \cdot x^{n-1} \)?

For natural \( n \) we can derive from the definition. Example for \( n = 2 \):

\[ \frac{(x+h)^2 - x^2}{h} = \frac{x^2 + 2xh + h^2 - x^2}{h} = 2x + h \to 2x \]

As \( h \to 0 \) we get \( (x^2)' = 2x \). For \( n = 3 \): expanding \( (x+h)^3 \), the \( h^2 \) term vanishes in the limit, leaving \( 3x^2 \).

Intuition: the power \( n \) «comes down» as a multiplier; the exponent drops by 1. Higher power means steeper growth — so the derivative has \( n \) and \( x^{n-1} \).

Example: parabola \( y = x^2 \) — in detail

For \( f(x) = x^2 \) the formula gives \( f'(x) = 2x \). Each point has its own slope:

Point \( x \)	\( f(x) = x^2 \)	\( f'(x) = 2x \) (slope)
0	0	0
1	1	2
2	4	4
3	9	6
4	16	8

At \( x = 2 \): the parabola is tangent to a line with slope 4 (tangent \( y = 4x - 4 \) passes through (2, 4)).
At \( x = 4 \): slope is 8 — the parabola is steeper, tangent \( y = 8x - 16 \).
At \( x = 0 \): vertex of the parabola, slope 0 — tangent is horizontal.

On the graph: green curve — \( y = x^2 \); dashed lines — tangents at (2, 4) and (4, 16). Slopes 4 and 8 match \( f'(2) = 4 \), \( f'(4) = 8 \).

Example: square — \( x \) as side length, area \( A(x) = x^2 \)

ΔA is the increment (change) of area A. If area is given by function \( A(x) \), then:

\[ \Delta A = A(x + \Delta x) - A(x) \]

So ΔA is how much area is added when side \( x \) increases by Δx.

Increase the side by \( \Delta x \). To the original square we add: two “strips” of area \( x \cdot \Delta x \) (right and top) and a small square \( (\Delta x)^2 \) in the corner. So:

\[ \Delta A = 2x \cdot \Delta x + (\Delta x)^2,\quad \frac{\Delta A}{\Delta x} = 2x + \Delta x \to 2x \]

as \( \Delta x \to 0 \). Derivative \( A'(x) = 2x \) — for the same \( \Delta x \), larger \( x \) gives more area gain.

Change x and Δx — observe how the “strips” and \( \Delta A / \Delta x \approx 2x \) change.

It’s important to distinguish two cases.

\( \Delta A/\Delta x \approx 2x \) — yes, this is an approximation. It holds when \( \Delta x \) is small but nonzero. The smaller \( \Delta x \), the more accurate:

\[ \frac{\Delta A}{\Delta x} = 2x + \Delta x \approx 2x \quad \text{for small } \Delta x \]

The formula \( (x^n)' = nx^{n-1} \) — this is not an approximation, but an exact expression. The derivative is defined as the limit:

\[ f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \]

We take this limit first, then obtain the exact formula.

\( \Delta A/\Delta x \) — approximation, because \( \Delta x \) is still finite. \( A'(x) = 2x \) and \( (x^n)' = nx^{n-1} \) — exact formulas, because they are the result of the limit \( \Delta x \to 0 \).

In short: approximation applies to finite differences; the derivatives themselves and their formulas are exact.

Why \( (C \cdot f)' = C \cdot f' \)?

The constant can be factored out of the derivative because it does not depend on \( x \) and merely scales the rate of change of the function.

From the definition:

\[ (C \cdot f)'(x) = \lim_{h \to 0} \frac{C \cdot f(x+h) - C \cdot f(x)}{h} = C \cdot \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} = C \cdot f'(x) \]

The constant \( C \) does not depend on \( h \), so it can be taken outside the limit.

Examples: \( (5x^2)' = 5 \cdot 2x = 10x \), \( (-3x^3)' = -3 \cdot 3x^2 = -9x^2 \).

Now a few numerical examples.

Numerical example: linear function \( g(x) = 3x - 1 \)

For a straight line the slope is the same everywhere. At any point, e.g. \( x = 5 \):

\[ \frac{g(5 + h) - g(5)}{h} = \frac{(3(5+h) - 1) - g(5)}{h} = \frac{14 + 3h - 14}{h} = \frac{3h}{h} = 3 \]

where \( g(5) = 3 \cdot 5 - 1 = 14 \).

For any \( h \neq 0 \) we get 3 — the derivative is constant, same as the coefficient of \( x \) in the line equation.

\( x \)	\( g(x) \)	\( g(x+0.1) - g(x) \)	\( \frac{g(x+0.1)-g(x)}{0.1} \)
5	14	0.3	3
10	29	0.3	3
-2	-7	0.3	3

Numerical example: quadratic function \( f(x) = x^2 \)

Approximate the derivative at \( x = 2 \) using “change in \( f \) over change in \( x \)”. Take a small step \( h \):

\[ \frac{f(2 + h) - f(2)}{h} = \frac{(2+h)^2 - 4}{h} \]

At \( x = 2 \) (\( f(2) = 4 \)):

\( h \)	\( f(2+h) \)	\( f(2+h) - f(2) \)	\( \frac{f(2+h)-f(2)}{h} \)
1	9	5	5
0.1	4.41	0.41	4.1
0.01	4.0401	0.0401	4.01
0.001	4.0004…	0.004…	4.001

The ratio tends to 4 → \( f'(2) = 4 \).

At \( x = 4 \) (\( f(4) = 16 \)):

\[ \frac{f(4 + h) - f(4)}{h} = \frac{(4+h)^2 - 16}{h} = \frac{8h + h^2}{h} = 8 + h \to 8 \]

\( h \)	\( f(4+h) \)	\( f(4+h) - f(4) \)	\( \frac{f(4+h)-f(4)}{h} \)
1	25	9	9
0.1	16.81	0.81	8.1
0.01	16.0801	0.0801	8.01

The ratio tends to 8 → \( f'(4) = 8 \). By \( (x^2)' = 2x \): at \( x = 2 \) we get 4, at \( x = 4 \) — 8. For a curve, the derivative is different at each point.

Why it matters in ML: training is minimization of the loss function. We need to know which direction to change the weights so that the loss decreases. The derivative points in the direction of increase; we go the opposite way so the loss drops.

Interactive exercise

Car and Bezier curve

1. Car. Change the speed — the distance graph \( s(t) \) and derivative table update. Derivative \( s'(t) \) = speed (constant for uniform motion).

2. Draw a Bezier curve. Drag 4 control points — a cubic Bezier curve is built. Place the orange point on the curve — the derivative calculation appears on the right.

Self-check tasks

Tasks with hints and verification

Solve the tasks. Open the hint if needed. Enter your answer and click “Check”.

Summary

Derivative — rate of change, direction of increase
Partial derivative — with respect to one variable, others fixed
Gradient — vector of partial derivatives, direction of steepest increase
Chain rule — how to compute derivatives along a chain of layers
Gradient descent — update weights in the \( -\nabla L \) direction ¹

PyTorch, TensorFlow, etc. compute gradients automatically (autograd) — but understanding what happens under the hood helps when you hit «exploding gradients» or «model won’t learn».

\( \nabla L \) — the gradient of the loss (vector of partial derivatives). It points in the direction of steepest increase of \( L \). So \( -\nabla L \) is the direction of steepest decrease; we move that way in gradient descent. ↩︎ ↩︎

Math Analysis — Example of Analysis (Finding πr²)

Thu, 12 Mar 2026 00:00:00 +0000

How do we find the area of a circle analytically?

Slicing into Rectangles

Idea: slice the circle into thin concentric rings and «unroll» each ring into a rectangle.

A ring of radius r and width Δr has circumference 2πr. Unrolled, it becomes a rectangle:

length: 2πr
width: Δr
area: 2πr · Δr

The total area is the sum of all such rings, from r = 0 to r = R. The smaller Δr, the better the approximation.

If we arrange these cut strips (each of length 2πr) on a graph from left to right — from r = 0 to r = R — we get a triangle. Base R, height 2πR: the area of the triangle is ½ · R · 2πR = πR².

Result: area of the circle S = πR². The same approach — slicing a shape into «small pieces», summing their areas, and taking the limit — underlies integral calculus and, as we will see, leads to derivatives.

Mathematics on Blog

Math Analysis — Derivatives

What is speed?

Essential Mathematics — Overview

Linear Algebra

Derivatives

Probability and Statistics

Computer Vision (CV)

Optional but helpful

Practical minimum

Math Analysis — Derivatives

Contents

What is a derivative

Formal definition

Simple explanation

The simplest explanation

Rules and basic derivative facts

Numerical example: linear function \( g(x) = 3x - 1 \)

Numerical example: quadratic function \( f(x) = x^2 \)

Interactive exercise

Self-check tasks

Summary

Math Analysis — Example of Analysis (Finding πr²)

Slicing into Rectangles