This document is derived of the lecture by Prof. Dr. T. Brox & Dr. O. Ronneberger & Prof. Dr. M. Teschner at the University of Freiburg in semester WS1516.

- regular grid \(I_{ij}\) (intensity values)
- continuous function \(I: (\Omega \subset \Re^2) \rightarrow \Re\)with\((x, y) \rightarrow I(x, y)\)
- \(\Omega\) is image domain, usually rectangular

- Ability to adapt to changing light intensities
- Retina: Cones (color vision), rods (gray, high sensitivity)
- Analog preprocessing (smoothing, edge detection)
- Sharp image only in center, sign. reduces the amount of data before sending to brain
- “digital”: spiking or not spiking, timing
- Gabor filter: linear filter for edge detection
- Sparse coding: over-complete code, only few are active for certain signal
- lower cortical areas (primary visual cortex V1): neuros respond only to signals in very small local area
- higher cortical areas: receptive fields larger (integration of information from lower areas)

- Pinhole: Objects points \((X, Y, Z)\) are projected to \((x, y)\) by \(x = f \frac{X}{Z}; y = f \frac{Y}{Z}\). Problem: sharpness, light intensity. aperture (size of hole): Ratio
- Optical: Light focusing, large aperture and sharpness possible. \(\frac{1}{S_1} + \frac{1}{S_2} = \frac{1}{f}\). Computervision practice: pinhole camera + correction of lens effects
- effect of lenses is more important: shape from defocus, image analysis in microscopy

- Discretize image domain: \(\{I_{ij}|i = 1, ..., N; j = 1, .., M\}\)
- Grid points: pixels
- Grid: usually rectangular point grid, equal spacing \(h\) (Grid size)
- spacing is the same in all directions \(h = h_x = h_y\)
- if true spacing not known: \(h = 1\)
- Reduce no of pixels: downsampling. Increase: upsampling

- Pros Continuous: scene is continuous, limiting case for successively finer grids, rotational invariance, exact length, subpixel accuracy
- Pros Discrete models: sensor data discrete, model closer to implementation, efficient optimization possible

- function can be represented by its frequency components (Fourier)
- Discrete signal can only repesent frequencies up to a certain limit
**(Nyquist frequency)** - Ignorance of the Nyquist frequency leads to aliasing artifacs (straight line stepping)
- Sampling of periodic signals is frequency modulation (multipl. of two periodic signals)
- It can lead to Moiré effects (special aliasing artifact)
- Videos: temporal aliasing

**Input signal can be reconstructed from samples in a unique way if the sampling rate is at least two times the badwith of the input signal**

- two different frequencies may lead to the same discrete signal
- reduction of bandwidth (max freq) can be achieved by smoothing
- consequence: smooth your input image sufficiently before downsampling

- Decanting operator: ensures minimum smoothing necessary to avoid aliasing
- In case of overlap: intensity distribution to both cells according to overlap ratio
- normalization of intensity value in low res image by downsampling factor
- Operator is separable: can be applied sequentially along all axes

- Bilinear interpolation: weighted avg of neighboring pixels
- project fine grid point to available coarse grid
- weighted avg along x and y average: \[a_j = (1 - \alpha)I_{i,j} + \alpha I_{i+1,j}\] \[a_{j+1} = (1 - \alpha)I_{i,j + 1} + \alpha I_{i+1,j+1}\] \[I_{k,l} = (1 - \beta)a_j + \beta a_{j+1}\]
- general concept to retrieve vals at pts between grid pts
- arbitrary dimensions

- Discretization of co-domain \(\Re \rightarrow \{1, ..., N\}\)
- needed for float or integer rep
- 256 gray scales, 8 bit per px
- human: only 40 gray scales, but thousands of colors
- optimal quantization by clustering
- assume: \(I(x, y) \in \Re; 0 \leq I(x, y) \leq 255\)

- Additive noise: \(I_{ij} = I^*_{ij} + n_{ij}\)
- Gaussian noise: \(G_\sigma(x) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})\)
- Multiplicative noise: \(I_{ij}=I^*_{ij}(1 + n_{ij})\)
- Trick: \(log I_{ij} = log I^*_{ij} + log(1+ n_{ij})\) (additive noise).
- Backtransform: \(I^*_{ij} = exp log I^*_{ij}\)

- Impulse noise: certain percentage of pixels is replaced by one or two fixed values (pixel defect)
- salt-and-pepper: some pixels replaced by white or black values

- Uniform noise: certain percentage of px replaces by random variables

- Measure: \(I\)versus\(I^*\) (ground truth)
- variance of image versus variance of noise
- variance of image: \(\sigma^2_I = \frac{1}{N}\sum_i(I^*_i - \mu)^2\)
- additive, zero-mean noise model: \(I_i = I^*_i + n_i\)
- variance of noise \(\sigma^2_n = \frac{1}{N}\sum_i(I^*_i - I_i)^2\)
**Signal-to-noise ratio**\(SNR = 10 {log}_{10}\frac{\sigma^2_I}{\sigma^2_n} = 10 {log}_{10}\frac{\sum_i(I^*_i - \mu)^2}{\sum_i(I^*_i - I_i)^2}\)**Peak signal-to-noise ratio**\(PSNR = 10 {log}_{10}(\frac{N({max}_i I^*_i - {min}_i I^*_i)^2}{\sum_i(I^*_i - I_i)^2})\)- Unit: decibel (dB). Higher the better

- treat each px independently \(u_{ij} = f(I_{ij})\)
- example: change brightness (\(u(x, y) = I(x, y) + b\)) (clipping needed)

- \(u_{ij} = \alpha I(x,y)\)

- camera chips different response curves \(I \infty I^\gamma\)
- correction: \(f(I(x, y)) = I_{max}(\frac{I(x, y)}{I_{max}})^\frac{1}{\gamma}\), \(\gamma > 0\)
- range \([0, I_{max}]\) not affected
- Dark areas become brighter without leading to oversaturation in the brighter areas

- number of pixels that contain certain gray value
- histogram equalization: transform such that all gray values are equally frequent

- Gray: \(I_\Delta = |I_1 - I_2|\)
- Color: \(I_\Delta = \frac{1}{3} \sum_{k=1}^{3} |I_{k, 1} - I_{k, 2}|\)

- improve singal by taking neighboring points into account
- filters defined in the Fourier domain (global Frequency analysis on all pixels)
**convolution theorem**: Translate filtering in the Fourier domain to the spatial domain: \(F(f) \cdot F(h) = F(f * h)\)- filters usually defined in spatial domain \(h\)instead of\(F(h)\): more interested in local analysis, easier to generalize to nonlinear filters, better at boundaries

- With static filter \(h\): linear operation. \((f * h)(x) = \int h(-x')f(x + x')dx'\)
- in 2D: \((I * h)(x, y) = \int h(-x', -y')I(x + x', y + y')dx'dy'\)
- properies:
- linearity: \((\alpha f + \beta g) * h = \alpha (f * h) + \beta (g * h)\)
- shift invariance: \(f(x) * g(x + \delta) = (f * g)(x + \delta)\)
- commutativity \(f * g = g * f\)
- associativity \((f * g) * h = f * (g * h)\)
- correlation \(f(x) * h(x) = \int h(x')f(x + x')dx'\)

- 1D: \((f * h)_i = \sum^n_{k=1} f_k h_{i-k}\)
- 2D: \((f * h)_{i,j} = \sum_{n, m}{k=1,l=1} f_{k, l} h_{i-k, j-l}\)
- Separability: A filter is separable if \(h(x, y) = \delta(x, y) * h_1(x, \cdot) * h_2(\cdot, y)\) (\(\delta\) is Dirac distribution. \(\delta(x) = \lbrace^{\infty; x = 0}_{0; else}\) and \(\int \delta(x) dx = 1\)
- Separable filters can be applied succesively as 1D convolution, reduces runtime complexity from \(O(NMnm)\) to \(O(NM(n+m))\)

- Kernel: \(G_\sigma(x) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{x^2}{2\sigma^2})\)
- separable: \(I'(x, y) = I(x, y) * G(x,\cdot), * G(\cdot, y)\)
- discrete stencil derived by sampling at \(3\sigma\) interval
- Runtime complexity \(O(NM\sigma)\) (Fourier domain \(O(NM log(NM))\)

- Boundary of \(\Omega\) is \(\delta \Omega\)
- Dirichlet boundary conditions \(I(x, y) = 0 \space \forall (x, y) \in \delta \Omega\)
- Homogeneous Neumann boundary conditions: \(\frac{\delta}{\delta n}I(x, y) = 0 \space \forall (x, y) \in \delta \Omega\) with \(n\) as outer normal vector and \(\frac{\delta I}{\delta n} = n^T\Delta I\). (
**mirror the image**at boundary). Preferred method (preserves average intensity)

- alternative to Gauss filter
- Kernel \(B_\sigma(x) = \lbrace^{\frac{1}{2\sigma}\space |x| < \sigma}_{0 \space {else}}\)
- simple averaging of neighboring values
- properties:
- separable
- not rotationally invariant
- much more efficient \(O(NM)\)
- result not as smooth
- successive convolution of box filter with itself yields filters that resemble the Gauss filter
- smoothing result of filter of order \(k\) is \(k\) times differentiable

- (fast) alternative to Gauss filter
- Idea: recursively propagate information in both directions of the signal
- \(\alpha\) smoothness parameter
- approximates Gaussian convolution
- relation between \(\sigma\) and \(\alpha\): \(\alpha \sigma = \frac{5}{2 \sqrt{\pi}}\)
- properties:
- separable
- rotationally invariant
- \(O(NM)\)
- hard/impossible to implement Neumann boundary conditions

- measuring the local change of intensities
- continuous functions: given by derivate \(\frac{df}{dx}\)
- multidimensional functions: \(\Delta I = (\frac{\delta I}{\delta x}, \frac{\delta I}{\delta y})^T\) (gradient)
- alternative notation: \(\Delta I = (I_x, I_y)^T\)
- image must be differentiable

- ensure differentiability by combining with small amount of Gauss smoothing (
**Gaussian derivatives**) - Convolution and derivatives are linear operations (Applying derivative to image or to Gaussian does not matter)
- \(\frac{\delta}{\delta x}(I(x) * G_\sigma(x)) = I(x) * \frac{\delta}{\delta x} G_\sigma(x)\)
- arbitrarily high order derivatives exist
- sample derivative of gaussian filter for implementation
- common mask for first derivative is \((-\frac{1}{2}, 0, \frac{1}{2})\) (
**central difference**)

- higher derivatives than first derivative
- eg. Laplace filter: Zero-crossings of this filter correspond to image edges
- \(\Delta I = (\frac{\delta^2 I}{\delta^2 x}, \frac{\delta^2 I}{\delta^2 y})^T\)

- Another way to detect edges is by the gradient magnitude
- \(|\Delta I| = \sqrt{I^2_x + I^2_y}\)

- formalize model assumptions and cast task as \(E(X) = A_1(x) + ... + A_n(x)\) (
**Energy/Loss function**) - solve \(x* = {argmin}_x E(x)\) with \(u \in \Re^N\)

- model assumptions: similar to input image, result should be smooth
- formalize:
- similar: \(E_D(u_{i,j}) = \sum_{i,j} (u_{i,j} - I_{i,j})^2\)
- smooth: \(E_S(u_{i, j}) = \sum_{i,j} (u_{i, j+1} - u_{i, j})^2 + (u_{i+1,j} - u_{i, j})^2\)

- problem to solve: \(u^*_{i,j} = {argmin} (E_D(u_{i,j}) + \alpha E_S(u_{i, j}))\) (\(\alpha\) is smoothening weight factor)

- transparency
- optimality
- theoretical aspects can be analysed
- Existence and uniqueness of solutions
- Stability of solutions with respect of the input data
- Difficulty of the problem class

- fewer parameters than in heuristic multi-step methods
- combination of approaches is easy

- Formalizing is ad-hoc (use statistical interpretations)
- choosing weighting parameter(s) not easy (can be learned)
- global optimization is hard

**necessary condition**: \(\frac{\delta E}{\delta u} = 0 \Leftrightarrow \frac{\delta E}{\delta u_{i, j}} = 0\)- compute \(\frac{\delta E}{\delta u_{i, j}}\)
- write as system of linear equations (matrix, \(N \times N\) for \(N\) pixels, symmetric and positive)
- contains one main diagonal (central pixels) and for off-diagonals (for each of the four neighbors of a pixel)
- matrix system is a
**sparse**(almost all entries are 0) - Positive definite, the inverse \(A^{-1}\) exists and we can solve for \(u\)

- Convex
- Positive curvature
- No local minima
- Global minimum is unique
- Minimization by setting the derivative to 0
- Non-convex functions
- many local minima and maxima
- Global minimum may not be unique
- Global minimization is usually impossible, only heuristics exist

A functions is **convex** if

\(f((1 - \alpha)x_1 + \alpha x_2) \leq (1 - \alpha)f(x_1) + \alpha f(x_2)\)

A functions is **strictly convex** if

\(f((1 - \alpha)x_1 + \alpha x_2) < (1 - \alpha)f(x_1) + \alpha f(x_2)\)

- every combination of (strictly) convex functions is again (strictly) convex

- iterative solver, preserves sparsity of the matrix
- Converges if the matrix is
**strictly diagonal dominant**\(|a_{ii}| > \sum_{i \neq j} |a_{i,j}| \space \forall i\) - decompose matrix into \(A = D + M\) (\(D\) diagonal part, \(M\) is off-diagonal part)
- for linear system: \(Ax = b \Leftrightarrow (D + M)x = b \Leftrightarrow Dx = b - Mx\)
- \(D^{-1}\): replace diagonal elements with inverse
- compute solution \(x\) iteratively, start at any \(x_0\): \(x_{k+1} = D^{-1}(b - M_{k})\)
- iterate until \(r_k = Ax_k - b\) (
**residual**) is smaller than a threshold or the change in \((x_{k+1} - x_{k})^2\) becomes small. If that is 0, the iterate has**converged**. - properties:
- simple, paralelizable, few requirements
- slow
- convergence only for \(k \leftarrow \infty\)
- computation not in place

- \(Ax = b \Leftrightarrow Dx = b - Lx - Ux\): split the off-diagonal part \(M\) into the lower triangle \(L\) and the upper triangle \(U\)
- iterate, traverse vector \(x\) from top to bottom and use new values for multiplication with lower triangle \(x_{k+1} = D^{-1} (b - Lx_{k+1} Ux_{k})\)
- Converges if \(A\) is positive or negative definite
- properties:
- in place computation
- recursive propagation of information: faster

- emphasize Gauß-Seidel by over-relaxing solution \(x_{k+1} = (1 - \omega)x_k + \omega D^{-1} (b - Lx_{k+1} Ux_{k})\) (\(\omega = 1\) is Gauß-Seidel)
- Converges for positive- or negative definite matrices (all eigenvalues positive or negative), if \(\omega \in (0, 2)\)
- Over-relaxation for \(\omega > 1\) faster convergence
- Under-relaxation for \(\omega < 1\) can help establish convergence in case of divergent iterative processes
- optimal \(\omega\) must be determined empirically

- two non-zero vectors \(u\) and \(v\) are **conjugate* with respect to \(A\) if the inner product \(<u, v>_A := u^T A v = 0\). (vectors orthogonal with respect to special scalar product)
- set of \(n\) conjugate vectors \(\{p_k\}\) form basis of \(\Re^n\), so solution \(x^*\) of \(Ax = b\) can be expanded as \(x^* = \alpha_1 p_1 + ... + \alpha_n p_n\)
- coefficients \(\alpha_k\) are derived
- after \(n\) computations we obtain exact solution \(x^*\)
- Good choice of \(\{p_k\}\): few coefficients approximate solution well
- Start at some initial point \(x_0\)
- Let \(p_0\) be residual \(r_0 = b - Ax_0\). This is gradient of \(E(x) = \frac{1}{2}x^TAx - b^T x\) the minimizer of which is \(x^*\). (hence the name CG)
- Iteratively compute parameters
- Stop when residual is small. Guaranteed solution after \(n\) iterations
- matrix \(A\): must be symmetric ans positive definite
- computing exact solution is not an option as \(n\) is the number of pixels
- number of iterations to get good solution depends on the
**condition number**of A (largest vs smallest eigenvalue). The same holds for other iterative methods **preconditioners**\(P^{-1}\) are used to have small condition for \(P^{-1}A\) \(Ax = b \Leftrightarrow P^{-1}Ax = P^{-1}b\)- Most simple preconditioner: Jacobi preconditioner

- previous linear solvers: only act locally
- due to sparsity of matrix, distributes information at a pixel to four neighbors
- idea of multigrid solvers: shorten distances by using coarser point of view
- additional effect: fewer entries, iterations are faster at coarse levels

- downsample image, create linear system from this
- compute approx solution at the coarsse grid (e.g. SOR)
- take upsampled result as init guess for next finer grid
- Refine (e.g. SOR)
- properties:
- iterations at coarse level fast
- simple implementation
- coarse levels do not approximate original system well

- do not downsample image but error
- compute first solution at fine grid
- correct error at coarse grid
- refine result at finer grid

- combine cascadig and correcting multigrid
- start at coarse grid with downsampled image
- at each finer level, apply a W-cycle

- considering problems with continuous variables
- segmentation and matching problems usually lead to integer problems
- these are combinatorial problems, few are solvable in polynomial time
- typical problems: linear programs, integer quadratic programs, second oder cone programs

- continuous formulation: \(u^*(x) = {argmin}_{u(x)}E(u(x))\) with \(u(x) : \Omega \rightarrow \Re\)
**energy functional**, optimization based on**calculus of variation**- Necessary condition(s) for a minimum:
**Euler-Lagrange equations** - advantage: no ensurance that discrete energy is
**consistent**with continuous one (eg. measuring line lengths) - disadvantage: gradients in direction of a function have to be computed

- dicretization of continuous quantities can lead to artifacts if done wrong
- solution of problem should be independent of rotation (
**rotation invariance**) **A discretization is called consistent if it converges to the continuous model with finer grid sizes**- Example: discrete approximation of gradient magnitude: errors must depend on grid size

Calculus of variation and **Gâteaux derivative** to compute gradient of a functional \(\frac{\delta E(u(x))}{\delta u(x)}\) with respect to a function \(u(x)\)

- functional maps each element \(u\) of a vector space (e.g. a function) to a scalar \(E(u)\)
- the Gâteaux derivative generalizes the directional derivative to infinite dimensional vector spaces: \(\frac{\delta E(u(x))}{\delta u(x)} \Big|_h = lim_{\epsilon \rightarrow 0} \frac{E(u + \epsilon h) - E(u)}{\epsilon} = \frac{dE(u + \epsilon h)}{d \epsilon}\Big|_{\epsilon}\)
- can be interpreted as projection of the gradient to the direction \(h\): \(\frac{\delta E(u(x))}{\delta u(x)} \Big|_h = \Bigg\langle\frac{dE(u)}{du}, h \Bigg\rangle = \int \frac{dE(u)}{du}(x)h(x)dx\)
- gradient \(\frac{dE(u)}{du}(x)\) is needed to minimize \(E(u)\)

\(E(u(x)) = \int_\Omega (u(x) - I(x))^2 + \alpha|\nabla u(x)|^2 dx\)

- rather than on a vector, optimization problem is on a function
- short writing: dependency of functions on \(x\) is clear: \(E(u) = \int_\Omega (u - I)^2 + \alpha|\nabla u|^2 dx\)
- condition for minimum: \(\frac{\delta E(u)}{\delta u} \Big|_h = 0 \qquad \forall h\)
- compute the Gâteaux derivative and simplify
- Add \(+ \epsilon h\) to all \(u\)
- derive to \(\epsilon\)
- partial integration to turn \(h_x\) and \(h_y\) into \(h\)
- directional derivative must be \(0\) for all directions
- yields two conditions
- gradient must be zero: Euler-Lagrange equation set to \(0\)
- Boundary conditions must be \(0\) (
**natural boundary conditions**, coincide with**Neumann boundary conditions**)

- Discretization of \(\frac{dE(u)}{du} = (u - I) - \alpha (u_{xx} + u_{yy}) = 0\) leads to a linear system of equations \(\frac{\delta E}{\delta u_{i, j}} = (u_{i, j} - I_{i, j}) - \alpha (u_{i-1,j} + u_{i,j-1} - 4 u_{i, j} + u_{i+1, j} + u_{i, j+1}) = 0\)
- solve for minimizing discrete engery: same linear solvers can be applied
- different discretization stages do not always lead to same algorithm

- A quadratic regularizer leads to blurred edges
- not smooth everywhere (
**outliners**at edges) - Can be done by using non-quadratic regularizers
- statistical interpretation of regularization

- Energy minimization can also emerge from probablistic approach taking into account likelihoods and prior probabilities
- Bayes formula: \(p(u|I) = \frac{p(I|u)p(u)}{p(I)}\)
- Find solution \(u\) that maximizes \(p(u|I)\) (
**maximum a-posteriori (MAP)**approach) - Marginal does not depend on \(u\), can be ignored in maximization

- MAP estimation: \(p(u|I) \propto p(I|u)p(u) \rightarrow \max\)
- Noise model: independently and identically distributed Gaussian noise \(p(I|u) \propto \prod_{x \in \Omega} exp(-(u(x) - I(x))^2)\)
- A-priori model: gradient is likely to small (smoothness). Gaussian noise with varians \(\frac{1}{2\alpha}\) on gradient \(p(u) \propto \prod_{x \in \Omega} exp(-\frac{|\nabla u(x)|^2}{\frac{1}{\alpha}})\)
- turn maximization into minimization of the negative logarithm (is monotonous): energy minimization problem: \(-\log p(I|u) - \log p(u) = \int_{\Omega} (u-I)^2 + \alpha|\nabla u|^2 dx \rightarrow \min\)

- Other noise models lead to different penalizers in energry function
- Expecting outliners in the smoothness assumption, replace Gaussian noise model with Laplace distribution
- Leadsto energy function that preserves image edges
- Tikhonov regularizer \(\Psi (s^2) = s^2\)

- Euler-Lagrange equation is nonlinear in the unknowns
- nonlinear system of equations
**gradient descent**- Iterative
- start with initial value
- iteratively move in direction of largest decrease in energy (negative gradient direction)
- converges to next local minimum
- Linear combination is convex, Euler-Lagrange is convex: Gradient descent will find a global optimum
- slow convergence

- keep nonlinear prefactor fixed (regain linear system) and compute updates iteratively
- proven to converge if linear system in each step is solved exactly
- only few iterations, much faster than gradient descent

- Given a sequence of images \(I(x, y, t)\), what is the motion of each pixel between subsequent frames
- Vector field \((u, v)(x, y, t)\) that transforms the second image into the first one
**optical flow**: move each point \((x, y)\) at time \(t\), such that it fithe the point at time \(t + 1\)- Can be used for detection or segmentation of objects
**motion segmentation** **scene flow**: estimate 3D motion of objects**structure from motion**: estimate 3D structure- apparent motion in image plane
**optic flow**, differs from true motion - Ambiguities: not from image data alone, any black pixel could be moved to any black pixel
- assumptions are needed for unique soliton. Neighboorhood (
**aperture**) determines the solution

- Gray value of moving pixel stays constant: \(I(x + u, y + v, t + 1) - I(x, y, t) = 0\)
- Constraint nonlinear in unknown flow: difficult estimation
- Linearize with Taylor expansion: \(I(x + u, y + v, t + 1) = I(x, y, t) + I_xu + I_yv + I_t + O(u^2, v^2)\)
- Linearized
**optic flow constraint**: \(I_xu + I_yv + I_t = 0\) - only sufficiently precise ifimages are smooth and flow is small

- regarding a single pixel, we cannot uniquely determine its flow vector
- we need a second assumption to obtain a unique solution (two unknowns, one equation)
- assume that motion is constant within a local neighborhood
- results in multiple equations for the two unknowns \(I_x(x', y', t)u + I_y(x',y', t) + I_t(x',y',t) = 0 \qquad \forall x',y' \in \Re(x, y)\)
- over-determined system: not a unique solution anymore
- we seek a solution that minimizes the total squared error \(\text{argmin}_{u, v} \sum_{x, y \in \Re}(I_x(x, y)u + I_y(x, y)v + I_t(x, y))^2\)
- condition for minimum is derivative to \(u\) and \(v\) are both zero
- linear system at each pixel: \({\sum I^2_x \qquad \sum I_xI_y \choose \sum I_xI_y \qquad \sum I^2_y} {u \choose v} = {- \sum I_xI_t \choose - \sum I_y I_t}\)
- instead of uniformly weighted, box shared window, we can also use a gaussian window
- weighted sums by window come to a convolution
- unique solution only if system is not singular
- local neighborhood must contain non-parallel gradients
- otherwise aperture problem still present
- assumption of locally constant motion not realistic (boundaries)
- there are no direct constraints on the smoothness of the flow field
- advantages: fast and simple
- local method: point estimates are computed independently

- global dependency between points due to global smoothness assumption
- global solution derived with variational methods
- basic model: Horn-Schunck method
- can be further extended to yield highly accurate estimates in many practical situations

- assumes gray value constancy too
- global smoothness of the flow field: \(|\nabla u|^2 + |\nabla v|^2 \rightarrow \min\)
- can be combined to energy functional, sought optical flow is minimizer of this energy
- energy is convex: we get a unique global optimum
- Use Gâteaux derivatives to obtain Euler-Lagrange equations
- discretized versions return linear equation system to solve, can be written in matrix-vector notation
- matrix is sparse with only few diagonals different from zero
- blocks structure (
**data term**and**smoothness term**) due to coupling of \(u\) and \(v\) - can be solved with linear solvers (Gauss-Seidel, SOR, Multigrid)

- illumination changes will “distract”
- motion discontinuities are blurred
- underestimation of large displacements
- outliers due to non-Gaussian noise or occlusion

- accuracy of optical flow:
**average angular error** - requires
**ground truth** - measures the 3D angle between correct and estimated flow vectors
- also captures errors in length of the 2D flow vector

- consider a
**robust function**applied to the smoothness term to allow for motion discontinuites - occlusions can be approached by applying a robust function to the data term
- nonlinear system can be solved with
**lagged nonlinearity**

- gray value constancy assumtion not fulfilled due to illumination changes
- remedy: matching criterion that is invariant to illumination changes
- the gradient is invariant to additive changes, use that as alternative constraint

- linearization of constancy assumtions is only valid for small displacements (subpixel)
- to tackle larger displacements, we must refrain from such a linearization
- difficulties: non-convex, Euler-Lagrange equations get highly nonlinear
- can be handled with Gauss-Newton method and coarse-to-fine approach

- Partition of the image domain \(\Omega\) into serveral (usually disjoint) regions \(\Omega_i\)
- special case: two-region segmentation (foreground-background)
- difference to
**edge detection**: required closed contours - edge detection is party of solution as it provides pieces of potential contours
- exponentially many possibilities
- hierarchical decomposition ideal (
**object segmentation**) - impossible without prior knowledge
- can be seen as a grouping process combining pixels
**superpixels**

- intensity, color, texture, motion, disparity, depth
**first-order features**vs**second-order features**- first-order: provided by sensor
- second-order: derived from data

- edge-based techniques employ an
**edge indicator**or**pairwise**similarities between pixels. Fit closed contours such that the contour coincides with strong edges (low similarities) - region-based techniques employ a
**statistical model**of each region. Regions should be as homogeneous as possible. Include pixels in different regions that are maximally different - region-based techniques based on local statistics that approach edge-based techniques in the limit

- must simple way to come to a segmentation
- convert intensity image into a binary image to label the two regions
- can be generalized to \(N\) regions by introducing \(N - 1\) thresholds
- Problems
- no threshold that separates the objects
- context is ignored

- region-based with very simple region model

- segmentation is similar to clustering: assign pixels to regions
- example:
**k-means clustering** - initialize: all pixels belong to random region
- mean feature vector for each region
- move pixel to another region if this decreases the total distance
- iterate until pixels do not move any longer
- \(K = 2\), k-means is like thresholding with an automatically determined threshold
- clustering ignore spatial context: only features determine assignment, not the position
- adding pixel coordinates to feature vector is not equivalent to enforcing smooth region contours

**Region growing**:**seed points**are initial regions- for each point in boundary: if a neighbor is similar enough, it is assigned to the region
- grow until there are no more similar pixels along the boundary
**Region merging**: initially all pixels are their own region- two most similar regions are successively merged into a larger region
- repeat until similarity threshold or a given number of regions is reached
- some dissimilarity criteria
- euclidean distance of the features means: \(d^2(\Re_1, \Re_2) = (\mu_1, \mu_2)^2\)
- mean euclidean distance along common boundary

- Illustrative description: regard the
**image gradient magnitude**as mountains and let it rain - water flows downhil and gathers in catchment basins: the regions
- regions meet a the watersheds (edges)
- tiny gradient fluctuations result in over-segmentations (presmoothen image!)

- use energy functional \(E(C) = - \int_0^1 |\nabla I(C(s))|^2 ds + \alpha \int_0^1 |C_s(s)|^2 ds\)
- \(C : [0, 1] \rightarrow \Omega\) is parametric contour. \(C_s\) denotes first derivative
- first term:
**external energy**(depends on input image) - second term:
**internal energy**(independent of model and data) - minimizing external enery drives the contour to follow maxima of the gradient
- consider all possible contours and choose the one that best captures the maxima
- external energy alone would lead to a fractal contours with infinite length
- avoided by internal energy, which penalized the length of the contour
- together: prefer a compromise of a short contour which captures as much image gradient as possible
- with calculus of variation and gradient descent: local minima

**incidcation functor**\(\phi : \Omega \rightarrow [-1, 1]\)- zero-level line represent contour \(C = {x \in \Omega | \phi(x) = 0}\)
- for evolving \(C\) evolve \(\phi\)
- allows topological changes, can be applied in any dimension
- represents contour and enclosed region
**level set**

- energy minimization based on regions statistics
- states the optimal separation of pixel intensities
- similar to k-means clustering (two-means)
**but with an addition constraint on the length of the separation contour** - express using an implicit representation of the contour, can be found analytically as mean intensities inside the two regions

- combinatorial optimization
- structure for
**min-cut**- each pixel represented by
**node**, connected to neighbors via**edges** - neighborhoods can have different complexity (most simple: 4-neighborhood)
- two extra nodes (source and target) connect to all pixel nodes

- each pixel represented by
- goal: find minimum cut through graph that separates source and target
- source and target nodes correspond to regional models
- connection between neighbors enforces a regular labeling
- can be formulated as energy problem
- two terms:
- first term: comprises the links to the source and target nodes
- second term: comprises the
**n-links**between neighboring pixels. Usually fixed weights, but can depend on image gradient

- for fixed means \(\mu_1, \mu_2\) it can be solved in polynomial (avg. case: linear) time

- run classifier which yields a score \(S_i(x)\) for each class \(i\)
- minimize with level sets or graph cuts

- matching of local structures

- straightforward way to match point in images
- regard the image patch around each point in image 1
- compare it to image patches around all point in image 2
- Expensive \(O(kN^2)\) with \(k\) size of patch and \(N\) pixels in image
- not invariant to typical appearance changes

- only limited number of matches needed
- idea: do not match all points, but only promising subsets
- requirements
- point must come with enough information for unique matching
- subset in image 2 must contain matches from subset in image 1

- choose points with high information content and clear localization (corner points)
- with structure tensor, measure cornerness (fast to compute)
- eigenvalue decompositon of structure tensor
- smoothing integrates gradients from the neighborhood
- eigenvectors yield the dominant orientation in this neighborhood and perpendicular orientation
- eigenvalues yield the structure magnitude in these directions
- a large second eigenvalue indicates strong structures in multiple directions (corners)
- corner: local maxima of second eigenvalue
- problem: depends on image scale

- consider a
**Gaussian pyramid**of smoothend images - characteristic scale can be computed based on the
**Laplacian** - yields an estimate of scale shift between two images
- uniqueness is not ensures
- may be multiple local maxima
- even global maximum need not be unique
- maximum operator is not robust, little noise can lead to different outcome

- alternative to Harris-Laplace detector
- considers local maxima of Laplacian scale space
- advantage: does not mix apples and oranges (corner detector and Laplacian)
- Laplacian focuses on blobs rather than corners (complementary information, one might be interested in both)

- positive issue of interest points
- significantly reduced complexity
- with 100 detected points in both images, one has to compare only 10000 patches instead of 96 billion in a 640x480 image

- negative issues
- non-dense displacement fields (important matches might be missed)
- corresponding patches can be slightly shifted

- further problems (independent of interest points)
- patches in both images may look very different due to rotation, transformations, lighting changes, blurring
- subpixel accuracy not available

**local descriptors**are vectors that contain information about the local neighborhood of an interest point- simplest local descriptor: block of certain size centered at interest point
- goal: design local descriptors that are
**invariant**under the mentoined transformations

- use characteristic scale from interest point detection
- choose and normalize the size of the blocks: structures have same scale in both images

- affine tranformation: \(f(x) = Ax + t\)
- maps a circle to an ellipse (or vice-versa)
- approximation of a projective transformation
- parameters can be estimated (region detector, maximally stable extremal regions)

- maximally stable extermal regions
- encircled by large gradients
- obtained by watershed-like algorithm
- apart from scale also yields elongation of fitted ellipses
- affine invariance

- alternative to normalized neighborhood: derive invariant features within the fixed block
- gray value histogram: rotational invariance, invariant to blurring
- sensitive to lighting changes (bad)
- loss of information (very bad)

- histogram of gradient direction
**orientational histograms**- invariant to (additive) lighting changes
- building block of many successful descriptors

- based on local assembly of orientation histograms and adaptive local neighborhoods
- extract SIFT feature points
- strongest responses of Laplacian scale space
- fit quadratic function to obtain subpixel accuracy
- create orientation histogram at selected scale
- peak of smoothed histogram estimated orientation
- in case of two peaks, create two feature points
- estimation of position, scale and orientation
- affine invariance provided with MSER
- object recognition: dense sampling of such points at all position, scales, no rotation invariance

- compute gradient orientation and magnitude at each pixel
- compute orientation indicator at each pixel
- create \(N \times M \times 8\) array and init with zero
- quantize the orientation at each pixel and add respective magnitude to the respective entry in array

- local integration results in orientation histogram (smooth array with gaussian kernel)
- smooth in orientation direction
- sample feature vectors from the image
- normalize the feature vector

- descriptor for matching edge maps
- insensitive to small deformations and spurious edges

- CNNs are trained on large datasets with class labels
- network learns a representation that is good for object classification
- intermediate layer outputs turn out to be good generic descriptors

- CNN to discriminate
**surrogate classes** - applied transformations: translation, rotation, scaling, color, contrast, brightness, blur
- transformations define invariance properties of the features to be learned

- reconstruct the 3D structure of a scene
- missing cue is depth of an observed point
- any point along the projection ray that passes the camera center and the image point could have caused the observation
- thus from position of a single image point we cannot determine the coordinates of the corresponding 3D point
- many ways (hence
**shape from X**) - prominent: reconstruction from binocular images
- center of second camera must not coincide with the one of the first camera

- need two corresponding image points (
**disparity estimation**) - reconstruct the two projection rays, crossing is sough 3D point
- projection properties of camera needed
- comprised in the projection matrix \(P\) (
**camera calibration**)

- add further cameras
- geometrically, only two cameras needed
- practice
- find corresponding points
- no correspondence for occluded points
- accuracy is limited by pixel grid

- more cameras decrease the noise in correspondences and total accuracy of depth estimates is increased
- drawback: multiple cameras are more difficult to calibrate

- single moving camera observes static 3D point
- advantages
- plenty of images from single camera
- reasonable frame rate: displacements are small

- problems
- not fixed calibrated camera setup. Estimate camera motion (
**egomotion**) together with structure - scene must be static, otherwise ambiguities

- not fixed calibrated camera setup. Estimate camera motion (
- sometimes the camera motion is approximately known (other sensors!)

- factorize projection matrix \(P = KM\)
**camera calibration matrix**, contains**camera’s internal paramters**\(\begin{bmatrix}\alpha_x & s & x_0 \\ 0 & \alpha_y & y_0 \\ 0 & 0 & 1 \end{bmatrix}\)- pose of the camera relative to a world coordinate system \(M = (R|t)\): rotation matrix and a translation vector with 3 degrees of freedom (
**external parameters**)

- camera’s internal parameters are calibrated
- calibrate external parameters online (can be estimated from point correspondences)
- internal parameters now stay fixed
- in cause of autofocus, this is not the case. Internal parameters can also be estimated online (
**autocalibration**), but decreases the robustness of estimation - when \(P = KM\) (fully calibrated), we can estimate the 3D structure

- robotics: self-localization and mapping (SLAM)
- localization errors accumulate over time
- when camera returns to earlier position and if localization errors are small enough, the image can be matched to earlier image (
**loop closing**) - special case of object recognition (instance recognition)
- loop closing is the only way to correct accumulated errors (
**drift**)

- correspondences: established for points whose projection rays hit surface in nearly orthogonal direction
- if projection is tangential to surface: structures on surface are severely deformed (no robust matching)
**shape form silhouette**is kind of contrary approach: shape is inferred from points where the surface is tangential- silhouette: restricts the volume that can be occupied by the observed object
- points along projection rays that see background cannot belong to the object volume
- points along projection rays that see foreground may or may not belong to object volume
- more cameras: volume can be occupied is successively reduces
- remaining volume is the maximal volume that is consistent with all silhouettes (
**visual hull**) - helps find good initializations for other reconstruction methods (e.g. stereo)

- find the silhouette in the image (segmentation) and reconstruction the surface can also be considered as single 3D segmentation problem
- task: find a surface that consistently separates all images into a foreground and a background region

- given a single image, compute shape, albedo, scattering, illumination
- inverse rendering: reconstruct from the observed image
- severely under-constrained problem

- single light source and surface element
- direction of light in polar coordinates \((\theta, \phi)\) and incoming energy \(E(\theta, \phi)\)
- light emitted by surface in directing \((\theta_e, \phi_e)\) given by \(L(\theta_e, \phi_e)\)
- function describing reflectance properties of material called
**bidirectional reflectance distribution function (BRDF)**: \(L(\theta_e, \phi_e) = f(\theta, \phi, \theta_e, \phi_e)E(\theta, \phi)\) - more detailed models BRDF can have more than four params

**lambertian surface**emits incoming light uniformly in all directions,**radiance**only depends on angle between surface normal \(n\) and light source direction \(s\): \(I(x, y) = R(X, Y, Z) = \rho s^T n\)- rough materials (stone, textiles, …)
- most of these materials absorb some incoming light. Non-absorbed fraction is called
**albedo**\(\rho\) - model often used in computer vision (stereo matching)
- constance \(\rho\) and given light source direction \(s\) leads to basic shape form shading

- sought depth map \(Z(x, y)\) can be related to surface normal direction \(n\). Estimate \(n\)
- derivatives of surface \(S(x, y) = (x, y, Z(x, y))\) with respect to \(x\) and \(y\) yield two vector fields that span the tangential plane \(\delta_xS = \begin{bmatrix}1\\0\\p\end{bmatrix} \qquad \delta_yS = \begin{bmatrix}0\\1\\q\end{bmatrix}\)
- cross product of two vector fields yields the normal field \(n = \frac{1}{\sqrt{p^2+q^2+1}} \begin{bmatrix}-p\\-q\\1\end{bmatrix}\)
- plug into Lamertian radiance function: like in optical flow: one equation, two unknowns

- assuming smooth surface, leads to variational approach
- derive Euler-Lagrange equations, minimizer for \(p\) and \(q\) can be found by gradient descent
- Energy is non-convex and can have multiple local minima. Coarse-to-fine methods can be used as heuristic to find global minimum

- recent framework by Barron and Malik considers more components of inverse rendering
- includes multiple sources of prior knowledge
- priors learned from training data
- reflectance is mostly smooth and histogram is sparse
- shape is mostly smooth
- isotropy of shapes and the fact that an observed surface is likely to point towards observer
- natural lightning prior

- series of image with focus set to different depths
- estimate the relative blur \(d \sigma\) needed to match the other image
- amount of blur needed determines the depth

- if appearance of texture is known (regular, repetitive pattern)
- usually: texture not known. Practical assumptions are that the texture is homogenous, isotropic and stationary
- from repetitive nature of texture elements under these conditions we can estimate the non-deformed shape of such an element and the surface that causes its deformation

**instance recognition**seeks to detect the presence of known object in new image- should also be localized in image (detection)
- major problem: object might look quite difference (viepoint, lighting, occlusions, …)
- important difference to
**object class recognition**or object class localization: the same instance of an object class is seen in the two images - there can be large differences between two instances of an object class (eg. two dogs)

- two main parts: set of features that describe the object and a classifier that separates objects
- classical approach: use a handcrafted feature representation and learn the classifier
- deep learning: learn both
- typical classifiers: support vector-matchine (two-class), logistic regression (multi-class, used in deep networks)
- nearest neighbor classifier (multi-class)

- Basic idea: learn a decision function with maximum margin (distance to most critical training points)
- decision function modeled as linear combination of features \(y(x) = w^T \phi(x) + b\)
- large marign concept leads to a convex optimization problem: \(\text{argmin}_{w,b} \frac{1}{2} ||w||^2\) subject to \(t_n(w^T \phi(x_n) + b) \geq 1\)
- efficient code publicly available

- assign the class label of most similar training point
- advantages
- simple concept
- works for multiple classes

- drawbacks
- all training samples must be stored (no generalizing)
- search form most similar training sample consumes much time

- color
- local descriptors (e.g. SIFT)
- discriminative properties good
- easy to extract from input images
- describe the object locally (robust to occlusions, local variation)
- fixed level of abstraction

- deep representations
- multiple levels of abstraction
- learned from training data

- object should be recognized even if appearance has changed due to typical transformations
- descriptors and classifiers are called
**invariant**with respect to transformation - tradeoff between invariance and discriminative power of descriptor
- instance recognition: invariance needed with respect to
- viewpoint
- background
- lighting
- partial occlusion

- class recognition: needs complex invariant features learned from training examples

- distortion corrected patches
- SIFT/HOG (orientation histograms)
- Shape context

- instead of manual descriptor design, let computer find optimal descriptor for a task from a training set
- task: object classification
- training set consists of images and their class labels
- \(\text{Image} \rightarrow f(I) \rightarrow \text{features} \rightarrow C(f(I)) \rightarrow \text{Class}\)

- shallow modeling of function \(f(I)\) is not efficient to cover all the variation that appears in an object class

- each layer produces more abstract representation of the input
- layers are similar in structure
- weights of connections between layers (filters) learned from training data
- fully connected layer
- 200x200 image: 40k hidden units, 2 billion parameters
- spatial correlation is local
- waste of resources, not enough training samples

- locally connected layer
- 200x200 images, 40k hidden units, Filter size: 10x10, 4M parameters
- parametrization is good when input image is registered (face recognition)
- stationary: statistics is similar at different location

- convolutional layer
- share the same parameters across different location: convolutions with learned kernels
- single layer: max-pooling replaces small local area of feature map by maximum value in that area (downsampling)
- data reduction, loss of localization accuracy
- allows for larger receptive fields in next layer
- ReLU nonlinearity sets negative values to \(0\) (no activation)

- up to 30 layers
- cost function for training the weights, classification error on training set \(w^* = \text{argmin}_w \sum_i y_i \log(f_w(x_i))\)
- optimization by gradient descent (
**back-propagation**)

- initialize the weights
- Compute \(f_w(x_i)\) by forward-propagating a sample though the nextowkr
- gradient with regard to a weight in a certain layer: use chain rule
- update the weights

- filters of different layers capture different scales due to max-pooling
- filters at low layers are well localized and consider only small image area (small
**receptive field**) - filters at high layers are badly localized and capture context

- filters at low layers are well localized and consider only small image area (small
- feature sharing: same features obtained at low layers can be useful for assembling complex features at higher layers
- successive abstraction
- features close to image input resemble simple edge filters
- features close to class output are invariant to the typical appearance variations

- image classification only provides a class label per image
- object localization: provide bounding box of the object
- object detection: bounding boxes for potentially many object instances in the image
- semantic segmentation: for each pixel, which object class does it belong to?
- instance segmentation: additionally separates different class instances

- define features in local window
- consider many (or all) positions and scales and mace binary decision: is this window a card or not?
- non-maximum-suppression: keep only local maxima
- convolutional networks and sliding window concept are redundant
- exploit for larger efficiency

- input image \(\rightarrow\) extract region proposals \(\rightarrow\) compute CCN features \(\rightarrow\) classify regions
- instead of windows, only \(\approx 1000\) region proposals are considered
- faster implementations first compute the ConvNet activations of the whole image and use them to classify the proposal windows

- common benchmark needed
- ImageNet, PAsCAL Visual Object Classes, Microsoft CoCo
- training set and test set
- detection performance is assessed by precision-recall curves

**precision**: which part of the detected objectsis correct \(\frac{\text{true positives}}{\text{true positives} + \text{false positives}}\)**recall**: which percentage of objects is detected \(\frac{\text{true positives}}{\text{true positives} + \text{false negatives}}\)- Precision-recall curves obtained by different thresholds on the classification score
- single number:
**average precision**

- generate an image given a virtual camera, objects and light sources
- various techniques: rasterization, raytracing
- major research topics in computer graphics

- generate 2D images from 3D scenes
- transform geometric primitives such as lines and polygons into raster image representations
- 3D objects are approximated by vertices (points), lines and polygons

- processing stages comprise the rendering pipeline
- supported by commodity grahics hardware
- OpenGL and DirectX
- Task: 3D input (camera, objects, light source, textures) \(\rightarrow\) 2D output
- functionality: resolve visibility, lighting model, compute shadows, apply textures

- vertex processing / geometry stage /vertex sharer
- process vertices independently in same way
- transformations per vertex, computes lighting per vertex

- geometry shader
- generates, modifies, discards primitives

- primitive assembly and rasterization / rasterization state
- assembles primitives such as points, lines, triangles
- converts primitives into raster image
- generates fragments / pixel candidates
- fragment attributes are interpolated from vertices of a primitive

- fragment processing /fragment shader
- processes all fragments independently in the same way
- fragments are processed, discarded or stored in framebuffer

- position, orientate and scale each object and respective vertices in scene with model transform
- position and orientate camera by view transform
- i.e. inverse view transform is the transform that places camera at the origin of coordinate system, facing z-direction
- entire scene is transformed with inverse view transform

- interaction of light sources and surfaces is represented with a lighting / illumination model
- lighting computes color for each vertex (light source positionand properties, materials)

- \(P\) transforms the view volume to canonical view volume
- view volume depends on camera properties (orthographic projection vs perspective projection)
- canonical view volume is a cube from \((-1, -1, -1)\) to \((1, 1, 1)\)
- view volume is specified by near, far, left, right, bottom, top
- view volume (cuboid or frustum) is transformed into a cube
- object inside the view volume are transformed accordingly
- orthographic
- combination of translation and scaling
- all objects are translated and scaled in the same way

- perspective
- complex transformation
- scaling factor depends on distance of an object to viewer
- objects farther away appear smaller

- primitives, that intersect boundary of view volume are clipped

- projected primitve coordinates \((x_p, y_p, z_p)\) are transformed to screen coordinates \((x_s, y_s)\)
- screen coordinates together with depth value are window coordinates \((x_s, y_s, z_w)\)
- \((x_p, y_p)\) are translated and scaled from range \((-1, 1)\) to actual pixel positions \((x_s, y_s)\) on the display
- \(z_p\) is generally translated and scaled from the range of \((-1, 1)\) to \((0, 1)\) for \(z_w\)
- screen coordinates \((x_s, y_s)\) represent the pixel position of a fragment that is generated in a subsequent step
- \(z_w\), the depth value, is an attribute of this fragment used for further processing

Fragment attributes are processed and tests are performed

- texture lookup
- texturing
- fog
- antialiasing

- scissor test: check if fragment is inside a specified rectangle
- used for, e.g., masked rendering

- alpha test: check if the alpha value fulfills a certain requirement
- transparency

- stencil test: check if stencil value in the framebuffer at the position of the fragment fulfills a certain requirement
- masking, shadows

- depth test
- compare depth value of fragment and depth value of the framebuffer at the position of the fragment
- used for resolving visibility

- combines the fragment color with the framebuffer color at the position of the fragment
- usually: determined by the alpha values

- dithering
- finite number of colors
- map color value to one of the nearest renderable colors

- logical operations / masking

- graphics rendering API: display of geometric representations and attributes. Independent from OS and window system
- interaction with GPUs, hardware-accelerated

- fixed-function pipeline
- focus on parallelized implementation
- promoted by quasi-standards of all components of rasterization-based renderer

- fixed-function pipeline with programmable vertex and fragment processing
- optional: shaders (work on vertex / fragment)

- programmable vertex and fragment processing
- no fixed-function pipeline
- required: shaders

- focus on core functionality
- improved handling of large data
- improved flexibility
- implementation of non-standard effects not restricted to “misusing” pipeline functionality

- programming
- not just a set of parameters of standard functionality
- concepts of vertex and fragment processing are not “nice to know”, but required knowledge

- OpenGL Mathematics glm

- geometry shader
- optional, modify geometric primitives
- generation of geometry no longer restricted to CPU

- flexible use of texture data

- tessellation shader
- optional
- tessellate patches
- flexible generation of large and detailed geometries

- compute shader
- optional
- perform arbitrary computations
- are not part of the rendering pipeline

- data transfer to GPU (vertices with attributes and connectivity)
- vertex shader: program that is executed for each vertex. Input and output is vertex
- rasterizer
- fragment shader (each fragment)
- framebuffer update

- Vertex Buffer Object VBO
- copy memory from CPU to GPU
- arbitrary data: typically vertex attributes

- Vertex Array Object VAO
- link between VBO and shader programs
- specifies how to interpret VBO data
- specifies the mapping to input variables of shaders

written in OpenGL Shading Language GLSL, runs on GPU, vertex and fragment shader are mandatory

- works on vertices
- minimum functionality: transform from local to clip space
- can compute/set a color at vertices
- rasterizer can interpolate attributes from vertices to fragments

- colorize

- congruent transformations (Euclidean transformations)
- preserve shape and size
- translation, rotation, reflection

- similarity transformations
- preserve shape
- translation, rotation, reflection, scale

- preserve collinearity: points on a line are transformed to points on a line
- preserve proportions: ratios of distances between points are preserved
- preserve parallelism: parallel linst stay parallel
- angles and length are not preserved
- translation, rotation, reflection, scale, shear are affine
- orthographic project is a combination of affine transf.
- perspective projection is not affine
- 3D point: \(p' = T(p) = Ap + t\) (\(A\): 3x3 matrix for scale and rotation; \(t\): translation)
- preserve affine combinations: \(T(\sum_i \alpha_i p_i) = \sum_i \alpha_i T(p_i) \qquad \forall \sum_i \alpha_i = 1\)
- e.g.: Line can be transformed by transforming control points
- all affine transformations are represented with one matrix-vector multiplication

- rendering pipeline transforms vertices, normals, colors, texture coordinates
- points (e.g. vertices) specify a location in space
- vectors (e.g. normals) specify a direction
- relations between points and vectors:
- \(p_1 - p_2 = \vec{v}\)
- \(p_1 + \vec{v} = p_2\)
- \(\vec{v_1} + \vec{v_2} = \vec{v_3}\)
- \(p_1 + p_2\) undef
- \(\vec{p} = p - O \qquad p = O + \vec{p}\)

- transformations can have different effects on points and vectors
- translation
- move point to a different position
- does not change the vector

- using homogeneous coordinates, transformations of vectors and points can be handled in a unified way

- \(\begin{bmatrix}x\\y\\z\\w\end{bmatrix}\) with \(w \neq 0\) are homogeneous coordinates of the 3D point \(\begin{bmatrix}\frac{x}{w}\\\frac{y}{w}\\\frac{z}{w}\end{bmatrix}\)
- \(\begin{bmatrix}\lambda x\\\lambda y\\\lambda z\\\lambda w\end{bmatrix}\) represent the same point \(\begin{bmatrix}\frac{\lambda x}{\lambda w}\\\frac{\lambda y}{\lambda w}\\\frac{\lambda z}{\lambda w}\end{bmatrix} = \begin{bmatrix}\frac{x}{w}\\\frac{y}{w}\\\frac{z}{w}\end{bmatrix}\) for all \(\lambda \neq 0\)

- Set \(w = 0\)

- general form \[\begin{bmatrix}m_{00} & m_{01} & m_{02} & t_0 \\ m_{10} & m_{11} & m_{12} & t_1 \\ m_{20} & m_{21} & m_{22} & t_2 \\ p_{0} & p_{1} & p_{2} & w \\ \end{bmatrix}\]
- \(m_ii\) rotation, scale
- \(t\) translation
- \(p_i\) represent projection
- \(w\) is analog to the forth component for points/vectors

- Translation
- Rotation
- inverse rotation
- mirroring / reflection
- scale
- shear

- rotation and reflection matrices are orthogonal \(RR^T = R^TR = I\) and \(R^{-1} = R^T\)
- \(R_1, R_2\) are orthogonal \(\Rightarrow R_1R_2\) is orthogonal
- rotation: \(\det R = 1\) reflection: \(\det R = -1\)
- length of a vector does not change \(||Rv|| = ||v||\)
- angles are preserved \(<Ru, Rv> = <u, v>\)

- translation: \(p_2 = p_1 - t\) and \(p_2 = T(-1)p_1\)
- rotation: by rotation matrix

- view transform can be seen as basis transform
- objects are places with respect to a global coordinate system \(C_1 = (O_1, \{e_1, e_2, e_3\})\)
- camera is positioned at \(O_2\) and oriented at \(\{b_1, b_2, b_3\}\) given by viewing direction and up-vector
- after view transform, all objects are represented in the eye or camera coordinate system \(C_2 = (O_2, \{b_1, b_2, b_3\})\)
- placing and orienting the camera corresponds to the application of the inverse transform to the objects
- rotation the camera by \(R\) and translating it by \(T\) corresponds to translating the objects bs \(T^{-1}\) and rotating them by \(R^{-1}\)

- planes can be represented by a surface normal \(n\) and a point \(r\). All points p with \(n(p-r) = 0\) form a plane.

- Composition is achieved by matrix multiplication.
- Eg: Apply \(T\) to \(p\) and then \(R\): \(R(Tp) = (RP)p\)
- Note that: \(TR \neq RT\)
- order matters!

- projection from \(v\) onto \(l\) maps point \(p\) onto \(p'\)
- \(p'\) is the intersection of the line though \(p\) and \(v\) with line \(l\)
- \(v\) is viewpoint, center of perspectivity
- \(l\) is the viewline
- line though \(p\) and \(v\) is projector
- \(v\) is not on line \(l, p \neq v\)
- if homogeneous component of viewpoint \(v\) is not equal to zero, we have perspective projection
- if \(v\) is at infinity, we have parallel projection

- location of viewpoint and orientation of viewline determine the type of projection
- parallel (viewpoint at infinity, parallel projectors)
- perspective (non-parallel projectors)

- 2D projection is represented by the matrix \(M = vl^T - (l v) I_3\)
- \(i = \{ax + by + c = 0\} = (a, b, c)^T\)

- matrices \(M\) and \(\lambda M\) represent the same transformation \(\lambda M p = \lambda p'\)
- 2D transfomrations in homogeneous form \[\begin{bmatrix}m_{11} & m_{12} & t_x \\ m_{21} & m_{22} & t_y \\ w_x & w_y & h \\ \end{bmatrix}\]
- \(w_x\) and \(w_y\) map the homogeneous component of \(w\) of a point to a value \(w'\) that depends on \(x\) and \(y\)
- therefore, scaling of a point depends on \(x\) and / or \(y\)
- in perspective 3D projections, this is generally employed to scale the \(x\) and \(y\) component with respect to \(z\), its distance to the viewer

- projection from \(v\) onto \(n\), maps a point \(p\) onto \(p'\)
- \(p'\) is the intersection of the line through \(p\) and \(v\) with plane \(n\)
- \(v\) is the viewpoint center of perspectivity
- \(n\) is the viewplane
- the line through \(p\) and \(v\) is a projector
- \(v\) is not on the plane \(n, p \neq v\)
- represented by a matrix \(M = vn^T - (nv) I_4\)
- \(n = \{ax + by + cz + d = 0\} = (a, b, c, d)^T\)

- projection transformation maps a view volume to the canonical view volume
- the view volume is specified by its boundary
- the canonical view volume is a cube from \((-1, -1, -1)\) to \((1, 1, 1)\)
- transformation maps
- from eye coordinates
- to clip coordinates
- to normalized device coordinates NDC
- \(x\) component of a point from (left, right) to \((-1, 1)\)
- \(y\) component of a point from (bottom, top) to \((-,1 1)\)
- \(z\) component of a point from (near, far) to \((-1, 1)\)

- to obtian \(x\) and \(y\) component of a projected point, the point is first projected onto the near plane (viewplane)
- \(x_p = \frac{nx_e}{-z_e} \qquad y_p = \frac{ny_e}{-z_e}\)
- transform the view volume, the pyramidal frustum to the canonical view volume

- view volume is represented by cuboid
- the projection transform maps the cuboid to the canonical value

- Objects are transformed from object to eye space with \(\text{GL_MODELVIEW} \cdot p\)
- Objects are transformed from eye space to clip space with \(\text{GL_PROJECTION} \cdot \text{GL_MODELVIEW} \cdot p\)
- colors are transformed with color matrix GL_COLOR
- texture coordinates are transformed with texture matrix GL_TEXTURE

- each matrix type has a stack
- matrix on top of stack is used

- radiant energy \(Q\)
- radiant flux \(\Phi\) radiant power \(P\)
- rate of flow of radient energy per unit time: \(\Phi = \frac{dQ}{dt}\)

- flux density (irradiance, radiant existance)
- radiant flux per unit area: \(E = \frac{d \Phi}{A}\)

- irradiance measures overall radiant flux \(\Phi\) (light flow, photons per unit time) into a surface element
- radiant intensity
- radiant flux per unit solid angle \(I = \frac{d \Phi}{d \omega}\)
- radiant flux \(\Phi\) (light flow, photons per unit time) incdent on, emerging from or passing through a point in a certain direction

- 2D angle in 3D space
- measured in steradians
- \(d \omega = \frac{dA}{r^2}\)

- point light source with radiant intensity \(I\) in direction \(\omega\)
- irradiance along a ray in direction \(\omega\) is inversely proportional to the square of the distance \(r\) from the source \(E = \frac{I}{r^2}\)
- the number of photons emitted in direction \(d\omega\) and hitting surface area \(dA\) at distance \(r\) is inversely proportinal to \(r^2\)
- the area subtended by a solid angle is proportional to \(r^2\)
- \(E = \frac{d \Phi}{d A} = \frac{d \Phi}{r^2dw} = \frac{I}{r^2}\)

- radiant fulx per unit solid angle per unit projected area incident on, emerging from, passing through a surface element in a certain direction: \(L = \frac{d^2 \Phi}{d A_p d \omega} = \frac{d^2 \Phi}{d A \cos \theta d w}\)
- describes strength and directon of radiation
- constant radiance in all directions corresponds to a radiant intensity that is proportional to \(\cos \theta\) but constant radiant intensity results in different radiance
- measured by sensors
- computed in computer-generated images
- preserved along lines in space
- does not change with distance

- photons characterized by wavelength in visible spectrum \(390 \text{nm}\) to \(750 \text{nm}\)
- light consists of a set of photons
- the distribution of wavelengths within this set is referred to as spectrum of light
- spectra are perceived as colors
- if spectrum consists of dominant wavelength, humans perceive a “rainbow” color (monochromatic)
- equally distributed wavelengths are perceived as shade of gray (achromatic)
- otherwise, colors “mixed from rainbox colors” are perceived (chromatic)

- sensitive to radiation within the visible spectrum
- has sensors for day and night vision (cones, rods)
- perceived light is radiation spectrum weighted with absorption spectra of the eye
- in daylight, three cone signales are interpreted by the brain

- XYZ color space
- \(X = \int_\lambda I(\lambda) \bar{x}(\lambda) d\lambda\)
- \(Y = \int_\lambda I(\lambda) \bar{y}(\lambda) d\lambda\)
- \(Z = \int_\lambda I(\lambda) \bar{z}(\lambda) d\lambda\)

- spectrum \(I\) is represented by \(X, Y, Z\) given the color-matching functions \(\bar{x}, \bar{y}, \bar{z}\)
- color-matching functions correspond to spectral sensitivities of the cones of a standard observer

- XYZ represents color and brightness / luminance
- two values are sufficient to represent color
- \(x = \frac{X}{X + Y + Z}\)
- \(y = \frac{Y}{X + Y + Z}\)

- monochromatic colors are on the boundary
- the center is achromatic

- use three primary colors, can only reproduce colors within spanned triangle

- three primaries: red, green, blue
- light source color: what is contained in spectrum
- object color: what is reflects, what is absorbed

- compute interaction of objects with light
- account for variety of light sources and material properties
- but: keep it fast
- only use local information
- interaction between objects neglected
- interreflections among objects not handled

- diffuse: reflected into many different directions, matte surfaces
- specular (mirror like): is reflected into a dominant direction (shiny surface)
- material properties
- general: combination of both

- computation of diffuse reflection is governed by Lambert’s cosine law
- radiant intensity reflected from a “Lambertian” (matte, diffuse) surface is proportional to the cosine of the angle between view direction and surface normal \(n\)
- irradiance on an oriented surface patch above a Lambertian surface is constant
- irradiance on a surface is proportional to the cosine of the angle between surface normal \(n\) and light source direction \(l\)

- perceived radiance depends on viewing direction
- maximum radiance, if viewer direction \(v\) corresponds to the reflection direction \(r\) or, if the halfway vector \(h\) corresponds to the surface normal

- Phong and Blinn-Phong do not account for energy preservation
- radiance depends on angle \(\theta\) and exponent \(m\)
- radiant exitance from surface depends on \(m\)
- normalized Blinn-Phong versions account for energy conservation

- diffuse and specular reflection are modeled for directional light from a point light source of from parallel light
- ambient light is an approx. for indirect light sources
- the effect is generally approximated by a constant

- combines, ambient, diffuse and specular components
- Phong allows to set different light colors for different components
- multiple light sources possible

- between object surface and light source
- irradiance of surface illuminated by a point light source is inversely proportional to the square distance from the surface to the light source
- light source attenuation

- between object surface and viewer
- atmospheric effects, e.g. fog, influence the visibility of objects
- refers to transparency of air
- low visibility, radiance converges to a “fog color”

- specify whether lighting is evaluated per vertex or per fragment
- if per vertex: colors are interpolated or not
- per fragment: surface normals are interpolated accross a primitive
- flat shading (Constant shading)
- evaluation of the lighting model per vertex
- primitives are colored with color of one specific vertex

- Gouraud shading
- evaluation of lighting model per vertex
- primitives are colored by bilinear interpolation of vertex colors

- Phong shading
- bilinear interpolation of vertex normals during rasterization
- evaluation of the lighting model per fragment

- mach bands are illusions due to our neural processing
- the bright bands at 45 degrees and 135 degrees are illusory. The intensity inside each square is the same

- flat shading (constant shading): simple, fast
- Gouraud shading: more realistic, suffers from Mach band effect, local highlights are not resolved if the highlight is not captured by a vertex
- Phong shading: highest quality, most expensive

Transform geometric primitives into a raster (pixel positions).

- components start and end point of line are integer values \(p_b = (x_b, y_b)\) and \(p_e = (x_e, y_e)\)
- lines are represented as \(y = mx + b\) or \(F(x, y) = ax + by + c = 0\)
- algorithms are often restricted to \(0 \leq m \leq 1\)
- algorithms consist of an initialization step and a loop
- line rasterization algorithms are usually described for a subset of lines and generalized using symmetries
- incremental updates are often employed
- Bresenham avoids floating-point aritthmetic
- improved algorithms address aliasing / clipping artifacts
- note that the algorithms do not compute all pixels that are intersected by the line

- center \((0, 0)\) and radius \(r\)
- implicit representation \(F(x, y) = x^2 + y^2 - r^2 = 0\)
- algorithms compute only one eight of a circle
- incremental updates are often employed
- floating-point arithmetic is avoided

- a polygon is defined by edges
- should be closed to allow inside / outside classification
- rasterization estimates all pixel positions inside a polygon
- in general simple, but
- if adjacent polygons share an edge, each pixel on the edge should belong to exactly one polygon
- no pixel along the edge should be missed
- no pixel along the edge should be rasterized twice

- estimates all pixel positions inside a polygon

- improve the realism in rendered images
- illustrate spatial relations between objects
- Goal: determination of shadowed parts in 3D scenes
- only geometry is considered
- not standard functionality of rasterization-based rendering pipeline

- project the primitives of occluders onto a receiver plane, based on the light source location
- projected primitives form shadow geometry
- implementation: draw receiver plane, disable depth test, draw shadow geometry, enable depth test, draw occluders
- issues: restricted to planar receivers, no self-shadowing, antishadows

- see shadow casting as visibility problem
- scene points are visible from light source (illuminated), others invisible from light source (in shadow)
- resolving visibility is standard functionality in rendering pipeline (z-buffer algorithm)
- algorithm:
- render scene from light source
- store all distances to visible (illuminated) scene points in a shadow map
- render scene from camera
- compare the distance of rendered scene points to the light with stored values in the shadow map
- if both distances are equal, the rendered scene point is illuminated

- use two depth buffers
- “usual” depth buffer for view point
- second depth buffer for light position (shadow map)

- render scene from light position into shadow map
- render scene from view position into depth buffer
- transform depth buffer values to shadow map values
- compare transformed depth buffer values and shadow map values to decide whether a fragment is shadowed or not

- discretized representation of depth values can cause an erroneous classification of scene points
- offset of shadow map value reduces aliasing artifacts

- in large scenes, sampling artifacts can occur
- uniform sampling of the shadow map can result in non-uniform resolution for shadows in a scene
- shadow map resolution tends to be too coarse for nearby objects and too high for sistand objects
- Solution: perspective shadow maps
- scene and light are transformed using the projective transformation of the camera
- compute the shadow map in post-perspective space

- employ a polygonal representation of the shadow volume
- point-in-volume test
- implementation issues: classification of shadow volume polygons into front and back fases
- stencil buffer values count the number of intersected front and back faces
- algorithm: Z-pass
- render scene to initialize depth buffer
- stencil enter / leave counting approach
- render shadow volume twice using face culling
- do not update depth and color

- if pixel is shadow, stencil is non zero
- else stencil is zero

- implementation
- render the scene with only ambient lighting
- render front facing shadow volume polygons to determine how many front face polygons are in front of the depth buffer pixels
- render back facing shadow volume polygons
- render the scene with full shading where stencil is zero

- can miss intersections due to near plane clipping

- basic Z-Fail
- counts the difference of occluded front and back faces
- misses intersections behind the far plane

- with depth clamping
- do not clip primitives at the far plane
- clamp the actual depth value to the far plane

- Z-fail with far plane at infinity
- adapt the perspective projection matrix

- add per-pixel surface details without raising geometric complexity of a scene
- keep number of vertices and primitives low
- textures are represented as 2D images
- can represent variety of properties: colors, normals, etc.
- positions of texture pixels are characterized by texture coordinates \((u, v)\) in texture space
- texture mapping is a transformation from object space to texture space \((x, y, z) \rightarrow (u, v)\)
- texture mapping is generally applied per fragment

- object space location
- texture coordinates are defined for object space locations
- if the object moves in world space, move texture along with it

- projector function
- maps a 3D object space location to a 2D parameter-space value

- corresponder functions
- map from parameter space to texture space
- texture-space values are used to obtain values from a texture

- value transform
- a function that transforms obtained texture values

- planar projection, cylindrical projection or spherical projection
- can be an initialization step for the surface parameterization
- use different projections for surface patches
- minimize distortions, avoid artifacts at seams
- several texture coordinates can be assigned to a single vertex

- one or more corresponder functions transform from parameter space to texture space
- transform of \((u, v)\) into valid range from \(0\) to \(1\) e.g.
- repeat, mirror, clamp to edge, clamp to border

- arbitrary transformations, e.g. represented by a matrix in homogeneous form, can be applied to texture values
- texture blending functions
- define how to use the tetxue value
- replace the original surface color with texture color
- linearly combine the original surface color with the texture
- multiply, add, subtract surface color and texture color

- alternatively, texture values can be used in the computation of illumination model

- 2d image is mapped / glued to a triangulated object surface
- certain aspects affect the quality
- interpolation of texture coordinates
- sampling
- quality of applied textures can be improves by
- perspective-correct interpolation
- considering magnification and minification

- simplified transparency model
- semitransparent objects are filters / attenuators of occluded objects
- refraction and object thickness are neglected

- algorithms are based on stipple patterns, color blending per pixel

- screen-door transparency
- transparent object is rendered with a fill / stipple pattern, e.g. checkerboard (pattern of opaque and transparent fragments)
- limited number of fill patterns results in limited number of transparency levels
- aliasing artifacts
- simple methods

- alpha values: describes the opacity of fragment, stored with RGB color (RGBA)
- order matters

- polygons / fragments have to be rendered in sorted depth order
- hardware generally renders in object order
- depth test only returns the nearest fragment per pixel, sorting is not realized
- intersecting polygons have to be handled
- dynamic scenes require re-sorting

- exactly two depth layers for arbitrary viewing directions
- first depth layer defined by front faces
- second depth layer defined by back faces
- algorithm: render back faces in first pass, blend with front faces in second pass

- object-space methods
- use pre-computed spatial data structures (e.g. binary partition tree)
- useful for static geometry
- varying viewer positions and orientations can be handled

- screen-space methods
- employ the functionality of the rendering pipeline
- several rendering passes compute depth layers
- final pass renders the ordered depth layers
- useful for dynamic / deforming geometry and arbitrary views
- no pre-computation is required

- BSP tree is hierarchical spatial data structure
- 3D space is subdivided by means of arbitrary oriented planes
- leaves represent convex space cells
- applications
- visible surface algorithm
- depth sorting
- collision detection

- precomputed for static scenes
- all planar primitives are represented in the tree
- balancing is less important, as entire tree has to be queried
- one rendering pass

- back to front rendering
- render far branch of viewpoint
- render root (node) polygon
- render near branch of the viewpoint
- recursively applied to sub-trees

- not only visible surface generation, but depth sorting of all primitives per pixel position
- additional data structure
- can be pre-computed
- required polygon splits
- dynamic scenes require an update

- use the functionality of rendering pipeline
- several rendering passes compute depth layers
- final pass renders the ordered depth layers
- useful for dynamic / deforming geometry
- no pre-computation is required / can be employed
- algorithm
- first render pass gives the front-most fragment color / depth
- each successive render pass extracts the fragment for the next-nearest fragment on a per pixel basis (screen-space approach)
- two depth buffers are used

- object is rendered once for each depth layer
- two separate depth tests per fragment
- depth peeling strips away one depth layer with each successive rendering pass
- implementation based on shadow mapping

- screen-space algorithm
- multiple rendering passes generate depth layers per pixel position
- view dependent (in constrast to BSP)
- appropriate for dynamic scenes
- quality and performance are determined by number of rendering passes

Angle of incidence is equal to the angle of reflection.

- original and reflected geometry is rendered
- reflected geometry is generated with respect to the reflection plane with surface normal \(n = (n_x, n_y, n_z)\) and point \(p\) on the reflection plane
- semi-transparent reflection plan, e.g. with color blending (original geometry)
- reflection plane to stencil: only where stencil is set

- cube mapping: approximates reflections of the environment on arbitrary surfaces
- project environment onto a object-embedding shape (sphere or cupe)
- view-dependent mapping
- approximate implementation of reflections off arbitrary surfaces
- cannot handle changing reflections of moving objects

- generate or load a 2D map of environment
- for each fragment of a reflective object: compute the normal \(n\)
- compute the reflection vector \(r\) from the view vector \(v\) and the normal \(n\) at the surface point
- use the reflection vector to compute and index into the environment map that represents the objects in the reflection direction.
- use the texel data (texture value) from the environment map to color the current fragment

- placing a camera at center of object and take pictures in 6 directions

- orthographically project an image of a mirrored sphere
- map stores colors seen by reflected rays
- sphere map contains information about both, the environment in front of the sphere and in back of the sphere
- can be obtained by Raytracing, warping automatically generated cubic maps

- disadvantages: maps are hard to create on the fly, sampling non-linear, non-uniform, sampling is view-point dependent
- advantages: no interpolation across map seems
- objects should be small compared to environment
- issues with self-reflections and non-convex objects
- separate map for each object
- maps may need to be changed in case of a change in viewpoint due to non-uniform sampling
- translated objects might require a map update