This document is derived of the lecture by Prof. Dr. T. Brox & Dr. O. Ronneberger & Prof. Dr. M. Teschner at the University of Freiburg in semester WS1516.
Introduction
Digital image
- regular grid Iij (intensity values)
- continuous function I:(Ω⊂ℜ2)→ℜwith(x,y)→I(x,y)
- Ω is image domain, usually rectangular
Human vison
- Ability to adapt to changing light intensities
- Retina: Cones (color vision), rods (gray, high sensitivity)
- Analog preprocessing (smoothing, edge detection)
- Sharp image only in center, sign. reduces the amount of data before sending to brain
- “digital”: spiking or not spiking, timing
- Gabor filter: linear filter for edge detection
- Sparse coding: over-complete code, only few are active for certain signal
- lower cortical areas (primary visual cortex V1): neuros respond only to signals in very small local area
- higher cortical areas: receptive fields larger (integration of information from lower areas)
Camera
- Pinhole: Objects points (X,Y,Z) are projected to (x,y) by x=fXZ;y=fYZ. Problem: sharpness, light intensity. aperture (size of hole): Ratio
- Optical: Light focusing, large aperture and sharpness possible. 1S1+1S2=1f. Computervision practice: pinhole camera + correction of lens effects
- effect of lenses is more important: shape from defocus, image analysis in microscopy
Sampling
- Discretize image domain: {Iij|i=1,...,N;j=1,..,M}
- Grid points: pixels
- Grid: usually rectangular point grid, equal spacing h (Grid size)
- spacing is the same in all directions h=hx=hy
- if true spacing not known: h=1
- Reduce no of pixels: downsampling. Increase: upsampling
Continuous vs discrete
- Pros Continuous: scene is continuous, limiting case for successively finer grids, rotational invariance, exact length, subpixel accuracy
- Pros Discrete models: sensor data discrete, model closer to implementation, efficient optimization possible
Aliasing and Moiré effect
- function can be represented by its frequency components (Fourier)
- Discrete signal can only repesent frequencies up to a certain limit (Nyquist frequency)
- Ignorance of the Nyquist frequency leads to aliasing artifacs (straight line stepping)
- Sampling of periodic signals is frequency modulation (multipl. of two periodic signals)
- It can lead to Moiré effects (special aliasing artifact)
- Videos: temporal aliasing
Nyquist-Shannon theorem
Input signal can be reconstructed from samples in a unique way if the sampling rate is at least two times the badwith of the input signal
- two different frequencies may lead to the same discrete signal
- reduction of bandwidth (max freq) can be achieved by smoothing
- consequence: smooth your input image sufficiently before downsampling
Downsampling
- Decanting operator: ensures minimum smoothing necessary to avoid aliasing
- In case of overlap: intensity distribution to both cells according to overlap ratio
- normalization of intensity value in low res image by downsampling factor
- Operator is separable: can be applied sequentially along all axes
Upsampling
- Bilinear interpolation: weighted avg of neighboring pixels
- project fine grid point to available coarse grid
- weighted avg along x and y average:
aj=(1−α)Ii,j+αIi+1,j
aj+1=(1−α)Ii,j+1+αIi+1,j+1
Ik,l=(1−β)aj+βaj+1
- general concept to retrieve vals at pts between grid pts
- arbitrary dimensions
Quantization
- Discretization of co-domain ℜ→{1,...,N}
- needed for float or integer rep
- 256 gray scales, 8 bit per px
- human: only 40 gray scales, but thousands of colors
- optimal quantization by clustering
- assume: I(x,y)∈ℜ;0≤I(x,y)≤255
Noise
- Additive noise: Iij=I∗ij+nij
- Gaussian noise: Gσ(x)=1√2πσ2exp(−(x−μ)22σ2)
- Multiplicative noise: Iij=I∗ij(1+nij)
- Trick: logIij=logI∗ij+log(1+nij) (additive noise).
- Backtransform: I∗ij=explogI∗ij
- Impulse noise: certain percentage of pixels is replaced by one or two fixed values (pixel defect)
- salt-and-pepper: some pixels replaced by white or black values
- Uniform noise: certain percentage of px replaces by random variables
Signal-to-noise ratio (SNR)
- Measure: IversusI∗ (ground truth)
- variance of image versus variance of noise
- variance of image: σ2I=1N∑i(I∗i−μ)2
- additive, zero-mean noise model: Ii=I∗i+ni
- variance of noise σ2n=1N∑i(I∗i−Ii)2
- Signal-to-noise ratio SNR=10log10σ2Iσ2n=10log10∑i(I∗i−μ)2∑i(I∗i−Ii)2
- Peak signal-to-noise ratio PSNR=10log10(N(maxiI∗i−miniI∗i)2∑i(I∗i−Ii)2)
- Unit: decibel (dB). Higher the better
Point operations
- treat each px independently uij=f(Iij)
- example: change brightness (u(x,y)=I(x,y)+b) (clipping needed)
Constrast enhancement
Gamma correction
- camera chips different response curves I∞Iγ
- correction: f(I(x,y))=Imax(I(x,y)Imax)1γ, γ>0
- range [0,Imax] not affected
- Dark areas become brighter without leading to oversaturation in
the brighter areas
Gray value histogram
- number of pixels that contain certain gray value
- histogram equalization: transform such that all gray values are equally frequent
Difference image
- Gray: IΔ=|I1−I2|
- Color: IΔ=13∑3k=1|Ik,1−Ik,2|
Linear filters
- improve singal by taking neighboring points into account
- filters defined in the Fourier domain (global Frequency analysis on all pixels)
- convolution theorem: Translate filtering in the Fourier domain to the spatial domain: F(f)⋅F(h)=F(f∗h)
- filters usually defined in spatial domain hinstead ofF(h): more interested in local analysis, easier to generalize to nonlinear filters, better at boundaries
Convolution
- With static filter h: linear operation. (f∗h)(x)=∫h(−x′)f(x+x′)dx′
- in 2D: (I∗h)(x,y)=∫h(−x′,−y′)I(x+x′,y+y′)dx′dy′
- properies:
- linearity: (αf+βg)∗h=α(f∗h)+β(g∗h)
- shift invariance: f(x)∗g(x+δ)=(f∗g)(x+δ)
- commutativity f∗g=g∗f
- associativity (f∗g)∗h=f∗(g∗h)
- correlation f(x)∗h(x)=∫h(x′)f(x+x′)dx′
Discrete convolution
- 1D: (f∗h)i=∑nk=1fkhi−k
- 2D: (f∗h)i,j=∑n,mk=1,l=1fk,lhi−k,j−l
- Separability: A filter is separable if h(x,y)=δ(x,y)∗h1(x,⋅)∗h2(⋅,y) (δ is Dirac distribution. δ(x)={∞;x=00;else and ∫δ(x)dx=1
- Separable filters can be applied succesively as 1D convolution, reduces runtime complexity from O(NMnm) to O(NM(n+m))
Gauss filter
- Kernel: Gσ(x)=1√2πσ2exp(−x22σ2)
- separable: I′(x,y)=I(x,y)∗G(x,⋅),∗G(⋅,y)
- discrete stencil derived by sampling at 3σ interval
- Runtime complexity O(NMσ) (Fourier domain O(NMlog(NM))
Boundary conditions
- Boundary of Ω is δΩ
- Dirichlet boundary conditions I(x,y)=0 ∀(x,y)∈δΩ
- Homogeneous Neumann boundary conditions: δδnI(x,y)=0 ∀(x,y)∈δΩ with n as outer normal vector and δIδn=nTΔI. (mirror the image at boundary). Preferred method (preserves average intensity)
Box filter
- alternative to Gauss filter
- Kernel Bσ(x)={12σ |x|<σ0 else
- simple averaging of neighboring values
- properties:
- separable
- not rotationally invariant
- much more efficient O(NM)
- result not as smooth
- successive convolution of box filter with itself yields filters that resemble the Gauss filter
- smoothing result of filter of order k is k times differentiable
Recursive filter (Deriche filter)
- (fast) alternative to Gauss filter
- Idea: recursively propagate information in both directions of the signal
- α smoothness parameter
- approximates Gaussian convolution
- relation between σ and α: ασ=52√π
- properties:
- separable
- rotationally invariant
- O(NM)
- hard/impossible to implement Neumann boundary conditions
Derivative filters
- measuring the local change of intensities
- continuous functions: given by derivate dfdx
- multidimensional functions: ΔI=(δIδx,δIδy)T (gradient)
- alternative notation: ΔI=(Ix,Iy)T
- image must be differentiable
Gaussian derivative
- ensure differentiability by combining with small amount of Gauss smoothing (Gaussian derivatives)
- Convolution and derivatives are linear operations (Applying derivative to image or to Gaussian does not matter)
- δδx(I(x)∗Gσ(x))=I(x)∗δδxGσ(x)
- arbitrarily high order derivatives exist
- sample derivative of gaussian filter for implementation
- common mask for first derivative is (−12,0,12) (central difference)
Higher order derivatives
- higher derivatives than first derivative
- eg. Laplace filter: Zero-crossings of this filter correspond to image edges
- Another way to detect edges is by the gradient magnitude
Energy Minimization
- formalize model assumptions and cast task as E(X)=A1(x)+...+An(x) (Energy/Loss function)
- solve x∗=argminxE(x) with u∈ℜN
Example: image denoising
- model assumptions: similar to input image, result should be smooth
- formalize:
- similar: ED(ui,j)=∑i,j(ui,j−Ii,j)2
- smooth: ES(ui,j)=∑i,j(ui,j+1−ui,j)2+(ui+1,j−ui,j)2
- problem to solve: u∗i,j=argmin(ED(ui,j)+αES(ui,j)) (α is smoothening weight factor)
Advantages
- transparency
- optimality
- theoretical aspects can be analysed
- Existence and uniqueness of solutions
- Stability of solutions with respect of the input data
- Difficulty of the problem class
- fewer parameters than in heuristic multi-step methods
- combination of approaches is easy
Problems
- Formalizing is ad-hoc (use statistical interpretations)
- choosing weighting parameter(s) not easy (can be learned)
- global optimization is hard
Example: image denoising
- necessary condition: δEδu=0⇔δEδui,j=0
- compute δEδui,j
- write as system of linear equations (matrix, N×N for N pixels, symmetric and positive)
- contains one main diagonal (central pixels) and for off-diagonals (for each of the four neighbors of a pixel)
- matrix system is a sparse (almost all entries are 0)
- Positive definite, the inverse A−1 exists and we can solve for u
Convexity
- Convex
- Positive curvature
- No local minima
- Global minimum is unique
- Minimization by setting the derivative to 0
- Non-convex functions
- many local minima and maxima
- Global minimum may not be unique
- Global minimization is usually impossible, only heuristics exist
A functions is convex if
f((1−α)x1+αx2)≤(1−α)f(x1)+αf(x2)
A functions is strictly convex if
f((1−α)x1+αx2)<(1−α)f(x1)+αf(x2)
- every combination of (strictly) convex functions is again (strictly) convex
Jacobi method
- iterative solver, preserves sparsity of the matrix
- Converges if the matrix is strictly diagonal dominant |aii|>∑i≠j|ai,j| ∀i
- decompose matrix into A=D+M (D diagonal part, M is off-diagonal part)
- for linear system: Ax=b⇔(D+M)x=b⇔Dx=b−Mx
- D−1: replace diagonal elements with inverse
- compute solution x iteratively, start at any x0:
xk+1=D−1(b−Mk)
- iterate until rk=Axk−b (residual) is smaller than a threshold or the change in (xk+1−xk)2 becomes small. If that is 0, the iterate has converged.
- properties:
- simple, paralelizable, few requirements
- slow
- convergence only for k←∞
- computation not in place
Gauß-Seidel method
- Ax=b⇔Dx=b−Lx−Ux: split the off-diagonal part M into the lower triangle L and the upper triangle U
- iterate, traverse vector x from top to bottom and use new values for multiplication with lower triangle
xk+1=D−1(b−Lxk+1Uxk)
- Converges if A is positive or negative definite
- properties:
- in place computation
- recursive propagation of information: faster
Successive over-relaxation (SOR)
- emphasize Gauß-Seidel by over-relaxing solution
xk+1=(1−ω)xk+ωD−1(b−Lxk+1Uxk) (ω=1 is Gauß-Seidel)
- Converges for positive- or negative definite matrices (all eigenvalues positive or negative), if ω∈(0,2)
- Over-relaxation for ω>1 faster convergence
- Under-relaxation for ω<1 can help establish convergence in case of divergent iterative processes
- optimal ω must be determined empirically
Conjugate gradient (CG)
- two non-zero vectors u and v are **conjugate* with respect to A if the inner product <u,v>A:=uTAv=0. (vectors orthogonal with respect to special scalar product)
- set of n conjugate vectors {pk} form basis of ℜn, so solution x∗ of Ax=b can be expanded as x∗=α1p1+...+αnpn
- coefficients αk are derived
- after n computations we obtain exact solution x∗
- Good choice of {pk}: few coefficients approximate solution well
- Start at some initial point x0
- Let p0 be residual r0=b−Ax0. This is gradient of
E(x)=12xTAx−bTx the minimizer of which is x∗. (hence the name CG)
- Iteratively compute parameters
- Stop when residual is small. Guaranteed solution after n iterations
- matrix A: must be symmetric ans positive definite
- computing exact solution is not an option as n is the number of pixels
- number of iterations to get good solution depends on the condition number of A (largest vs smallest eigenvalue). The same holds for other iterative methods
- preconditioners P−1 are used to have small condition for P−1A
Ax=b⇔P−1Ax=P−1b
- Most simple preconditioner: Jacobi preconditioner
Multigrid methods
- previous linear solvers: only act locally
- due to sparsity of matrix, distributes information at a pixel to four neighbors
- idea of multigrid solvers: shorten distances by using coarser point of view
- additional effect: fewer entries, iterations are faster at coarse levels
Unidirectional (cascadic) multigrid
- downsample image, create linear system from this
- compute approx solution at the coarsse grid (e.g. SOR)
- take upsampled result as init guess for next finer grid
- Refine (e.g. SOR)
- properties:
- iterations at coarse level fast
- simple implementation
- coarse levels do not approximate original system well
Correcting (bidirectional) multi-grid
- do not downsample image but error
- compute first solution at fine grid
- correct error at coarse grid
- refine result at finer grid
Full multigrid
- combine cascadig and correcting multigrid
- start at coarse grid with downsampled image
- at each finer level, apply a W-cycle
Integer problems
- considering problems with continuous variables
- segmentation and matching problems usually lead to integer problems
- these are combinatorial problems, few are solvable in polynomial time
- typical problems: linear programs, integer quadratic programs, second oder cone programs
Variational Methods
Continuous energies
- continuous formulation: u∗(x)=argminu(x)E(u(x)) with u(x):Ω→ℜ
- energy functional, optimization based on calculus of variation
- Necessary condition(s) for a minimum: Euler-Lagrange equations
- advantage: no ensurance that discrete energy is consistent with continuous one (eg. measuring line lengths)
- disadvantage: gradients in direction of a function have to be computed
Consistency
- dicretization of continuous quantities can lead to artifacts if done wrong
- solution of problem should be independent of rotation (rotation invariance)
- A discretization is called consistent if it converges to the continuous model with finer grid sizes
- Example: discrete approximation of gradient magnitude: errors must depend on grid size
Calculus of variation
Calculus of variation and Gâteaux derivative to compute gradient of a functional δE(u(x))δu(x) with respect to a function u(x)
Gâteaux derivative
- functional maps each element u of a vector space (e.g. a function) to a scalar E(u)
- the Gâteaux derivative generalizes the directional derivative to infinite dimensional vector spaces:
δE(u(x))δu(x)|h=limϵ→0E(u+ϵh)−E(u)ϵ=dE(u+ϵh)dϵ|ϵ
- can be interpreted as projection of the gradient to the direction h:
δE(u(x))δu(x)|h=⟨dE(u)du,h⟩=∫dE(u)du(x)h(x)dx
- gradient dE(u)du(x) is needed to minimize E(u)
Denoising example
E(u(x))=∫Ω(u(x)−I(x))2+α|∇u(x)|2dx
- rather than on a vector, optimization problem is on a function
- short writing: dependency of functions on x is clear: E(u)=∫Ω(u−I)2+α|∇u|2dx
- condition for minimum: δE(u)δu|h=0∀h
- compute the Gâteaux derivative and simplify
- Add +ϵh to all u
- derive to ϵ
- partial integration to turn hx and hy into h
- directional derivative must be 0 for all directions
- yields two conditions
- gradient must be zero: Euler-Lagrange equation set to 0
- Boundary conditions must be 0 (natural boundary conditions, coincide with Neumann boundary conditions)
Implementation
- Discretization of dE(u)du=(u−I)−α(uxx+uyy)=0 leads to a linear system of equations
δEδui,j=(ui,j−Ii,j)−α(ui−1,j+ui,j−1−4ui,j+ui+1,j+ui,j+1)=0
- solve for minimizing discrete engery: same linear solvers can be applied
- different discretization stages do not always lead to same algorithm
Discontinuity preserving regularization
- A quadratic regularizer leads to blurred edges
- not smooth everywhere (outliners at edges)
- Can be done by using non-quadratic regularizers
- statistical interpretation of regularization
Bayesian approach
- Energy minimization can also emerge from probablistic approach taking into account likelihoods and prior probabilities
- Bayes formula: p(u|I)=p(I|u)p(u)p(I)
- Find solution u that maximizes p(u|I) (maximum a-posteriori (MAP) approach)
- Marginal does not depend on u, can be ignored in maximization
Energy minimization
- MAP estimation: p(u|I)∝p(I|u)p(u)→max
- Noise model: independently and identically distributed Gaussian noise p(I|u) \propto \prod_{x \in \Omega} exp(-(u(x) - I(x))^2)
- A-priori model: gradient is likely to small (smoothness). Gaussian noise with varians \frac{1}{2\alpha} on gradient
p(u) \propto \prod_{x \in \Omega} exp(-\frac{|\nabla u(x)|^2}{\frac{1}{\alpha}})
- turn maximization into minimization of the negative logarithm (is monotonous): energy minimization problem:
-\log p(I|u) - \log p(u) = \int_{\Omega} (u-I)^2 + \alpha|\nabla u|^2 dx \rightarrow \min
Interpretation
- Other noise models lead to different penalizers in energry function
- Expecting outliners in the smoothness assumption, replace Gaussian noise model with Laplace distribution
- Leadsto energy function that preserves image edges
- Tikhonov regularizer \Psi (s^2) = s^2
Gradient descent
- Euler-Lagrange equation is nonlinear in the unknowns
- nonlinear system of equations
- gradient descent
- Iterative
- start with initial value
- iteratively move in direction of largest decrease in energy (negative gradient direction)
- converges to next local minimum
- Linear combination is convex, Euler-Lagrange is convex: Gradient descent will find a global optimum
- slow convergence
Faster: lagged diffusivity
- keep nonlinear prefactor fixed (regain linear system) and compute updates iteratively
- proven to converge if linear system in each step is solved exactly
- only few iterations, much faster than gradient descent
Motion estimation
- Given a sequence of images I(x, y, t), what is the motion of each pixel between subsequent frames
- Vector field (u, v)(x, y, t) that transforms the second image into the first one
- optical flow: move each point (x, y) at time t, such that it fithe the point at time t + 1
- Can be used for detection or segmentation of objects motion segmentation
- scene flow: estimate 3D motion of objects
- structure from motion: estimate 3D structure
- apparent motion in image plane optic flow, differs from true motion
- Ambiguities: not from image data alone, any black pixel could be moved to any black pixel
- assumptions are needed for unique soliton. Neighboorhood (aperture) determines the solution
Optic flow constraint
- Gray value of moving pixel stays constant: I(x + u, y + v, t + 1) - I(x, y, t) = 0
- Constraint nonlinear in unknown flow: difficult estimation
- Linearize with Taylor expansion: I(x + u, y + v, t + 1) = I(x, y, t) + I_xu + I_yv + I_t + O(u^2, v^2)
- Linearized optic flow constraint: I_xu + I_yv + I_t = 0
- only sufficiently precise ifimages are smooth and flow is small
Lucas-Kanade method
- regarding a single pixel, we cannot uniquely determine its flow vector
- we need a second assumption to obtain a unique solution (two unknowns, one equation)
- assume that motion is constant within a local neighborhood
- results in multiple equations for the two unknowns
I_x(x', y', t)u + I_y(x',y', t) + I_t(x',y',t) = 0 \qquad \forall x',y' \in \Re(x, y)
- over-determined system: not a unique solution anymore
- we seek a solution that minimizes the total squared error
\text{argmin}_{u, v} \sum_{x, y \in \Re}(I_x(x, y)u + I_y(x, y)v + I_t(x, y))^2
- condition for minimum is derivative to u and v are both zero
- linear system at each pixel:
{\sum I^2_x \qquad \sum I_xI_y \choose \sum I_xI_y \qquad \sum I^2_y} {u \choose v} = {- \sum I_xI_t \choose - \sum I_y I_t}
- instead of uniformly weighted, box shared window, we can also use a gaussian window
- weighted sums by window come to a convolution
- unique solution only if system is not singular
- local neighborhood must contain non-parallel gradients
- otherwise aperture problem still present
- assumption of locally constant motion not realistic (boundaries)
- there are no direct constraints on the smoothness of the flow field
- advantages: fast and simple
- local method: point estimates are computed independently
Global optic flow estimation
- global dependency between points due to global smoothness assumption
- global solution derived with variational methods
- basic model: Horn-Schunck method
- can be further extended to yield highly accurate estimates in many practical situations
Horn-Schnuck model
- assumes gray value constancy too
- global smoothness of the flow field: |\nabla u|^2 + |\nabla v|^2 \rightarrow \min
- can be combined to energy functional, sought optical flow is minimizer of this energy
- energy is convex: we get a unique global optimum
- Use Gâteaux derivatives to obtain Euler-Lagrange equations
- discretized versions return linear equation system to solve, can be written in matrix-vector notation
- matrix is sparse with only few diagonals different from zero
- blocks structure (data term and smoothness term) due to coupling of u and v
- can be solved with linear solvers (Gauss-Seidel, SOR, Multigrid)
Problems
- illumination changes will “distract”
- motion discontinuities are blurred
- underestimation of large displacements
- outliers due to non-Gaussian noise or occlusion
Average angular error
- accuracy of optical flow: average angular error
- requires ground truth
- measures the 3D angle between correct and estimated flow vectors
- also captures errors in length of the 2D flow vector
Motion discontinuities and occlusions
- consider a robust function applied to the smoothness term to allow for motion discontinuites
- occlusions can be approached by applying a robust function to the data term
- nonlinear system can be solved with lagged nonlinearity
Illumination changes
- gray value constancy assumtion not fulfilled due to illumination changes
- remedy: matching criterion that is invariant to illumination changes
- the gradient is invariant to additive changes, use that as alternative constraint
Large motion
- linearization of constancy assumtions is only valid for small displacements (subpixel)
- to tackle larger displacements, we must refrain from such a linearization
- difficulties: non-convex, Euler-Lagrange equations get highly nonlinear
- can be handled with Gauss-Newton method and coarse-to-fine approach
Segmentation and Grouping
Segmentation
- Partition of the image domain \Omega into serveral (usually disjoint) regions \Omega_i
- special case: two-region segmentation (foreground-background)
- difference to edge detection: required closed contours
- edge detection is party of solution as it provides pieces of potential contours
- exponentially many possibilities
- hierarchical decomposition ideal (object segmentation)
- impossible without prior knowledge
- can be seen as a grouping process combining pixels superpixels
Feature space
- intensity, color, texture, motion, disparity, depth
- first-order features vs second-order features
- first-order: provided by sensor
- second-order: derived from data
Edge-based vs region-based methods
- edge-based techniques employ an edge indicator or pairwise similarities between pixels. Fit closed contours such that the contour coincides with strong edges (low similarities)
- region-based techniques employ a statistical model of each region. Regions should be as homogeneous as possible. Include pixels in different regions that are maximally different
- region-based techniques based on local statistics that approach edge-based techniques in the limit
Thresholding
- must simple way to come to a segmentation
- convert intensity image into a binary image to label the two regions
- can be generalized to N regions by introducing N - 1 thresholds
- Problems
- no threshold that separates the objects
- context is ignored
- region-based with very simple region model
Clustering
- segmentation is similar to clustering: assign pixels to regions
- example: k-means clustering
- initialize: all pixels belong to random region
- mean feature vector for each region
- move pixel to another region if this decreases the total distance
- iterate until pixels do not move any longer
- K = 2, k-means is like thresholding with an automatically determined threshold
- clustering ignore spatial context: only features determine assignment, not the position
- adding pixel coordinates to feature vector is not equivalent to enforcing smooth region contours
Greedy heuristics
- Region growing: seed points are initial regions
- for each point in boundary: if a neighbor is similar enough, it is assigned to the region
- grow until there are no more similar pixels along the boundary
- Region merging: initially all pixels are their own region
- two most similar regions are successively merged into a larger region
- repeat until similarity threshold or a given number of regions is reached
- some dissimilarity criteria
- euclidean distance of the features means: d^2(\Re_1, \Re_2) = (\mu_1, \mu_2)^2
- mean euclidean distance along common boundary
Watershed segmentation
- Illustrative description: regard the image gradient magnitude as mountains and let it rain
- water flows downhil and gathers in catchment basins: the regions
- regions meet a the watersheds (edges)
- tiny gradient fluctuations result in over-segmentations (presmoothen image!)
Energy minimization: snakes model
- use energy functional
E(C) = - \int_0^1 |\nabla I(C(s))|^2 ds + \alpha \int_0^1 |C_s(s)|^2 ds
- C : [0, 1] \rightarrow \Omega is parametric contour. C_s denotes first derivative
- first term: external energy (depends on input image)
- second term: internal energy (independent of model and data)
- minimizing external enery drives the contour to follow maxima of the gradient
- consider all possible contours and choose the one that best captures the maxima
- external energy alone would lead to a fractal contours with infinite length
- avoided by internal energy, which penalized the length of the contour
- together: prefer a compromise of a short contour which captures as much image gradient as possible
- with calculus of variation and gradient descent: local minima
Implicit representation of contours
- incidcation functor \phi : \Omega \rightarrow [-1, 1]
- zero-level line represent contour C = {x \in \Omega | \phi(x) = 0}
- for evolving C evolve \phi
- allows topological changes, can be applied in any dimension
- represents contour and enclosed region
- level set
Region-based active contours
- energy minimization based on regions statistics
- states the optimal separation of pixel intensities
- similar to k-means clustering (two-means) but with an addition constraint on the length of the separation contour
- express using an implicit representation of the contour, can be found analytically as mean intensities inside the two regions
Graph cuts for segmentation
- combinatorial optimization
- structure for min-cut
- each pixel represented by node, connected to neighbors via edges
- neighborhoods can have different complexity (most simple: 4-neighborhood)
- two extra nodes (source and target) connect to all pixel nodes
- goal: find minimum cut through graph that separates source and target
- source and target nodes correspond to regional models
- connection between neighbors enforces a regular labeling
- can be formulated as energy problem
- two terms:
- first term: comprises the links to the source and target nodes
- second term: comprises the n-links between neighboring pixels. Usually fixed weights, but can depend on image gradient
- for fixed means \mu_1, \mu_2 it can be solved in polynomial (avg. case: linear) time
Semantic segmentation
- run classifier which yields a score S_i(x) for each class i
- minimize with level sets or graph cuts
Interest points and local descriptors
- matching of local structures
Block matching
- straightforward way to match point in images
- regard the image patch around each point in image 1
- compare it to image patches around all point in image 2
- Expensive O(kN^2) with k size of patch and N pixels in image
- not invariant to typical appearance changes
Interest points
- only limited number of matches needed
- idea: do not match all points, but only promising subsets
- requirements
- point must come with enough information for unique matching
- subset in image 2 must contain matches from subset in image 1
Corner detection
- choose points with high information content and clear localization (corner points)
- with structure tensor, measure cornerness (fast to compute)
- eigenvalue decompositon of structure tensor
- smoothing integrates gradients from the neighborhood
- eigenvectors yield the dominant orientation in this neighborhood and perpendicular orientation
- eigenvalues yield the structure magnitude in these directions
- a large second eigenvalue indicates strong structures in multiple directions (corners)
- corner: local maxima of second eigenvalue
- problem: depends on image scale
Characteristic scale
- consider a Gaussian pyramid of smoothend images
- characteristic scale can be computed based on the Laplacian
- yields an estimate of scale shift between two images
- uniqueness is not ensures
- may be multiple local maxima
- even global maximum need not be unique
- maximum operator is not robust, little noise can lead to different outcome
- alternative to Harris-Laplace detector
- considers local maxima of Laplacian scale space
- advantage: does not mix apples and oranges (corner detector and Laplacian)
- Laplacian focuses on blobs rather than corners (complementary information, one might be interested in both)
Block matching at interest points
- positive issue of interest points
- significantly reduced complexity
- with 100 detected points in both images, one has to compare only 10000 patches instead of 96 billion in a 640x480 image
- negative issues
- non-dense displacement fields (important matches might be missed)
- corresponding patches can be slightly shifted
- further problems (independent of interest points)
- patches in both images may look very different due to rotation, transformations, lighting changes, blurring
- subpixel accuracy not available
Invariant: local descriptors
- local descriptors are vectors that contain information about the local neighborhood of an interest point
- simplest local descriptor: block of certain size centered at interest point
- goal: design local descriptors that are invariant under the mentoined transformations
scale invariance
- use characteristic scale from interest point detection
- choose and normalize the size of the blocks: structures have same scale in both images
affine invariance
- affine tranformation: f(x) = Ax + t
- maps a circle to an ellipse (or vice-versa)
- approximation of a projective transformation
- parameters can be estimated (region detector, maximally stable extremal regions)
affine region detector
- maximally stable extermal regions
- encircled by large gradients
- obtained by watershed-like algorithm
- apart from scale also yields elongation of fitted ellipses
histograms
- alternative to normalized neighborhood: derive invariant features within the fixed block
- gray value histogram: rotational invariance, invariant to blurring
- sensitive to lighting changes (bad)
- loss of information (very bad)
- histogram of gradient direction orientational histograms
- invariant to (additive) lighting changes
- building block of many successful descriptors
SIFT/HOG descriptor
- based on local assembly of orientation histograms and adaptive local neighborhoods
- extract SIFT feature points
- strongest responses of Laplacian scale space
- fit quadratic function to obtain subpixel accuracy
- create orientation histogram at selected scale
- peak of smoothed histogram estimated orientation
- in case of two peaks, create two feature points
- estimation of position, scale and orientation
- affine invariance provided with MSER
- object recognition: dense sampling of such points at all position, scales, no rotation invariance
Algorithm
- compute gradient orientation and magnitude at each pixel
- compute orientation indicator at each pixel
- create N \times M \times 8 array and init with zero
- quantize the orientation at each pixel and add respective magnitude to the respective entry in array
- local integration results in orientation histogram (smooth array with gaussian kernel)
- smooth in orientation direction
- sample feature vectors from the image
- normalize the feature vector
Shape context
- descriptor for matching edge maps
- insensitive to small deformations and spurious edges
Descriptors learned with convolutional networks
- CNNs are trained on large datasets with class labels
- network learns a representation that is good for object classification
- intermediate layer outputs turn out to be good generic descriptors
Unsupervised training to trigger invariant features
- CNN to discriminate surrogate classes
- applied transformations: translation, rotation, scaling, color, contrast, brightness, blur
- transformations define invariance properties of the features to be learned
Shapes from X
- reconstruct the 3D structure of a scene
- missing cue is depth of an observed point
- any point along the projection ray that passes the camera center and the image point could have caused the observation
- thus from position of a single image point we cannot determine the coordinates of the corresponding 3D point
- many ways (hence shape from X)
- prominent: reconstruction from binocular images
- center of second camera must not coincide with the one of the first camera
Stereo reconstruction
- need two corresponding image points (disparity estimation)
- reconstruct the two projection rays, crossing is sough 3D point
- projection properties of camera needed
- comprised in the projection matrix P (camera calibration)
Multiview
- add further cameras
- geometrically, only two cameras needed
- practice
- find corresponding points
- no correspondence for occluded points
- accuracy is limited by pixel grid
- more cameras decrease the noise in correspondences and total accuracy of depth estimates is increased
- drawback: multiple cameras are more difficult to calibrate
Structure from motion
- single moving camera observes static 3D point
- advantages
- plenty of images from single camera
- reasonable frame rate: displacements are small
- problems
- not fixed calibrated camera setup. Estimate camera motion (egomotion) together with structure
- scene must be static, otherwise ambiguities
- sometimes the camera motion is approximately known (other sensors!)
Decomposition of the projection matrix
- factorize projection matrix P = KM
- camera calibration matrix, contains camera’s internal paramters
\begin{bmatrix}\alpha_x & s & x_0 \\ 0 & \alpha_y & y_0 \\ 0 & 0 & 1 \end{bmatrix}
- pose of the camera relative to a world coordinate system M = (R|t):
rotation matrix and a translation vector with 3 degrees of freedom (external parameters)
How it works
- camera’s internal parameters are calibrated
- calibrate external parameters online (can be estimated from point correspondences)
- internal parameters now stay fixed
- in cause of autofocus, this is not the case. Internal parameters can also be estimated online (autocalibration), but decreases the robustness of estimation
- when P = KM (fully calibrated), we can estimate the 3D structure
loop closing
- robotics: self-localization and mapping (SLAM)
- localization errors accumulate over time
- when camera returns to earlier position and if localization errors are small enough, the image can be matched to earlier image (loop closing)
- special case of object recognition (instance recognition)
- loop closing is the only way to correct accumulated errors (drift)
Shape from silhouette
- correspondences: established for points whose projection rays hit surface in nearly orthogonal direction
- if projection is tangential to surface: structures on surface are severely deformed (no robust matching)
- shape form silhouette is kind of contrary approach: shape is inferred from points where the surface is tangential
- silhouette: restricts the volume that can be occupied by the observed object
- points along projection rays that see background cannot belong to the object volume
- points along projection rays that see foreground may or may not belong to object volume
- more cameras: volume can be occupied is successively reduces
- remaining volume is the maximal volume that is consistent with all silhouettes (visual hull)
- helps find good initializations for other reconstruction methods (e.g. stereo)
as 3D segmentation
- find the silhouette in the image (segmentation) and reconstruction the surface can also be considered as single 3D segmentation problem
- task: find a surface that consistently separates all images into a foreground and a background region
shape from shading
- given a single image, compute shape, albedo, scattering, illumination
- inverse rendering: reconstruct from the observed image
- severely under-constrained problem
Bidirectional reflectance distribution function (BRDF)
- single light source and surface element
- direction of light in polar coordinates (\theta, \phi) and incoming energy E(\theta, \phi)
- light emitted by surface in directing (\theta_e, \phi_e) given by L(\theta_e, \phi_e)
- function describing reflectance properties of material called bidirectional reflectance distribution function (BRDF):
L(\theta_e, \phi_e) = f(\theta, \phi, \theta_e, \phi_e)E(\theta, \phi)
- more detailed models BRDF can have more than four params
Lambertian reflectance
- lambertian surface emits incoming light uniformly in all directions, radiance only depends on angle between surface normal n and light source direction s: I(x, y) = R(X, Y, Z) = \rho s^T n
- rough materials (stone, textiles, …)
- most of these materials absorb some incoming light. Non-absorbed fraction is called albedo \rho
- model often used in computer vision (stereo matching)
- constance \rho and given light source direction s leads to basic shape form shading
Horn
- sought depth map Z(x, y) can be related to surface normal direction n. Estimate n
- derivatives of surface S(x, y) = (x, y, Z(x, y)) with respect to x and y yield two vector fields that span the tangential plane
\delta_xS = \begin{bmatrix}1\\0\\p\end{bmatrix} \qquad \delta_yS = \begin{bmatrix}0\\1\\q\end{bmatrix}
- cross product of two vector fields yields the normal field
n = \frac{1}{\sqrt{p^2+q^2+1}} \begin{bmatrix}-p\\-q\\1\end{bmatrix}
- plug into Lamertian radiance function: like in optical flow: one equation, two unknowns
Ikeuchi-Horn
- assuming smooth surface, leads to variational approach
- derive Euler-Lagrange equations, minimizer for p and q can be found by gradient descent
- Energy is non-convex and can have multiple local minima. Coarse-to-fine methods can be used as heuristic to find global minimum
SIRFS: Shape, Illumination and Reflectance from Shading
- recent framework by Barron and Malik considers more components of inverse rendering
- includes multiple sources of prior knowledge
- priors learned from training data
- reflectance is mostly smooth and histogram is sparse
- shape is mostly smooth
- isotropy of shapes and the fact that an observed surface is likely to point towards observer
- natural lightning prior
Shape from defocus
- series of image with focus set to different depths
- estimate the relative blur d \sigma needed to match the other image
- amount of blur needed determines the depth
Shape from texture
- if appearance of texture is known (regular, repetitive pattern)
- usually: texture not known. Practical assumptions are that the texture is homogenous, isotropic and stationary
- from repetitive nature of texture elements under these conditions we can estimate the non-deformed shape of such an element and the surface that causes its deformation
Object Recognition and deep learning
- instance recognition seeks to detect the presence of known object in new image
- should also be localized in image (detection)
- major problem: object might look quite difference (viepoint, lighting, occlusions, …)
- important difference to object class recognition or object class localization: the same instance of an object class is seen in the two images
- there can be large differences between two instances of an object class (eg. two dogs)
Classification problem
- two main parts: set of features that describe the object and a classifier that separates objects
- classical approach: use a handcrafted feature representation and learn the classifier
- deep learning: learn both
- typical classifiers: support vector-matchine (two-class), logistic regression (multi-class, used in deep networks)
- nearest neighbor classifier (multi-class)
Support vector machine
- Basic idea: learn a decision function with maximum margin (distance to most critical training points)
- decision function modeled as linear combination of features y(x) = w^T \phi(x) + b
- large marign concept leads to a convex optimization problem:
\text{argmin}_{w,b} \frac{1}{2} ||w||^2 subject to t_n(w^T \phi(x_n) + b) \geq 1
- efficient code publicly available
Nearest neighbor classifier
- assign the class label of most similar training point
- advantages
- simple concept
- works for multiple classes
- drawbacks
- all training samples must be stored (no generalizing)
- search form most similar training sample consumes much time
Feature representation
- color
- local descriptors (e.g. SIFT)
- discriminative properties good
- easy to extract from input images
- describe the object locally (robust to occlusions, local variation)
- fixed level of abstraction
- deep representations
- multiple levels of abstraction
- learned from training data
Invariance requirements in object recognition
- object should be recognized even if appearance has changed due to typical transformations
- descriptors and classifiers are called invariant with respect to transformation
- tradeoff between invariance and discriminative power of descriptor
- instance recognition: invariance needed with respect to
- viewpoint
- background
- lighting
- partial occlusion
- class recognition: needs complex invariant features learned from training examples
Some popular local descriptors
- distortion corrected patches
- SIFT/HOG (orientation histograms)
- Shape context
Feature learning
- instead of manual descriptor design, let computer find optimal descriptor for a task from a training set
- task: object classification
- training set consists of images and their class labels
- \text{Image} \rightarrow f(I) \rightarrow \text{features} \rightarrow C(f(I)) \rightarrow \text{Class}
- shallow modeling of function f(I) is not efficient to cover all the variation that appears in an object class
deep network for image classification
- each layer produces more abstract representation of the input
- layers are similar in structure
- weights of connections between layers (filters) learned from training data
- fully connected layer
- 200x200 image: 40k hidden units, 2 billion parameters
- spatial correlation is local
- waste of resources, not enough training samples
- locally connected layer
- 200x200 images, 40k hidden units, Filter size: 10x10, 4M parameters
- parametrization is good when input image is registered (face recognition)
- stationary: statistics is similar at different location
- convolutional layer
- share the same parameters across different location: convolutions with learned kernels
- single layer: max-pooling replaces small local area of feature map by maximum value in that area (downsampling)
- data reduction, loss of localization accuracy
- allows for larger receptive fields in next layer
- ReLU nonlinearity sets negative values to 0 (no activation)
Large ConvNet
- up to 30 layers
- cost function for training the weights, classification error on training set
w^* = \text{argmin}_w \sum_i y_i \log(f_w(x_i))
- optimization by gradient descent (back-propagation)
Back-propagation
- initialize the weights
- Compute f_w(x_i) by forward-propagating a sample though the nextowkr
- gradient with regard to a weight in a certain layer: use chain rule
- update the weights
Nice properties
- filters of different layers capture different scales due to max-pooling
- filters at low layers are well localized and consider only small image area (small receptive field)
- filters at high layers are badly localized and capture context
- feature sharing: same features obtained at low layers can be useful for assembling complex features at higher layers
- successive abstraction
- features close to image input resemble simple edge filters
- features close to class output are invariant to the typical appearance variations
Localization tasks
- image classification only provides a class label per image
- object localization: provide bounding box of the object
- object detection: bounding boxes for potentially many object instances in the image
- semantic segmentation: for each pixel, which object class does it belong to?
- instance segmentation: additionally separates different class instances
Sliding window approach
- define features in local window
- consider many (or all) positions and scales and mace binary decision: is this window a card or not?
- non-maximum-suppression: keep only local maxima
- convolutional networks and sliding window concept are redundant
- exploit for larger efficiency
deep network classification on object window proposals (R-CNN)
- input image \rightarrow extract region proposals \rightarrow compute CCN features \rightarrow classify regions
- instead of windows, only \approx 1000 region proposals are considered
- faster implementations first compute the ConvNet activations of the whole image and use them to classify the proposal windows
Object recognition benchmarks
- common benchmark needed
- ImageNet, PAsCAL Visual Object Classes, Microsoft CoCo
- training set and test set
- detection performance is assessed by precision-recall curves
Precision-recall curves
- precision: which part of the detected objectsis correct \frac{\text{true positives}}{\text{true positives} + \text{false positives}}
- recall: which percentage of objects is detected \frac{\text{true positives}}{\text{true positives} + \text{false negatives}}
- Precision-recall curves obtained by different thresholds on the classification score
- single number: average precision
Rendering Pipeline
Rendering
- generate an image given a virtual camera, objects and light sources
- various techniques: rasterization, raytracing
- major research topics in computer graphics
Rasterization
- generate 2D images from 3D scenes
- transform geometric primitives such as lines and polygons into raster image representations
- 3D objects are approximated by vertices (points), lines and polygons
Rendering pipline
- processing stages comprise the rendering pipeline
- supported by commodity grahics hardware
- OpenGL and DirectX
- Task: 3D input (camera, objects, light source, textures) \rightarrow 2D output
- functionality: resolve visibility, lighting model, compute shadows, apply textures
Main stages
- vertex processing / geometry stage /vertex sharer
- process vertices independently in same way
- transformations per vertex, computes lighting per vertex
- geometry shader
- generates, modifies, discards primitives
- primitive assembly and rasterization / rasterization state
- assembles primitives such as points, lines, triangles
- converts primitives into raster image
- generates fragments / pixel candidates
- fragment attributes are interpolated from vertices of a primitive
- fragment processing /fragment shader
- processes all fragments independently in the same way
- fragments are processed, discarded or stored in framebuffer
Vertex processing (Geometry stage)
- position, orientate and scale each object and respective vertices in scene with model transform
- position and orientate camera by view transform
- i.e. inverse view transform is the transform that places camera at the origin of coordinate system, facing z-direction
- entire scene is transformed with inverse view transform
Lighting
- interaction of light sources and surfaces is represented with a lighting / illumination model
- lighting computes color for each vertex (light source positionand properties, materials)
- P transforms the view volume to canonical view volume
- view volume depends on camera properties (orthographic projection vs perspective projection)
- canonical view volume is a cube from (-1, -1, -1) to (1, 1, 1)
- view volume is specified by near, far, left, right, bottom, top
- view volume (cuboid or frustum) is transformed into a cube
- object inside the view volume are transformed accordingly
- orthographic
- combination of translation and scaling
- all objects are translated and scaled in the same way
- perspective
- complex transformation
- scaling factor depends on distance of an object to viewer
- objects farther away appear smaller
Clipping
- primitives, that intersect boundary of view volume are clipped
- projected primitve coordinates (x_p, y_p, z_p) are transformed to screen coordinates (x_s, y_s)
- screen coordinates together with depth value are window coordinates (x_s, y_s, z_w)
- (x_p, y_p) are translated and scaled from range (-1, 1) to actual pixel positions (x_s, y_s) on the display
- z_p is generally translated and scaled from the range of (-1, 1) to (0, 1) for z_w
- screen coordinates (x_s, y_s) represent the pixel position of a fragment that is generated in a subsequent step
- z_w, the depth value, is an attribute of this fragment used for further processing
Fragment processing
Fragment attributes are processed and tests are performed
Attribute processing
- texture lookup
- texturing
- fog
- antialiasing
Tests
- scissor test: check if fragment is inside a specified rectangle
- used for, e.g., masked rendering
- alpha test: check if the alpha value fulfills a certain requirement
- stencil test: check if stencil value in the framebuffer at the position of the fragment fulfills a certain requirement
- depth test
- compare depth value of fragment and depth value of the framebuffer at the position of the fragment
- used for resolving visibility
Blending / Merging
- combines the fragment color with the framebuffer color at the position of the fragment
- usually: determined by the alpha values
- dithering
- finite number of colors
- map color value to one of the nearest renderable colors
- logical operations / masking
OpenGL
- graphics rendering API: display of geometric representations and attributes. Independent from OS and window system
- interaction with GPUs, hardware-accelerated
OpenGL 1.0
- fixed-function pipeline
- focus on parallelized implementation
- promoted by quasi-standards of all components of rasterization-based renderer
OpenGL 2.0
- fixed-function pipeline with programmable vertex and fragment processing
- optional: shaders (work on vertex / fragment)
OpenGL 3.0
- programmable vertex and fragment processing
- no fixed-function pipeline
- focus on core functionality
- improved handling of large data
- improved flexibility
- implementation of non-standard effects not restricted to “misusing” pipeline functionality
- programming
- not just a set of parameters of standard functionality
- concepts of vertex and fragment processing are not “nice to know”, but required knowledge
- OpenGL Mathematics glm
OpenGL 3.2
- geometry shader
- optional, modify geometric primitives
- generation of geometry no longer restricted to CPU
- flexible use of texture data
OpenGL 4.1
- tessellation shader
- optional
- tessellate patches
- flexible generation of large and detailed geometries
OpenGL 4.3
- compute shader
- optional
- perform arbitrary computations
- are not part of the rendering pipeline
GPU Data Flow
- data transfer to GPU (vertices with attributes and connectivity)
- vertex shader: program that is executed for each vertex. Input and output is vertex
- rasterizer
- fragment shader (each fragment)
- framebuffer update
Data Transfer
- Vertex Buffer Object VBO
- copy memory from CPU to GPU
- arbitrary data: typically vertex attributes
- Vertex Array Object VAO
- link between VBO and shader programs
- specifies how to interpret VBO data
- specifies the mapping to input variables of shaders
Shader
written in OpenGL Shading Language GLSL, runs on GPU, vertex and fragment shader are mandatory
Vertex shader
- works on vertices
- minimum functionality: transform from local to clip space
- can compute/set a color at vertices
- rasterizer can interpolate attributes from vertices to fragments
Fragment shader
- congruent transformations (Euclidean transformations)
- preserve shape and size
- translation, rotation, reflection
- similarity transformations
- preserve shape
- translation, rotation, reflection, scale
- preserve collinearity: points on a line are transformed to points on a line
- preserve proportions: ratios of distances between points are preserved
- preserve parallelism: parallel linst stay parallel
- angles and length are not preserved
- translation, rotation, reflection, scale, shear are affine
- orthographic project is a combination of affine transf.
- perspective projection is not affine
- 3D point: p' = T(p) = Ap + t (A: 3x3 matrix for scale and rotation; t: translation)
- preserve affine combinations: T(\sum_i \alpha_i p_i) = \sum_i \alpha_i T(p_i) \qquad \forall \sum_i \alpha_i = 1
- e.g.: Line can be transformed by transforming control points
- all affine transformations are represented with one matrix-vector multiplication
Points and Vectors
- rendering pipeline transforms vertices, normals, colors, texture coordinates
- points (e.g. vertices) specify a location in space
- vectors (e.g. normals) specify a direction
- relations between points and vectors:
- p_1 - p_2 = \vec{v}
- p_1 + \vec{v} = p_2
- \vec{v_1} + \vec{v_2} = \vec{v_3}
- p_1 + p_2 undef
- \vec{p} = p - O \qquad p = O + \vec{p}
- transformations can have different effects on points and vectors
- translation
- move point to a different position
- does not change the vector
- using homogeneous coordinates, transformations of vectors and points can be handled in a unified way
Homogeneous Coordinates of Points
- \begin{bmatrix}x\\y\\z\\w\end{bmatrix} with w \neq 0 are homogeneous coordinates of the 3D point \begin{bmatrix}\frac{x}{w}\\\frac{y}{w}\\\frac{z}{w}\end{bmatrix}
- \begin{bmatrix}\lambda x\\\lambda y\\\lambda z\\\lambda w\end{bmatrix} represent the same point \begin{bmatrix}\frac{\lambda x}{\lambda w}\\\frac{\lambda y}{\lambda w}\\\frac{\lambda z}{\lambda w}\end{bmatrix} = \begin{bmatrix}\frac{x}{w}\\\frac{y}{w}\\\frac{z}{w}\end{bmatrix} for all \lambda \neq 0
Homogeneous Coordinates of Vectors
- general form
\begin{bmatrix}m_{00} & m_{01} & m_{02} & t_0 \\ m_{10} & m_{11} & m_{12} & t_1 \\ m_{20} & m_{21} & m_{22} & t_2 \\ p_{0} & p_{1} & p_{2} & w \\ \end{bmatrix}
- m_ii rotation, scale
- t translation
- p_i represent projection
- w is analog to the forth component for points/vectors
Examples
- Translation
- Rotation
- inverse rotation
- mirroring / reflection
- scale
- shear
Orthogonal matrices
- rotation and reflection matrices are orthogonal RR^T = R^TR = I and R^{-1} = R^T
- R_1, R_2 are orthogonal \Rightarrow R_1R_2 is orthogonal
- rotation: \det R = 1 reflection: \det R = -1
- length of a vector does not change ||Rv|| = ||v||
- angles are preserved <Ru, Rv> = <u, v>
- translation: p_2 = p_1 - t and p_2 = T(-1)p_1
- rotation: by rotation matrix
application
- view transform can be seen as basis transform
- objects are places with respect to a global coordinate system C_1 = (O_1, \{e_1, e_2, e_3\})
- camera is positioned at O_2 and oriented at \{b_1, b_2, b_3\} given by viewing direction and up-vector
- after view transform, all objects are represented in the eye or camera coordinate system C_2 = (O_2, \{b_1, b_2, b_3\})
- placing and orienting the camera corresponds to the application of the inverse transform to the objects
- rotation the camera by R and translating it by T corresponds to translating the objects bs T^{-1} and rotating them by R^{-1}
planes and normals
- planes can be represented by a surface normal n and a point r. All points p with n(p-r) = 0 form a plane.
- Composition is achieved by matrix multiplication.
- Eg: Apply T to p and then R: R(Tp) = (RP)p
- Note that: TR \neq RT
- order matters!
Projection
in 2D
- projection from v onto l maps point p onto p'
- p' is the intersection of the line though p and v with line l
- v is viewpoint, center of perspectivity
- l is the viewline
- line though p and v is projector
- v is not on line l, p \neq v
- if homogeneous component of viewpoint v is not equal to zero, we have perspective projection
- if v is at infinity, we have parallel projection
Classification
- location of viewpoint and orientation of viewline determine the type of projection
- parallel (viewpoint at infinity, parallel projectors)
- perspective (non-parallel projectors)
General Case
- 2D projection is represented by the matrix M = vl^T - (l v) I_3
- i = \{ax + by + c = 0\} = (a, b, c)^T
Discussion
- matrices M and \lambda M represent the same transformation \lambda M p = \lambda p'
- 2D transfomrations in homogeneous form
\begin{bmatrix}m_{11} & m_{12} & t_x \\ m_{21} & m_{22} & t_y \\ w_x & w_y & h \\ \end{bmatrix}
- w_x and w_y map the homogeneous component of w of a point to a value w' that depends on x and y
- therefore, scaling of a point depends on x and / or y
- in perspective 3D projections, this is generally employed to scale the x and y component with respect to z, its distance to the viewer
in 3D
- projection from v onto n, maps a point p onto p'
- p' is the intersection of the line through p and v with plane n
- v is the viewpoint center of perspectivity
- n is the viewplane
- the line through p and v is a projector
- v is not on the plane n, p \neq v
- represented by a matrix M = vn^T - (nv) I_4
- n = \{ax + by + cz + d = 0\} = (a, b, c, d)^T
OpenGL: View Volume
- projection transformation maps a view volume to the canonical view volume
- the view volume is specified by its boundary
- the canonical view volume is a cube from (-1, -1, -1) to (1, 1, 1)
- transformation maps
- from eye coordinates
- to clip coordinates
- to normalized device coordinates NDC
- x component of a point from (left, right) to (-1, 1)
- y component of a point from (bottom, top) to (-,1 1)
- z component of a point from (near, far) to (-1, 1)
Perspective projection
- to obtian x and y component of a projected point, the point is first projected onto the near plane (viewplane)
- x_p = \frac{nx_e}{-z_e} \qquad y_p = \frac{ny_e}{-z_e}
- transform the view volume, the pyramidal frustum to the canonical view volume
Parallel projection
- view volume is represented by cuboid
- the projection transform maps the cuboid to the canonical value
OpenGL Matrices
- Objects are transformed from object to eye space with \text{GL_MODELVIEW} \cdot p
- Objects are transformed from eye space to clip space with \text{GL_PROJECTION} \cdot \text{GL_MODELVIEW} \cdot p
- colors are transformed with color matrix GL_COLOR
- texture coordinates are transformed with texture matrix GL_TEXTURE
Matrix Stack
- each matrix type has a stack
- matrix on top of stack is used
Lightning
Radiometric Quantities
- radiant energy Q
- radiant flux \Phi radiant power P
- rate of flow of radient energy per unit time: \Phi = \frac{dQ}{dt}
- flux density (irradiance, radiant existance)
- radiant flux per unit area: E = \frac{d \Phi}{A}
- irradiance measures overall radiant flux \Phi (light flow, photons per unit time) into a surface element
- radiant intensity
- radiant flux per unit solid angle I = \frac{d \Phi}{d \omega}
- radiant flux \Phi (light flow, photons per unit time) incdent on, emerging from or passing through a point in a certain direction
Solid Angle
- 2D angle in 3D space
- measured in steradians
- d \omega = \frac{dA}{r^2}
Inverse square law
- point light source with radiant intensity I in direction \omega
- irradiance along a ray in direction \omega is inversely proportional to the square of the distance r from the source E = \frac{I}{r^2}
- the number of photons emitted in direction d\omega and hitting surface area dA at distance r is inversely proportinal to r^2
- the area subtended by a solid angle is proportional to r^2
- E = \frac{d \Phi}{d A} = \frac{d \Phi}{r^2dw} = \frac{I}{r^2}
Radiance
- radiant fulx per unit solid angle per unit projected area incident on, emerging from, passing through a surface element in a certain direction: L = \frac{d^2 \Phi}{d A_p d \omega} = \frac{d^2 \Phi}{d A \cos \theta d w}
- describes strength and directon of radiation
- constant radiance in all directions corresponds to a radiant intensity that is proportional to \cos \theta but constant radiant intensity results in different radiance
- measured by sensors
- computed in computer-generated images
- preserved along lines in space
- does not change with distance
Visible color spectrum
- photons characterized by wavelength in visible spectrum 390 \text{nm} to 750 \text{nm}
- light consists of a set of photons
- the distribution of wavelengths within this set is referred to as spectrum of light
- spectra are perceived as colors
- if spectrum consists of dominant wavelength, humans perceive a “rainbow” color (monochromatic)
- equally distributed wavelengths are perceived as shade of gray (achromatic)
- otherwise, colors “mixed from rainbox colors” are perceived (chromatic)
Human Eye
- sensitive to radiation within the visible spectrum
- has sensors for day and night vision (cones, rods)
- perceived light is radiation spectrum weighted with absorption spectra of the eye
- in daylight, three cone signales are interpreted by the brain
CIE Color Space
- XYZ color space
- X = \int_\lambda I(\lambda) \bar{x}(\lambda) d\lambda
- Y = \int_\lambda I(\lambda) \bar{y}(\lambda) d\lambda
- Z = \int_\lambda I(\lambda) \bar{z}(\lambda) d\lambda
- spectrum I is represented by X, Y, Z given the color-matching functions \bar{x}, \bar{y}, \bar{z}
- color-matching functions correspond to spectral sensitivities of the cones of a standard observer
CIE xy Chromaticity diagram
- XYZ represents color and brightness / luminance
- two values are sufficient to represent color
- x = \frac{X}{X + Y + Z}
- y = \frac{Y}{X + Y + Z}
- monochromatic colors are on the boundary
- the center is achromatic
Display Devices
- use three primary colors, can only reproduce colors within spanned triangle
RGB Color Space
- three primaries: red, green, blue
- light source color: what is contained in spectrum
- object color: what is reflects, what is absorbed
Lightning models
- compute interaction of objects with light
- account for variety of light sources and material properties
- but: keep it fast
- only use local information
- interaction between objects neglected
- interreflections among objects not handled
diffuse vs specular
- diffuse: reflected into many different directions, matte surfaces
- specular (mirror like): is reflected into a dominant direction (shiny surface)
- material properties
- general: combination of both
Lambert’s Cosine Law
- computation of diffuse reflection is governed by Lambert’s cosine law
- radiant intensity reflected from a “Lambertian” (matte, diffuse) surface is proportional to the cosine of the angle between view direction and surface normal n
- irradiance on an oriented surface patch above a Lambertian surface is constant
- irradiance on a surface is proportional to the cosine of the angle between surface normal n and light source direction l
specular reflection
- perceived radiance depends on viewing direction
- maximum radiance, if viewer direction v corresponds to the reflection direction r or, if the halfway vector h corresponds to the surface normal
- Phong and Blinn-Phong do not account for energy preservation
- radiance depends on angle \theta and exponent m
- radiant exitance from surface depends on m
- normalized Blinn-Phong versions account for energy conservation
ambient light
- diffuse and specular reflection are modeled for directional light from a point light source of from parallel light
- ambient light is an approx. for indirect light sources
- the effect is generally approximated by a constant
Phong illumination model
- combines, ambient, diffuse and specular components
- Phong allows to set different light colors for different components
- multiple light sources possible
Considering Distances
- between object surface and light source
- irradiance of surface illuminated by a point light source is inversely proportional to the square distance from the surface to the light source
- light source attenuation
- between object surface and viewer
- atmospheric effects, e.g. fog, influence the visibility of objects
- refers to transparency of air
- low visibility, radiance converges to a “fog color”
Shading models
- specify whether lighting is evaluated per vertex or per fragment
- if per vertex: colors are interpolated or not
- per fragment: surface normals are interpolated accross a primitive
- flat shading (Constant shading)
- evaluation of the lighting model per vertex
- primitives are colored with color of one specific vertex
- Gouraud shading
- evaluation of lighting model per vertex
- primitives are colored by bilinear interpolation of vertex colors
- Phong shading
- bilinear interpolation of vertex normals during rasterization
- evaluation of the lighting model per fragment
Mach bad Effect
- mach bands are illusions due to our neural processing
- the bright bands at 45 degrees and 135 degrees are illusory. The intensity inside each square is the same
Shading models
- flat shading (constant shading): simple, fast
- Gouraud shading: more realistic, suffers from Mach band effect, local highlights are not resolved if the highlight is not captured by a vertex
- Phong shading: highest quality, most expensive
Rasterization
Transform geometric primitives into a raster (pixel positions).
Line
- components start and end point of line are integer values p_b = (x_b, y_b) and p_e = (x_e, y_e)
- lines are represented as y = mx + b or F(x, y) = ax + by + c = 0
- algorithms are often restricted to 0 \leq m \leq 1
- algorithms consist of an initialization step and a loop
- line rasterization algorithms are usually described for a subset of lines and generalized using symmetries
- incremental updates are often employed
- Bresenham avoids floating-point aritthmetic
- improved algorithms address aliasing / clipping artifacts
- note that the algorithms do not compute all pixels that are intersected by the line
Circle
- center (0, 0) and radius r
- implicit representation F(x, y) = x^2 + y^2 - r^2 = 0
- algorithms compute only one eight of a circle
- incremental updates are often employed
- floating-point arithmetic is avoided
Polygons
- a polygon is defined by edges
- should be closed to allow inside / outside classification
- rasterization estimates all pixel positions inside a polygon
- in general simple, but
- if adjacent polygons share an edge, each pixel on the edge should belong to exactly one polygon
- no pixel along the edge should be missed
- no pixel along the edge should be rasterized twice
- estimates all pixel positions inside a polygon
Shadow
- improve the realism in rendered images
- illustrate spatial relations between objects
- Goal: determination of shadowed parts in 3D scenes
- only geometry is considered
- not standard functionality of rasterization-based rendering pipeline
Projection Shadows
- project the primitives of occluders onto a receiver plane, based on the light source location
- projected primitives form shadow geometry
- implementation: draw receiver plane, disable depth test, draw shadow geometry, enable depth test, draw occluders
- issues: restricted to planar receivers, no self-shadowing, antishadows
Shadow Maps
- see shadow casting as visibility problem
- scene points are visible from light source (illuminated), others invisible from light source (in shadow)
- resolving visibility is standard functionality in rendering pipeline (z-buffer algorithm)
- algorithm:
- render scene from light source
- store all distances to visible (illuminated) scene points in a shadow map
- render scene from camera
- compare the distance of rendered scene points to the light with stored values in the shadow map
- if both distances are equal, the rendered scene point is illuminated
Scene rendering
- use two depth buffers
- “usual” depth buffer for view point
- second depth buffer for light position (shadow map)
- render scene from light position into shadow map
- render scene from view position into depth buffer
- transform depth buffer values to shadow map values
- compare transformed depth buffer values and shadow map values to decide whether a fragment is shadowed or not
Aliasing
- discretized representation of depth values can cause an erroneous classification of scene points
- offset of shadow map value reduces aliasing artifacts
Sampling
- in large scenes, sampling artifacts can occur
- uniform sampling of the shadow map can result in non-uniform resolution for shadows in a scene
- shadow map resolution tends to be too coarse for nearby objects and too high for sistand objects
- Solution: perspective shadow maps
- scene and light are transformed using the projective transformation of the camera
- compute the shadow map in post-perspective space
Shadow volumes
- employ a polygonal representation of the shadow volume
- point-in-volume test
- implementation issues: classification of shadow volume polygons into front and back fases
- stencil buffer values count the number of intersected front and back faces
- algorithm: Z-pass
- render scene to initialize depth buffer
- stencil enter / leave counting approach
- render shadow volume twice using face culling
- do not update depth and color
- if pixel is shadow, stencil is non zero
- else stencil is zero
- implementation
- render the scene with only ambient lighting
- render front facing shadow volume polygons to determine how many front face polygons are in front of the depth buffer pixels
- render back facing shadow volume polygons
- render the scene with full shading where stencil is zero
- can miss intersections due to near plane clipping
Z-Fail
- basic Z-Fail
- counts the difference of occluded front and back faces
- misses intersections behind the far plane
- with depth clamping
- do not clip primitives at the far plane
- clamp the actual depth value to the far plane
- Z-fail with far plane at infinity
- adapt the perspective projection matrix
Texture Mapping
- add per-pixel surface details without raising geometric complexity of a scene
- keep number of vertices and primitives low
- textures are represented as 2D images
- can represent variety of properties: colors, normals, etc.
- positions of texture pixels are characterized by texture coordinates (u, v) in texture space
- texture mapping is a transformation from object space to texture space (x, y, z) \rightarrow (u, v)
- texture mapping is generally applied per fragment
Pipeline
- object space location
- texture coordinates are defined for object space locations
- if the object moves in world space, move texture along with it
- projector function
- maps a 3D object space location to a 2D parameter-space value
- corresponder functions
- map from parameter space to texture space
- texture-space values are used to obtain values from a texture
- value transform
- a function that transforms obtained texture values
Projector functions
- planar projection, cylindrical projection or spherical projection
- can be an initialization step for the surface parameterization
- use different projections for surface patches
- minimize distortions, avoid artifacts at seams
- several texture coordinates can be assigned to a single vertex
Corresponder functions
- one or more corresponder functions transform from parameter space to texture space
- transform of (u, v) into valid range from 0 to 1 e.g.
- repeat, mirror, clamp to edge, clamp to border
- arbitrary transformations, e.g. represented by a matrix in homogeneous form, can be applied to texture values
- texture blending functions
- define how to use the tetxue value
- replace the original surface color with texture color
- linearly combine the original surface color with the texture
- multiply, add, subtract surface color and texture color
- alternatively, texture values can be used in the computation of illumination model
Image texturing
- 2d image is mapped / glued to a triangulated object surface
- certain aspects affect the quality
- interpolation of texture coordinates
- sampling
- quality of applied textures can be improves by
- perspective-correct interpolation
- considering magnification and minification
Transparency and Reflection
- simplified transparency model
- semitransparent objects are filters / attenuators of occluded objects
- refraction and object thickness are neglected
- algorithms are based on stipple patterns, color blending per pixel
Stipple patterns
- screen-door transparency
- transparent object is rendered with a fill / stipple pattern, e.g. checkerboard (pattern of opaque and transparent fragments)
- limited number of fill patterns results in limited number of transparency levels
- aliasing artifacts
- simple methods
Color Blending
- alpha values: describes the opacity of fragment, stored with RGB color (RGBA)
- order matters
Depth ordering
- polygons / fragments have to be rendered in sorted depth order
- hardware generally renders in object order
- depth test only returns the nearest fragment per pixel, sorting is not realized
- intersecting polygons have to be handled
- dynamic scenes require re-sorting
Convex objects
- exactly two depth layers for arbitrary viewing directions
- first depth layer defined by front faces
- second depth layer defined by back faces
- algorithm: render back faces in first pass, blend with front faces in second pass
arbitrarily shaped objects
- object-space methods
- use pre-computed spatial data structures (e.g. binary partition tree)
- useful for static geometry
- varying viewer positions and orientations can be handled
- screen-space methods
- employ the functionality of the rendering pipeline
- several rendering passes compute depth layers
- final pass renders the ordered depth layers
- useful for dynamic / deforming geometry and arbitrary views
- no pre-computation is required
Binary space partitioning (BSP)
- BSP tree is hierarchical spatial data structure
- 3D space is subdivided by means of arbitrary oriented planes
- leaves represent convex space cells
- applications
- visible surface algorithm
- depth sorting
- collision detection
- precomputed for static scenes
- all planar primitives are represented in the tree
- balancing is less important, as entire tree has to be queried
- one rendering pass
Querying (rendering)
- back to front rendering
- render far branch of viewpoint
- render root (node) polygon
- render near branch of the viewpoint
- recursively applied to sub-trees
Discussion
- not only visible surface generation, but depth sorting of all primitives per pixel position
- additional data structure
- can be pre-computed
- required polygon splits
- dynamic scenes require an update
Depth peeling
- use the functionality of rendering pipeline
- several rendering passes compute depth layers
- final pass renders the ordered depth layers
- useful for dynamic / deforming geometry
- no pre-computation is required / can be employed
- algorithm
- first render pass gives the front-most fragment color / depth
- each successive render pass extracts the fragment for the next-nearest fragment on a per pixel basis (screen-space approach)
- two depth buffers are used
- object is rendered once for each depth layer
- two separate depth tests per fragment
- depth peeling strips away one depth layer with each successive rendering pass
- implementation based on shadow mapping
Discussion
- screen-space algorithm
- multiple rendering passes generate depth layers per pixel position
- view dependent (in constrast to BSP)
- appropriate for dynamic scenes
- quality and performance are determined by number of rendering passes
Reflection
Angle of incidence is equal to the angle of reflection.
Planar surfaces
- original and reflected geometry is rendered
- reflected geometry is generated with respect to the reflection plane with surface normal n = (n_x, n_y, n_z) and point p on the reflection plane
- semi-transparent reflection plan, e.g. with color blending (original geometry)
- reflection plane to stencil: only where stencil is set
Arbitrary surfaces: Environment Mapping
- cube mapping: approximates reflections of the environment on arbitrary surfaces
- project environment onto a object-embedding shape (sphere or cupe)
- view-dependent mapping
- approximate implementation of reflections off arbitrary surfaces
- cannot handle changing reflections of moving objects
Steps
- generate or load a 2D map of environment
- for each fragment of a reflective object: compute the normal n
- compute the reflection vector r from the view vector v and the normal n at the surface point
- use the reflection vector to compute and index into the environment map that represents the objects in the reflection direction.
- use the texel data (texture value) from the environment map to color the current fragment
Cubic mapping
- placing a camera at center of object and take pictures in 6 directions
Spherical mapping
- orthographically project an image of a mirrored sphere
- map stores colors seen by reflected rays
- sphere map contains information about both, the environment in front of the sphere and in back of the sphere
- can be obtained by Raytracing, warping automatically generated cubic maps
Discussion
- disadvantages: maps are hard to create on the fly, sampling non-linear, non-uniform, sampling is view-point dependent
- advantages: no interpolation across map seems
- objects should be small compared to environment
- issues with self-reflections and non-convex objects
- separate map for each object
- maps may need to be changed in case of a change in viewpoint due to non-uniform sampling
- translated objects might require a map update