This document is derived of the lecture by Prof. Dr. G. Lausen at the University of Freiburg in semester SS16.

- Query:
- relational database: structured tables
- graph database: sets of graphs
- information retrieval system: sets of documents

- overwhelming amounts of data, can not pose query. Recommender systems predicts what a user is believed to be looking for

- Extract set of terms from document after stop word removal, stemming and lemmatization of document’s words
- set of terms is \(T = \{t_1, ..., t_n\}\) and set of documents is \(D = \{d_1, ..., d_m\}\). For each document \(d_i\) and each term \(t_k\) let \(w_{i, k} \in \Re\) be a weight
- vector space model: consider matrix defined by the vectors of D’s documents (e.g. term-document matrix)

- term frequency: number of occurrences of term in given document
- \(i\) is term, \(j\) is document.
- \(freq(i, j)\) is number of occurrences of \(i\) in \(j\).
- \(OtherTerms(i, j)\) is the set of terms different from \(i\) in \(j\).
- \(maxOthers(i, j) = max_{z \in OtherTerms(i, j)}\{freq(z, j)\}\)

- normalized term frequency: \(TF(i, j) = \frac{freq(i, j)}{maxOthers(i, j)}\)
- Inverse document frequency:
- \(N\) number of recommendable documents
- \(n(i)\) number of documents from \(N\) containing keyword \(i\)
- \(IDF(i) = log \frac{N}{n(i)}\)

- TF-IDF: \(TF-IDF(i, j) = TF(i, j) \cdot IDF(i)\)

- Relationships between objects can naturally be represented as “subject-predicate-object” triples: \((s, p, o)\)
- Subjects, objects and predicates identified by URI (IRIs), can appear in either role
- Graphical interpretation: labeled direct graph
- Problem: same property used seveal times complicates graph drawings

- property graph made up of nodes, relationships and properties
- nodes contain properties, typically as key-value pairs
- relationships connect and structure nodes
- relationships can also have properties

- based on
- personal profile of user
- popularity characteristic for a set of peers
- similarity to a given example
- personal profile and some additional knowledge
- combination of above

- computation problem: learn a function that predicts relevance score for interesting item typically not seen before by the user
- example: collaborative filtering, content-based recoomendations

- RDF building blocks
- uris
- literals
- blank nodes

- n-ary Relationships by blank nodes
- SPARQL
- PREFIX: mechanism for abbreviating URIs
- SELECT: identifies the variables to be returned in query answer
- FROM: names the graph to be queries
- WHERE: query pattern

```
-- DATA
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
_:a foaf:givenName "Alice".
_:b foaf:givenName "Bob" .
_:b dc:date "2005-04-04T04:04:04Z"^^xsd:dateTime .
-- QUERY
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?givenName
WHERE { ?x foaf:givenName ?name .
OPTIONAL { ?x dc:date ?date } .
FILTER (!bound(?date)) }
```

- A tuple from \((T \cup V) \times (U \cup V) \times (T \cup V)\) is a tripple pattern
- A triple pattern is also a graph pattern. If
`P1`

and`P2`

are graph patterns, then expessions`(P1 AND P2)`

,`(P1 OPT P2)`

and`(P1 UNION P2)`

are graph patterns - If
`P`

is graph pattern and`R`

is SPARQL built-in condition, then the expession`P FILTER R`

is graph pattern

```
P1 = ((?A, email, ?E) OPT (?A, webPage, ?W)).
P2 = (((?A, name, ?N) OPT (?A, email, ?E)) OPT (?A, webPage, ?W)).
P3 = ((?A, name, ?N) OPT ((?A, email, ?E) OPT (?A, webPage, ?W))).
P4 = ((?A, name, ?N) AND ((?A, email, ?E) UNION (?A, webPage, ?W))).
```

- mapping \[ \mu : T \rightarrow T$. For a triple pattern$ t $we denote$ \mu(t) $the triple obtained by replacing the variables in$ t $according to$ \mu \]
- the domain of \(\mu\) \(dom(\mu)\) is subset of \(V\) where \(\mu\) defined
- mappings \(\mu_1\) and \(\mu_2\) are compatible when for all \(x \in dom(\mu_1) \cap dom(\mu_2)\) it is the case that \(\mu_1(x) = \mu_2(x)\)
- disjoint domains are always compatible
- empty mapping is compatible with any other mapping

- SELECT: returns all, or a subset of the variables bound in query pattern match
- CONSTRUCT: returns an RDF graph constructed by substituting variables in set of triple templates
- ASK: returns boolean indicating whether a query pattern matches or not
- DESCRIBE: returns RDF graph that describes resources found

- each link’s vote is proportional to the importance of its source page
- if page \(j\) with importance \(r_j\) has \(k_j\) out-links, each link gets \(\frac{r_j}{k_j}\) votes
- Page \(j\)’s own importance is the sum of the votes on its in-links: \(r_j = \sum_{i \rightarrow j} \frac{r_i}{k_i}\)
- transition matrix \(M\) has \(n\) rows and columns. Element \(m_{ij}\) of \(M\) has value \(\frac{1}{k}\) when page \(j\) has \(k_j\) arcs out, and one of them is to page \(i\). Otherwise \[ m_{ij} = 0\]
- M is stochastic, column values always sum up to \(1\). Transition matrix of marcov process

- \(v_i\) is probability that the surfer at next step is at page \(i\) given that the surfer currently visits page \(j\).
- \(v_1 = M v_0\)
- \(v_2 = M V_1 = M(M v_0) = M^2 v_0\)
- after \(i\) steps \(v_i = M^i v_0\)
- Given that \(M\) is stochastic, the markov process is stationary. A limit exists ans has to satisfy \(v = Mv\) provided \(M\) represents a strongly connected (web) graph

Problem: web is not strongly connected graph!

Dead end (no more outgoing link): \(v_\infty = \begin{pmatrix}0 \\ 0 \\ 0 \\ 0\end{pmatrix}\) Spider Trap (infinitive loop to self): \(v_\infty = \begin{pmatrix}0 \\ 0 \\ 1 \\ 0\end{pmatrix}\)

Note that this will not necessarily represent possible distribution of random surfer, but approximation of relative importance of the respective pages

New iterative step:

\(v' = \beta M v + \frac{1 − \beta}{n} \cdot e\)

- \(e\) is \(n\)-dim vector with all \(1\)’s
- \(n\) number of nodes in graph
- \(\beta\) is chosen to represent probability for a new random surfer to start at a random page with probability \(1 - \beta\).
**Typical \(\beta\) is \(0.8\) to \(0.9\)**

Example: search term *Jaguar* - how to rank pages?

- Ideally: private PageRank vector
- Topic-sensitive PageRank for small number of topics
- Interest in topics can be determined from pages recently viewed

\(v' = \beta M v + (1 - \beta) \frac{e_s}{|S|})\)

\(S\) is the teleport set, i.e. a set of pages belonging to selected topic. \(e_s\) is a vector \(1\)’s in the S-positions and \(0\)’s elsewhere

Spam farm: collection of pages to increase PageRank of certain page. Inaccessible pages, accessible pages and own pages.

- Target page \(t\), \(m\) supporting pages, \(n\) pages in total. Given \(\beta\).
- \(x\) the amount of the PageRank contributed to by \(t\) by the accessible pages \(p\)
- \(y\) PageRank of target page \(t\)
- Rank of each farm page is \(r = \frac{\beta y}{m} + \frac{1 - \beta}{n}\)

Thus PageRank \(y\) of \(t\) is built from \(x\), \(\beta\) times PageRank of every supporting page

\(\beta(\beta \frac{y}{m} + \frac{1-\beta}{n})\)

Conclusion: Increase \(m\) (supporting pages) to increase \(y\).

TrustRank: Topic-sensitive PageRank where “topic” is a set of pages believed to be trustworthy. Required (human) seed.

Assumptions:

- trust is additive
- the degree of trust decreases with distance
- trust is split across out-links

Spam mass of \(p\) (Page) is \(\frac{r - t}{r}\) (PageRank, TrustRank). Spam mass close to 1: spam.

Measure the similarity of nodes of same type by the similarity of structural context in respective graph. Example: Users, Tags, Web pages. Users place tags, tags are placed on pages, users place tags on pages. Note that only binary relationships are expressed for simplicity of structure.

A node of the same type where a random walker starting at \(N\) will most likely be in the limit is considered most similar to \(N\).

\(v' = \beta M v + (1 - \beta) e_N\)

- Ranking of Web search results based on link structure
- search specific - in contrast to PageRank
- Idea
- pages with good content: authorities, quality measured in weighted in-links
- pages which link to pages which have good content are hubs; links are weighted depending on quality of pages they link to
- compute scores to hubs and authorities such that the recursive definition has a unique solution (equilibrium)

- Assumption: if page receives many links from other pages, then it is receiving kind of collective endorsement

- highly endorsed answers to queries are the authorities for query
- high-value lists are hubs for query
- assign each page \(p\) two scores: \(auth(p)\) and \(hub(p)\)
- initial values are equal to \(1\).
- Authority update rule: for each page \(p\) update \(auth(p)\) to be the sum of the hub scores of all pages that point to it
- Hub update rule: for each page \(p\) update \(hub(p)\) to be the sum of the authority scores of all pages that it points to
- Normalization step: Each authority score is divided down by the sum of all authority scores, and each hub score is divided down by the sum of all hub scores

Motivation: search engine should not contain items which are “too” similar. Plagiarism, mirror pages, articles from same source

\(SIM(S, T) = \frac{|S \cap T|}{|S \cup T|}\)

*k-shingle* (or *k-gram*) is any substring of length \(k\) found within the document.

\(D\) is document. \(k\)-shingles appear one or moe times in \(D\) is denoted \(S^k(D)\).

White space can be ignored or replaced by single blank

Choosing a good \(k\) value is critical, pick a \(k\) large enough that the probability of any given shingle appearing in any given document is low. Rule of thumb: \(k = 5\) for short documents, \(k = 10\) for long documents.

(Long) shingles can be compressed by hashing them on for example \(4\) bytes. To compute if hashing is needed, compute number of hashes possible vs number of shingles possible.

Real world scenario: News articles have a lot of stop words. A shingle can thus be formed out of a stop word followed by two other words.

Note: two pages with the same article and different surrounding have higher Jaccard Similarity than two pages with same surrounding and different articles.

Scenario: near-duplicate detection for documents. \(N = 1 000 000\) documents, a pairwise test for Jaccard Similarity will result in

\(\frac{N(N-1)}{2} = 5 \cdot 10^{11}\)

comparisons. At \(10^6\) comparisons per second, this will take \(5\) days.

Goal: Estimate Jaccard Similarity by breaking down large sets of shingles into signatures.

The characteristic matrix is a matrix that associates shingles with documents (sets of shingles). If a shingle is in a document, the value is \(1\), else \(0\).

Minhashing considers permutations of the rows of a characteristic matrix. In respect to a permutation, the minhash value of a document is given by the number of the first row in which the column has a 1.

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.

Given \(n\) minhash functions \(h_1, h_2, ..., h_n\). For a collection of sets \(S_1, ..., S_m\) we construct a matrix SIG with columns assigned to the sets \(S_i\) with \(1 \leq i \leq m\), for each such set \(S_i\) its column is defined by the vector \([h_1(S_i), ..., h_n(S_i)]\)

**Algorithm**: check each row of characteristic matrix only once and can compute all hash functions simultaneously. \(SIG(i, c)\) refers to the \(i\)th hash function and column \(c\) (representing set \(S_c\)).

- Set \(SIG(i, c)\) to \(\infty\) for all \(i\) and \(c\)
- Compute \(h_1(r), ..., h_n(r)\) for each row \(r\) of the matrix
- for each row in the given order, for each column \(c\) do the following
- if \(c\) has \(0\) in row \(r\), do nothing
- if \(c\) has \(1\) in row \(r\), then for each \(i = 1, ..., n\) set \(SIG(i, c)\) to the smaller of the current value of \(SIG(i, c)\) and \(h_i(r)\)

Problem: efficiently find pairs of documents with greatest similarity. For 1M documents, would have to check \(1 000 000 \choose 2\) pairs (half a trillion!). Idea: only check pairs of documents whose similarity is above some lower bound.

- divide signature matrix in \(b\) bands of \(r\) rows
- for each band consider hash function over \(r\)-dimensional vectors. Similar pairs are hashed into the same bucket

\(b\) bans, \(r\) rows, \(D_1\) and \(D_2\) pair of documents with Jaccard similarity \(s\). Probability that the minhash signatures of \(D_1\) and \(D_2\) agree in any one particular row is \(s\)

- probability that signature agree in all rows of particular band is \(s^r\)
- probability that signatures do not agree in all rows of particular band is \(1 - s^r\)
- probability that the signatures do not agree in all rows of any bands \((1 - s^r)^b\)
- probability that the signatures agree in all rows of at least one band is \(1 - (1 - s^r)^b\). This is the probability of becoming candidate pair

Threshould \(t\) at which probability of becoming a candidate is \(\frac{1}{2}\). Treat \(t\) as function of \(b\) and \(r\): \(\frac{1}{2} = 1 - (1 - t^r)^b\) - thus approximation is \(\frac{1}{b}^\frac{1}{r}\)

- fix \(k\) and construct for each document its set of \(k\)-shingles
- sort document-shingle pairs by shingle
- Fix \(n\) for the minhash signatures and compute them
- Fix the number of bands \(b\) and the number of rows \(r\) such that \(br = n\) and the threshold \(t = \frac{1}{b}^\frac{1}{r}\)
- find candidate pairs
- check whether threshould is passed indeed
- optinally, check the candidate documents whether they are truly similar as expected

Build inverted index, to discover candidates compute prefixes in set of length

\(|u| - ceil(t \cdot |u|) + 1\)

where \(t\) is Jaccard Similarity threshold. Only compare when prefix elements are candidates in other prefixes. Note: this can miss similar records.

- each feature (word, shingle) of a document is assigned as dimension
- number of occurrences is the value in respective dimension

A *distance measure* is a function \(d(x, y)\) that takes two vectors \(x, y\) and produces a real number satisfying these axioms:

- \(d(x, y) \geq 0\)
- \(d(x, y) = 0\) iff \(x = y\)
- \(d(x, y) = d(y, x)\)
- \(d(x, y) \leq d(x, z) + d(z, y)\)

Jaccard: \(d(x, y) = 1 - SIM(x, y)\) Hamming: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = |\{i \in \{1, ...,n\} | x_i \neq y_i\}|\) Euclidean Distance: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2}\) Manhattan: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = \sum^n_{i=1} | x_i - y_i |\) Cosine similarity: \(sim([x_1, ..., x_n], [y_1, ..., y_n]) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot || \vec{y} ||}\) Cosine distance: \(d(\vec{x}, \vec{y}) = 1 - sim(\vec{x}, \vec{y})\)

Given:

- information about users: ID, ratings, preferences, demographics, …
- information about products: ID, title, genre, directory, …

We would like to compute a ranking based on which relevant interesting items can be recommended.

Differente types of RS are:

- collaborative: based on previous decisions of other users
- content-based: based on user’s previous decisions
- social: based on trust relations, friendships, …
- knowledge-based: based on user’s needs
- hybrid: combinations of various mechanisms

Typically sparse matrix with users \(U = \{u_1, ..., u_n\}\) and products \(P = \{p_1, ..., p_m\}\) \(R\) containing ratings. Goals is to find \(pred(a, p)\) where \(a\) is an active user of \(U\) for all non rated items \(p\) that maximizes \(a\)’s utility.

Assumptions: If users had similar taste in past, they will have similar taste in the future and preferences remain stable and consistent over time.

Rating \(r(a, q)\) of some item \(q\) for active user \(a\) is predicated based on ratings \(r(u, p)\) assigned to item \(p\) by other users \(u \in U\).

Predict similarity of item by finding similar users to the active user.

Let \(P_{xy}\) be a set of all items co-rated by both users \(x\) and \(y\):

\(P_{xy} = \{p \in P | r_{x, p} \neq \emptyset, r_{y, p} \neq \emptyset\}\)

Compute similarity using Cosine similarity or Pearson correlation. Cosine similarity is invariant to the length of vectors, Person is invariant to separate changes of location and scale of the variables, i.e. linear transformations \(a + bx\) where \(a, b\) constants with \(b \neq 0\). Take nearest neighbors and use them to compute rating, either using average or weighting the average with similarity factors and normalisation.

User-based recommendation is problematic with large number of users. As matrix is sparse, only a few common ratings between two users are likely. Items are more sable than users. Thus: Item-based methods are attractive!

Determine similarity of items by column-wise analysis of utility matrix.

Let \(U_{xy}\) be a set of all users co-rated by both items \(x\) and \(y\):

\(U_{xy} = \{u \in U | r_{u, x} \neq \emptyset, r_{u, y} \neq \emptyset\}\)

Use adjusted cosine similarity. The prediction function is the similarity weighted average of all ratings of similar items by the active user.

Given utility matrix \(R\). Analyse \(R\) to determine association rules: If user \(u\) liked a certain set of items \(Q \subseteq P\) then \(u\) will most probably also like item \(p \notin Q\).

Let \(a\) be the active user. * Check whether \(a\) has alerady rated items such that the precondition of rule is satisfied * recommend to \(a\) all items \(i\) appearing as consequence of such rule, whenever \(a\) has not rated \(i\) so far.

Rules are written as \(X \Rightarrow Y\).

- Support of \(X \Rightarrow Y\): \(supp(X) = \frac{\text{number of users having rated } X}{\text{number of all users}}\)
- Confidence of \(X \Rightarrow Y\): \(conf(X \Rightarrow Y) = \frac{supp(X \cup Y)}{supp(X)}\)

Let \(s \in [0, 1]\) be a threshold. Itemset \(X\) is called frequent whenever \(supp(X) \geq s\).

Let \(c \in [0, 1]\) be a threshold. Determine association rules \(X \Rightarrow Y\) where \(X \cup Y\) is frequent and \(conf(X \Rightarrow Y) \geq c\)

Apriori property: Eavery subset of frequent itemset is also frequent itemset. Whenever an itemset \(X\) is not frequent, any itemset $$ Y X$is also not frequent

Finding association rules: 1. consider$ X Y \(such that\) supp(XY) s \(and the active user has rated all items in\) X $$ 2. Verify \(\frac{supp(XY)}{supp(X)} \geq c\) 3. Compute the union of all items appearing in the consequence of such rules 4. Sort the items according to confidence of the rule that predicted them. If there are multiple, take the one with the highest confidence 5. Return the first \(N\) elements of the ordered list as recommendations

Base prediction on popularity differential between items for users.

Rating database \(R\), sets of users \(U\). \(u_i = r_{u, i}\) user’s rating of item \(i\).

Items \(i, j: S_{j, i}(R)\) denotes the set of users that rated both \(i\) and \(j\); these are the rows in \(R\) which contain co-ratings for \(i\) and \(j\).

Average deviation of items \(i\) and \(j\)

\(dev_{j, i} = \sum_{u \in S_{j, i}(R)} \frac{r_{u, j} - r_{u, i}}{|S_{j, i(R)}|}\)

Prediction for an item \(j\) for user \(a\) based on item \(i\) is \(pred(a, j) = dev_{j, i} + a_i\)

Combination of all individual predictions:

\(pred(a, j) = \frac{\sum_{i \in Relevant(j)}(dev_{j, i} + a_i)}{|Relevant(j)|}\)

Slope-on algorithms are a simple approach to items-based recoommendation. They are easy to implement and their accuracy is on par with more complicated and computationally more expensive algorithms.

Challenge: Good predictions when there are few ratings available. Idea: Add more information, exploit transitivity of customer tastes to augment given matrix. Recommendation thus becomes a graph-problem.

Filling blank entries “solves” the recommendation problem. General theory *singular-value decomposition* tells us that a given matrix can be expressed as product of other matrices with lower dimensionality. This makes sense if reaction of users on items can be determined based on small set of features of items and users (latent factors). To apply SVD to a matrix, it must be free from blank entries. As a solution, the UV-decomposition algorithm decomposes a given matrix \(M\) intro matrices \(U, V\) such that \(M \sim UV\) where \(M\) may contain blank entries. As \(U\) and \(V\) do not have blank entires, \(UV\) does not have blank entries as well. Product fills blank entries in \(M\).

Reduce \(5 \times 5\) matrix to \(5 \times 2\) and \(2 \times 5\) matrix. Observation: already contains less elements than original matrix. Find \(U\) and \(V\) that closely approximate \(M\). Construct \(U\) and \(V\) iteratively, continuously converging to \(M\). Process stops when difference to \(M\) is small enough.

Difference: Root-Mean-Square Error (RMSE):

\(RMSE(M, M') = \sqrt{\frac{1}{k} \sum_{i=1}^n \sum_{j=1}^m(m_{ij} - m'_{ij})^2}\)

Note: \(k\) is number of non blank elements, term will produce \(0\) when \(m_{ij}\) is blank in \(M\).

Let \(M\) be an \[ m \times n$matrix with rank$ r \]. Find matrices \(U\), \(\Sigma\) and \(V\) with following properties:

- \(U\) is \(m \times r\) column-orthonormal matrix
- \(V\) is \(n \times r\) column-orthonormal matrix, \(V^T\) is a row-orthonormal matrix
- \(\Sigma\) is a diagonal matrix. Elements of \(\Sigma\) are called singular values of \(M\)

Rating \(r(u, q)\) of some item \(q\) for user \(u\) is estimated based on the ratings \(r(u, p)\) assigned by user \(u\) to itmes \(p \in P\) that are similar to item \(q\).

The content of item \(p\), \(Content(p)\) typically is given as an item profile, i.e. set of attributes characterizing item \(p\). (Meta-Information, keywords)

Profile of user \(u\) contains tastes and preferences. (Given or learned)

Utility function \(r(u, p) = pred(ContentBasedProfile(u), Content(p))\)

Jaccard: \(SIM(D_1, D_2) = \frac{|D_1 \cap D_2|}{|D_1 \cup D_2|}\) Dice: \(SIM(D_1, D_2) = \frac{2|D_1 \cap D_2|}{|D_1| + |D_2|}\)

Prediction rating/interest is computed by analysing how user liked similar items/documents in the past. Rocchio’s method: also take into account feedback for results by user.

Compute conditional probabilities for each possible rating value given Alice’s other ratings.

Eg.: \(X = \{i_1 = 1, i_2 = 3, ...\}\) compute \(P(i_5 = 1 | X)\), \(P(i_5 = 2 | X)\), …

\(P(X | Y) = \frac{P(X, Y)}{P(Y)}\)

Thus:

\(P(Y | X) = \frac{P(X | Y)P(Y)}{P(X)}\)

The *naive Bayes* assumes that features (ratings) are conditionally independent:

\(P(Y | X) = \frac{\prod^d_{i=1} P(X_i | Y)P(Y)}{P(X)}\)

where \(d\) is number of features.

\(P(X)\) is considered constant and will be ignored.

Remark: in many domains naive Bayes classifiers work well even when conditionally independence assumption does not hold. Data sparsity problems has to be solved.

Task: find line \(w_1x_1 + w_2x_2 = b\) which separates instances of two classes. Parameters \(w_1, w_2, b\) have to be learned

In general: \(n\) dimensions, classification function \(\vec{w}^T \vec{x} = b\) has to be learned. (Hyperplane separating the two classes)

- Training set: data which an ML algorithm is applied. Feature vector and label \((\vec{x}, y)\)
- Discover a function \(y = f(\vec{x})\) that best predicts the value \(y\) associated with \(\vec{x}\).
- \(y\) a real number: regression
- \(y\) a boolean value: binary classification
- \(y\) member of some finite set: multiclass classification

- Test set: validate the learned function.
- High error rate signals overfitting of model (to specific for the training set)
- cross-validation: data is divided into \(k\) equal sized chunks, each chunk is treated as test data w.r.t. the remaining chunks form training data.

Input is \(\vec{x} = [x_1, ..., x_d]\) with weights \(\vec{w} = [w_1, ..., w_d]\) and threshold \(\theta\)

Output is \(+1\) if \(\vec{w}\vec{x} > \theta\), \(-1\) if \(\vec{w}\vec{x} < \theta\).

Training a perceptron we try to find a weight vector \(\vec{w}\) and a threshold \(\theta\) such that all feature vectors with \(y = +1\) are on the positive side and all with \(y = -1\) on the negative side.

This will only work for data that is linearly separable in the sense that there exists a hyperplane. There may exist more than one possible hyperplane.

Algorithm: 1. Initialize \(\vec{w} = \vec{0}\) 2. Pick learning-rate parameter \(\eta\) (a small real number). This is crucial because if \(\eta\) is too small then convergence is slow, if it is too big the boundary will “dance around” and might not converge 3. Consider each training example \(t = (\vec{x}, y)\) in turn. * Let \(y' = \vec{w}\vec{x}\) * If \(y'\) and \(y\) have the same sign, do nothing. \(t\) is properly classified * If $ y’$ and \(y\) have different signs or \(y' = 0\) then \(\vec{w} = \vec{w} + \eta y \vec{x}\)

Maximize the distance \(\gamma\) between the hyperplane and any point of training set. A \(d\)-dimensional hyperplane has at least \(d + 1\) support vectors.

If data is not linearly separable, it can help to map it to a higher dimensional feature space where it is separable. For example: input space \(\vec{x} = (x_1, x_2)\) mapped into quadratic feature space \(f(\vec{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2)\)

Given a set of points, group points into clusters using some distance notion. Distance can be **cosine distance**, **jaccard distance** or for example **euclidean distance**.

- agglomerative (bottom up). Each point is a cluster, merge nearest clusters into one
- divise: everything is one cluster, recursively split.

Cluster distance: distance of centroids.

\(k\)-means for small datasets. BFR for large datasets (required memory: \(O(c)\) and not \(O(d)\) where \(c\) is clusters and \(d\) is data). Idea of BFR: Summarize points close enough to centroid, group points that are close together but not close to any existing centroid.