This document is derived of the lecture by Prof. Dr. G. Lausen at the University of Freiburg in semester SS16.
-- DATA
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
_:a foaf:givenName "Alice".
_:b foaf:givenName "Bob" .
_:b dc:date "2005-04-04T04:04:04Z"^^xsd:dateTime .
-- QUERY
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?givenName
WHERE { ?x foaf:givenName ?name .
OPTIONAL { ?x dc:date ?date } .
FILTER (!bound(?date)) }
P1
and P2
are graph patterns, then expessions (P1 AND P2)
, (P1 OPT P2)
and (P1 UNION P2)
are graph patternsP
is graph pattern and R
is SPARQL built-in condition, then the expession P FILTER R
is graph patternP1 = ((?A, email, ?E) OPT (?A, webPage, ?W)).
P2 = (((?A, name, ?N) OPT (?A, email, ?E)) OPT (?A, webPage, ?W)).
P3 = ((?A, name, ?N) OPT ((?A, email, ?E) OPT (?A, webPage, ?W))).
P4 = ((?A, name, ?N) AND ((?A, email, ?E) UNION (?A, webPage, ?W))).
Problem: web is not strongly connected graph!
Dead end (no more outgoing link): \(v_\infty = \begin{pmatrix}0 \\ 0 \\ 0 \\ 0\end{pmatrix}\) Spider Trap (infinitive loop to self): \(v_\infty = \begin{pmatrix}0 \\ 0 \\ 1 \\ 0\end{pmatrix}\)
Note that this will not necessarily represent possible distribution of random surfer, but approximation of relative importance of the respective pages
New iterative step:
\(v' = \beta M v + \frac{1 − \beta}{n} \cdot e\)
Example: search term Jaguar - how to rank pages?
\(v' = \beta M v + (1 - \beta) \frac{e_s}{|S|})\)
\(S\) is the teleport set, i.e. a set of pages belonging to selected topic. \(e_s\) is a vector \(1\)’s in the S-positions and \(0\)’s elsewhere
Spam farm: collection of pages to increase PageRank of certain page. Inaccessible pages, accessible pages and own pages.
Thus PageRank \(y\) of \(t\) is built from \(x\), \(\beta\) times PageRank of every supporting page
\(\beta(\beta \frac{y}{m} + \frac{1-\beta}{n})\)
Conclusion: Increase \(m\) (supporting pages) to increase \(y\).
TrustRank: Topic-sensitive PageRank where “topic” is a set of pages believed to be trustworthy. Required (human) seed.
Assumptions:
Spam mass of \(p\) (Page) is \(\frac{r - t}{r}\) (PageRank, TrustRank). Spam mass close to 1: spam.
Measure the similarity of nodes of same type by the similarity of structural context in respective graph. Example: Users, Tags, Web pages. Users place tags, tags are placed on pages, users place tags on pages. Note that only binary relationships are expressed for simplicity of structure.
A node of the same type where a random walker starting at \(N\) will most likely be in the limit is considered most similar to \(N\).
\(v' = \beta M v + (1 - \beta) e_N\)
Motivation: search engine should not contain items which are “too” similar. Plagiarism, mirror pages, articles from same source
\(SIM(S, T) = \frac{|S \cap T|}{|S \cup T|}\)
k-shingle (or k-gram) is any substring of length \(k\) found within the document.
\(D\) is document. \(k\)-shingles appear one or moe times in \(D\) is denoted \(S^k(D)\).
White space can be ignored or replaced by single blank
Choosing a good \(k\) value is critical, pick a \(k\) large enough that the probability of any given shingle appearing in any given document is low. Rule of thumb: \(k = 5\) for short documents, \(k = 10\) for long documents.
(Long) shingles can be compressed by hashing them on for example \(4\) bytes. To compute if hashing is needed, compute number of hashes possible vs number of shingles possible.
Real world scenario: News articles have a lot of stop words. A shingle can thus be formed out of a stop word followed by two other words.
Note: two pages with the same article and different surrounding have higher Jaccard Similarity than two pages with same surrounding and different articles.
Scenario: near-duplicate detection for documents. \(N = 1 000 000\) documents, a pairwise test for Jaccard Similarity will result in
\(\frac{N(N-1)}{2} = 5 \cdot 10^{11}\)
comparisons. At \(10^6\) comparisons per second, this will take \(5\) days.
Goal: Estimate Jaccard Similarity by breaking down large sets of shingles into signatures.
The characteristic matrix is a matrix that associates shingles with documents (sets of shingles). If a shingle is in a document, the value is \(1\), else \(0\).
Minhashing considers permutations of the rows of a characteristic matrix. In respect to a permutation, the minhash value of a document is given by the number of the first row in which the column has a 1.
The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.
Given \(n\) minhash functions \(h_1, h_2, ..., h_n\). For a collection of sets \(S_1, ..., S_m\) we construct a matrix SIG with columns assigned to the sets \(S_i\) with \(1 \leq i \leq m\), for each such set \(S_i\) its column is defined by the vector \([h_1(S_i), ..., h_n(S_i)]\)
Algorithm: check each row of characteristic matrix only once and can compute all hash functions simultaneously. \(SIG(i, c)\) refers to the \(i\)th hash function and column \(c\) (representing set \(S_c\)).
Problem: efficiently find pairs of documents with greatest similarity. For 1M documents, would have to check \(1 000 000 \choose 2\) pairs (half a trillion!). Idea: only check pairs of documents whose similarity is above some lower bound.
\(b\) bans, \(r\) rows, \(D_1\) and \(D_2\) pair of documents with Jaccard similarity \(s\). Probability that the minhash signatures of \(D_1\) and \(D_2\) agree in any one particular row is \(s\)
Threshould \(t\) at which probability of becoming a candidate is \(\frac{1}{2}\). Treat \(t\) as function of \(b\) and \(r\): \(\frac{1}{2} = 1 - (1 - t^r)^b\) - thus approximation is \(\frac{1}{b}^\frac{1}{r}\)
Build inverted index, to discover candidates compute prefixes in set of length
\(|u| - ceil(t \cdot |u|) + 1\)
where \(t\) is Jaccard Similarity threshold. Only compare when prefix elements are candidates in other prefixes. Note: this can miss similar records.
A distance measure is a function \(d(x, y)\) that takes two vectors \(x, y\) and produces a real number satisfying these axioms:
Jaccard: \(d(x, y) = 1 - SIM(x, y)\) Hamming: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = |\{i \in \{1, ...,n\} | x_i \neq y_i\}|\) Euclidean Distance: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2}\) Manhattan: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = \sum^n_{i=1} | x_i - y_i |\) Cosine similarity: \(sim([x_1, ..., x_n], [y_1, ..., y_n]) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot || \vec{y} ||}\) Cosine distance: \(d(\vec{x}, \vec{y}) = 1 - sim(\vec{x}, \vec{y})\)
Given:
We would like to compute a ranking based on which relevant interesting items can be recommended.
Differente types of RS are:
Typically sparse matrix with users \(U = \{u_1, ..., u_n\}\) and products \(P = \{p_1, ..., p_m\}\) \(R\) containing ratings. Goals is to find \(pred(a, p)\) where \(a\) is an active user of \(U\) for all non rated items \(p\) that maximizes \(a\)’s utility.
Assumptions: If users had similar taste in past, they will have similar taste in the future and preferences remain stable and consistent over time.
Rating \(r(a, q)\) of some item \(q\) for active user \(a\) is predicated based on ratings \(r(u, p)\) assigned to item \(p\) by other users \(u \in U\).
Predict similarity of item by finding similar users to the active user.
Let \(P_{xy}\) be a set of all items co-rated by both users \(x\) and \(y\):
\(P_{xy} = \{p \in P | r_{x, p} \neq \emptyset, r_{y, p} \neq \emptyset\}\)
Compute similarity using Cosine similarity or Pearson correlation. Cosine similarity is invariant to the length of vectors, Person is invariant to separate changes of location and scale of the variables, i.e. linear transformations \(a + bx\) where \(a, b\) constants with \(b \neq 0\). Take nearest neighbors and use them to compute rating, either using average or weighting the average with similarity factors and normalisation.
User-based recommendation is problematic with large number of users. As matrix is sparse, only a few common ratings between two users are likely. Items are more sable than users. Thus: Item-based methods are attractive!
Determine similarity of items by column-wise analysis of utility matrix.
Let \(U_{xy}\) be a set of all users co-rated by both items \(x\) and \(y\):
\(U_{xy} = \{u \in U | r_{u, x} \neq \emptyset, r_{u, y} \neq \emptyset\}\)
Use adjusted cosine similarity. The prediction function is the similarity weighted average of all ratings of similar items by the active user.
Given utility matrix \(R\). Analyse \(R\) to determine association rules: If user \(u\) liked a certain set of items \(Q \subseteq P\) then \(u\) will most probably also like item \(p \notin Q\).
Let \(a\) be the active user. * Check whether \(a\) has alerady rated items such that the precondition of rule is satisfied * recommend to \(a\) all items \(i\) appearing as consequence of such rule, whenever \(a\) has not rated \(i\) so far.
Rules are written as \(X \Rightarrow Y\).
Let \(s \in [0, 1]\) be a threshold. Itemset \(X\) is called frequent whenever \(supp(X) \geq s\).
Let \(c \in [0, 1]\) be a threshold. Determine association rules \(X \Rightarrow Y\) where \(X \cup Y\) is frequent and \(conf(X \Rightarrow Y) \geq c\)
Apriori property: Eavery subset of frequent itemset is also frequent itemset. Whenever an itemset \(X\) is not frequent, any itemset $$ Y X$is also not frequent
Finding association rules: 1. consider$ X Y \(such that\) supp(XY) s \(and the active user has rated all items in\) X $$ 2. Verify \(\frac{supp(XY)}{supp(X)} \geq c\) 3. Compute the union of all items appearing in the consequence of such rules 4. Sort the items according to confidence of the rule that predicted them. If there are multiple, take the one with the highest confidence 5. Return the first \(N\) elements of the ordered list as recommendations
Base prediction on popularity differential between items for users.
Rating database \(R\), sets of users \(U\). \(u_i = r_{u, i}\) user’s rating of item \(i\).
Items \(i, j: S_{j, i}(R)\) denotes the set of users that rated both \(i\) and \(j\); these are the rows in \(R\) which contain co-ratings for \(i\) and \(j\).
Average deviation of items \(i\) and \(j\)
\(dev_{j, i} = \sum_{u \in S_{j, i}(R)} \frac{r_{u, j} - r_{u, i}}{|S_{j, i(R)}|}\)
Prediction for an item \(j\) for user \(a\) based on item \(i\) is \(pred(a, j) = dev_{j, i} + a_i\)
Combination of all individual predictions:
\(pred(a, j) = \frac{\sum_{i \in Relevant(j)}(dev_{j, i} + a_i)}{|Relevant(j)|}\)
Slope-on algorithms are a simple approach to items-based recoommendation. They are easy to implement and their accuracy is on par with more complicated and computationally more expensive algorithms.
Challenge: Good predictions when there are few ratings available. Idea: Add more information, exploit transitivity of customer tastes to augment given matrix. Recommendation thus becomes a graph-problem.
Filling blank entries “solves” the recommendation problem. General theory singular-value decomposition tells us that a given matrix can be expressed as product of other matrices with lower dimensionality. This makes sense if reaction of users on items can be determined based on small set of features of items and users (latent factors). To apply SVD to a matrix, it must be free from blank entries. As a solution, the UV-decomposition algorithm decomposes a given matrix \(M\) intro matrices \(U, V\) such that \(M \sim UV\) where \(M\) may contain blank entries. As \(U\) and \(V\) do not have blank entires, \(UV\) does not have blank entries as well. Product fills blank entries in \(M\).
Reduce \(5 \times 5\) matrix to \(5 \times 2\) and \(2 \times 5\) matrix. Observation: already contains less elements than original matrix. Find \(U\) and \(V\) that closely approximate \(M\). Construct \(U\) and \(V\) iteratively, continuously converging to \(M\). Process stops when difference to \(M\) is small enough.
Difference: Root-Mean-Square Error (RMSE):
\(RMSE(M, M') = \sqrt{\frac{1}{k} \sum_{i=1}^n \sum_{j=1}^m(m_{ij} - m'_{ij})^2}\)
Note: \(k\) is number of non blank elements, term will produce \(0\) when \(m_{ij}\) is blank in \(M\).
Let \(M\) be an \[ m \times n$matrix with rank$ r \]. Find matrices \(U\), \(\Sigma\) and \(V\) with following properties:
Rating \(r(u, q)\) of some item \(q\) for user \(u\) is estimated based on the ratings \(r(u, p)\) assigned by user \(u\) to itmes \(p \in P\) that are similar to item \(q\).
The content of item \(p\), \(Content(p)\) typically is given as an item profile, i.e. set of attributes characterizing item \(p\). (Meta-Information, keywords)
Profile of user \(u\) contains tastes and preferences. (Given or learned)
Utility function \(r(u, p) = pred(ContentBasedProfile(u), Content(p))\)
Jaccard: \(SIM(D_1, D_2) = \frac{|D_1 \cap D_2|}{|D_1 \cup D_2|}\) Dice: \(SIM(D_1, D_2) = \frac{2|D_1 \cap D_2|}{|D_1| + |D_2|}\)
Prediction rating/interest is computed by analysing how user liked similar items/documents in the past. Rocchio’s method: also take into account feedback for results by user.
Compute conditional probabilities for each possible rating value given Alice’s other ratings.
Eg.: \(X = \{i_1 = 1, i_2 = 3, ...\}\) compute \(P(i_5 = 1 | X)\), \(P(i_5 = 2 | X)\), …
\(P(X | Y) = \frac{P(X, Y)}{P(Y)}\)
Thus:
\(P(Y | X) = \frac{P(X | Y)P(Y)}{P(X)}\)
The naive Bayes assumes that features (ratings) are conditionally independent:
\(P(Y | X) = \frac{\prod^d_{i=1} P(X_i | Y)P(Y)}{P(X)}\)
where \(d\) is number of features.
\(P(X)\) is considered constant and will be ignored.
Remark: in many domains naive Bayes classifiers work well even when conditionally independence assumption does not hold. Data sparsity problems has to be solved.
Task: find line \(w_1x_1 + w_2x_2 = b\) which separates instances of two classes. Parameters \(w_1, w_2, b\) have to be learned
In general: \(n\) dimensions, classification function \(\vec{w}^T \vec{x} = b\) has to be learned. (Hyperplane separating the two classes)
Input is \(\vec{x} = [x_1, ..., x_d]\) with weights \(\vec{w} = [w_1, ..., w_d]\) and threshold \(\theta\)
Output is \(+1\) if \(\vec{w}\vec{x} > \theta\), \(-1\) if \(\vec{w}\vec{x} < \theta\).
Training a perceptron we try to find a weight vector \(\vec{w}\) and a threshold \(\theta\) such that all feature vectors with \(y = +1\) are on the positive side and all with \(y = -1\) on the negative side.
This will only work for data that is linearly separable in the sense that there exists a hyperplane. There may exist more than one possible hyperplane.
Algorithm: 1. Initialize \(\vec{w} = \vec{0}\) 2. Pick learning-rate parameter \(\eta\) (a small real number). This is crucial because if \(\eta\) is too small then convergence is slow, if it is too big the boundary will “dance around” and might not converge 3. Consider each training example \(t = (\vec{x}, y)\) in turn. * Let \(y' = \vec{w}\vec{x}\) * If \(y'\) and \(y\) have the same sign, do nothing. \(t\) is properly classified * If $ y’$ and \(y\) have different signs or \(y' = 0\) then \(\vec{w} = \vec{w} + \eta y \vec{x}\)
Maximize the distance \(\gamma\) between the hyperplane and any point of training set. A \(d\)-dimensional hyperplane has at least \(d + 1\) support vectors.
If data is not linearly separable, it can help to map it to a higher dimensional feature space where it is separable. For example: input space \(\vec{x} = (x_1, x_2)\) mapped into quadratic feature space \(f(\vec{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2)\)
Given a set of points, group points into clusters using some distance notion. Distance can be cosine distance, jaccard distance or for example euclidean distance.
Cluster distance: distance of centroids.
\(k\)-means for small datasets. BFR for large datasets (required memory: \(O(c)\) and not \(O(d)\) where \(c\) is clusters and \(d\) is data). Idea of BFR: Summarize points close enough to centroid, group points that are close together but not close to any existing centroid.