This document is derived of the lecture by Prof. Dr. G. Lausen at the University of Freiburg in semester SS16.

Queries and Recommendations

Queries

Weighting terms: TF-IDF measure

RDF Graph Data Model

Property Graph Data Model

Recommendations

RDF-graphs and SPARQL

Example

-- DATA
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
_:a foaf:givenName "Alice".
_:b foaf:givenName "Bob" .
_:b dc:date "2005-04-04T04:04:04Z"^^xsd:dateTime .

-- QUERY
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?givenName
WHERE { ?x foaf:givenName ?name .
        OPTIONAL { ?x dc:date ?date } .
        FILTER (!bound(?date)) }

Semantics of SPARQL 1.0 (Definition 1)

Examples

P1 = ((?A, email, ?E) OPT (?A, webPage, ?W)).
P2 = (((?A, name, ?N) OPT (?A, email, ?E)) OPT (?A, webPage, ?W)).
P3 = ((?A, name, ?N) OPT ((?A, email, ?E) OPT (?A, webPage, ?W))).
P4 = ((?A, name, ?N) AND ((?A, email, ?E) UNION (?A, webPage, ?W))).

Definition 2

Query forms

PageRank

Random Surfer

Problem: web is not strongly connected graph!

Dead end (no more outgoing link): \(v_\infty = \begin{pmatrix}0 \\ 0 \\ 0 \\ 0\end{pmatrix}\) Spider Trap (infinitive loop to self): \(v_\infty = \begin{pmatrix}0 \\ 0 \\ 1 \\ 0\end{pmatrix}\)

Removing dead ends

Note that this will not necessarily represent possible distribution of random surfer, but approximation of relative importance of the respective pages

Treating Spider Traps (and dead ends)

New iterative step:

\(v' = \beta M v + \frac{1 − \beta}{n} \cdot e\)

Topic-Sensitive PageRank

Example: search term Jaguar - how to rank pages?

Biased random walk

\(v' = \beta M v + (1 - \beta) \frac{e_s}{|S|})\)

\(S\) is the teleport set, i.e. a set of pages belonging to selected topic. \(e_s\) is a vector \(1\)’s in the S-positions and \(0\)’s elsewhere

TrustRank

Spam farm: collection of pages to increase PageRank of certain page. Inaccessible pages, accessible pages and own pages.

Thus PageRank \(y\) of \(t\) is built from \(x\), \(\beta\) times PageRank of every supporting page

\(\beta(\beta \frac{y}{m} + \frac{1-\beta}{n})\)

Conclusion: Increase \(m\) (supporting pages) to increase \(y\).

TrustRank: Topic-sensitive PageRank where “topic” is a set of pages believed to be trustworthy. Required (human) seed.

Assumptions:

Spam mass of \(p\) (Page) is \(\frac{r - t}{r}\) (PageRank, TrustRank). Spam mass close to 1: spam.

SimRank

Measure the similarity of nodes of same type by the similarity of structural context in respective graph. Example: Users, Tags, Web pages. Users place tags, tags are placed on pages, users place tags on pages. Note that only binary relationships are expressed for simplicity of structure.

A node of the same type where a random walker starting at \(N\) will most likely be in the limit is considered most similar to \(N\).

\(v' = \beta M v + (1 - \beta) e_N\)

Hubs and Authorities

HITS: Hypertext induced Topic Selection

Hubs and Authorities

Similar Items

Motivation: search engine should not contain items which are “too” similar. Plagiarism, mirror pages, articles from same source

Jaccard similarity of sets

\(SIM(S, T) = \frac{|S \cap T|}{|S \cup T|}\)

Shingling

k-shingle (or k-gram) is any substring of length \(k\) found within the document.

\(D\) is document. \(k\)-shingles appear one or moe times in \(D\) is denoted \(S^k(D)\).

White space can be ignored or replaced by single blank

Choosing a good \(k\) value is critical, pick a \(k\) large enough that the probability of any given shingle appearing in any given document is low. Rule of thumb: \(k = 5\) for short documents, \(k = 10\) for long documents.

(Long) shingles can be compressed by hashing them on for example \(4\) bytes. To compute if hashing is needed, compute number of hashes possible vs number of shingles possible.

Real world scenario: News articles have a lot of stop words. A shingle can thus be formed out of a stop word followed by two other words.

Note: two pages with the same article and different surrounding have higher Jaccard Similarity than two pages with same surrounding and different articles.

Minhashing

Scenario: near-duplicate detection for documents. \(N = 1 000 000\) documents, a pairwise test for Jaccard Similarity will result in

\(\frac{N(N-1)}{2} = 5 \cdot 10^{11}\)

comparisons. At \(10^6\) comparisons per second, this will take \(5\) days.

Goal: Estimate Jaccard Similarity by breaking down large sets of shingles into signatures.

The characteristic matrix is a matrix that associates shingles with documents (sets of shingles). If a shingle is in a document, the value is \(1\), else \(0\).

Minhashing considers permutations of the rows of a characteristic matrix. In respect to a permutation, the minhash value of a document is given by the number of the first row in which the column has a 1.

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.

Minhash signature matrix

Given \(n\) minhash functions \(h_1, h_2, ..., h_n\). For a collection of sets \(S_1, ..., S_m\) we construct a matrix SIG with columns assigned to the sets \(S_i\) with \(1 \leq i \leq m\), for each such set \(S_i\) its column is defined by the vector \([h_1(S_i), ..., h_n(S_i)]\)

Algorithm: check each row of characteristic matrix only once and can compute all hash functions simultaneously. \(SIG(i, c)\) refers to the \(i\)th hash function and column \(c\) (representing set \(S_c\)).

  1. Set \(SIG(i, c)\) to \(\infty\) for all \(i\) and \(c\)
  2. Compute \(h_1(r), ..., h_n(r)\) for each row \(r\) of the matrix
  3. for each row in the given order, for each column \(c\) do the following
    1. if \(c\) has \(0\) in row \(r\), do nothing
    2. if \(c\) has \(1\) in row \(r\), then for each \(i = 1, ..., n\) set \(SIG(i, c)\) to the smaller of the current value of \(SIG(i, c)\) and \(h_i(r)\)

Locality-Sensitive Hashing

Problem: efficiently find pairs of documents with greatest similarity. For 1M documents, would have to check \(1 000 000 \choose 2\) pairs (half a trillion!). Idea: only check pairs of documents whose similarity is above some lower bound.

Determine candidate pairs

Analysis of banding technique

\(b\) bans, \(r\) rows, \(D_1\) and \(D_2\) pair of documents with Jaccard similarity \(s\). Probability that the minhash signatures of \(D_1\) and \(D_2\) agree in any one particular row is \(s\)

Threshould \(t\) at which probability of becoming a candidate is \(\frac{1}{2}\). Treat \(t\) as function of \(b\) and \(r\): \(\frac{1}{2} = 1 - (1 - t^r)^b\) - thus approximation is \(\frac{1}{b}^\frac{1}{r}\)

Wrap up

  1. fix \(k\) and construct for each document its set of \(k\)-shingles
  2. sort document-shingle pairs by shingle
  3. Fix \(n\) for the minhash signatures and compute them
  4. Fix the number of bands \(b\) and the number of rows \(r\) such that \(br = n\) and the threshold \(t = \frac{1}{b}^\frac{1}{r}\)
  5. find candidate pairs
  6. check whether threshould is passed indeed
  7. optinally, check the candidate documents whether they are truly similar as expected

Similarity joins

Build inverted index, to discover candidates compute prefixes in set of length

\(|u| - ceil(t \cdot |u|) + 1\)

where \(t\) is Jaccard Similarity threshold. Only compare when prefix elements are candidates in other prefixes. Note: this can miss similar records.

Similarity measures

A distance measure is a function \(d(x, y)\) that takes two vectors \(x, y\) and produces a real number satisfying these axioms:

  1. \(d(x, y) \geq 0\)
  2. \(d(x, y) = 0\) iff \(x = y\)
  3. \(d(x, y) = d(y, x)\)
  4. \(d(x, y) \leq d(x, z) + d(z, y)\)

Jaccard: \(d(x, y) = 1 - SIM(x, y)\) Hamming: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = |\{i \in \{1, ...,n\} | x_i \neq y_i\}|\) Euclidean Distance: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2}\) Manhattan: \(d([x_1, ..., x_n], [y_1, ..., y_n]) = \sum^n_{i=1} | x_i - y_i |\) Cosine similarity: \(sim([x_1, ..., x_n], [y_1, ..., y_n]) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot || \vec{y} ||}\) Cosine distance: \(d(\vec{x}, \vec{y}) = 1 - sim(\vec{x}, \vec{y})\)

Recommender systems (RS)

Given:

We would like to compute a ranking based on which relevant interesting items can be recommended.

Differente types of RS are:

Collaborative

Typically sparse matrix with users \(U = \{u_1, ..., u_n\}\) and products \(P = \{p_1, ..., p_m\}\) \(R\) containing ratings. Goals is to find \(pred(a, p)\) where \(a\) is an active user of \(U\) for all non rated items \(p\) that maximizes \(a\)’s utility.

Assumptions: If users had similar taste in past, they will have similar taste in the future and preferences remain stable and consistent over time.

Rating \(r(a, q)\) of some item \(q\) for active user \(a\) is predicated based on ratings \(r(u, p)\) assigned to item \(p\) by other users \(u \in U\).

User-based recommendation

Predict similarity of item by finding similar users to the active user.

Let \(P_{xy}\) be a set of all items co-rated by both users \(x\) and \(y\):

\(P_{xy} = \{p \in P | r_{x, p} \neq \emptyset, r_{y, p} \neq \emptyset\}\)

Compute similarity using Cosine similarity or Pearson correlation. Cosine similarity is invariant to the length of vectors, Person is invariant to separate changes of location and scale of the variables, i.e. linear transformations \(a + bx\) where \(a, b\) constants with \(b \neq 0\). Take nearest neighbors and use them to compute rating, either using average or weighting the average with similarity factors and normalisation.

Item-based recommendation

User-based recommendation is problematic with large number of users. As matrix is sparse, only a few common ratings between two users are likely. Items are more sable than users. Thus: Item-based methods are attractive!

Determine similarity of items by column-wise analysis of utility matrix.

Let \(U_{xy}\) be a set of all users co-rated by both items \(x\) and \(y\):

\(U_{xy} = \{u \in U | r_{u, x} \neq \emptyset, r_{u, y} \neq \emptyset\}\)

Use adjusted cosine similarity. The prediction function is the similarity weighted average of all ratings of similar items by the active user.

Association rules

Given utility matrix \(R\). Analyse \(R\) to determine association rules: If user \(u\) liked a certain set of items \(Q \subseteq P\) then \(u\) will most probably also like item \(p \notin Q\).

Let \(a\) be the active user. * Check whether \(a\) has alerady rated items such that the precondition of rule is satisfied * recommend to \(a\) all items \(i\) appearing as consequence of such rule, whenever \(a\) has not rated \(i\) so far.

Rules are written as \(X \Rightarrow Y\).

Let \(s \in [0, 1]\) be a threshold. Itemset \(X\) is called frequent whenever \(supp(X) \geq s\).

Let \(c \in [0, 1]\) be a threshold. Determine association rules \(X \Rightarrow Y\) where \(X \cup Y\) is frequent and \(conf(X \Rightarrow Y) \geq c\)

Apriori property: Eavery subset of frequent itemset is also frequent itemset. Whenever an itemset \(X\) is not frequent, any itemset $$ Y X$is also not frequent

Finding association rules: 1. consider$ X Y \(such that\) supp(XY) s \(and the active user has rated all items in\) X $$ 2. Verify \(\frac{supp(XY)}{supp(X)} \geq c\) 3. Compute the union of all items appearing in the consequence of such rules 4. Sort the items according to confidence of the rule that predicted them. If there are multiple, take the one with the highest confidence 5. Return the first \(N\) elements of the ordered list as recommendations

Slope on predictors

Base prediction on popularity differential between items for users.

Rating database \(R\), sets of users \(U\). \(u_i = r_{u, i}\) user’s rating of item \(i\).

Items \(i, j: S_{j, i}(R)\) denotes the set of users that rated both \(i\) and \(j\); these are the rows in \(R\) which contain co-ratings for \(i\) and \(j\).

Average deviation of items \(i\) and \(j\)

\(dev_{j, i} = \sum_{u \in S_{j, i}(R)} \frac{r_{u, j} - r_{u, i}}{|S_{j, i(R)}|}\)

Prediction for an item \(j\) for user \(a\) based on item \(i\) is \(pred(a, j) = dev_{j, i} + a_i\)

Combination of all individual predictions:

\(pred(a, j) = \frac{\sum_{i \in Relevant(j)}(dev_{j, i} + a_i)}{|Relevant(j)|}\)

Slope-on algorithms are a simple approach to items-based recoommendation. They are easy to implement and their accuracy is on par with more complicated and computationally more expensive algorithms.

Data Sparsity and Cold-start Problem

Challenge: Good predictions when there are few ratings available. Idea: Add more information, exploit transitivity of customer tastes to augment given matrix. Recommendation thus becomes a graph-problem.

Matrix dimensionality reduction

Filling blank entries “solves” the recommendation problem. General theory singular-value decomposition tells us that a given matrix can be expressed as product of other matrices with lower dimensionality. This makes sense if reaction of users on items can be determined based on small set of features of items and users (latent factors). To apply SVD to a matrix, it must be free from blank entries. As a solution, the UV-decomposition algorithm decomposes a given matrix \(M\) intro matrices \(U, V\) such that \(M \sim UV\) where \(M\) may contain blank entries. As \(U\) and \(V\) do not have blank entires, \(UV\) does not have blank entries as well. Product fills blank entries in \(M\).

UV-Decomposition

Reduce \(5 \times 5\) matrix to \(5 \times 2\) and \(2 \times 5\) matrix. Observation: already contains less elements than original matrix. Find \(U\) and \(V\) that closely approximate \(M\). Construct \(U\) and \(V\) iteratively, continuously converging to \(M\). Process stops when difference to \(M\) is small enough.

Difference: Root-Mean-Square Error (RMSE):

\(RMSE(M, M') = \sqrt{\frac{1}{k} \sum_{i=1}^n \sum_{j=1}^m(m_{ij} - m'_{ij})^2}\)

Note: \(k\) is number of non blank elements, term will produce \(0\) when \(m_{ij}\) is blank in \(M\).

SVD

Let \(M\) be an \[ m \times n$matrix with rank$ r \]. Find matrices \(U\), \(\Sigma\) and \(V\) with following properties:

Content-based methods

Rating \(r(u, q)\) of some item \(q\) for user \(u\) is estimated based on the ratings \(r(u, p)\) assigned by user \(u\) to itmes \(p \in P\) that are similar to item \(q\).

The content of item \(p\), \(Content(p)\) typically is given as an item profile, i.e. set of attributes characterizing item \(p\). (Meta-Information, keywords)

Profile of user \(u\) contains tastes and preferences. (Given or learned)

Utility function \(r(u, p) = pred(ContentBasedProfile(u), Content(p))\)

Jaccard: \(SIM(D_1, D_2) = \frac{|D_1 \cap D_2|}{|D_1 \cup D_2|}\) Dice: \(SIM(D_1, D_2) = \frac{2|D_1 \cap D_2|}{|D_1| + |D_2|}\)

Prediction rating/interest is computed by analysing how user liked similar items/documents in the past. Rocchio’s method: also take into account feedback for results by user.

Learning models

Collaborative recommendation

Compute conditional probabilities for each possible rating value given Alice’s other ratings.

Eg.: \(X = \{i_1 = 1, i_2 = 3, ...\}\) compute \(P(i_5 = 1 | X)\), \(P(i_5 = 2 | X)\), …

Bayes

\(P(X | Y) = \frac{P(X, Y)}{P(Y)}\)

Thus:

\(P(Y | X) = \frac{P(X | Y)P(Y)}{P(X)}\)

The naive Bayes assumes that features (ratings) are conditionally independent:

\(P(Y | X) = \frac{\prod^d_{i=1} P(X_i | Y)P(Y)}{P(X)}\)

where \(d\) is number of features.

\(P(X)\) is considered constant and will be ignored.

Remark: in many domains naive Bayes classifiers work well even when conditionally independence assumption does not hold. Data sparsity problems has to be solved.

Linear classifiers

Task: find line \(w_1x_1 + w_2x_2 = b\) which separates instances of two classes. Parameters \(w_1, w_2, b\) have to be learned

In general: \(n\) dimensions, classification function \(\vec{w}^T \vec{x} = b\) has to be learned. (Hyperplane separating the two classes)

Machine learning (ML)

Perceptron: A linear binary classifier

Input is \(\vec{x} = [x_1, ..., x_d]\) with weights \(\vec{w} = [w_1, ..., w_d]\) and threshold \(\theta\)

Output is \(+1\) if \(\vec{w}\vec{x} > \theta\), \(-1\) if \(\vec{w}\vec{x} < \theta\).

Training a perceptron we try to find a weight vector \(\vec{w}\) and a threshold \(\theta\) such that all feature vectors with \(y = +1\) are on the positive side and all with \(y = -1\) on the negative side.

This will only work for data that is linearly separable in the sense that there exists a hyperplane. There may exist more than one possible hyperplane.

Algorithm: 1. Initialize \(\vec{w} = \vec{0}\) 2. Pick learning-rate parameter \(\eta\) (a small real number). This is crucial because if \(\eta\) is too small then convergence is slow, if it is too big the boundary will “dance around” and might not converge 3. Consider each training example \(t = (\vec{x}, y)\) in turn. * Let \(y' = \vec{w}\vec{x}\) * If \(y'\) and \(y\) have the same sign, do nothing. \(t\) is properly classified * If $ y’$ and \(y\) have different signs or \(y' = 0\) then \(\vec{w} = \vec{w} + \eta y \vec{x}\)

Support-Vector Machines (SVM)

Maximize the distance \(\gamma\) between the hyperplane and any point of training set. A \(d\)-dimensional hyperplane has at least \(d + 1\) support vectors.

Transforming training set

If data is not linearly separable, it can help to map it to a higher dimensional feature space where it is separable. For example: input space \(\vec{x} = (x_1, x_2)\) mapped into quadratic feature space \(f(\vec{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2)\)

Clustering

Given a set of points, group points into clusters using some distance notion. Distance can be cosine distance, jaccard distance or for example euclidean distance.

Hierarchical clustering

Cluster distance: distance of centroids.

\(k\)-means for small datasets. BFR for large datasets (required memory: \(O(c)\) and not \(O(d)\) where \(c\) is clusters and \(d\) is data). Idea of BFR: Summarize points close enough to centroid, group points that are close together but not close to any existing centroid.