This document is derived of the lecture by Prof. Dr. G. Lausen at the University of Freiburg in semester SS16.

## Queries and Recommendations

• Query:
• relational database: structured tables
• graph database: sets of graphs
• information retrieval system: sets of documents
• overwhelming amounts of data, can not pose query. Recommender systems predicts what a user is believed to be looking for

### Queries

• Extract set of terms from document after stop word removal, stemming and lemmatization of document’s words
• set of terms is $$T = \{t_1, ..., t_n\}$$ and set of documents is $$D = \{d_1, ..., d_m\}$$. For each document $$d_i$$ and each term $$t_k$$ let $$w_{i, k} \in \Re$$ be a weight
• vector space model: consider matrix defined by the vectors of D’s documents (e.g. term-document matrix)

#### Weighting terms: TF-IDF measure

• term frequency: number of occurrences of term in given document
• $$i$$ is term, $$j$$ is document.
• $$freq(i, j)$$ is number of occurrences of $$i$$ in $$j$$.
• $$OtherTerms(i, j)$$ is the set of terms different from $$i$$ in $$j$$.
• $$maxOthers(i, j) = max_{z \in OtherTerms(i, j)}\{freq(z, j)\}$$
• normalized term frequency: $$TF(i, j) = \frac{freq(i, j)}{maxOthers(i, j)}$$
• Inverse document frequency:
• $$N$$ number of recommendable documents
• $$n(i)$$ number of documents from $$N$$ containing keyword $$i$$
• $$IDF(i) = log \frac{N}{n(i)}$$
• TF-IDF: $$TF-IDF(i, j) = TF(i, j) \cdot IDF(i)$$

#### RDF Graph Data Model

• Relationships between objects can naturally be represented as “subject-predicate-object” triples: $$(s, p, o)$$
• Subjects, objects and predicates identified by URI (IRIs), can appear in either role
• Graphical interpretation: labeled direct graph
• Problem: same property used seveal times complicates graph drawings

#### Property Graph Data Model

• property graph made up of nodes, relationships and properties
• nodes contain properties, typically as key-value pairs
• relationships connect and structure nodes
• relationships can also have properties

### Recommendations

• based on
• personal profile of user
• popularity characteristic for a set of peers
• similarity to a given example
• personal profile and some additional knowledge
• combination of above
• computation problem: learn a function that predicts relevance score for interesting item typically not seen before by the user
• example: collaborative filtering, content-based recoomendations

## RDF-graphs and SPARQL

• RDF building blocks
• uris
• literals
• blank nodes
• n-ary Relationships by blank nodes
• SPARQL
• PREFIX: mechanism for abbreviating URIs
• SELECT: identifies the variables to be returned in query answer
• FROM: names the graph to be queries
• WHERE: query pattern

### Example

-- DATA
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
_:a foaf:givenName "Alice".
_:b foaf:givenName "Bob" .
_:b dc:date "2005-04-04T04:04:04Z"^^xsd:dateTime .

-- QUERY
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?givenName
WHERE { ?x foaf:givenName ?name .
OPTIONAL { ?x dc:date ?date } .
FILTER (!bound(?date)) }

### Semantics of SPARQL 1.0 (Definition 1)

• A tuple from $$(T \cup V) \times (U \cup V) \times (T \cup V)$$ is a tripple pattern
• A triple pattern is also a graph pattern. If P1 and P2 are graph patterns, then expessions (P1 AND P2), (P1 OPT P2) and (P1 UNION P2) are graph patterns
• If P is graph pattern and R is SPARQL built-in condition, then the expession P FILTER R is graph pattern

#### Examples

P1 = ((?A, email, ?E) OPT (?A, webPage, ?W)).
P2 = (((?A, name, ?N) OPT (?A, email, ?E)) OPT (?A, webPage, ?W)).
P3 = ((?A, name, ?N) OPT ((?A, email, ?E) OPT (?A, webPage, ?W))).
P4 = ((?A, name, ?N) AND ((?A, email, ?E) UNION (?A, webPage, ?W))).

### Definition 2

• mapping $\mu : T \rightarrow T. For a triple pattern t we denote \mu(t) the triple obtained by replacing the variables in t according to \mu$
• the domain of $$\mu$$ $$dom(\mu)$$ is subset of $$V$$ where $$\mu$$ defined
• mappings $$\mu_1$$ and $$\mu_2$$ are compatible when for all $$x \in dom(\mu_1) \cap dom(\mu_2)$$ it is the case that $$\mu_1(x) = \mu_2(x)$$
• disjoint domains are always compatible
• empty mapping is compatible with any other mapping

### Query forms

• SELECT: returns all, or a subset of the variables bound in query pattern match
• CONSTRUCT: returns an RDF graph constructed by substituting variables in set of triple templates
• ASK: returns boolean indicating whether a query pattern matches or not
• DESCRIBE: returns RDF graph that describes resources found

### PageRank

• each link’s vote is proportional to the importance of its source page
• if page $$j$$ with importance $$r_j$$ has $$k_j$$ out-links, each link gets $$\frac{r_j}{k_j}$$ votes
• Page $$j$$’s own importance is the sum of the votes on its in-links: $$r_j = \sum_{i \rightarrow j} \frac{r_i}{k_i}$$
• transition matrix $$M$$ has $$n$$ rows and columns. Element $$m_{ij}$$ of $$M$$ has value $$\frac{1}{k}$$ when page $$j$$ has $$k_j$$ arcs out, and one of them is to page $$i$$. Otherwise $m_{ij} = 0$
• M is stochastic, column values always sum up to $$1$$. Transition matrix of marcov process

#### Random Surfer

• $$v_i$$ is probability that the surfer at next step is at page $$i$$ given that the surfer currently visits page $$j$$.
• $$v_1 = M v_0$$
• $$v_2 = M V_1 = M(M v_0) = M^2 v_0$$
• after $$i$$ steps $$v_i = M^i v_0$$
• Given that $$M$$ is stochastic, the markov process is stationary. A limit exists ans has to satisfy $$v = Mv$$ provided $$M$$ represents a strongly connected (web) graph

Problem: web is not strongly connected graph!

Dead end (no more outgoing link): $$v_\infty = \begin{pmatrix}0 \\ 0 \\ 0 \\ 0\end{pmatrix}$$ Spider Trap (infinitive loop to self): $$v_\infty = \begin{pmatrix}0 \\ 0 \\ 1 \\ 0\end{pmatrix}$$

Note that this will not necessarily represent possible distribution of random surfer, but approximation of relative importance of the respective pages

##### Treating Spider Traps (and dead ends)

New iterative step:

$$v' = \beta M v + \frac{1 − \beta}{n} \cdot e$$

• $$e$$ is $$n$$-dim vector with all $$1$$’s
• $$n$$ number of nodes in graph
• $$\beta$$ is chosen to represent probability for a new random surfer to start at a random page with probability $$1 - \beta$$. Typical $$\beta$$ is $$0.8$$ to $$0.9$$

### Topic-Sensitive PageRank

Example: search term Jaguar - how to rank pages?

• Ideally: private PageRank vector
• Topic-sensitive PageRank for small number of topics
• Interest in topics can be determined from pages recently viewed

#### Biased random walk

$$v' = \beta M v + (1 - \beta) \frac{e_s}{|S|})$$

$$S$$ is the teleport set, i.e. a set of pages belonging to selected topic. $$e_s$$ is a vector $$1$$’s in the S-positions and $$0$$’s elsewhere

### TrustRank

Spam farm: collection of pages to increase PageRank of certain page. Inaccessible pages, accessible pages and own pages.

• Target page $$t$$, $$m$$ supporting pages, $$n$$ pages in total. Given $$\beta$$.
• $$x$$ the amount of the PageRank contributed to by $$t$$ by the accessible pages $$p$$
• $$y$$ PageRank of target page $$t$$
• Rank of each farm page is $$r = \frac{\beta y}{m} + \frac{1 - \beta}{n}$$

Thus PageRank $$y$$ of $$t$$ is built from $$x$$, $$\beta$$ times PageRank of every supporting page

$$\beta(\beta \frac{y}{m} + \frac{1-\beta}{n})$$

Conclusion: Increase $$m$$ (supporting pages) to increase $$y$$.

TrustRank: Topic-sensitive PageRank where “topic” is a set of pages believed to be trustworthy. Required (human) seed.

Assumptions:

• the degree of trust decreases with distance
• trust is split across out-links

Spam mass of $$p$$ (Page) is $$\frac{r - t}{r}$$ (PageRank, TrustRank). Spam mass close to 1: spam.

### SimRank

Measure the similarity of nodes of same type by the similarity of structural context in respective graph. Example: Users, Tags, Web pages. Users place tags, tags are placed on pages, users place tags on pages. Note that only binary relationships are expressed for simplicity of structure.

A node of the same type where a random walker starting at $$N$$ will most likely be in the limit is considered most similar to $$N$$.

$$v' = \beta M v + (1 - \beta) e_N$$

### Hubs and Authorities

#### HITS: Hypertext induced Topic Selection

• Ranking of Web search results based on link structure
• search specific - in contrast to PageRank
• Idea
• pages with good content: authorities, quality measured in weighted in-links
• pages which link to pages which have good content are hubs; links are weighted depending on quality of pages they link to
• compute scores to hubs and authorities such that the recursive definition has a unique solution (equilibrium)
• Assumption: if page receives many links from other pages, then it is receiving kind of collective endorsement

#### Hubs and Authorities

• highly endorsed answers to queries are the authorities for query
• high-value lists are hubs for query
• assign each page $$p$$ two scores: $$auth(p)$$ and $$hub(p)$$
• initial values are equal to $$1$$.
• Authority update rule: for each page $$p$$ update $$auth(p)$$ to be the sum of the hub scores of all pages that point to it
• Hub update rule: for each page $$p$$ update $$hub(p)$$ to be the sum of the authority scores of all pages that it points to
• Normalization step: Each authority score is divided down by the sum of all authority scores, and each hub score is divided down by the sum of all hub scores

## Similar Items

Motivation: search engine should not contain items which are “too” similar. Plagiarism, mirror pages, articles from same source

### Jaccard similarity of sets

$$SIM(S, T) = \frac{|S \cap T|}{|S \cup T|}$$

### Shingling

k-shingle (or k-gram) is any substring of length $$k$$ found within the document.

$$D$$ is document. $$k$$-shingles appear one or moe times in $$D$$ is denoted $$S^k(D)$$.

White space can be ignored or replaced by single blank

Choosing a good $$k$$ value is critical, pick a $$k$$ large enough that the probability of any given shingle appearing in any given document is low. Rule of thumb: $$k = 5$$ for short documents, $$k = 10$$ for long documents.

(Long) shingles can be compressed by hashing them on for example $$4$$ bytes. To compute if hashing is needed, compute number of hashes possible vs number of shingles possible.

Real world scenario: News articles have a lot of stop words. A shingle can thus be formed out of a stop word followed by two other words.

Note: two pages with the same article and different surrounding have higher Jaccard Similarity than two pages with same surrounding and different articles.

### Minhashing

Scenario: near-duplicate detection for documents. $$N = 1 000 000$$ documents, a pairwise test for Jaccard Similarity will result in

$$\frac{N(N-1)}{2} = 5 \cdot 10^{11}$$

comparisons. At $$10^6$$ comparisons per second, this will take $$5$$ days.

Goal: Estimate Jaccard Similarity by breaking down large sets of shingles into signatures.

The characteristic matrix is a matrix that associates shingles with documents (sets of shingles). If a shingle is in a document, the value is $$1$$, else $$0$$.

Minhashing considers permutations of the rows of a characteristic matrix. In respect to a permutation, the minhash value of a document is given by the number of the first row in which the column has a 1.

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.

#### Minhash signature matrix

Given $$n$$ minhash functions $$h_1, h_2, ..., h_n$$. For a collection of sets $$S_1, ..., S_m$$ we construct a matrix SIG with columns assigned to the sets $$S_i$$ with $$1 \leq i \leq m$$, for each such set $$S_i$$ its column is defined by the vector $$[h_1(S_i), ..., h_n(S_i)]$$

Algorithm: check each row of characteristic matrix only once and can compute all hash functions simultaneously. $$SIG(i, c)$$ refers to the $$i$$th hash function and column $$c$$ (representing set $$S_c$$).

1. Set $$SIG(i, c)$$ to $$\infty$$ for all $$i$$ and $$c$$
2. Compute $$h_1(r), ..., h_n(r)$$ for each row $$r$$ of the matrix
3. for each row in the given order, for each column $$c$$ do the following
1. if $$c$$ has $$0$$ in row $$r$$, do nothing
2. if $$c$$ has $$1$$ in row $$r$$, then for each $$i = 1, ..., n$$ set $$SIG(i, c)$$ to the smaller of the current value of $$SIG(i, c)$$ and $$h_i(r)$$

### Locality-Sensitive Hashing

Problem: efficiently find pairs of documents with greatest similarity. For 1M documents, would have to check $$1 000 000 \choose 2$$ pairs (half a trillion!). Idea: only check pairs of documents whose similarity is above some lower bound.

#### Determine candidate pairs

• divide signature matrix in $$b$$ bands of $$r$$ rows
• for each band consider hash function over $$r$$-dimensional vectors. Similar pairs are hashed into the same bucket

#### Analysis of banding technique

$$b$$ bans, $$r$$ rows, $$D_1$$ and $$D_2$$ pair of documents with Jaccard similarity $$s$$. Probability that the minhash signatures of $$D_1$$ and $$D_2$$ agree in any one particular row is $$s$$

• probability that signature agree in all rows of particular band is $$s^r$$
• probability that signatures do not agree in all rows of particular band is $$1 - s^r$$
• probability that the signatures do not agree in all rows of any bands $$(1 - s^r)^b$$
• probability that the signatures agree in all rows of at least one band is $$1 - (1 - s^r)^b$$. This is the probability of becoming candidate pair

Threshould $$t$$ at which probability of becoming a candidate is $$\frac{1}{2}$$. Treat $$t$$ as function of $$b$$ and $$r$$: $$\frac{1}{2} = 1 - (1 - t^r)^b$$ - thus approximation is $$\frac{1}{b}^\frac{1}{r}$$

#### Wrap up

1. fix $$k$$ and construct for each document its set of $$k$$-shingles
2. sort document-shingle pairs by shingle
3. Fix $$n$$ for the minhash signatures and compute them
4. Fix the number of bands $$b$$ and the number of rows $$r$$ such that $$br = n$$ and the threshold $$t = \frac{1}{b}^\frac{1}{r}$$
5. find candidate pairs
6. check whether threshould is passed indeed
7. optinally, check the candidate documents whether they are truly similar as expected

### Similarity joins

Build inverted index, to discover candidates compute prefixes in set of length

$$|u| - ceil(t \cdot |u|) + 1$$

where $$t$$ is Jaccard Similarity threshold. Only compare when prefix elements are candidates in other prefixes. Note: this can miss similar records.

### Similarity measures

• each feature (word, shingle) of a document is assigned as dimension
• number of occurrences is the value in respective dimension

A distance measure is a function $$d(x, y)$$ that takes two vectors $$x, y$$ and produces a real number satisfying these axioms:

1. $$d(x, y) \geq 0$$
2. $$d(x, y) = 0$$ iff $$x = y$$
3. $$d(x, y) = d(y, x)$$
4. $$d(x, y) \leq d(x, z) + d(z, y)$$

Jaccard: $$d(x, y) = 1 - SIM(x, y)$$ Hamming: $$d([x_1, ..., x_n], [y_1, ..., y_n]) = |\{i \in \{1, ...,n\} | x_i \neq y_i\}|$$ Euclidean Distance: $$d([x_1, ..., x_n], [y_1, ..., y_n]) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2}$$ Manhattan: $$d([x_1, ..., x_n], [y_1, ..., y_n]) = \sum^n_{i=1} | x_i - y_i |$$ Cosine similarity: $$sim([x_1, ..., x_n], [y_1, ..., y_n]) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot || \vec{y} ||}$$ Cosine distance: $$d(\vec{x}, \vec{y}) = 1 - sim(\vec{x}, \vec{y})$$

## Recommender systems (RS)

Given:

• information about users: ID, ratings, preferences, demographics, …
• information about products: ID, title, genre, directory, …

We would like to compute a ranking based on which relevant interesting items can be recommended.

Differente types of RS are:

• collaborative: based on previous decisions of other users
• content-based: based on user’s previous decisions
• social: based on trust relations, friendships, …
• knowledge-based: based on user’s needs
• hybrid: combinations of various mechanisms

### Collaborative

Typically sparse matrix with users $$U = \{u_1, ..., u_n\}$$ and products $$P = \{p_1, ..., p_m\}$$ $$R$$ containing ratings. Goals is to find $$pred(a, p)$$ where $$a$$ is an active user of $$U$$ for all non rated items $$p$$ that maximizes $$a$$’s utility.

Assumptions: If users had similar taste in past, they will have similar taste in the future and preferences remain stable and consistent over time.

Rating $$r(a, q)$$ of some item $$q$$ for active user $$a$$ is predicated based on ratings $$r(u, p)$$ assigned to item $$p$$ by other users $$u \in U$$.

#### User-based recommendation

Predict similarity of item by finding similar users to the active user.

Let $$P_{xy}$$ be a set of all items co-rated by both users $$x$$ and $$y$$:

$$P_{xy} = \{p \in P | r_{x, p} \neq \emptyset, r_{y, p} \neq \emptyset\}$$

Compute similarity using Cosine similarity or Pearson correlation. Cosine similarity is invariant to the length of vectors, Person is invariant to separate changes of location and scale of the variables, i.e. linear transformations $$a + bx$$ where $$a, b$$ constants with $$b \neq 0$$. Take nearest neighbors and use them to compute rating, either using average or weighting the average with similarity factors and normalisation.

#### Item-based recommendation

User-based recommendation is problematic with large number of users. As matrix is sparse, only a few common ratings between two users are likely. Items are more sable than users. Thus: Item-based methods are attractive!

Determine similarity of items by column-wise analysis of utility matrix.

Let $$U_{xy}$$ be a set of all users co-rated by both items $$x$$ and $$y$$:

$$U_{xy} = \{u \in U | r_{u, x} \neq \emptyset, r_{u, y} \neq \emptyset\}$$

Use adjusted cosine similarity. The prediction function is the similarity weighted average of all ratings of similar items by the active user.

#### Association rules

Given utility matrix $$R$$. Analyse $$R$$ to determine association rules: If user $$u$$ liked a certain set of items $$Q \subseteq P$$ then $$u$$ will most probably also like item $$p \notin Q$$.

Let $$a$$ be the active user. * Check whether $$a$$ has alerady rated items such that the precondition of rule is satisfied * recommend to $$a$$ all items $$i$$ appearing as consequence of such rule, whenever $$a$$ has not rated $$i$$ so far.

Rules are written as $$X \Rightarrow Y$$.

• Support of $$X \Rightarrow Y$$: $$supp(X) = \frac{\text{number of users having rated } X}{\text{number of all users}}$$
• Confidence of $$X \Rightarrow Y$$: $$conf(X \Rightarrow Y) = \frac{supp(X \cup Y)}{supp(X)}$$

Let $$s \in [0, 1]$$ be a threshold. Itemset $$X$$ is called frequent whenever $$supp(X) \geq s$$.

Let $$c \in [0, 1]$$ be a threshold. Determine association rules $$X \Rightarrow Y$$ where $$X \cup Y$$ is frequent and $$conf(X \Rightarrow Y) \geq c$$

Apriori property: Eavery subset of frequent itemset is also frequent itemset. Whenever an itemset $$X$$ is not frequent, any itemset $$Y \supseteq X$$is also not frequent

#### Support-Vector Machines (SVM)

Maximize the distance $$\gamma$$ between the hyperplane and any point of training set. A $$d$$-dimensional hyperplane has at least $$d + 1$$ support vectors.

### Transforming training set

If data is not linearly separable, it can help to map it to a higher dimensional feature space where it is separable. For example: input space $$\vec{x} = (x_1, x_2)$$ mapped into quadratic feature space $$f(\vec{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2)$$

## Clustering

Given a set of points, group points into clusters using some distance notion. Distance can be cosine distance, jaccard distance or for example euclidean distance.

### Hierarchical clustering

• agglomerative (bottom up). Each point is a cluster, merge nearest clusters into one
• divise: everything is one cluster, recursively split.

Cluster distance: distance of centroids.

$$k$$-means for small datasets. BFR for large datasets (required memory: $$O(c)$$ and not $$O(d)$$ where $$c$$ is clusters and $$d$$ is data). Idea of BFR: Summarize points close enough to centroid, group points that are close together but not close to any existing centroid.