Data Analysis and Query Languages

This document is derived of the lecture by Prof. Dr. G. Lausen at the University of Freiburg in semester SS16.

Queries and Recommendations

Query:
- relational database: structured tables
- graph database: sets of graphs
- information retrieval system: sets of documents
overwhelming amounts of data, can not pose query. Recommender systems predicts what a user is believed to be looking for

Queries

Extract set of terms from document after stop word removal, stemming and lemmatization of document’s words
set of terms is $T = \{t_1, ..., t_n\}$ and set of documents is $D = \{d_1, ..., d_m\}$. For each document $d_i$ and each term $t_k$ let $w_{i, k} \in \Re$ be a weight
vector space model: consider matrix defined by the vectors of D’s documents (e.g. term-document matrix)

Weighting terms: TF-IDF measure

term frequency: number of occurrences of term in given document
$i$ is term, $j$ is document.
- $freq(i, j)$ is number of occurrences of $i$ in $j$.
- $OtherTerms(i, j)$ is the set of terms different from $i$ in $j$.
- $maxOthers(i, j) = max_{z \in OtherTerms(i, j)}\{freq(z, j)\}$
normalized term frequency: $TF(i, j) = \frac{freq(i, j)}{maxOthers(i, j)}$
Inverse document frequency:
- $N$ number of recommendable documents
- $n(i)$ number of documents from $N$ containing keyword $i$
- $IDF(i) = log \frac{N}{n(i)}$
TF-IDF: $TF-IDF(i, j) = TF(i, j) \cdot IDF(i)$

RDF Graph Data Model

Relationships between objects can naturally be represented as “subject-predicate-object” triples: $(s, p, o)$
Subjects, objects and predicates identified by URI (IRIs), can appear in either role
Graphical interpretation: labeled direct graph
- Problem: same property used seveal times complicates graph drawings

Property Graph Data Model

property graph made up of nodes, relationships and properties
nodes contain properties, typically as key-value pairs
relationships connect and structure nodes
relationships can also have properties

Recommendations

based on
- personal profile of user
- popularity characteristic for a set of peers
- similarity to a given example
- personal profile and some additional knowledge
- combination of above
computation problem: learn a function that predicts relevance score for interesting item typically not seen before by the user
example: collaborative filtering, content-based recoomendations

RDF-graphs and SPARQL

RDF building blocks
- uris
- literals
- blank nodes
n-ary Relationships by blank nodes
SPARQL
- PREFIX: mechanism for abbreviating URIs
- SELECT: identifies the variables to be returned in query answer
- FROM: names the graph to be queries
- WHERE: query pattern

Example

-- DATA
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
_:a foaf:givenName "Alice".
_:b foaf:givenName "Bob" .
_:b dc:date "2005-04-04T04:04:04Z"^^xsd:dateTime .

-- QUERY
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?givenName
WHERE { ?x foaf:givenName ?name .
        OPTIONAL { ?x dc:date ?date } .
        FILTER (!bound(?date)) }

Semantics of SPARQL 1.0 (Definition 1)

A tuple from $(T \cup V) \times (U \cup V) \times (T \cup V)$ is a tripple pattern
A triple pattern is also a graph pattern. If P1 and P2 are graph patterns, then expessions (P1 AND P2), (P1 OPT P2) and (P1 UNION P2) are graph patterns
If P is graph pattern and R is SPARQL built-in condition, then the expession P FILTER R is graph pattern

Examples

P1 = ((?A, email, ?E) OPT (?A, webPage, ?W)).
P2 = (((?A, name, ?N) OPT (?A, email, ?E)) OPT (?A, webPage, ?W)).
P3 = ((?A, name, ?N) OPT ((?A, email, ?E) OPT (?A, webPage, ?W))).
P4 = ((?A, name, ?N) AND ((?A, email, ?E) UNION (?A, webPage, ?W))).

Definition 2

mapping \[ \mu : T \rightarrow T$. For a triple pattern$ t $we denote$ \mu(t) $the triple obtained by replacing the variables in$ t $according to$ \mu \]
the domain of $\mu$ $dom(\mu)$ is subset of $V$ where $\mu$ defined
mappings $\mu_1$ and $\mu_2$ are compatible when for all $x \in dom(\mu_1) \cap dom(\mu_2)$ it is the case that $\mu_1(x) = \mu_2(x)$
- disjoint domains are always compatible
- empty mapping is compatible with any other mapping

Query forms

SELECT: returns all, or a subset of the variables bound in query pattern match
CONSTRUCT: returns an RDF graph constructed by substituting variables in set of triple templates
ASK: returns boolean indicating whether a query pattern matches or not
DESCRIBE: returns RDF graph that describes resources found

Web Link Analysis

PageRank

each link’s vote is proportional to the importance of its source page
if page $j$ with importance $r_j$ has $k_j$ out-links, each link gets $\frac{r_j}{k_j}$ votes
Page $j$’s own importance is the sum of the votes on its in-links: $r_j = \sum_{i \rightarrow j} \frac{r_i}{k_i}$
transition matrix $M$ has $n$ rows and columns. Element $m_{ij}$ of $M$ has value $\frac{1}{k}$ when page $j$ has $k_j$ arcs out, and one of them is to page $i$. Otherwise \[ m_{ij} = 0\]
- M is stochastic, column values always sum up to $1$. Transition matrix of marcov process

Random Surfer

$v_i$ is probability that the surfer at next step is at page $i$ given that the surfer currently visits page $j$.
$v_1 = M v_0$
$v_2 = M V_1 = M(M v_0) = M^2 v_0$
after $i$ steps $v_i = M^i v_0$
Given that $M$ is stochastic, the markov process is stationary. A limit exists ans has to satisfy $v = Mv$ provided $M$ represents a strongly connected (web) graph

Problem: web is not strongly connected graph!

Dead end (no more outgoing link): $v_\infty = \begin{pmatrix}0 \\ 0 \\ 0 \\ 0\end{pmatrix}$ Spider Trap (infinitive loop to self): $v_\infty = \begin{pmatrix}0 \\ 0 \\ 1 \\ 0\end{pmatrix}$

Removing dead ends

Note that this will not necessarily represent possible distribution of random surfer, but approximation of relative importance of the respective pages

Treating Spider Traps (and dead ends)

New iterative step:

$v' = \beta M v + \frac{1 − \beta}{n} \cdot e$

$e$ is $n$-dim vector with all $1$’s
$n$ number of nodes in graph
$\beta$ is chosen to represent probability for a new random surfer to start at a random page with probability $1 - \beta$. Typical $\beta$ is $0.8$ to $0.9$

Topic-Sensitive PageRank

Example: search term Jaguar - how to rank pages?

Ideally: private PageRank vector
Topic-sensitive PageRank for small number of topics
Interest in topics can be determined from pages recently viewed

Biased random walk

$v' = \beta M v + (1 - \beta) \frac{e_s}{|S|})$

$S$ is the teleport set, i.e. a set of pages belonging to selected topic. $e_s$ is a vector $1$’s in the S-positions and $0$’s elsewhere

TrustRank

Spam farm: collection of pages to increase PageRank of certain page. Inaccessible pages, accessible pages and own pages.

Target page $t$, $m$ supporting pages, $n$ pages in total. Given $\beta$.
$x$ the amount of the PageRank contributed to by $t$ by the accessible pages $p$
$y$ PageRank of target page $t$
Rank of each farm page is $r = \frac{\beta y}{m} + \frac{1 - \beta}{n}$

Thus PageRank $y$ of $t$ is built from $x$, $\beta$ times PageRank of every supporting page

$\beta(\beta \frac{y}{m} + \frac{1-\beta}{n})$

Conclusion: Increase $m$ (supporting pages) to increase $y$.

TrustRank: Topic-sensitive PageRank where “topic” is a set of pages believed to be trustworthy. Required (human) seed.

Assumptions:

trust is additive
the degree of trust decreases with distance
trust is split across out-links

Spam mass of $p$ (Page) is $\frac{r - t}{r}$ (PageRank, TrustRank). Spam mass close to 1: spam.

SimRank

Measure the similarity of nodes of same type by the similarity of structural context in respective graph. Example: Users, Tags, Web pages. Users place tags, tags are placed on pages, users place tags on pages. Note that only binary relationships are expressed for simplicity of structure.

A node of the same type where a random walker starting at $N$ will most likely be in the limit is considered most similar to $N$.

$v' = \beta M v + (1 - \beta) e_N$

Hubs and Authorities

HITS: Hypertext induced Topic Selection

Ranking of Web search results based on link structure
search specific - in contrast to PageRank
Idea
- pages with good content: authorities, quality measured in weighted in-links
- pages which link to pages which have good content are hubs; links are weighted depending on quality of pages they link to
- compute scores to hubs and authorities such that the recursive definition has a unique solution (equilibrium)
Assumption: if page receives many links from other pages, then it is receiving kind of collective endorsement

Hubs and Authorities

highly endorsed answers to queries are the authorities for query
high-value lists are hubs for query
assign each page $p$ two scores: $auth(p)$ and $hub(p)$
initial values are equal to $1$.
Authority update rule: for each page $p$ update $auth(p)$ to be the sum of the hub scores of all pages that point to it
Hub update rule: for each page $p$ update $hub(p)$ to be the sum of the authority scores of all pages that it points to
Normalization step: Each authority score is divided down by the sum of all authority scores, and each hub score is divided down by the sum of all hub scores

Similar Items

Motivation: search engine should not contain items which are “too” similar. Plagiarism, mirror pages, articles from same source

Jaccard similarity of sets

$SIM(S, T) = \frac{|S \cap T|}{|S \cup T|}$

Shingling

k-shingle (or k-gram) is any substring of length $k$ found within the document.

$D$ is document. $k$-shingles appear one or moe times in $D$ is denoted $S^k(D)$.

White space can be ignored or replaced by single blank

Choosing a good $k$ value is critical, pick a $k$ large enough that the probability of any given shingle appearing in any given document is low. Rule of thumb: $k = 5$ for short documents, $k = 10$ for long documents.

(Long) shingles can be compressed by hashing them on for example $4$ bytes. To compute if hashing is needed, compute number of hashes possible vs number of shingles possible.

Real world scenario: News articles have a lot of stop words. A shingle can thus be formed out of a stop word followed by two other words.

Note: two pages with the same article and different surrounding have higher Jaccard Similarity than two pages with same surrounding and different articles.

Minhashing

Scenario: near-duplicate detection for documents. $N = 1 000 000$ documents, a pairwise test for Jaccard Similarity will result in

$\frac{N(N-1)}{2} = 5 \cdot 10^{11}$

comparisons. At $10^6$ comparisons per second, this will take $5$ days.

Goal: Estimate Jaccard Similarity by breaking down large sets of shingles into signatures.

The characteristic matrix is a matrix that associates shingles with documents (sets of shingles). If a shingle is in a document, the value is $1$, else $0$.

Minhashing considers permutations of the rows of a characteristic matrix. In respect to a permutation, the minhash value of a document is given by the number of the first row in which the column has a 1.

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.

Minhash signature matrix

Given $n$ minhash functions $h_1, h_2, ..., h_n$. For a collection of sets $S_1, ..., S_m$ we construct a matrix SIG with columns assigned to the sets $S_i$ with $1 \leq i \leq m$, for each such set $S_i$ its column is defined by the vector $[h_1(S_i), ..., h_n(S_i)]$

Algorithm: check each row of characteristic matrix only once and can compute all hash functions simultaneously. $SIG(i, c)$ refers to the $i$th hash function and column $c$ (representing set $S_c$).

Set $SIG(i, c)$ to $\infty$ for all $i$ and $c$
Compute $h_1(r), ..., h_n(r)$ for each row $r$ of the matrix
for each row in the given order, for each column $c$ do the following
1. if $c$ has $0$ in row $r$, do nothing
2. if $c$ has $1$ in row $r$, then for each $i = 1, ..., n$ set $SIG(i, c)$ to the smaller of the current value of $SIG(i, c)$ and $h_i(r)$

Locality-Sensitive Hashing

Problem: efficiently find pairs of documents with greatest similarity. For 1M documents, would have to check $1 000 000 \choose 2$ pairs (half a trillion!). Idea: only check pairs of documents whose similarity is above some lower bound.

Determine candidate pairs

divide signature matrix in $b$ bands of $r$ rows
for each band consider hash function over $r$-dimensional vectors. Similar pairs are hashed into the same bucket

Analysis of banding technique

$b$ bans, $r$ rows, $D_1$ and $D_2$ pair of documents with Jaccard similarity $s$. Probability that the minhash signatures of $D_1$ and $D_2$ agree in any one particular row is $s$

probability that signature agree in all rows of particular band is $s^r$
probability that signatures do not agree in all rows of particular band is $1 - s^r$
probability that the signatures do not agree in all rows of any bands $(1 - s^r)^b$
probability that the signatures agree in all rows of at least one band is $1 - (1 - s^r)^b$. This is the probability of becoming candidate pair

Threshould $t$ at which probability of becoming a candidate is $\frac{1}{2}$. Treat $t$ as function of $b$ and $r$: $\frac{1}{2} = 1 - (1 - t^r)^b$ - thus approximation is $\frac{1}{b}^\frac{1}{r}$

Wrap up

fix $k$ and construct for each document its set of $k$-shingles
sort document-shingle pairs by shingle
Fix $n$ for the minhash signatures and compute them
Fix the number of bands $b$ and the number of rows $r$ such that $br = n$ and the threshold $t = \frac{1}{b}^\frac{1}{r}$
find candidate pairs
check whether threshould is passed indeed
optinally, check the candidate documents whether they are truly similar as expected

Similarity joins

Build inverted index, to discover candidates compute prefixes in set of length

$|u| - ceil(t \cdot |u|) + 1$

where $t$ is Jaccard Similarity threshold. Only compare when prefix elements are candidates in other prefixes. Note: this can miss similar records.

Similarity measures

each feature (word, shingle) of a document is assigned as dimension
number of occurrences is the value in respective dimension

A distance measure is a function $d(x, y)$ that takes two vectors $x, y$ and produces a real number satisfying these axioms:

$d(x, y) \geq 0$
$d(x, y) = 0$ iff $x = y$
$d(x, y) = d(y, x)$
$d(x, y) \leq d(x, z) + d(z, y)$

Jaccard: $d(x, y) = 1 - SIM(x, y)$ Hamming: $d([x_1, ..., x_n], [y_1, ..., y_n]) = |\{i \in \{1, ...,n\} | x_i \neq y_i\}|$ Euclidean Distance: $d([x_1, ..., x_n], [y_1, ..., y_n]) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2}$ Manhattan: $d([x_1, ..., x_n], [y_1, ..., y_n]) = \sum^n_{i=1} | x_i - y_i |$ Cosine similarity: $sim([x_1, ..., x_n], [y_1, ..., y_n]) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot || \vec{y} ||}$ Cosine distance: $d(\vec{x}, \vec{y}) = 1 - sim(\vec{x}, \vec{y})$

Recommender systems (RS)

Given:

information about users: ID, ratings, preferences, demographics, …
information about products: ID, title, genre, directory, …

We would like to compute a ranking based on which relevant interesting items can be recommended.

Differente types of RS are:

collaborative: based on previous decisions of other users
content-based: based on user’s previous decisions
social: based on trust relations, friendships, …
knowledge-based: based on user’s needs
hybrid: combinations of various mechanisms

Collaborative

Typically sparse matrix with users $U = \{u_1, ..., u_n\}$ and products $P = \{p_1, ..., p_m\}$ $R$ containing ratings. Goals is to find $pred(a, p)$ where $a$ is an active user of $U$ for all non rated items $p$ that maximizes $a$’s utility.

Assumptions: If users had similar taste in past, they will have similar taste in the future and preferences remain stable and consistent over time.

Rating $r(a, q)$ of some item $q$ for active user $a$ is predicated based on ratings $r(u, p)$ assigned to item $p$ by other users $u \in U$.

User-based recommendation

Predict similarity of item by finding similar users to the active user.

Let $P_{xy}$ be a set of all items co-rated by both users $x$ and $y$:

$P_{xy} = \{p \in P | r_{x, p} \neq \emptyset, r_{y, p} \neq \emptyset\}$

Compute similarity using Cosine similarity or Pearson correlation. Cosine similarity is invariant to the length of vectors, Person is invariant to separate changes of location and scale of the variables, i.e. linear transformations $a + bx$ where $a, b$ constants with $b \neq 0$. Take nearest neighbors and use them to compute rating, either using average or weighting the average with similarity factors and normalisation.

Item-based recommendation

User-based recommendation is problematic with large number of users. As matrix is sparse, only a few common ratings between two users are likely. Items are more sable than users. Thus: Item-based methods are attractive!

Determine similarity of items by column-wise analysis of utility matrix.

Let $U_{xy}$ be a set of all users co-rated by both items $x$ and $y$:

$U_{xy} = \{u \in U | r_{u, x} \neq \emptyset, r_{u, y} \neq \emptyset\}$

Use adjusted cosine similarity. The prediction function is the similarity weighted average of all ratings of similar items by the active user.

Association rules

Given utility matrix $R$. Analyse $R$ to determine association rules: If user $u$ liked a certain set of items $Q \subseteq P$ then $u$ will most probably also like item $p \notin Q$.

Let $a$ be the active user. * Check whether $a$ has alerady rated items such that the precondition of rule is satisfied * recommend to $a$ all items $i$ appearing as consequence of such rule, whenever $a$ has not rated $i$ so far.

Rules are written as $X \Rightarrow Y$.

Support of $X \Rightarrow Y$: $supp(X) = \frac{\text{number of users having rated } X}{\text{number of all users}}$
Confidence of $X \Rightarrow Y$: $conf(X \Rightarrow Y) = \frac{supp(X \cup Y)}{supp(X)}$

Let $s \in [0, 1]$ be a threshold. Itemset $X$ is called frequent whenever $supp(X) \geq s$.

Let $c \in [0, 1]$ be a threshold. Determine association rules $X \Rightarrow Y$ where $X \cup Y$ is frequent and $conf(X \Rightarrow Y) \geq c$

Apriori property: Eavery subset of frequent itemset is also frequent itemset. Whenever an itemset $X$ is not frequent, any itemset $$ Y X$is also not frequent

Finding association rules: 1. consider$ X Y $such that$ supp(XY) s $and the active user has rated all items in$ X $$ 2. Verify $\frac{supp(XY)}{supp(X)} \geq c$ 3. Compute the union of all items appearing in the consequence of such rules 4. Sort the items according to confidence of the rule that predicted them. If there are multiple, take the one with the highest confidence 5. Return the first $N$ elements of the ordered list as recommendations

Slope on predictors

Base prediction on popularity differential between items for users.

Rating database $R$, sets of users $U$. $u_i = r_{u, i}$ user’s rating of item $i$.

Items $i, j: S_{j, i}(R)$ denotes the set of users that rated both $i$ and $j$; these are the rows in $R$ which contain co-ratings for $i$ and $j$.

Average deviation of items $i$ and $j$

$dev_{j, i} = \sum_{u \in S_{j, i}(R)} \frac{r_{u, j} - r_{u, i}}{|S_{j, i(R)}|}$

Prediction for an item $j$ for user $a$ based on item $i$ is $pred(a, j) = dev_{j, i} + a_i$

Combination of all individual predictions:

$pred(a, j) = \frac{\sum_{i \in Relevant(j)}(dev_{j, i} + a_i)}{|Relevant(j)|}$

Slope-on algorithms are a simple approach to items-based recoommendation. They are easy to implement and their accuracy is on par with more complicated and computationally more expensive algorithms.

Data Sparsity and Cold-start Problem

Challenge: Good predictions when there are few ratings available. Idea: Add more information, exploit transitivity of customer tastes to augment given matrix. Recommendation thus becomes a graph-problem.

Matrix dimensionality reduction

Filling blank entries “solves” the recommendation problem. General theory singular-value decomposition tells us that a given matrix can be expressed as product of other matrices with lower dimensionality. This makes sense if reaction of users on items can be determined based on small set of features of items and users (latent factors). To apply SVD to a matrix, it must be free from blank entries. As a solution, the UV-decomposition algorithm decomposes a given matrix $M$ intro matrices $U, V$ such that $M \sim UV$ where $M$ may contain blank entries. As $U$ and $V$ do not have blank entires, $UV$ does not have blank entries as well. Product fills blank entries in $M$.

UV-Decomposition

Reduce $5 \times 5$ matrix to $5 \times 2$ and $2 \times 5$ matrix. Observation: already contains less elements than original matrix. Find $U$ and $V$ that closely approximate $M$. Construct $U$ and $V$ iteratively, continuously converging to $M$. Process stops when difference to $M$ is small enough.

Difference: Root-Mean-Square Error (RMSE):

$RMSE(M, M') = \sqrt{\frac{1}{k} \sum_{i=1}^n \sum_{j=1}^m(m_{ij} - m'_{ij})^2}$

Note: $k$ is number of non blank elements, term will produce $0$ when $m_{ij}$ is blank in $M$.

SVD

Let $M$ be an \[ m \times n$matrix with rank$ r \]. Find matrices $U$, $\Sigma$ and $V$ with following properties:

$U$ is $m \times r$ column-orthonormal matrix
$V$ is $n \times r$ column-orthonormal matrix, $V^T$ is a row-orthonormal matrix
$\Sigma$ is a diagonal matrix. Elements of $\Sigma$ are called singular values of $M$

Content-based methods

Rating $r(u, q)$ of some item $q$ for user $u$ is estimated based on the ratings $r(u, p)$ assigned by user $u$ to itmes $p \in P$ that are similar to item $q$.

The content of item $p$, $Content(p)$ typically is given as an item profile, i.e. set of attributes characterizing item $p$. (Meta-Information, keywords)

Profile of user $u$ contains tastes and preferences. (Given or learned)

Utility function $r(u, p) = pred(ContentBasedProfile(u), Content(p))$

Jaccard: $SIM(D_1, D_2) = \frac{|D_1 \cap D_2|}{|D_1 \cup D_2|}$ Dice: $SIM(D_1, D_2) = \frac{2|D_1 \cap D_2|}{|D_1| + |D_2|}$

Prediction rating/interest is computed by analysing how user liked similar items/documents in the past. Rocchio’s method: also take into account feedback for results by user.

Learning models

Collaborative recommendation

Compute conditional probabilities for each possible rating value given Alice’s other ratings.

Eg.: $X = \{i_1 = 1, i_2 = 3, ...\}$ compute $P(i_5 = 1 | X)$, $P(i_5 = 2 | X)$, …

Bayes

$P(X | Y) = \frac{P(X, Y)}{P(Y)}$

Thus:

$P(Y | X) = \frac{P(X | Y)P(Y)}{P(X)}$

The naive Bayes assumes that features (ratings) are conditionally independent:

$P(Y | X) = \frac{\prod^d_{i=1} P(X_i | Y)P(Y)}{P(X)}$

where $d$ is number of features.

$P(X)$ is considered constant and will be ignored.

Remark: in many domains naive Bayes classifiers work well even when conditionally independence assumption does not hold. Data sparsity problems has to be solved.

Linear classifiers

Task: find line $w_1x_1 + w_2x_2 = b$ which separates instances of two classes. Parameters $w_1, w_2, b$ have to be learned

In general: $n$ dimensions, classification function $\vec{w}^T \vec{x} = b$ has to be learned. (Hyperplane separating the two classes)

Machine learning (ML)

Training set: data which an ML algorithm is applied. Feature vector and label $(\vec{x}, y)$
Discover a function $y = f(\vec{x})$ that best predicts the value $y$ associated with $\vec{x}$.
- $y$ a real number: regression
- $y$ a boolean value: binary classification
- $y$ member of some finite set: multiclass classification
Test set: validate the learned function.
- High error rate signals overfitting of model (to specific for the training set)
- cross-validation: data is divided into $k$ equal sized chunks, each chunk is treated as test data w.r.t. the remaining chunks form training data.

Perceptron: A linear binary classifier

Input is $\vec{x} = [x_1, ..., x_d]$ with weights $\vec{w} = [w_1, ..., w_d]$ and threshold $\theta$

Output is $+1$ if $\vec{w}\vec{x} > \theta$, $-1$ if $\vec{w}\vec{x} < \theta$.

Training a perceptron we try to find a weight vector $\vec{w}$ and a threshold $\theta$ such that all feature vectors with $y = +1$ are on the positive side and all with $y = -1$ on the negative side.

This will only work for data that is linearly separable in the sense that there exists a hyperplane. There may exist more than one possible hyperplane.

Algorithm: 1. Initialize $\vec{w} = \vec{0}$ 2. Pick learning-rate parameter $\eta$ (a small real number). This is crucial because if $\eta$ is too small then convergence is slow, if it is too big the boundary will “dance around” and might not converge 3. Consider each training example $t = (\vec{x}, y)$ in turn. * Let $y' = \vec{w}\vec{x}$ * If $y'$ and $y$ have the same sign, do nothing. $t$ is properly classified * If $ y’$ and $y$ have different signs or $y' = 0$ then $\vec{w} = \vec{w} + \eta y \vec{x}$

Support-Vector Machines (SVM)

Maximize the distance $\gamma$ between the hyperplane and any point of training set. A $d$-dimensional hyperplane has at least $d + 1$ support vectors.

Transforming training set

If data is not linearly separable, it can help to map it to a higher dimensional feature space where it is separable. For example: input space $\vec{x} = (x_1, x_2)$ mapped into quadratic feature space $f(\vec{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2)$

Clustering

Given a set of points, group points into clusters using some distance notion. Distance can be cosine distance, jaccard distance or for example euclidean distance.

Hierarchical clustering

agglomerative (bottom up). Each point is a cluster, merge nearest clusters into one
divise: everything is one cluster, recursively split.

Cluster distance: distance of centroids.

$k$-means for small datasets. BFR for large datasets (required memory: $O(c)$ and not $O(d)$ where $c$ is clusters and $d$ is data). Idea of BFR: Summarize points close enough to centroid, group points that are close together but not close to any existing centroid.

Data Analysis and Query Languages SS16

Queries and Recommendations

Queries

Weighting terms: TF-IDF measure

RDF Graph Data Model

Property Graph Data Model

Recommendations

RDF-graphs and SPARQL

Example

Semantics of SPARQL 1.0 (Definition 1)

Examples

Definition 2

Query forms

Web Link Analysis

PageRank

Random Surfer

Removing dead ends

Treating Spider Traps (and dead ends)

Topic-Sensitive PageRank

Biased random walk

TrustRank

SimRank

Hubs and Authorities

HITS: Hypertext induced Topic Selection

Hubs and Authorities

Similar Items

Jaccard similarity of sets

Shingling

Minhashing

Minhash signature matrix

Locality-Sensitive Hashing

Determine candidate pairs

Analysis of banding technique

Wrap up

Similarity joins

Similarity measures

Recommender systems (RS)

Collaborative

User-based recommendation

Item-based recommendation

Association rules

Slope on predictors

Data Sparsity and Cold-start Problem

Matrix dimensionality reduction

UV-Decomposition

SVD

Content-based methods

Learning models

Collaborative recommendation

Bayes

Linear classifiers

Machine learning (ML)

Perceptron: A linear binary classifier

Support-Vector Machines (SVM)

Transforming training set

Clustering

Hierarchical clustering