# Models of Computation For Big Data pdf pdf

Gratis

**Advanced Information and Knowledge Processing** **SpringerBriefs in Advanced Information and Knowledge** **Processing** **Series Editors**

Xindong Wu

*School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA*

Lakhmi Jain

*University of Canberra, Adelaide, SA, Australia*

*SpringerBriefs in Advanced Information and Knowledge Processing* presents concise research in

this exciting field. Designed to complement Springer’s *Advanced Information and Knowledge*

*Processing* series, this Briefs series provides researchers with a forum to publish their

cutting-edge research which is not yet mature enough for a book in the *Advanced Information*

*and Knowledge Processing* series, but which has grown beyond the level of a workshop paper or journal article.

Typical topics may include, but are not restricted to: Big Data analytics Big Knowledge Bioinformatics Business intelligence Computer security Data mining and knowledge discovery Information quality and privacy Internet of things Knowledge management Knowledge-based software engineering Machine intelligence Ontology Semantic Web Smart environments Soft computing Social networks

SpringerBriefs are published as part of Springer’s eBook collection, with millions of users worldwide and are available for individual print and electronic purchase. Briefs are characterized by fast, global electronic dissemination, standard publishing contracts, easy-to- use manuscript preparation and formatting guidelines and expedited production schedules to assist researchers in distributing their research fast and efficiently.

More information about this series at Rajendra Akerkar

**Models of Computation for Big Data**

Rajendra Akerkar Western Norway Research Institute, Sogndal, Norway

ISSN 1610-3947 e-ISSN 2197-8441 Advanced Information and Knowledge Processing

ISSN 2524-5198 e-ISSN 2524-5201 SpringerBriefs in Advanced Information and Knowledge Processing

ISBN 978-3-319-91850-1 e-ISBN 978-3-319-91851-8

Library of Congress Control Number: 2018951205 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

**Preface**

This book addresses algorithmic problems in the age of big data. Rapidly increasing volumes of diverse data from distributed sources create challenges for extracting valuable knowledge and commercial value from data. This motivates increased interest in the design and analysis of algorithms for rigorous analysis of such data.

The book covers mathematically rigorous models, as well as some provable limitations of algorithms operating in those models. Most techniques discussed in the book mostly come from research in the last decade and of the algorithms we discuss have huge applications in Web data compression, approximate query processing in databases, network measurement signal processing and so on. We discuss lower bound methods in some models showing that many of the algorithms we presented are optimal or near optimal. The book itself will focus on the underlying techniques rather than the specific applications.

This book grew out of my lectures for the course on big data algorithms. Actually,

*algorithmic aspects for modern data models* is a success in research, teaching and practice

which has to be attributed to the efforts of the growing number of researchers in the field, to name a few Piotr Indyk, Jelani Nelson, S. Muthukrishnan, Rajiv Motwani. Their excellent work is the foundation of this book. This book is intended for both graduate students and advanced undergraduate students satisfying the discrete probability, basic algorithmics and linear algebra prerequisites.

I wish to express my heartfelt gratitude to my colleagues at Vestlandsforsking, Norway, and Technomathematics Research Foundation, India, for their encouragement in persuading me to consolidate my teaching materials into this book. I thank Minsung Hong for help in the LaTeX typing. I would also like to thank Helen Desmond and production team at Springer. Thanks to the INTPART programme funding for partially supporting this book project. The love, patience and encouragement of my father, son and wife made this project possible.

**Rajendra Akerkar** **Sogndal, Norway** **May 2018**

**Contents**

**References**

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2018 Rajendra Akerkar, *Models of Computation for Big Data*, Advanced Information and Knowledge Processing

**1. Streaming Models**

Rajendra Akerkar

(1) Western Norway Research Institute, Sogndal, Norway

**Rajendra Akerkar** **Email:**

### 1.1 Introduction

In the analysis of big data there are queries that do not scale since they need massive computing resources and time to generate exact results. For example, count distinct, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of dedicated algorithms, known as streaming algorithms or sketches that can produce results orders-of magnitude faster and with precisely proven error bounds. For interactive queries there may not be supplementary practical options, and in the case of real-time analysis, sketches are the only recognized solution.

Streaming data is a sequence of digitally encoded signals used to represent information in transmission. For streaming data, the input data that are to be operated are not available all at once, but rather arrive as continuous data sequences. Naturally, a data stream is a sequence of data elements, which is extremely bigger than the amount of available memory. More often than not, an element will be simply an (integer) number from some range. However, it is often convenient to allow other data types, such as: multidimensional points, metric points, graph vertices and edges, etc. The goal is to approximately compute some function of the data using only one pass over the data stream. The critical aspect in designing data stream algorithms is that any data element that has not been stored is ultimately lost forever. Hence, it is vital that data elements are properly selected and preserved. Data streams arise in several real world applications. For example, a network router must process terabits of packet data, which cannot be all stored by the router. Whereas, there are many statistics and patterns of the network traffic that are useful to know in order to be able to detect unusual network behaviour. Data stream algorithms enable computing such statistics fast by using little memory. In Streaming we want to maintain a sketch *F*(*X*) on the fly as *X* is updated. Thus in previous example, if numbers come on the fly, I can keep a running sum, which is a streaming algorithm. The streaming setting appears in a lot of places, for example, your router can monitor online traffic. You can sketch the number of traffic to find the traffic pattern.

The fundamental mathematical ideas to process streaming data are sampling and random universe sampling, reservoir sampling, etc. There are two main difficulties with sampling for streaming data. First, sampling is not a powerful primitive for many problems since too many samples are needed for performing sophisticated analysis and a lower bound is given in. Second, as stream unfolds, if the samples maintained by the algorithm get deleted, one may be forced to resample from the past, which is in general, expensive or impossible in practice and in any case, not allowed in streaming data problems. Random projections rely on dimensionality reduction, using projection along random vectors. The random vectors are generated by space-efficient computation of random variables. These projections are called the sketches. There are many variations of random projections which are of simpler type.

Sampling and sketching are two basic techniques for designing streaming algorithms. The idea behind sampling is simple to understand. Every arriving item is preserved with a certain probability, and only a subset of the data is kept for further computation. Sampling is also easy to implement, and has many applications. Sketching is the other technique for designing streaming algorithms. Sketch techniques have undergone wide development within the past few years. They are particularly appropriate for the data streaming scenario, in which large quantities of data flow by and the the sketch summary must continually be updated rapidly and compactly. A sketch-based algorithm creates a compact synopsis of the data which has been observed, and the size of the synopsis is usually smaller than the full observed data. Each update observed in the stream potentially causes this synopsis to be updated, so that the synopsis can be used to approximate certain functions of the data seen so far. In order to build a sketch, we should either be able to perform a single linear scan of the input data (in no strict order), or to scan the entire stream which collectively build up the input. See that many sketches were originally designed for computations in situations where the input is never collected together in one place, but exists only implicitly as defined by the stream. Sketch *F*(*X*) with respect to some function *f* is a *compression* of data *X*. It allows us computing *f*(*X*) (with approximation) given access only to *F*(*X*). A sketch of a large-scale data is a small data structure that lets you approximate particular characteristics of the original data. The exact nature of the sketch depends on what you are trying to approximate as well as the nature of the data.

The goal of the streaming algorithm is to make one pass over the data and to use limited memory to compute functions of *x*, such as the frequency moments, the number of distinct elements, the heavy hitters, and treating *x* as a matrix, various quantities in numerical linear algebra such as a low rank approximation. Since computing these quantities exactly or deterministically often requires a prohibitive amount of space, these algorithms are usually randomized and approximate.

Many algorithms that we will discuss in this book are randomized, since it is often necessary to achieve good space bounds. A *randomized algorithm* is an algorithm that can toss coins and take different actions depending on the outcome of those tosses. Randomized algorithms have several advantages over deterministic ones. Usually, randomized algorithms tend to be simpler than deterministic algorithms for the same task. The strategy of picking a random element to partition the problem into subproblems and recursing on one of the partitions is much simpler. Further, for some problems randomized algorithms have a better asymptotic running time than their deterministic one. Randomization can be beneficial when the algorithm faces lack of information and also very useful in the design of online algorithms solution that is good for all inputs. Randomization, in the form of sampling, can assist us estimate the size of exponentially large spaces or sets.

### 1.2 Space Lower Bounds

Advent of cutting-edge communication and storage technology enable large amount of raw data to be produced daily, and subsequently, there is a rising demand to process this data efficiently. Since it is unrealistic for an algorithm to store even a small fraction of the data stream, its performance is typically measured by the amount of space it uses. In many scenarios, such as internet routing, once a stream element is examined it is lost forever unless explicitly saved by the processing algorithm. This, along with the complete size of the data, makes multiple passes over the data impracticable.

Let us consider the distinct elements problems to find the number of distinct elements in a stream, where queries and additions are allowed. We take *s* the space of the algorithm, *n* the size of the universe from which the elements arrive, and *m* the length of the stream.

**Theorem 1.1 **There is no deterministic exact algorithm for computing number of distinct elements in *O*(*min* *n*, *m*) space (Alon et al. 1999).

*Proof *Using a streaming algorithm with space *s* for the problem we are going to show how to

encode using only *s* bits. Obviously, we are going to produce an injective mapping from to . Hence, this implies that *s* must be at least *n*. We look for procedures such that and *Enc*(*x*) is a function from to .

In the encoding procedure, given a string *x*, devise a stream containing and add *i* at the end of the stream if . Then *Enc*(*x*) is the memory content of the algorithm on that stream. In the decoding procedure, let us consider each *i* and add it at the end of the stream and query then the number of distinct elements. If the number of distinct elements increases this implies that , otherwise it implies that . So we can recover *x* completely. Hence proved.

Now we show that approximate algorithms are inadequate for such problem.

**Theorem 1.2 **Any deterministic algorithm that provides 1.1 approximation requires space.

*Proof*

Suppose we had a collection *F* fulfilling the following: , for some constant .

Let us consider the algorithm to encode vectors , where is the indicator vector of set *S*. The lower bound follows since we must have . The encoding procedure is similar as the previous proof.

In the decoding procedure, let us iterate over all sets and test for each set *S* if it corresponds to our initial encoded set. Further take at each time the memory contents of *M* of the streaming algorithm after having inserted initial string. Then for each *S*, we initialize the algorithm with memory contents *M* and then feed element *i* if . Suppose if *S* equals the initial encoded set, the number of distinct elements does increase slightly, whereas if it is not it almost doubles. Considering the approximation assurance of the algorithm we understand that if *S* is not our initial set then the number of distinct elements grows by .

In order to confirm the existence of such a family of sets *F*, we partition *n* into intervals of length 100 each. To form a set *S* we select one number from each interval uniformly at random. Obviously, such a set has size exactly . For two sets *S*, *T* selected uniformly at random as before let be the random variable that equals 1 if they have the same number selected from interval *i*. So, . Hence the anticipated size of the intersection is just . The probability that this intersection is bigger than five times its mean is smaller than for some constant , by a standard Chernoff bound. Finally, by applying a union bound over all feasible intersections one can prove the result.

### 1.3 Streaming Algorithms

An important aspect of streaming algorithms is that these algorithms have to be approximate. There are a few things that one can compute exactly in a streaming manner, but there are lots of crucial things that one can’t compute that way, so we have to approximate. Most significant aggregates can be approximated online. Many of these approximate aggregates can be computed online. There are two ways: (1) Hashing: which turns a pretty identity function into hash. (2) sketching: you can take a very large amount of data and build a very small sketch of the data. Carefully done, you can use the sketch to get values of interest. This in turn will find a good sketch. All of the algorithms discussed in this chapter use sketching of some kind and some use hashing as well. One popular streaming algorithm is HyperLogLog by Flajolet. Cardinality estimation is the task of determining the number of distinct elements in a data stream. While the cardinality can be easily computed using space linear in the cardinality, for several applications, this is totally unrealistic and requires too much memory. Therefore, many algorithms that approximate the cardinality while using less resources have been developed. HyperLogLog is one of them. These algorithms play an important role in network monitoring systems, data mining applications, as well as database systems. The basic idea is if we have *n* samples that are hashed and inserted into a [0, 1) interval, those *n* samples are going to make intervals. Therefore, the average size of the intervals has to be . By symmetry, the average distance to the minimum of those hashed types is also going to be . Furthermore, duplicates values will go exactly on top of previous values, thus the *n* is the number of unique values we have inserted. For instance, if we have ten samples, the minimum is going to be right around 1 / 11. HyperLogLog is shown to be near optimal among algorithms that are based on order statistics.

**1.4 Non-adaptive Randomized Streaming**

The non-trivial update time lower bounds for randomized streaming algorithms in the Turnstile Model was presented in (Larsen et al. 2014). Only a specific restricted class of randomized streaming algorithms, namely those that are non-adaptive could be bounded.

Most well-known turnstile streaming algorithms in the literature are non-adaptive. Reference (Larsen et al. 2014) gives the non-trivial update time lower bounds for both randomized and deterministic turnstile streaming algorithms, which hold when the algorithms are non- adaptive.

**Definition 1.1 **A non-adaptive randomized streaming algorithm is an algorithm where it

may toss random coins before processing any elements of the stream, and the words read from and written to memory are determined by the index of the updated element and the initially tossed coins, on any update operation. These constraints suggest that memory must not be read or written to based on the current state of the memory, but only according to the coins and the index. Comparing the above definition to the sketches, a hash function chosen independently from any desired hash family can emulate these coins, enabling the update algorithm to find some specific words of memory to update using only the hash function and the index of the element to update. This makes the non-adaptive restriction fit exactly with all of the Turnstile Model algorithm. Both the Count-Min Sketch and the Count-Median Sketch are non-adaptive and support point queries.

**1.5 Linear Sketch**

Many data stream problems cannot be solved with just a sample. We can rather make use of data structures which, include a contribution from the entire input, instead of simply the items picked in the sample. For instance, consider trying to count the number of distinct objects in a stream. It is easy to see that unless almost all items are included in the sample, then we cannot tell whether they are the same or distinct. Since a streaming algorithm gets to see each item in turn, it can do better. We consider a *sketch* as compact data structure which summarizes the stream for certain types of query. It is a linear transformation of the stream: we can imagine the stream as defining a vector, and the algorithm computes the product of a matrix with this vector.

As we know a data stream is a sequence of data, where each item belongs to the universe. A data streaming algorithm takes a data stream as input and computes some function of the stream. Further, algorithm has access the input in a streaming fashion, i.e. algorithm cannot read the input in another order and for most cases the algorithm can only read the data once. Depending on how items in the universe are expressed in data stream, there are two typical models:

*Cash Register Model*: Each item in stream is an item of universe. Different items come in an

*Turnstile Model*: In this model we have a multi-set. Every in-coming item is linked with one

of two special symbols to indicate the dynamic changes of the data set. The turnstile model captures most practical situations that the dataset may change over time. The model is also known as dynamic streams.

We now discuss the turnstile model in streaming algorithms. In the turnstile model, the stream consists of a sequence of updates where each update either inserts an element or deletes one, but a deletion cannot delete an element that does not exist. When there are duplicates, this means that the multiplicity of any element cannot go negative.

In the model there is a vector that starts as the all zero vector and then a sequence of updates comes. Each update is of the form , where and . This matches to the operation .

Given a function *f*, we want to approximate *f*(*x*). For example, in the distinct elements problem is always 1 and . The well-known approach for designing turnstile algorithms is **linear sketching**. The idea is to preserve in memory , where , a matrix that is short and fat. We know that , obviously much smaller. We can see that *y* is *m*-dimensional, so we can store it efficiently but if we need to store the whole in memory then we will not get space-wise better algorithm. Hence, there are two options in creating and storing . is deterministic and so we can easily compute without keeping the whole matrix in memory. is defined by *k*-wise independent hash functions for some small *k*, so we can afford storing the hash functions and computing .

Let be the *i*th column of the matrix . Then . So by storing when the update occures we have that the new *y* equals . The first summand is the old *y* and the second summand is simply multiple of the *i*th column of . This is how updates take place when we have a linear sketch.

Now let us consider Moment Estimation Problem (Alon et al. 1999). The problem of estimating (frequency) moments of a data stream has attracted a lot of attention since the inception of streaming algorithms. Suppose let . We want to estimate the space needed to solve the moment estimation problem as *p* changes. There is a transition point in complexity of . space is achievable for approximation with success probability (Alon et al. 1999; Indyk 2006). For then we need exactly bits of space for space with success probability (Bar-Yossef et al. 2004; Indyk and Woodruff 2005).

### 1.6 Alon–Matias–Szegedy Sketch

Streaming algorithms aim to summarize a large volume of data into a compact summary, by maintaining a data structure that can be incrementally modified as updates are observed. They allow the approximation of particular quantities. Alon–Matias–Szegedy (AMS) sketches (Alon et al. 1999) are randomized summaries of the data that can be used to compute aggregates such as the second frequency moment and sizes of joins. AMS sketches can be viewed as random projections of the data in the frequency domain on ±1 pseudo-random vectors. The key property of AMS sketches is that the product of projections on the same random vector of frequencies of the join attribute of two relations is an unbiased estimate of the size of join of the relations. While a single AMS sketch is inaccurate, multiple such sketches can be computed and combined using averages and medians to obtain an estimate of any desired precision.

In particular, the AMS Sketch is focused on approximating the sum of squared entries of a vector defined by a stream of updates. This quantity is naturally related to the Euclidean norm of the vector, and so has many applications in high-dimensional geometry, and in data mining and machine learning settings that use vector representations of data. The data structure maintains a linear projection of the stream with a number of randomly chosen vectors. These random vectors are defined implicitly by simple hash functions, and so do not have to be stored explicitly. Varying the size of the sketch changes the accuracy guarantees on the resulting estimation. The fact that the summary is a linear projection means that it can be updated flexibly, and sketches can be combined by addition or subtraction, yielding sketches corresponding to the addition and subtraction of the underlying vectors.

A common feature of (Count-Min and AMS ) sketch algorithms is that they rely on hash functions on item identifiers, which are relatively easy to implement and fast to compute.

**Definition 1.2**

H is a k-wise independent hash family if

There are two versions of the AMS algorithm. The faster version, based on the hashing is also referred to as fast AMS to distinguish it from the original “slower” sketch, since each update is very fast.

**Algorithm**: 1.

Consider a random hash function from a four-wise independent family.

2. Let .

3. Let , output .

4. is an unbiased estimator with variance big-Oh of the square of its expectation.

5. Sample independent times : . Use Chebyshev’s inequality to obtain a approximation with probability.

6. Let .

7. Sample independent times : . Take the median to get -approximation with probability .

Each of the hash function takes bits to store, and there are hash functions in total.

**Lemma 1.1** .

*Proof* where since pair-wise independence.

**Lemma 1.2** .

*Proof*

where since pair-wise independence, and since four-wise independence.

In the next section we will present an idealized algorithm with infinite precision, given by Indyk (Indyk 2006). Though the sampling-based algorithms are simple, they cannot be employed for turnstile streams, and we need to develop other techniques.

Let us call a distribution over if for from this distribution and for all we have that is a random variable with distribution . An example of such a distribution are the Gaussians for and for the Cauchy distribution, which has probability density function .

From probability theory, we know that the central limit theorem establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. Hence, by the Central Limit Theorem an average of *d* samples from a distribution approaches a Gaussian as *d* goes to infinity.

### 1.7 Indyk’s Algorithm

The Indyk’s algorithm is one of the oldest algorithms which works on data streams. The main drawback of this algorithm is that it is a two pass algorithm, i.e., it requires two linear scans of the data which leads to high running time.

Let the *i*th row of be , as before, where comes from a *p*-stable distribution. Then consider . When a query arrives, output the median of all the . Without loss of generality, let us suppose a *p*-stable distribution has median equal to 1, which in fact means that for *z* from this distribution .

Let be an matrix where every element is sampled from a *p*-stable distribution, . Given , Indyk’s algorithm (Indyk 2006) estimates the *p*-norm of *x* as where . In a turnstile streaming model, each element in the stream reflects an update to an entry in *x*. When an algorithm would maintain *x* in memory and calculates at the end, hence need space, Indyk’s algorithm stores *y* and . Combined with a space-efficient way to produce we attain Superior space complexity.

Let us suppose is generated with such that if then . So, we assume the probability mass of assigned to interval is 1 / 2. Moreover, let be an indicator function defined as

Let be the *i*th row of . We have (1.1) which follows from the definition of *p*-stable distributions and noting that ’s are sampled from . This implies (1.2) since .

Moreover, it is possible to show that (1.3) (1.4)

Next, consider the following quantities: (1.5) (1.6) represents the fraction of ’s that satisfy , and likewise, represents the fraction of ’s that satisfy . Using linearity of expectation property, we have and . Therefore, the median of lies in as desired.

Next step is to analyze the variance of and . We have (1.7)

Since variance of any indicator variable is not more than 1, . Likewise, . With an appropriate choice of *m* now we can trust that the median of is in the desired -range of with high probability.

Hence, Indyk’s algorithm works, but independently producing and storing all *mn* elements we need the entries in each row to be independent from one another. The rows need to be pairwise independent for calculation of variance to hold.

Let us assume where ’s are *k*-wise independent *p*-stable distribution samples.

(1.8) If we can make this claim, then we can use *k*-wise independent samples in each row instead of fully independent samples to invoke the same arguments in the analysis above. This has been shown for (Kane et al. 2010). With this technique, we can state using only bits; across rows, we only need to use 2-wise independent hash function that maps a row index to a bit seed for the *k*-wise independent hash function.

Indyk’s approach for the norm is based on the property of the median. However, it is possible to construct estimators based on other quantiles and they may even outperform the median estimator, in terms of estimation accuracy. However, since the improvement is marginal for our parameters settings, we stick to the median estimator.

### 1.8 Branching Program

A *branching programs* are built on directed acyclic graphs and work by starting at a source vertex and testing the values of the variables that each vertex is labeled with and following the appropriate edge till a sink is reached, and accepting or rejecting based on the identity of the sink. The program starts at an source vertex which is not part of the grid. At each step, the program reads *S* bits of input, reflecting the fact that space is bounded by *S*, and makes a decision about which vertex in the subsequent column of the grid to jump to. After *R* steps, the last vertex visited by the program represents the outcome. The entire input, which can be represented as a length-*RS* bit string, induces a distribution over the final states. Here we wish to generate the input string using fewer ( ) random bits such that the original distribution over final states is well preserved. The following theorem addresses this idea.

**Theorem 1.3**

(Nisan 1992) There exists for such that (1.9) for any branching program *B* and any function .

The function *h* can simulate the input to the branching program with only *t* random bits such of the original program.

A random sample *x* from and add *x* at the root. Repeat the following procedure to create a complete binary tree. At each vertex, create two children and copy the string over to the left child. For the right child, use a random 2-wise independent hash function chosen for the corresponding level of the tree and record the result of the hash. Once we reach *R* levels, output the concatenation of all leaves, which is a length-*RS* bit string. Since each hash function requires *S* random bits and there are levels in the tree, this function uses bits total.

One way to simulate randomized computations with deterministic ones is to build a pseudorandom generator, namely, an efficiently computable function *g* that can stretch a short uniformly random seed of *s* bits into *n* bits that cannot be distinguished from uniform ones by small space machines. Once we have such a generator, we can obtain a deterministic computation by carrying out the computation for every fixed setting of the seed. If the seed is short enough, and the generator is efficient enough, this simulation remains efficient. We will use Nisan’s pseudorandom generator (PRG) to derandomize in Indyk’s algorithm. Specifically, when the column indexed by *x* is required, Nisans generator takes *x* as the input and, together with the original, the generator outputs a sequence of pseudorandom sequences.

1. Initialize , 2.

For : a.

Initialize b. For : i.

Update c. If , then increment d. If , then increment

This procedure uses bits and is a branching algorithm that imitate the proof of correctness for Indyk’s algorithm. The algorithm succeeded if and only if at the end of the computation and . The only source of randomness in this program are the ’s. We will apply Nisan’s PRG to generate these random numbers. We invoke Theorem

with

the algorithm given above as *B* and an indicator function checking whether the algorithm succeeded or not as *f*. See that the space bound is and the number of steps taken by the program is , or since . This means we can delude the proof of correctness of Indyk’s algorithm by using random bits to produce . Indyk’s algorithm uses *p*-stable distributions which only exist for . We shall consider a case when .

**Theorem 1.4** space is necessary and sufficient.

Nearly optimal lower bound related details are discussed in (Bar-Yossef et al. 2004) and (Indyk and Woodruff 2005).

In this chapter we will discuss the algorithm of Andoni (Andoni 2012), which is based on (Andoni et al. 2011; Jowhari et al. 2011). We will focus on . In this algorithm, we let

. *P* is a matrix, where each column has a single non-zero element that is either 1 or . *D* is a diagonal matrix with , where .

That is to say, So, same as the case, we will keep , but we estimate with

(1.10)

**Theorem 1.5** for .

Let , which means . To prove Theorem

we will begin by showing that delivers a good estimate and then prove that applying *P* to *z* maintains it.

**Claim** .

*Proof*

Let . We have (1.11) (1.12) (1.13) which implies . Thus, (1.14) (1.15) (1.16) for .

The following claim establishes that if we could maintain *Q* instead of *y* then we would have a better solution to our problem. However we can not store *Q* in memory because it’s *n*- dimensional and . Thus we need to analyze .

**Claim**

Let . Then Let us suppose each entry in *y* is a sort of counter and the matrix *P* takes each entry in *Q*, hashes it to a random counter, and adds that entry of *Q* times a random sign to the counter. There will be collision because and only *m* counters. These will cause different to potentially cancel each other out or add together in a way that one might expect to cause problems. We shall show that there are very few large ’s.

Interestingly, small ’s and big ’s might collide with each other. When we add the small ’s, we multiply them with a random sign. So the expectation of the aggregate contributions of the small ’s to each bucket is 0. We shall bound their variance as well, which will show that if they collide with big ’s then with high probability this would not considerably change the admissible counter. Ultimately, the maximal counter value (i.e., ) is close to the maximal and so to with high probability.

### 1.8.1 Light Indices and Bernstein’s Inequality

Bernstein’s inequality in probability theory is a more precise formulation of the classical Chebyshev inequality in probability theory, proposed by S.N. Bernshtein in 1911; it permits one to estimate the probability of large deviations by a monotone decreasing exponential function. In order to analyse the light indices, we will use *Bernstein’s inequality*.

**Theorem 1.6**

(Bernstein’s inequality) Suppose are independent, and for all *i*, , and . Then for all

We consider that the light indices together will not distort the heavy indices. Let us parametrize *P* as follows and choose a function as well as a function . Then,

Therefore, *h* states element of the column to make non-zero, and states which sign to use for column *j*.

The following light indices claim holds with constant probability that for all ,

**Claim**

If has no heavy indices then the magnitude of is much less than *T*. Obviously, it would not hinder with estimate. If assigned the maximal , then by previous claim that is the only heavy index assigned to . Therefore, all the light indices assigned to would not change it by more than *T* / 10, and since is within a factor of 2 of *T*, will still be within a constant multiplicative factor of *T*. If assigned some other heavy index, then the corresponding is less than 2*T* since is less than the maximal . This claim concludes that will be at most

2.1*T*.

Ultimately: where the second term is added only if has heavy index. By the triangle inequality, Applying this to the bucket containing the maximal shows that bucket of *y* should hold at least 0.4*T*. Furthermore, by similar argument all other buckets should hold at most 2.1*T*.

*Proof*

Fix . Then for , define Then We will call the *j*th term of the summand and then use Bernstein’s inequality.

1. We have , since the represent random signs.

2. We also have since , , and we iterate over light indices so .

It remains only to compute . If we condition on *Q*, then it implies that We need to consider the randomness of *Q* into account. We will merely prove that is small with high probability over the choice of *Q*. We will do this by computing the unconditional expectation of and then using Markov. Now and

The second integral trivially converges, and the former one converges because . This gives that which gives that with high probability we will have . To use Bernstein’s inequality, we will associate this bound on , which is given in terms of

, to a bound in terms of . By using an argument based on Hölder’s inequality,

**Theorem 1.7**

(Hölder’s inequality) Let . Then for any satisfying . Here , , , gives Using the fact that we chose *m* to , we can then obtain the following bound on with high probability.

Now let us use Bernstein’s inequality to prove the required result.

So the probability that the noise at most *T* / 10 can be made poly *n*. But there are at most *n* buckets, which means that a union bound gives us that with at least constant probability all of the light index contributions are are at most *T* / 10.

Distinct elements are used in SQL to efficiently count distinct entries in some column of a data table. It is also used in network anomaly detection to, track the rate at which a worm is spreading. You run distinct elements on a router to count how many distinct entities are sending packets with the worm signature through your router.

For more general moment estimation, there are other motivating examples as well. Imagine is the number of packets sent to IP address *i*. Estimating would give an approximation to the highest load experienced by any server. Obviously, as elaborated earlier, is difficult to approximate in small space, so in practice we settle for the closest possible norm to the -norm, which is the 2-norm.

### 1.9 Heavy Hitters Problem Data stream algorithms have become an indispensable tool for analysing massive data sets.

Such algorithms aim to process huge streams of updates in a single pass and store a compact summary from which properties of the input can be discovered, with strong guarantees on the quality of the result. This approach has found many applications, in large scale data processing and data warehousing, as well as in other areas, such as network measurements, sensor networks and compressed sensing. One high-level application example is computing popular products. For example, *A* could be all of the page views of products on amazon.com yesterday. The heavy hitters are then the most frequently viewed products.

Given a stream of items with weights attached, find those items with the greatest total weight. This is an intuitive problem, which relates to several natural questions: given a stream of search engine queries, which are the most frequently occurring terms? Given a stream of supermarket transactions and prices, which items have the highest total euro sales? Further, this simple question turns out to be a core subproblem of many more complex computations over data streams, such as estimating the entropy, and clustering geometric data. Therefore, it is of high importance to design efficient algorithms for this problem, and understand the performance of existing ones.

The problem can be solved efficiently if *A* is promptly obtainable in main memory then simply sort the array and do a linear scan over the result, outputting a value if and only if it occurs at least *n* / *k* times. But, what about solving the Heavy Hitters problem with a single pass over the array?

In Point Query, we are given some updated in a turnstile model, with *n* large. Suppose that *x* has a coordinate for each string your search engine could see and is the number of times we have seen string *i*. We seek a function that, for , returns a value in .

In Heavy Hitters, we have the same *x* but we need to compute a set such that 1.

2. If we can solve Point Query with bounded space then we can solve Heavy Hitters with bounded space as well (but without efficient run-time). So, we just run Point Query with on each and output the set of indices *i* for which we had large estimates of .

Now let us define an *incoherent matrix*.

**Definition 1.3**

is -*incoherent* if

1. For all *i*, 2.

For all , . We also define a related object: a *code*.

**Definition 1.4 **An -code is a set such that for all , , where indicates Hamming distance.

The key property of a code can be summarized verbally: any two distinct words in the code agree in at most entries.

There is a relationship between incoherent matrices and codes.

**Claim **Existence of an -code implies existence of an -incoherent with .

*Proof *We construct from . We have a column of for each , and we break each

column vector into *t* blocks, each of size *q*. Then, the *j*th block contains binary string of length

*q* whose *a*th bit is 1 if the *j*th element of is *a* and 0 otherwise. Scaling the whole matrix by

**Claim **Given an -incoherent matrix, we can create a linear sketch to solve Point Query.

and is an -code.

**Claim **A random code with

### 1.10 Count-Min Sketch

Next we will consider another algorithm where the objective is to know the frequency of popular items. The idea is we can hash each incoming item several different ways, and increment a count for that item in a lot of different places, one place for each hash. Since each array that we use is much smaller than the number of unique items that we see, it will be common for more than one item to has to a particular location. The trick is that for the any of most common items, it is very likely that at least one of the hashed locations for that item will only have collisions with less common items. That means that the count in that location will be mostly driven by that item. The problem is how to find the cell that only has collisions with less popular items.

In other words, Count-Min (CM) sketch is a compact summary data structure capable of representing a high-dimensional vector and answering queries on this vector, in particular point queries and dot product queries, with strong accuracy guarantees. Such queries are at the core of many computations, so the structure can be used in order to answer a variety of other queries, such as frequent items (heavy hitters), quantile finding, and join size estimation (Cormode and Muthukrishnan 2005). Since the data structure can easily process updates in the form of additions or subtractions to dimensions of the vector, which may correspond to insertions or deletions, it is capable of working over streams of updates, at high rates. The data structure maintains the linear projection of the vector with a number of other random vectors. These vectors are defined implicitly by simple hash functions. Increasing the range of the hash functions increases the accuracy of the summary, and increasing the number of hash functions decreases the probability of a bad estimate. These tradeoffs are quantified precisely below. Because of this linearity, CM sketches can be scaled, added and subtracted, to produce summaries of the corresponding scaled and combined vectors.

Thus for CM, we have streams of insertions, deletions, and queries of how many times a element could have appeared. If the number is always positive, it is called Turnstile Model. For example, in a music party, you will see lots of people come in and leave, and you want to know what happens inside. But you do not want to store every thing happened inside, you want to store it more efficiently.

One application of CM might be you scanning over a corpus of a lib. There are a bunch of URLs you have seen. There are huge number of URLs. You cannot remember all URLs you see. But you want to estimate the query about how many times you saw the same URLs. What we can do is to store a set of counting bloom filters. Because a URL can appear multiple times, how would you estimate the query given the set of counting bloom filter?

We can take the minimal of all hashed counters to estimate the occurrence of a particular URL. Specifically: See that the previous analysis about the overflow of counting bloom filters does work.

Then there is a question of how accurate the query is? Let *F*(*x*) be the real count of an individual item *x*. One simple bound of accuracy can be which tells us the average error for all single hashed places with regard to the real occurrence. So we know that it is always overestimated. For the total number of items , we have (*F*(*x*) is non-negative) where, in general, , . See that you do not even need *m* to be larger than *n*. If you have a huge number of items, you can choose *m* to be very small (*m* can be millions for billions of URLs). Now we have bound of occurrence estimation for each individual *i* in expectation. However, what we really need to concern is the query result. We know that

And now if I choose , If *F*(*x*) is concentrated in a few elements, the largest is proportional to roughly with the power law distribution. So if we choose *m* to be small, then you can estimate the top URLs pretty well.

In fact, you can show a better result for CM, which is rather than having your norm depend on the 1-norm. There could be a few elements having all of occurrences. For example, several people have been visiting google.com. The top few URLs have almost all the occurrences. Then probably for a given URL, it might collide some of them in some of the time. But probably one of them is not going to collide, and probably most of them are going to collide. So one can get in terms of *l*-1 norm but in terms of *l*-1 after dropping the top *k* elements. So given billions of URLs, you can drop the top ones and get *l*-1 norm for the residual URLs.

The Count-Min sketch has found a number of applications. For example, Indyk (Indyk 2003) used the Count-Min Sketch to estimate the residual mass after removing a set of items. This supports clustering over streaming data. Sarlós et al. (Sarlós et al. 2006) gave approximate algorithms for personalized page rank computations which make use of Count- Min Sketches to compactly represent web-size graphs.

### 1.10.1 Count Sketch

One of the important fundamental problems on a data stream is that of finding the most frequently occurring items in the stream. We shall assume that the stream is large enough distinct element are infeasible, and that we can only afford to process the data by making one or more passes over it. This problem arises in the context of search engines, where the streams in question are streams of queries sent to the search engine and we are interested in finding the most frequent queries handled in some period of time. Interestingly, in the context of search engine query streams, since the queries whose frequency changes most between two consecutive time periods can indicate which topics are increasing or decreasing in popularity at the fastest rate. Reference (Charikar et al. 2002) presented a simple data structure called a count-sketch and developed a 1-pass algorithm for computing the count- sketch of a stream. Using a count sketch, one can consistently estimate the frequencies of the most common items. Reference (Charikar et al. 2002) showed that the count-sketch data structure is additive, i.e. the sketches for two streams can be directly added or subtracted. Thus, given two streams, we can compute the difference of their sketches, which leads to a 2- pass algorithm for computing the items whose frequency changes the most between the streams.

The Count Sketch (Charikar et al. 2002) is basically like CM, except that when you do hashing, you also associate the sum with each hash function *h*.

Then the query can be defined as The error can be converted from *l*-1 norm to *l*-2 norm.

On top of that, suppose everything else is 0, then . So we will have Then if there is nothing special going on, the query result would be *F*(*x*).

### 1.10.2 Count-Min Sketch and Heavy Hitters Problem

The Count-Min (CM) Sketch is an example of a sketch that permits a number of related quantities to be estimated with accuracy guarantees, including point queries and dot product queries. Such queries are very crucial for several computations, so the structure can be used in order to answer a variety of other queries, such as frequent items (heavy hitters), quantile finding, join size estimation, and so on. Let us consider the CM sketch, that can be used to solve the -approximate heavy hitters (HH) problem. It has been implemented in real systems. A predecessor of the CM sketch (i.e. count sketch) has been implemented on top of their MapReduce parallel processing infrastructure at Google. The data structure used for this is based on hashing.

**Definition 1.5**

The Count-Min (CM) sketch (Cormode and Muthukrishnan 2005)

1. Hashing 2. counters for , 3.

4. for -point query with failure probability , set .

And let *query*(*i*) output (assuming “strict turnstile”, for any *i*, ).

**Claim** w.p . .

*Proof*

CM sketch 1.

Fix *i*, let if , otherwise. error *E*.

2. We have 3.

4.

.

**Theorem 1.8 **There is an -Heavy Hitter (strict turnstile) w.p *Proof *We can perform point query with

with query time .

Interestingly, a binary tree using *n* vector elements as the leaves can be illustrate as follows:

The above tree has levels and the weight of each vertex is the sum of elements. Here we can utilise a *CountMin* algorithm for each level. The procedure: 1. Run Count-Min from the roots downward with error and 2. Move down the tree starting from the root. For each vertex, run CountMin for each of its two children. If a child is a heavy hitter, i.e. CountMin returns , continue moving down that branch of the tree.

3. Add to *L* any leaf of the tree that you point query and that has .

The norm will be the same at every level since the weight of the parents vertex is exactly the sum of children vertices. Next vertex *u* contains heavy hitter amongst leaves in its subtree *u* is hit at its level. There is at most

- vertices at any given level which are heavy hitter at that level. This means that if all point queries correct, we only touch at most vertices during Best First Search. For each , we have

We know heavy hitter is guarantee. To be precise . You can get to for Heavy Hitters and CM sketch can give it with .

**Definition 1.6** is *x* with the heaviest *k* coordinates in magnitude reduced to zero. **Claim **If CM has , then w.p. , .

Given from CM output ( ). Let correspond to largest *k* entries of in magnitude. Now consider .

**Claim** .

*Proof*

Let *S* denote and *T* denote . We have Count-Min sketch is a flexible data structure which has now applications within Data Stream systems, but also in Sensor Networks, Matrix Algorithms, Computational Geometry and Privacy-Preserving Computations.

### 1.11 Streaming k-Means

The aim is to design light-weight algorithms that make only one pass over the data. Clustering techniques are largely used in machine learning applications, as a way to summarise large quantities of high-dimensional data, by partitioning them into clusters that are useful for the specific application. The problem with many heuristics designed to implement some notion of clustering is that their outputs can be hard to evaluate. Approximation guarantees, with respect to some valid objective, are thus useful. The *k*-means objective is a simple, intuitive, and widely-used clustering for data in Euclidean space. However, although many clustering algorithms have been designed with the *k*-means objective in mind, very few have approximation guarantees with respect to this objective. The problem to solve is that *k*- means clustering requires multiple tries to get a good clustering and each try involves going through the input data several times.

This algorithm will do what is normally a multi-pass algorithm in exactly one pass. In general, problem in *k*-means is that you wind up with clusterings containing bad initial conditions. So, you will split some clusters and other clusters will be joined together as one. Therefore you need to restart *k*-means. *k*-means is not only multi-pass, but you often have to carry out restarts and run it again. In case of multi-dimensional complex data ultimately you will get bad results.

But if we could come up with a small representation of the data, a sketch, that would prevent such problem. We could do the clustering on the sketch instead on the data. Suppose

*k*-means into a single pass algorithm. The clustering with too many clusters is the idea behind

streaming k-means sketch. All of the actual clusters in the original data have several sketch centroids in them, and that means, you will have something in every interesting feature of the data, so you can cluster the sketch instead of the data. The sketch can represent all kinds of impressive distributions if you have enough clusters. So any kind of clustering you would like to do on the original data can be done on the sketch.

### 1.12 Graph Sketching

Several kinds of highly structured data are represented as graphs. Enormous graphs arise in any application where there is data about both basic entities and the relationships between these entities, e.g., web-pages and hyperlinks; IP addresses and network flows; neurons and synapses; people and their friendships. Graphs have also become the *de facto* standard for representing many types of highly-structured data. However, analysing these graphs via classical algorithms can be challenging given the sheer size of the graphs (Guha and McGregor 2012).

A simple approach to deal with such graphs is to process them in the data stream model where the input is defined by a stream of data. For example, the stream could consist of the edges of the graph. Algorithms in this model must process the input stream in the order it arrives while using only a limited amount memory. These constraints capture different challenges that arise when processing massive data sets, e.g., monitoring network traffic in real time or ensuring I/O efficiency when processing data that does not fit in main memory. Immediate question is how to trade-off size and accuracy when constructing data summaries and how to quickly update these summaries. Techniques that have been developed to the reduce the space use have also been useful in reducing communication in distributed systems. The model also has deep connections with a variety of areas in theoretical computer science including communication complexity, metric embeddings, compressed sensing, and approximation algorithms.

Traditional algorithms for analyzing properties of a graph are not appropriate for massive graphs because of memory constraints. Often the graph itself is too large to be stored in memory on a single computer. There is a need for new techniques, new algorithms to solve graph problems such as, checking if a massive graph is connected, if it is bipartite, if it is *k*- connected, approximating the weight of a minimum spanning tree. Moreover, storing a massive graph requires usually memory, since that is the maximum number of edges the graph may have. In order to avoid using that much memory and one can make a constraint. The *semi-streaming model* is a widely used model of computation which restricts to using only memory, where *polylog n* is a notation for a polynomial in .

When processing big data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. An important class of synopses are sketches based on linear projections of the data. These are applicable in many models including various parallel, stream, and compressed sensing settings.

We discuss graph sketching where the graphs of interest encode the relationships between these entities. Sketching is connected to dimensionality reduction. The main challenge is to capture this richer structure and build the necessary synopses with only linear measurements.

Let , where we see edges in stream. Let and . We begin by providing some useful definitions:

**Definition 1.7 **A graph is bipartite if we can divide its vertices into two sets such that: any edge lies between vertices in opposite sets. **Definition 1.8 **A cut in a graph is a partition of the vertices into two disjoints sets. The cut size is the number of edges with endpoints in opposite sets of the partition.

**Definition 1.9 **A minimum spanning tree (MST) is a tree subgraph of the input graph that connects all vertices and has minimum weight among all spanning trees.

Given a connected, weighted, undirected graph *G*(*V*, *E*), for each edge , there is a weight *w*(*u*, *v*) associated with it. The Minimum Spanning Tree (MST) problem in *G* is to find a spanning tree such that the weighted sum of the edges in *T* is minimized, i.e.

For instance, the diagram below shows a graph, *G*, of nine vertices and 12 weighted edges. The bold edges form the edges of the MST, *T*. Adding up the weights of the MST edges, we get .

**Definition 1.10 **The order of a graph is the number of its vertices.

space.

**Claim **Any deterministic algorithm needs

*Proof *Suppose we have . As before, we will perform an encoding argument. We

create a graph with *n* vertices . The only edges that exist are as follows: for each

*i* such that , we create an edge from vertex 0 to vertex *i*. The encoding of *x* is then the

space contents of the connectivity streaming algorithm run on the edges of this graph. Then in decoding, by querying connectivity between 0 and *i* for each *i*, we can determine whether is 1 or 0. Thus the space of the algorithm must be at least , the minimum encoding length for compressing .

For several graph problems, it turns out that space is required. This motivated the Semi-streaming model for graphs (Feigenbaum et al. 2005), where the goal is to achieve space.

### 1.12.1 Graph Connectivity

Consider a dynamic graph stream in which the goal is to compute the number of connected components using memory limited to *O*(*n* polylog *n*). The idea is to use a basic algorithm and reproduce it using sketches. See the following algorithm.

**Algorithm** *Step 1*: For each vertex pick an edge that connects it to a neighbour.

*Step 2*: Contract the picked edges. *Step 3*: Repeat until there are no more edges to pick in step 1.

*Result*: the number of connected components is the number of vertices at the end of the

Finally, consider a non-sketch procedure, which is based on the simple stage process. In the first stage, we find an arbitrary incident edge for each vertex. We then collapse each of the resulting connected components into a supervertex. In each subsequent stage, we find an edge from every supervertex to another supervertex, if one exists, and collapse the connected components into new supervertices. It is not difficult to argue that this process terminates after stages and that the set of edges used to connect supervertices in the different stages include a spanning forest of the graph. From this we can obviously deduce whether the graph is connected.

In the past few years, there has been a significant work on the design and analysis of algorithms for processing graphs in the data stream model. Problems that have received substantial attention include estimating connectivity properties, finding approximate matching, approximating graph distances, and counting the frequency of sub-graphs.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2018 Rajendra Akerkar, *Models of Computation for Big Data*, Advanced Information and Knowledge Processing

**2. Sub-linear Time Models**

Rajendra Akerkar

(1) Western Norway Research Institute, Sogndal, Norway

**Rajendra Akerkar** **Email:**

### 2.1 Introduction

Sub-linear time algorithms represent a new paradigm in computing, where an algorithm must give some sort of an answer after inspecting only a very small portion of the input. It has its roots in the study of massive data sets that occur more and more frequently in various applications. Financial transactions with billions of input data and Internet traffic analyses are examples of modern data sets that show unprecedented scale. Managing and analysing such data sets forces us to reconsider the old idea of efficient algorithms. Processing such massive data sets in more than linear time is by far too expensive and frequently linear time algorithms may be extremely slow. Thus, there is the need to construct algorithms whose running times are not only polynomial, but rather are sub-linear in *n*.

In this chapter we study sub-linear time algorithms which are aimed at helping us understand massive datasets. The algorithms we study inspect only a tiny portion of an unknown object and have the aim of coming up with some useful information about the object. Algorithms of this sort provide a foundation for principled analysis of truly massive data sets and data objects.

The aim of algorithmic research is to design efficient algorithms, where efficiency is typically measured as a function of the length of the input. For instance, the elementary school algorithm for multiplying two n digit integers takes roughly steps, while more sophisticated algorithms have been devised which run in less than steps. It is still not known whether a linear time algorithm is achievable for integer multiplication. Obviously any algorithm for this task, as for any other non-trivial task, would need to take at least linear time in n, since this is what it would take to read the entire input and write the output. Thus, showing the existence of a linear time algorithm for a problem was traditionally considered to be the gold standard of achievement. Analogous to the reasoning that we used for multiplication, for most natural problems, an algorithm which runs in sub-linear time must necessarily use randomization and must give an answer which is in some sense imprecise. than a slower exact solution.

Constructing a sub-linear time algorithm may seem to be an extremely difficult task since it allows one to read only a small fraction of the input. But, in last decade, we have seen development of sub-linear time algorithms for optimization problems arising in such diverse areas as graph theory, geometry, algebraic computations, and computer graphics. The main research focus has been on designing efficient algorithms in the framework of property testing, which is an alternative notion of approximation for decision problems. However, more recently, we see some major progress in sub-linear-time algorithms in the classical model of randomized and approximation algorithms.

Let us begin by proving space lower bounds. The problems we are going to look at are (distinct elements)-specifically any algorithm that solves within a factor of must use bits. We’re also going to discuss , or randomized exact median, which requires space. Finally, we’ll see or , which requires space for a 2- approximation.

Suppose we have Alice and Bob, and a function . Alice gets , and Bob gets . They want to compute *f*(*x*, *y*). Suppose that Alice starts the conversation.

Suppose she sends a message to Bob. Then Bob replies with , and so on. After *k* iterations, someone can say that *f*(*x*, *y*) is determined. The aim for us is to minimize the total amount of communication, or , where the absolute value here refers to the length of the binary string.

One of the application domains for communication complexity is distributed computing. When we wish to study the cost of computing in a network spanning multiple cores or physical machines, it is very useful to understand how much communication is necessary, since communication between machines often dominates the cost of the computation. Accordingly, lower bounds in communication complexity have been used to obtain many negative results in distributed computing. All applications of communication complexity lower bounds in distributed computing to date have used only two-player lower bounds. The reason for this appears to be twofold: First, the models of multi-party communication favoured by the communication complexity community, the number-on-forehead model and the number in-hand broadcast model, do not correspond to most natural models of distributed computing. Second, two-party lower bounds are surprisingly powerful, even for networks with many players. A typical reduction from a two-player communication complexity problem to a distributed problem T finds a sparse cut in the network, and shows that, to solve T, the two sides of the cut must implicitly solve, say, set disjointness.

A *communication protocol* is a manner in which discourse agreed upon ahead of time, where Alice and Bob both know *f*. There’s obvious the two obvious protocols, where Alice sends bits to send *x*, or where Bob sends *y* via bits to Alice. The aim is to either beat these trivial protocols or prove that none exists. as follows: a communication complexity lower bound can yield a streaming lower bound. We’ll restrict our attention to 1-way protocols, where Alice just sends messages to Bob. Suppose that we had a lower bound for a communication problem-Alice has , and Bob has and we know that the lower bound on the optimal communication complexity is

. The *D* here refers to the fact that the communication protocol is deterministic. In case of a streaming problem, Alice can run her streaming algorithm on *x*, the first half of the stream, and send the memory contents across to Bob. Bob can then load it and pass *y*, the second half of the stream, and calculate *f*(*x*, *y*), the ultimate result. Hence the minimal amount of space necessary is .

Exact and deterministic requires space. We will use a reduction, because the problem must be hard, otherwise we could use the above argument. We use the *equality* . We claim . This is

*problem* (EQ), which is where

straightforward to prove in the one-way protocol, by using the pigeonhole principle (Nelson 2015).

In order to reduce EQ to let us suppose that there exists a streaming algorithm *A* for that uses *S* bits of space. Alice is going to run *A* on her stream *x*, and then send the memory contents to Bob. Bob then queries , and then for each , he can append and query as before, and solve the equality problem. Nonetheless, this solves EQ, which requires space, so *S* must be .

Let us define, *D*(*f*) is the optimal cost of a deterministic protocol. is the optimal cost of the random protocol with failure probability such that there is a shared random string (written in the sky or something). is the same as above, but each of Alice/Bob have private random strings. is the optimal cost of a deterministic protocol with failure probability where .

**Claim** .

*Proof *The first inequality is obvious, since we can simulate the problem. The second

inequality follows from the following scheme: Alice just uses the odd bits, and Bob just uses

*P* is a public random protocol with a random string *s*, . Then there

exists an such that the probability of succeeding is large. See that depends on . If we want to have a lower bound on deterministic algorithms, we need to lower bound *D*(*f*). If we want to have the lower bound of a randomized algorithm, we need to lower bound .

We need Alice to communicate the random bits over to Bob so that he can keep on running the algorithm, and we need to *include* these bits in the cost since we store the bits in memory. So, to lower bound randomized algorithms, we lower bound .

Interestingly one can solve EQ using public randomness with constant number of bits. If we want to solve it using private randomness for EQ, we need bits. Alice picks a random prime, and she sends and sends across and the prime. Neumann’s theorem says that you can reverse the middle inequality in the above at a cost of .

### 2.2 Fano’s Inequality

Fano’s inequality is a well-known information-theoretical result that provides a lower bound on worst-case error probabilities in multiple-hypotheses testing problems. It has important consequences in information theory and related fields. In statistics, it has become a major tool to derive lower bounds on minimax (worst-case) rates of convergence for various statistical problems such as nonparametric density estimation, regression, and classification.

Suppose you need to make some decision, and I give you some information that helps you to decide. Fano’s inequality gives a lower bound on the probability that you end up making the wrong choice as a function of your initial uncertainty and how newsy my information was. Interestingly, it does not place any constraint on how you make your decision. i.e., it gives a lower bound on your best case error probability. If the bound is negative, then in principle you might be able to eliminate your decision error. If the bound is positive (i.e., binds), then there is no way for you to use the information I gave you to always make the right decision.

Now let us consider a two-player problem: *INDEX*. Alice gets , and Bob gets , and .

We are going to show that *INDEX*, the problem of finding the *j*th element of a streamed vector, is hard. Then, we’ll show that this reduces to *GAPHAM*, or Gap Hamming which will reduce to . Also, INDEX reduces to *MEDIAN*. Finally, reduces (with ) to , .

**Claim**

, where , the entropy function. If . In fact, the distributional complexity has the same lower bound.

Let us first elaborate some definitions. Definitions: if we have a random variable *X*, then

(*entropy*) (*joint entropy*)

(*conditional entropy*) (*mutual information*)

The *entropy* is the amount of information or bits we need to send to communicate in expectation. This can be achieved via Huffman coding (in the limit). The mutual information is how much of *X* we get by communicating *Y*.

The following are some fundamental rules considering these equalities

**Lemma 2.1** Chain rule: .

Chain rule for mutual information: . Subadditivity: . Chain rule + subadditivity: . Basic .

.

**Theorem 2.1 **(Fano’s Inequality) If there exist two random variables *X*, *Y* and a predictor *g*

such that , then . If *X* is a binary random variable then the second term vanishes. If all you have is *Y*, based on *Y* you make a guess of *X*. Ultimately for small , .

Let be the transcript of the optimal communication protocol. It’s a one-way protocol. So, we know that .

We know that for all *x* and for all *j*, , which implies that for all *j*, , which them implies that by Fano’s inequality,

See that is a random variable because of the random string in the sky, and also because it is dependent on *X* (Nelson 2015).

Since we have INDEX, let’s use it to prove lower bound, namely MEDIAN. We want a randomized, exact median of with probability . We shall use a reduction.

**Claim **INDEX on reduces to MEDIAN with , with string length . To

solve INDEX, Alice inserts into the stream, and Bob inserts copies of 0, and another copies of .

Suppose that and . Then Alice will choose 3, 4, 7 out of 2, 3, 4, 5, 6, 7. Bob cares about a particular index, suppose the first index. Bob is going to make this stream length 5, such that the median of the stream is exactly the index he wants. Basically, we can insert 0 or exactly where we want, moving around the *j* index to be the middle, which then we can then output.

Now we use INDEX to give a lower bound on the space usage of randomized exact . We then present a lower bound for randomized approximate , something that we have thus far been unable to do. We then provide a lower bound on via the disjointness problem. Then we move on to dimensionality reduction, distortion, and distributional Johnson– Lindenstrauss and the fact that it implies Johnson–Lindenstrauss.

### 2.3 Randomized Exact and Approximate Bound We show that randomized exact requires space with failure probability .

*Proof*

We perform the following reduction from INDEX. Let Alice receive and Bob receive . It is then Bob’s job to find the *j*’th index of *x*. They proceed in the following manner.

Alice runs our algorithm on *x* and sends both the memory contents of the algorithm and the support of *x* to Bob. Bob then appends *j* to the stream and queries from the the memory contents of the algorithm. If increases, Bob outputs 0, else he outputs 1. We conclude that for *S* equal to the space usage of the algorithm Where factor comes from sending the support of *x*.

To show randomized approximate has space lower bound we first state this theorem that can be found in Kushilevitz and Nisan (Kushilevitz and Nisan 1997). In general, it lower bounds the private communication bound by the log of the deterministic communication bound.

**Theorem 2.2**

, where *f* is a communication problem, than

*Proof *If we view *f* as a two player game between Alice and Bob on a binary tree of height *s*

and total leaves , than Alice and Bob could deterministically simulate the private randomized procedure on this tree. For instance, for any path from root to leaf, Alice can compute the probability she would stay on the path given that Bob does as well. She can then send these probabilities to Bob for every single leaf. Bob can then compute the probabilities he stays on the same paths and can output the final result accordingly. Now we prove that randomized approximate has space lower bound .

*Proof*

Let *C* be a subset of such that the support of *c* is . Also , we have . Finally . In essence, it is a collection of subsets that are largely disjoint but very numerous. We know deterministic equality, EQ, on *C* requires communication. Then using Kushilevitz Nisan we have Now we notice there is a natural reduction from to randomized approximate .

Namely, Alice runs on her set *c* and sends the memory contents to Bob. Bob then runs on the memory contents and determines whether the output for has roughly doubled. If it has, then , if not, than .

**2.4 t**

**-Player Disjointness Problem**

Let us consider the t-player disjointness problem, for proving lower bounds for the . We have t-players . We assign an *n* bit string to player . We are then promised that either of the following conditions hold.

1. we have 2. such that we have

The problem is then to find *k* with the least communication possible where communication occurs from player 1 passing on to player 2 and so on and so forth until player player t gives the final result.

The proof also uses an information theoretic approach, known as *information complexity* (Chakrabarti et al. 2001). The idea is the following chain of inequalities, where is the optimal -error communication protocol for some function *f*:

, where is the set of inputs given to the *t* players, and is the transcript of the communication protocol (or the “communication log”) when the input is (see that it is a random variable since uses randomness). Then we define the *information complexity* as the minimum value achievable by any -error protocol when is drawn from distribution . Then we have that for all . A variant of this approach was used by (Bar-Yossef et al. 2004) to obtain lower bounds for *t*-player disjointness, with improvements in (Chakrabarti et al. 2003). The sharp bound was shown in (Gronemeier 2009), with a later work showing how the arguments in (Bar-Yossef et al. 2004) could be strengthened to also get the sharp bound (Jayram 2009).

**Theorem 2.3** .

*Remark 2.1 *Whilst we do not show that the theorem holds we know that it implies some player sends bits which we’ll demonstrate the following claim.

the randomized 1.1 approximation to needs bits of space.

**Claim **For

*Proof*

Consider for the disjoint players problem. Each player generates a virtual stream containing *j* if and only if . Further, we compute on these virtual streams. If all are disjoint then . Alternatively, because some element *k* must appear at least *t* times. Since the algorithm is a 1.1 approximation, we can perceive between the two cases. This implies the space usage of the algorithm, *S* gives hence proved.

### 2.5 Dimensionality Reduction

Dimensionality reduction (Globerson et al. 2003) has been one of the key techniques used to facilitate the processing of streaming data. The dimension reduction algorithms are generally classified into feature selection, feature extraction and random projection. In simple words, dimension reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. These techniques are typically used while solving machine learning problems to obtain better features for a classification or regression task. The characters of streaming data require the dimension reduction techniques to be as efficient as possible. Thus the common used dimension reduction algorithms used for data streams are random projection and feature selection.

Dimensionality reduction transforms high-dimensional data into lower-dimensional version, such that for the computational problem you are considering, once you solve the problem on the lower-dimensional transformed data, you can get approximate solution on original data. Since the data is in low dimension, your algorithm can run faster.

One of the most used dimensionality reduction techniques is Principal Component Analysis (PCA). The key idea is to find a new coordinate system in which the input data can be expressed with many less variables without a significant error. This new basis can be global or local and can fulfil very different properties. The big data together with the advanced computational resources have attracted the attention of many researchers in Statistics, Computer Science and Applied Mathematics who have developed a wide range of computational techniques dealing with the dimensionality reduction problem.

Dimensionality reduction techniques can be used in different ways including: 1.

*Data dimensionality reduction*: Construct a compact low-dimensional encoding of a given high-dimensional data set.

## 2. Data visualization: Give an interpretation of a given data set in terms of intrinsic degree of freedom, usually as a by-product of data dimensionality reduction.

*Preprocessing for supervised learning*: Simplify, reduce, and clean the data for subsequent

supervised training.

In many large-scale data processing applications, local distances convey more useful information than large distances and are sufficient for uncovering low-dimensional structure. Moreover, there are a variety of situations that rely only on local distances, including nearest- neighbor search, the computation of vector quantization rate-distortion curves, and popular data-segmentation and clustering algorithms. In these cases, it is often desirable to reduce the dimension of the data set for reductions of storage requirements or algorithm running times. If the long distances are unimportant, we may be able to reduce the dimensionality only preserving the local information, and such reduction can be into a far lower dimension than what is possible when attempting to preserve distances between all pairs of points.

Several algorithms for dimensionality reduction have been developed.

, and , and a function

**Definition 2.1 **Suppose we have two metric spaces,

. Then *f* has distortion if , , where .

We will focus on spaces in which (i.e. normed spaces).

Furthermore, if is the norm, then in worst case, target dimension is . That is, there exists a set of *n* points *X*, such that for all functions , with distortion , then *m* must be at least (Brinkman and Charikar 2005).

More recently in 2010, Johnson and Naor (Johnson and Naor 2010) have proposed the following theorem.

**Theorem 2.4 **Suppose is a complete normed vector space or “Banach Space” such

that for any N point subset of X, we can map to dimension subspace of *X* with *O*(1) distortion, then every n-dimensional linear subspace of *X* embeds into with distortion .

Dimension reduction for high dimensional metric data has been an extremely important paradigm in many application areas. In particular, the celebrated Johnson–Lindenstrauss Lemma has played a central role in a plethora of applications. We will now present the Johnson–Lindenstrauss lemma which is a result named after W. B. Johnson and J. Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a small set of points in a high- dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved, the map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.

### 2.5.1 Johnson Lindenstrauss Lemma

The Johnson–Lindenstrauss Lemma is a critical tool in the realm of dimensionality reduction and high dimensional approximate computational geometry. It is a classic result which implies that any set of *n* real vectors can be compressed to *O*(*logn*) dimensions while only distorting pairwise Euclidean distances by a constant factor. It is also employed for data mining in domains that analyse intrinsically high dimensional objects such as images and text. However, while algorithms performing the dimensionality reduction have become increasingly sophisticated, there is little understanding of the behaviour of these embeddings in practice. In many practical instances it is often the case that the high-dimensional data is inherently low dimensional and it is therefore desirable to reduce its dimension close to its inherent dimensionality, which is independent of the size of the data set.

**Theorem 2.5 **The Johnson–Lindenstrauss (JL) lemma (Johnson and Lindenstrauss 1984)

states that for all , , there exists , such that for all *i*, *j* ,

, We also present a distributed Johnson Lindenstrauss theorem

**Theorem 2.6**

For all , there exists a distribution on matrices , such that for all , and drawn from the distribution , Now we illustrate that the distributional Johnson Lindenstrauss proves Johnson Lindenstrauss.

**Claim** .

*Proof*

Set and look at for . Also see that . Then and so by union bound this probability is . Next we collect some notation and basic lemmas we will use (Nelson 2015).

Throughout, for a random variable *X*, denotes . It is known that is a norm for any . In mathematical analysis, Minkowski’s inequality establishes that the spaces are normed vector spaces. It is also known whenever . Whenever we discuss , we will assume .

For ( ). In the following part we will use two inequalities. Jensen’s Inequality which is an inequality discovered by Danish mathematician Johan Jensen in 1906. Jensen’s Inequality appears in many forms depending on the context. The inequality states that the convex transformation of a mean is less than or equal to the mean applied after convex transformation. It is a simple corollary that the opposite is true of concave transformations. The Khintchine inequality, named after Aleksandr Khinchin, is a theorem from probability, and is also frequently used in analysis. The importance of the Rademacher functions and the Khintchine inequality in Functional Analysis lies mainly on the fact of its utility in the study of the geometry of Banach spaces. Additionally, the concern of the Rademacher functions in the theory of functional and trigonometric series and in the theory of Banach spaces is well known and it is commonly attributable to stochastic independence of the Rademacher functions. One of the main manifestations of this stochastic independence is, namely, the Khintchine inequality. Moreover, the Khintchine inequality is an important auxiliary result frequently used to prove results concerning to summability.

.

**Lemma 2.2 **For *F* convex, , . **Lemma 2.3 **For *Proof*

Define . Then *f* is convex. Thus by Jensen’s inequality, Now raise both sides of the inequality to the 1 / *p*.

**Definition 2.2 **The Gaussian distribution has density function .

*Remark 2.2 *If then for integer *p* is 0 for *p* odd and is

for *p* even.

**Lemma 2.4**

For any , , and independent Rademachers,

*Proof*

Without loss of generality we can assume *p* is an even integer. (If not, let *q* be the smallest even integer larger than *p* then it suffices to have since by Lemma

.) Consider independent Gaussian of mean zero and variance 1. Expand into a sum of monomials.

Any monomial with odd exponents (i.e. odd ) vanishes, as in the Gaussian case. Meanwhile, monomials with all being even have nonnegative and .

Meanwhile if the are replaced by Gaussian , then . Thus the Rademacher *p*th moment is term-by-term dominated by the Gaussian case and thus . But is a Gaussian with mean zero and variance .

So far we showed that the Rademacher case is bounded by the Gaussian case. In the following lines you will see another approach. Hereafter we will not require *p* to be an even integer. We further demonstrate a decoupling inequality which will be essential for the proof of Hanson–Wright inequality. Hanson–Wright will be used to prove distributional JL. We use to denote .

**Lemma 2.5**

Let be independent and mean zero, and identically distributed as the and independent of them. Then for any and for all

*Proof*

Let be independent Bernoulli random variables each of expectation 1 / 2. Then (2.1)

Hence there must be some fixed vector which attains where . Let denote the |*S*|-dimensional vector corresponding to the for . Then

The Hanson–Wright inequality is equivalent to the statement that there exists a constant such that for all (2.2)

**Theorem 2.7**

((Hanson and Wright 1971) For random variables whose distributions are symmetric about zero (independent Rademachers) and real and symmetric, for all

*Proof*

Without loss of generality we suppose that (so that ). Then (2.3) (2.4) (2.5) (2.6)

Writing and comparing above equations, we see that for some constant ,

Thus *E* must be smaller than the larger root of the above quadratic equation, implying our desired upper bound on .

The Johnson–Lindenstrauss Lemma is useful because it allows one to project high- dimensional data to a very lower dimensionsal space while approximately preserving all of its metric properties. Several algorithms have runtimes that are exponential in the dimension and with only logarithmic dimension these algorithms become polynomial. Note that the reduced dimension depends only on the number of points and not on the original dimension. The theorem is sometimes called a Distributional Johnson–Lindenstrauss Lemma because of the connection to the J–L Lemma.

**Lemma 2.6**

Distributional Johnson-Lindenstrauss Lemma For any integer and , there exists a distribution over for such that for any of unit Euclidean norm,

*Proof*

Write , where the are independent Rademachers. Also overload to mean these Rademachers arranged as a vector of length *mn*, by concatenating rows of . See where

(2.7) Thus where the right-hand side is taken by the Hanson–Wright inequality with (from

(

)). Note that *A* is a block-diagonal matrix with each block equaling , and thus

. We also have . Thus Hanson–Wright allows which for is at most for .

### 2.5.2 Lower Bounds on Dimensionality Reduction

Lower bound on dimentionality reduction was initially proved by Jayram and Woodruff (Jayram and Woodruff 2013). Thereafter Kane et al. (Kane et al. 2011) also showed lower bound. The lower bound states that any -DJL for must have

. Since for all *x* we have the probabilistic guarantee

, then it is true also for any distribution over *x*. If we select *x* according to the uniform distribution over the sphere. Then this implies that there exists a matrix such that

. It is proved that this cannot happen for any fixed matrix unless *m* satisfies the lower bound.

In case of linear maps, a lower bound of Kasper and Nelson (Larsen and Nelson 2014) shows us that *m* should be at least . The hard set is shown to exist via the probabilistic method, and is constructed by taking the union of in together with plus sufficiently many random vectors. Now we assume a fine finite net that approximates all possible linear maps that work for the simplex (i.e. ) and then argue that for each linear map in the net then with some good probability there exists a point from the random ones such that its length is not preserved by the map. Therefore, approximating the set of linear maps and tuning the parameters, we can take a union bound over all matrices in the net.

This lower bound has been proved by Alon (Alon 2003). It shows that *m* must be at least to preserve distances between the set of points . The hard set is the simplex . Let *f* be the mapping. Transform the embedding so that (i.e. translate transform the embedding so that each point is actually mapped to ; this does not affect any distances). Now write . Then we have that since *f* preserves the distance from to 0. We also have for that

. This implies since we get that

. Setting we have that and . Let be the matrix that has as columns. Then observe that is a matrix with 1 on the diagonal and elements with absolute value at most everywhere. We have that

. Let us consider the following lemma that unfolds the problem for and further bootstrap it to work for all values of .

**Lemma 2.7 **A real symmetric matrix that is -near to the identity matrix, i.e. its diagonals

*Proof *Let be the non-zero eigenvalues of *A*, where . By Cauchy-

Schwarz inequality, we have . By using linear algebra the numerator is the trace of

*A* squared and the denominator is the Frobenius norm of *A* squared. We have that

and . Pushing it into the inequality along with the fact that we prove the lemma.

**Theorem 2.8 **A real symmetric matrix that is -near to the identity matrix must have .

such that . We will build our proof on the following

*Proof *Define the matrix

claim: It holds that where . Assuming that the claim is true we pick *k* to be the closest integer to . Thus , so we have that . Using the fact that and walking thought the calculations we can get the desired result.

What remains is to prove the claim. Let be the row-space of *A*. This means that such that . Then observe that . It is easy to see that each vector of this form is a linear combination of vectors of the form . where . This is a standard combinatorics problem of putting *r* balls into *k* bins with repetition, so the answer is .

Given a subset *T* of the unit sphere- for example - ideally we would like that . We want that .

**Definition 2.3 **The Gaussian mean width of a set *T* is defined as .

Suppose that for random signs, with . Then we have that . Actually, we just need a distribution that decays as fast as a Gaussian, has variance one and zero mean.

Let us give a simple example of the Gaussian mean width. For example, if *T* is the simplex then we have that which is roughly equal to by standard computations. Actually, what Gordon’s theorem tells us is that if the vectors of *T* have a nice geometry then one can improve upon Johnson–Lindenstrauss. The more well-clustered the vectors are, the lower dimension one can achieve.

We continue with the following claim: which is a subset of the unit sphere, , where .

*Proof*

Define , where . Then

, as we have illustrated. In fact if every vector in *T* has norm at most , then one gets . Let be such that , such that .

That is, is an -net of *T*. This implies that , which implies that

. Thus if

*T* is covered well by a small net, one can get a better bound. Let , such

that is a -net of *T* (we are assuming every vector in *T* has at most unit norm). Then . Then we have

The last inequality holds since by the triangle inequality, . Furthermore, , so .

Thus , where is the size of the best -net of *T* under metric *d*. Bounding this sum by an integral, we have that *g*(*T*) is at most a constant factor times . This inequality is called Dudley’s inequality.

Write , such that and . One can show that the Dudley bound is in fact

Write It was shown by Fernique (Fernique 1975) that for all *T*. Talagrand proved that in (Talagrand 1996) the lower bound is also true, and hence ; this is known as the “majorizing measures” theorem.

### 2.5.3 Dimensionality Reduction for k-Means Clustering

Clustering is ubiquitous in science and engineering with various application areas ranging from social science and medicine to the biology and the web. The most well-known clustering algorithm is the so-called *k*-means algorithm (Lloy 1982). This method is an iterative expectation-maximization type approach that attempts to address the following objective. Given a set of Euclidean points and a positive integer *k* corresponding to the number of clusters, split the points into *k* clusters so that the total sum of the squared Euclidean distances of each point to its nearest cluster center is minimized.

In recent years, the high dimensionality of enormous datasets has created a significant challenge to the design of efficient algorithmic solutions for *k*-means clustering. First, ultra- high dimensional data force existing algorithms for *k*-means clustering to be computationally inefficient, and second, the existence of many irrelevant features may not allow the identification of the relevant underlying structure in the data (Guyon et al. 2005). Researchers have addressed these obstacles by introducing feature selection and feature extraction techniques. Feature selection selects a (small) subset of the actual features of the data, whereas feature extraction constructs a (small) set of artificial features based on the original features.

Consider *m* points and an integer *k* denoting the number of clusters. The objective of *k*-means is to find a *k*-partition of *P* such that points that are “close” to each other belong to the same cluster and points that are “far” from each other belong to different clusters. A *k*-partition of *P* is a collection of *k* non-empty pairwise disjoint sets which covers *P*. Let be the size of . For each set , let be its centroid:

(2.8) The *k*-means objective function is

(2.9) where is the centroid of the cluster to which belongs. The objective of *k*- means clustering is to compute the optimal *k*-partition of the points in ,

(2.10) Now, the goal of dimensionality reduction for *k*-means clustering is to construct points

(2.11) (for some parameter ) Thus approximates the clustering structure of *P*. Dimensionality reduction via feature selection constructs the ’s by selecting actual features of the corresponding ’s, whereas dimensionality reduction via feature extraction constructs new artificial features based on the original features.

Assume that the optimum *k*-means partition of the points in has been computed.

(2.12) A dimensionality reduction algorithm for *k*-means clustering constructs a new set such that

(2.13) where is the approximation ratio of . Then again, we need that computing an optimal partition on the projected low-dimensional data and plugging it back to cluster the high dimensional data, will imply a factor approximation to the optimal clustering.

### 2.6 Gordon’s Theorem

Given a set *S*, what is the measure of complexity of *S* that explains how many dimensions one needs to take on the projection to approximately preserve the norms of points in *S*. This is captured by Gordon’s theorem. Oymak, Recht and Soltanolkotabi (Oymak et al. 2015) have showed that with right parameters the Distributional Johnson Lindenstrauss (DJL) lemma implies Gordon’s theorem. Basically take a DJL then for (where we hide constants in the inequalities) we take the guarantee for Gordon’s Theorem. The proof works by preserving the sets (plus differences and sums of vectors in these sets) at different scales. The result is not exactly optimal because it is known suffices (see for example (Dirksen 2015; Mendelson et al. 2007), but it provides a nice reduction from Gordon’s theorem to DJL. A classical result due to Gordon characterizes the precise trade-off between distortion, “size” of the set and the amount of reduction in dimension for a subset of the unit sphere.

The main result is summarized in the following theorem.

**Theorem 2.9** (Oymak et al. 2015) Define , , , .

Let . Then if *D* satisfies -DJL for , then To illustrate why this implies Gordon’s theorem, we take the random sign matrix, e.g.,

. We know that this matrix satisfies -DJL for , which equals for all *r*. The theorem therefore applies and so we see that we get an guarantee with

. And since , this is approximately , which gives Gordon’s theorem. Different proofs yield that obviously suffices. To prove the above theorem, the lemma below suffices.

**Lemma 2.8 **For a given set *T*, let be the sequence that achieves the infimum in the

definition of . To achieve , it suffices that for all , the following hold simultaneously for all .

For all , (2.14)

For all , (2.15)

For all and , (2.16)

We also have (2.17)

Obviously, the first three conditions hold with high probability since they are all JL-type conditions. The third one is a bit less obvious since it is about dot products instead of norms. But notice that . So if , then preserving and means that

. If *u* and *v* don’t have unit norm you can scale them to achieve the above condition. So the third condition also follows from the DJL premise.

We now argue that the lemma suffices to prove the theorem.

**Claim **Lemma *Proof*

Define Fix . We will show Define , and define So, .

Now define (2.18) We now proceed by bounding each of the four terms as follows.

Case : We have .

Case : We have

(2.19) We thus need to bound and .

We have . Next, we consider

(2.20) Now .

. Thus, using for ,

(2.21) Case : for . Thus Thus Case : By the triangle inequality, for any

(2.22) (2.23)

We have with the second inequality holding since for .

So, we also have Therefore Considering for all *r*,

Finally we arrive at where for .

### 2.7 Johnson–Lindenstrauss Transform

Enormous amount of data stored and manipulated on computers can be represented as points in a high-dimensional space. However, the algorithms for working with such data tend to become bogged down very rapidly as dimension increases. It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure. The Johnson Lindenstrauss lemma is an important result in this respect.

Fast Johnson–Lindenstrauss Transform (FJLT) was introduced by Ailon and Chazelle in 2006 (Ailon and Chazelle 2009). We will discuss that transform in the next section. Another approach is to build a distribution supported over matrices that are sparse.

In high-dimensional computational geometry problem, one can employ Johnson Lindenstrauss transform to speed up the algorithm in two steps: (1) apply a Johnson– Lindenstrauss (JL) map to reduce the problem to low dimension *m*, then (2) solve the lower-dimensional problem.

As *m* is made smaller, (2) becomes faster. Yet, one would also use step (1) to be as fast as possible. The dimensionality reduction has been a dense matrix-vector multiplication (Nelson 2015).

There are two possible ways of doing this: one is to make sparse. We saw in pset 1 that this sometimes works: we replaced the AMS sketch with a matrix each of whose columns has exactly 1 non-zero entry. The other way is to make structured, i.e., it’s still dense but has some structure that allows us to multiply faster.

One way to speed up JL is to make sparse. If has *s* non-zero entries per column, then can be multiplied in time , where . The aim is then to make

*s*, *m* as small as possible.

From (Achlioptas 2003) and gives DJL, with constant factors. But it provides a factor-3 speed-up since in expectation only one third of the entries in are non-zero. On the other hand, (Matousek 2008) proved that if has independent entries then you can’t speed things up by more than a constant factor.

The first to exhibit a without independent entries and therefore to break this lower bound were (Dasgupta et al. 2010), who got , nonzeros per column of . So depending on the parameters this could either be an improvement or not.

Now let us see (Kane and Nelson 2014) that you can take and , a strict improvement by choosing exactly *s* entries in each column of to have non-zero entries and then choosing the signs of those entries at random and normalizing appropriately. Instead you can separate each column of up into *s* blocks of size

*m* / *s*, and set exactly 1 non-zero entry in each block. The resulting matrix is exactly the count

The analysis employs Hanson–Wright inequality. For dense , we have seen that where was an matrix whose *i*th row had in the *i*th block of size , which was a quadratic form.

*n* and zeros elsewhere. Then we said

We shall write where is a random variable indicating whether the corresponding entry of was chosen to be non-zero. (So the are not independent.) For every , define *x*(*r*) by . The claim is now that where is an matrix whose *i*th row contains in the *i*th block of size *n* and zeros elsewhere. Using the inequality, we observe that is a block-diagonal matrix as before. And since we’re bounding the difference between and its expectation, it is equivalent to bound where is with its diagonals zeroed out.

Now condition on and recall that the inequality says that for all , . Then, taking *p*-norms with respect to the and using the triangle inequality, we obtain the bound

If we can bound the right-hand-side, we will obtain required DJL result by application of Markov’s inequality, since is positive. Therefore, it suffices to bound the p-norms with respect to the of the operator and Frobenius norms of .

Since is block-diagonal and its *i*th block is where is the diagonal of , we have . But the operator norm of the difference of positive-definite matrices is at most the max of either operator norm. Since both matrices have operator norm at most 1, this concludes always.

See that we defined as the center of the product from before, but with the diagonals zeroed out. is a block-diagonal matrix with *m* blocks with We can state the Frobenius norm as where we define the expression in the parentheses to be .

**Claim** Let us assume the claim and show that the Frobenius norm is correct.

Now,

*Proof *Let us just fix column *i*. It has *s* nonzero elements somewhere. There’s another column

*j*, and the question is how many of the nonzero locations of *i* match with nonzero elements of *j*.

Let’s have be an indicator random variable for column *j* having a nonzero element in the *t*th nonzero row of *i*. Then .

However, the moments are dominated by the independent case.

The expected value of any is *s* / *n*. The product at the end is just in the independent case. Here, it is a conditional product,

So the sum is dominated by the independent case, which can be handled via Bernstein’s inequality.

The runtime to apply the sparse JL map is .

### 2.8 Fast Johnson–Lindenstrauss Transform

We take another approach that will give time, which is better in cases where *x* is dense. This is based on Ailon and Chazelle (Ailon and Chazelle 2009) and is called as the Fast Johnson Lindenstrauss Transform (FJLT). The original construction in this area is due to Ailon and Chazelle (though fast Fourier ideas have obviously existed much longer), but there are many others. Theoretically, they are all to a first approximation the same; but practically, there can be a big difference between them.

Here is the definition of : where *P* is an sampling matrix (very sparse matrix in expectation, only a fraction of the elements are nonzero). *H* is times an orthogonal matrix, i.e. . Also

, and computing *Hx* should be fast for any *x*. *D* is an diagonal matrix with random signs along the diagonal. So, *HD* is an orthogonal matrix, meaning in particular that the Euclidean norm of vectors to which it is applied does not change.

We will let be an diagonal matrix where the *i*th diagonal entry equals 1 with probability *m* / *n* and 0 otherwise, and the are independent across *i*. See that an example of *H* could be the unnormalized discrete Fourier transform. Another possibility for *H* is the unnormalized Hadamard matrix where . Here *n* is a power of 2 and we are interpreting *i*, *j* as elements of . Both of these matrices allow *Hx* to be computed in time . In general, matrices *F* which are orthogonal with are called *bounded orthonormal systems*.

In (Ailon and Chazelle 2009) the following is shown. The key ingredient of their argument is the well-known Khintchine inequality from functional analysis.

**Claim**

If we restrict so that the above claim holds, then Bernstein implies that for we will have with probability . Thus by a union bound, the overall failure probability is .

Suppose we want to have rows, we can do this by using the matrix , where is for example a dense random sign matrix with rows.

The total time to apply is then . Rather different analysis can improve the dependence in *m* to be as follows.

**Theorem 2.10**

Let be an arbitrary unit norm vector, and suppose . Also let as described above with a number of rows equal to . Then

*Proof*

Let be an independent copy of , and let be uniformly random. Write so that .

(2.24) (2.25)

We will now bound . Define and see . Then

(2.26)

*H* is an unnormalized bounded orthonormal system.

Establishing and integrating above equations, we find that for some constant implying . By the Markov inequality and thus to achieve the theorem statement it suffices to set then choose

.

*Remark 2.3 *The Fast Johnson Lindenstrauss Transform gives suboptimal *m*. For necessary

optimal *m*, one can use the embedding matrix , where is the FJLT and is, say, a dense matrix with Rademacher entries having the optimal rows. In (Ailon and Chazelle 2009), this term enhanced by replacing the matrix *S* with a random sparse matrix *P*.

*Remark 2.4 *The analysis for the FJLT, such as the approach in (Ailon and Chazelle 2009),

would achieve a bound on *m* of . Such analyses operate by, using the notation of the proof of Theorem

, first conditioning on , then completing the proof using Bernstein’s inequality.

### 2.9 Sublinear-Time Algorithms: An Example

In this example, we discuss a type of approximation that makes sense for outputs of decision problems. of elements

*Example 2.1 *sequence monotonicity, version 1 Given an ordered list (with partial order ‘ ’ on them), the list is .

Instead of looking at each single sequence element, we consider the following version, *Example 2.2 *sequence monotonicity, version 2.

Given an ordered list of elements (with partial order ‘ ’ on them) and a real fraction , the list is *close to monotone*. That means, a list is *-close to monotone* if it has a monotone subsequence of length .

If the list is monotone, the test should pass with probability 3 / 4. If the list is -far from monotone, the test should fail with probability 3 / 4.

*Remark 2.5 *The choice of 3 / 4 is arbitrary; any constant bounded away from 1 / 2 works

equally well. We can expand the definition from our constant to a different constant by repeating the algorithm times and taking the majority answer.

*Remark 2.6 *The behaviour of the test on inputs that are very close to monotone, but are not

monotone, is undefined. (Those inputs are -close with .) We present some cases below: Select randomly and test .

We will show that complexity of such case is . For constant *c*: sequence to fail the test. Interestingly the test passes when it selects *i*, *j* from different groups.

If the test is repeated by repeatedly selecting new pairs *i*, *j*, each time discarding the old pair, and checking each such pair independently of the others, then pairs are needed.

However, if the test is repeated by selecting *k* indices and testing whether the subsequence induced by them is monotone, then samples are required using the Birthday Paradox. The Birthday Paradox states that in a random group of people, there is about a 50 % chance that two people have the same birthday. There are many reasons why this seems like a paradox.

In other way, select *i* randomly and test . For some constant *c*, and consider the following sequence (of *n* elements): Now, the longest monotone subsequence has length — rather small, therefore we expect this sequence to fail the test. However, the test passes unless the *i* it selects is a

*border point*, which happens with probability *c* / *n*.

Therefore we expect to have a linear number of samples before detecting an input that should be rejected. This would check that the sequence is locally monotone, and also monotone at large distances, but would not verify that it is monotone in middle-range gaps. And counter- examples can be found. However, there exists a correct, -samples algorithms that works by testing pairs at different distances .

**Lemma 2.9** are pairwise distinct.

and use dictionary order to compare ‘ ’ to ‘

*Proof *Replace each by the tuple ’, compare the first coordinate and use the second coordinate to break ties.

*Remark 2.7 *This move does not demand the sublinearity of the algorithm because it does

not require any pre-processing; the transformation can be done on the fly as each element is examined and compared.

*‘[n]’* indicates the set of positive integers. *‘ ’*

indicates assignment of a random member of the set on its right hand side (RHS) to the variable on its left hand side (LHS). If the distribution is not given, it is the uniform distribution.

For example, ‘ ’ assigns to *x* one of the three smallest positive integers, chosen uniformly.

The procedure is: Repeat times:

- – Select –

Query (obtain) the value

- – Do binary search for
- – If either

An inconsistency was found during the binary search; was not found; then return **fail**.

- – Return
**pass**.

During the binary search, we maintain an interval of allowed values for the next value we query. The interval begins as . The upper and lower bounds are updated whenever we take a step to the left or to the right, respectively. Whenever we query a value we state that it is in the interval and raise an inconsistency if it is not. This algorithm’s time complexity is

, since the augmented binary search and the choosing of a random index cost steps each; and those are repeated times. We will now show that the algorithm satisfies the required behavior. We will define which indices are ‘good’ and relate the number of bad indices to the length of a monotone sequence of elements at good indices.

**Definition 2.4 **An index *i* is *good* if augmented binary search for *i* is accomplished.

*Remark 2.8 *If indices are bad, then . *Proof*

Let *c* be the constant under the ‘ repetitions’ clause. Then (2.27) where the last inequality follows by setting *c* to a large enough constant value. inputs with probability 1 and rejects bad inputs with probability at least 3 / 4.

*Proof *When the list is monotone, it passes with trust since the binary search works and the

are considered distinct. It needs to show that “far from monotone” lists are rejected with high likelihood.

Suppose that an input passes with probability , we shall prove that it is -close. Let be received with probability . By Eq.

), the number of bad indices is . Hence indices are good.

**Claim **Suppose we delete every element at bad indices, the remaining sequence is monotone.

*Proof *Let be two indices. Consider the paths in the binary-search tree from the root to

*i*and to

*j*. These two paths have longest prefix common. Then it is enough to prove that

. When the path to is a prefix of the path to , then . Alternatively is a descendant of a *z*’s left or right child. Since *i* is good, then must be a descendant of *z*’s left child; for the same reason must be smaller than *z*. Thus, always. By symmetry, . Hence .

Hence proved.

### 2.10 Minimum Spanning Tree

Let us consider a connected undirected graph where the degree of each vertex is at most *d*. In addition, each edge (*i*, *j*) has an integer weight . The graph is given in an adjacency list format, and edges of weight do not appear in it. The aim is to find the weight of a minimum spanning tree (MST) of *G*. Specifically, if we let for

, then our objective is to find Our objective is to select a subset of the edges of minimum total length such that all the vertices are connected. It is immediate that the resulting set of edges forms a spanning tree every vertex must be included; Cycles do not improve connectivity and only increase the total length. Therefore, the problem is to find a spanning tree of minimum total length.

There are many greedy procedures work for this problem. One can either start with the empty graph and consecutively add edges while avoiding forming cycles, or start with the complete graph and consecutively remove edges while maintaining connectivity. The crucial aspect is the order in which edges are considered for addition or deletion. We present three basic greedy procedures in the following lines, all of which lead to optimal tree constructions:

*Kruskals algorithm*: Consider edges in increasing order of length, and pick each edge that

*Prims algorithm*: Start with an arbitrary node and call it the root component; at every step,

grow the root component by adding to it the shortest edge that has exactly one end-point in the component.

*Reverse delete*: Start with the entire graph, and consider edges for deletion in order of decreasing lengths. Remove an edge as long as the deletion does not disconnect the graph.

We now consider a fundamental algorithm for finding the MST, which proceeds in phases. In each iteration, the minimum weight edge on each vertex is added and the resulting connected components are collapsed to form new vertices. This algorithm can be implemented in the dynamic stream setting in passes by imitating each iteration in passes of the dynamic graph stream. In the first pass, we -sample an incident edge on each vertex without considering the weights. Suppose we sample an edge with weight on vertex *v*. In the next pass, we repeat sample incident edges but we ignore all edges of weight at least on vertex *v* when we create the sketch. Repeating this process assures that we succeed in finding the minimum weight edge incident on each vertex. Hence the algorithm takes passes as claimed.

Since we are interested in sub-linear time algorithms for this problem, and therefore, cannot hope to find *M*, we focus on finding an -multiplicative estimate of *M*, that is, a weight which satisfies

We see that , where . This follows since *G* is connected, and thus, any spanning tree of it consists of edges, and by the premise on the input weights.

In what follows, we relate the weight of a MST of *G* to the number of connected components in certain subgraphs of *G*. We begin by introducing the following notation for a graph *G*:

Let be the subgraph of *G* that consists of the edges having a weight of at most *i*. Let be the number of connected components in .

Let us consider two simple cases. The first case is when , namely, all the edges of *G* have a weight of 1. In this case, it is clear that the weight of a MST is . Now, let us consider the case that , and let us focus on . Clearly, one has to use edges (of weight 2) to connect the connected components in . This implies that the weight of a MST in this case is

We extend and formalize the intuition presented above. Specifically, we characterize the weight of a MST of *G* using the ’s, for any integer *w*.

**Claim** .

*Proof*

Let be the number of edges of weight *i* in any MST of *G*. Obviously, it is well-known that all minimum spanning trees of *G* have the same number of edges of weight *i*, and hence, the ’s are well defined. It is easy to validate that the number of edges having weight greater than is equal to the number of connected components in minus 1. That is, , where is set to be *n*. Therefore

### 2.10.1 Approximation Algorithm Algorithm , formally defined below, estimates the weight of the MST.

See that there are *w* calls to . Recall that the running time of this procedure is , and hence, the running time of is . It is worth noting that rather than extracting from *G* for each call of

, that makes the algorithm non-sublinear time, we simply modify so it ignores edges with weight greater than *i*.

We establish that with high probability. For this purpose, recall that outputs an estimation of the number of connected components which satisfies whp. Consequently, we get that whp. Notice that , where the last inequality is valid for any *n*, i.e., . Therefore, , which completes the proof.

The modern procedure for finding an -multiplicative estimate of *M* has a running time of . On the lower bound side, it is known that the running time of any algorithm must be .

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2018 Rajendra Akerkar, *Models of Computation for Big Data*, Advanced Information and Knowledge Processing

**3. Linear Algebraic Models**

Rajendra Akerkar

(1) Western Norway Research Institute, Sogndal, Norway

**Rajendra Akerkar** **Email:**

### 3.1 Introduction

This chapter presents some of the fundamental linear algebraic tools for large scale data analysis and machine learning. Specifically, the focus will fall on large scale linear algebra, including iterative, approximate and randomized algorithms for basic linear algebra computations and matrix functions. In the last decade, several algorithms for numerical linear algebra have been proposed, with substantial gains in performance over older algorithms (Nelson et al. 2014; Nelson 2015). Algorithms of such type are mostly pass-efficient, requiring only a constant number of passes over the matrix data for creating samples or sketches, and other work. Most these algorithms require at least two passes for their efficient performance guarantees, with respect to error or failure probability. Such a one-pass algorithm is close to the streaming model of computation, where there is one pass over the data, and resource bounds are sublinear in the data size.

**Definition 3.1**

and *D* is a distribution over satisfies the JL moment property if for any we have

*Example 3.1* 1.

. This induces JL moment property with and JL moment property with 2.

We have JL moment property with

**Claim**

*n* rows, we have

(3.1) *Proof *The proof is left as an exercise for the reader.

**Definition 3.2**

Given a linear subspace, is an -**subspace embedding** for *E* if We can frame these subspace embeddings to the approximate matrix multiplication methods:

. The above statement . This is a better bound, which preserves *x*. As we know, any linear subspace is the column space of some matrix. Thus, we will represent them as matrices.

**Claim **For any *A* of rank *d*, there exists a 0-subspace embedding with , but no -subspace embedding with if .

for .

*Proof *Now, let us imagine that there is an -subspace embedding

Then, the map has a non-trivial kernel. Actually, there is some such that . However, is a contradiction. For the case, begin by rotating the subspace *E* to become via multiplication by an orthogonal matrix, and then project to the first *d* coordinates.

**Theorem 3.1**

(Singular value decomposition) Every of rank r has a singular value decomposition (SVD) where has orthonormal columns, , is diagonal with strictly positive entries on the diagonal, and has orthonormal columns so and When we have , we can set . There are procedures to compute in time:

**Theorem 3.2 **(Demmel et al. 2007) We can approximate SVD well in time where

is the constant in the exponent of the complexity of matrix multiplication. Here the tilde hides logarithmic factors in *n*.

**Definition 3.3**

Suppose we are given , where . We want to solve ; however, since the system is over-constrained, an exact solution does not exist in general. In the *least*

*squares regression* (LSR) problem, we instead want to solve the equation in a specific

approximate sense: we want to compute The choice of the function to be optimized is not arbitrary. For example, assume that we have some system, and one of its parameters is a linear function of *d* other parameters. Actually, we experimentally observe a linear function plus some random error. Under certain premises, errors have mean 0, same variance, and are independent, then least squares regression is provably the best estimator out of a certain class of estimators.

Now consider that is the column span of *A*. Part of *b* lives in this column space, and part of it lives orthogonally. Then, the we need is the projection of *b* on that column span. Let be the SVD of *A*. Then the projection of *b* satisfies hence we can set

. Then we have . Thus, we can solve LSR in time.

**Claim **If for all x then

if , then: We are going to replace *A* with . Then we just need the *SVD* of , which only takes us time. If *m* is like *d* then this is faster. However, we still need to find and apply it.

**Claim**

If is -s.e. for then if , then:

*Proof*

. Similarly for the left side of the inequality.

The total time to find includes the time to find , the time to compute , and (the time to find the SVD for ).

### 3.2 Sampling and Subspace Embeddings

As with approximate matrix multiplication, there are two possible methods we will examine: sampling, and a Johnson-Lindenstrauss (JL) method.

Let be a diagonal matrix with diagonal elements . is 1. If we sample the *i*th row *i* of *A* (which can be written as ), 0 otherwise. .

If we used the sampling approach for approximate matrix multiplication, we select proportional to the norm for each row and decide what should be. The number of rows is non-deterministic.

If we do not want any ’s to be 0 - then we just miss a row. Define . If we don’t set , it doesn’t make sense. Look at the event that we did sample row *i*. Then

Pick *x* which achieves the sup in the definition of . Then If , the previous expression evaluates to , thus, we are guaranteed to mess up because there is some *x* which makes our error too big. Therefore, we need some .

, if *A* has column rank.

**Definition 3.4 **Given *A*, the *i*th *leverage score* is **Claim ** *Proof *See that both and are basis independent i.e. if M is square/invertible, then:

and Choose *M* s.t. has orthonormal columns: . Then, wlog and: Which *x* achieves the sup in ? The vector itself. Thus .

**Theorem 3.3 **(Drineas et al. 2006) If we choose , then .

See . So, none of the leverage scores can be bigger than 1, and they sum up to *d*. The minimum with 1 is needed to the multiplicative factor times the legepave score. We can analyse this using non-commutative khintchine.

Let us consider the analysis by non-commutative khintchine.

**Definition 3.5 **The *Schatten-p norm* of *A* for is -norm of singular values of *A*.

If *A* has rank , see that for (by Holder’s inequality).

**Theorem 3.4**

(Lust-Piquard and Pisier 1991) The total samples required is . for all *x* is the same as for all *y*. Call . Thus we want

. Therefore, want , i.e. .

The columns of *U* form an orthonormal basis for *E*. We want , i.e. . From Gordon’s theorem: If then suffices to have .

Thus, if we take a random Gaussian matrix, by Gordon’s theorem, it will preserve the subspace as long as it has at least rows.

We want our to have few rows. We should be able to find immediately. Multiplication with *A* should be fast. The problem is takes time *O*(*mnd*) using for loops, which takes time

. Hence, we want to use “fast ”.

**Definition 3.6 **An *oblivious subspace embedding* is a distribution *D* over s.t.

, : This distribution doesn’t depend on *A* or *U*. The Gaussian matrix provides an oblivious subspace embedding, however, Sarlós approach with a fast JL matrix solves too.

For any d-dimensional subspace there exists a set of size such that if preserves every up to then preserves all of *E* up to .

So what does this mean, if we have distributional JL than that automatically implies we have an oblivious subspace embedding. We would set the failure probability in JL to be which by union bound gives us a failure probability of OSE of .

### 3.3 Non-commutative Khintchine Inequality

Non-commutative Khintchine inequality plays a vital role in the recent developments in non- commutative Functional Analysis, and in particular in Operator Space Theory. For Noncommutative Khintchine let with are independent Bernoulli. Than To take the square root of a matrix just produce the singular value decomposition and take the square root of each of the singular values.

We take We know the given expression is We wish to bound and we know where is the i’th row of . This all implies where . Now we do the usual trick with proving Bernstein. By convexity we interchange the expectation with the norm and obtain which is just the usual symmetrization trick assuming row of are independent. Then we simplify The following was observed by Cohen, noncommutative khintchine can be applied to sparse JL but Cohen is able to obtain for s containing dependent entries as opposed to independent entries. There is a conjecture that the multiplies in is basically an addition and has been useful in compressed sensing.

### 3.4 Iterative Algorithms

Some forms of randomization have been used for several years in linear algebra. For example, the starting vectors in Lanczos algorithms are always random. A few years ago, new uses of randomization have proposed such as random mixing and random sampling, which can be combined to form random projections. These ideas have been explored theoretically and have found use in some specialized applications (e.g., data mining), but they have had little influence on mainstream numerical linear algebra. Tygert and Rokhlin (Rokhlin and Tygert 2008) and Avron et al. (Avron et al. 2010) shaded light on using gradient descent.

**Definition 3.7 **For a matrix *A*, the *condition number* of *A* is the ratio of its largest and smallest singular values.

Let be a 1 / 4 subspace embedding for the column span of *A*. Then let (SVD of ). Let . Then by orthonormality of *U* which means has a good condition number. Then algorithm is the following 1.

Pick such that By using reduction to subspace embeddings with being constant.

2. Iteratively let until some is obtained.

We will discuss an analysis using (Clarkson and Woodruff 2013). Observe that where the last equality follows by expanding the RHS. Obviously, all terms disappear except for versus , which are equal since is the optimal vector. So, is the projection of *b* onto the column span of .

Now let in SVD, then has a good condition number, thus iterations suffice to bring down the error to . Further, in every iteration, we have to multiply by *AR*; multiplying by *A* can be done in time proportional to the number of nonzero entries of *A*, , and multiplication by *R* in time proportional to . Ultimately, the pertinent term in the time complexity is , in addition the time to find the SVD.

### 3.5 Sarlós Method

This method is proposed by Sarlós (Sarlós 2006), where he asked what space and time lower algorithm. Key applications of low-rank matrix approximation by SVD include recommender systems, information retrieval via Latent Semantic Indexing, Kleinberg’s HITS algorithm for web search, clustering, and learning mixtures of distributions.

Let Then, . We have We want . Since have same column span, so , so . Now, let be a - subspace embedding — then has smallest singular value at least . Therefore

Now suppose also approximately preserves matrix multiplication. Here *w* is orthogonal to the columns of *A*, so . Then, by the general approximate matrix multiplication property,

We have , so set error parameter to get so , as required.

Ultimately, need not be an -subspace embedding. It suffices to merely be a *c*-subspace embedding for some fixed constant , while giving approximate matrix multiplication with error . Thus using the Thorup-Zhang sketch, this reduction we only require and even , as opposed to the first reduction that needed .

### 3.6 Low-Rank Approximation

The basic idea is a matrix with *n*, *d* both large, e.g. *n* users rating *d* movies. Suppose users are linear combinations of a few (*k*) basic types. We want to discover this low-rank structure.

Given a matrix , we want to compute . Some now argue that we should look for a non-negative matrix factorization; nevertheless, this version is still used.

**Theorem 3.5 **(Eckart-Young) Let be a singular-value decomposition of *A* where

and is diagonal with entries , then under , is the minimizer where and are the first *k* columns of *U* and *V* and . Our output is then . We can calculate in time, by calculating the SVD of *A*.

**Definition 3.8** is the projection of the columns of *B* onto the .

**Definition 3.9 **Let be a singular decomposition. is called *Moore-*

*Penrose pseudoinverse*of

*A*.

Now recall subspace embedding and approximate matrix multiplication to compute with rank at most *k* such that , following Sarlós’ approach (Sarlós 2006). The first works which got some decent error (like ) was due to Papadimitriou (Papadimitriou et al. 2000) and Frieze, Kanna and Vempala (Frieze et al. 2004).

**Theorem 3.6**

Define . As long as is an 1 / 2 subspace embedding for a certain *k*- dimensional subspace and satisfies approximate matrix multiplication with error , then where is the best rank *k* approximation to , i.e., projecting the columns of *A* to *V*.

Firstly, let us verify that this algorithm is fast, and that compute fast. To satisfy the conditions in the above theorem, we know that can be chosen with e.g. using a random sign matrix or slightly larger *m* using a faster subspace embedding. We need to multiply . We can use a fast subspace embedding to compute fast, then we can compute the SVD of in time. Let denote the best rank-*k* approximation under Frobenius norm. We wish to compute .

Computing takes *O*(*mnd*) time, then computing the SVD of takes time. It is better than the time to compute the SVD of *A*, but we can do better if we approximate. By using the right combination of subspace embeddings, for constant the scheme described here can be made to take time (where hides factors). We will do instead for .

We want to compute . If is the argmin without the rank constraint, then the *argmin* with the rank constraint is , where denotes the best rank-*k* approximation under Frobenius error.

Rather than find , we use *approximate regression* to find an approximately optimal . That is, we compute where is an -subspace embedding for the column space of (see has rank *m*). Then output is . and thus . The second equality above holds since the matrix preserves Frobenius norms, and the first equality since has a column space orthogonal to the column space of . Next, suppose are two functions mapping the same domain to such that for all *x* in the domain. Then

. Now, let the domain be the set of all rank-*k* matrices, and let and . Then . Therefore

(3.2) (3.3) (3.4) where (

) used that

since has columns orthogonal to the column space of . Also, (

) used that since is the best Frobenius approximation to *A* in the column space of . Ultimately,

again used

and also used the triangle inequality So, we have established the following theorem that follows from the above calculations and Theorem

**Theorem 3.7**

Let be a 1 / 2 subspace embedding for a certain *k*-dimensional subspace , and suppose also satisfies approximate matrix multiplication with error . Let be an -subspace embedding for the column space of , where is the SVD (and hence has rank at most ). Let where

Then has rank *k* and In particular, the error is for .

Further, we show that actually is a good rank-*k* approximation to *A* (i.e. we prove Theorem.

*Proof*

We denote the first *k* columns of *U* and *V* as and and the remaining columns by and . Let *Y* be the column span of and the orthogonal projection operator onto *Y* as

*P*. Then,

Then we can bound the second term in that sum: Now we just need to show that :

Here superscript (*i*) means the *i*th column. Now we have a bunch of different approximate regression problems which have the following form: which has optimal value . Consider the problem as original regression problem. In this case optimal gives

. Now we can use the analysis on the approximate least square from last week.

Here, we have a bunch of , , with and . Here, . Hence . Conversely,

. Since , if all singular values of are at least , we have where *G* has as *i*th column. exactly same as approximate matrix multiplication of and *G*. Since columns of *G* and are orthogonal, we have , hence if is a sketch for approximate matrix multiplication of error , then since . Clearly , hence proved.

### 3.7 Compressed Sensing

Compressed or compressive sensing developed from questions raised about the efficiency of signals, including audio, still images and video.

Nowadays varied sensing devices such as mobile phones and biomedical sensors are indispensable. Individually operating sensors normally form correlated sensor networks in large scale. Therefore, these sensors generate continuous flows of big sensing data that pose key challenges: how to sense and transmit massive spatio-temporal data in efficient manner. Many conventional distributed sensing schemes process input signals in the sensing devices to reduce the burden of network transmission. However, these conventional schemes are not well suited for resource limited sensing devices because of excessive energy and resource consumption. Compressive sensing sheds light on this problem by shifting the complexity burden of encoding process to the decoder. Compressive sensing enables to compress large amounts of inputs signals without much energy consumption. Recent advances in Compressive sensing reduce this computational burden even further by random sampling, so that Compressive sensing schemes are successfully applied to large-scale sensor networks.

Moreover, one encounters the task of inferring quantities of interest from measured information in computer science. For example, in signal and image processing, one would like to reconstruct a signal from measured data. When the information acquisition process is linear, the problem reduces to solving a linear system of equations. A compressible signal is one which is sparse in some basis, but not necessarily the standard basis. Here an approximately sparse signal is a sum of a sparse vector with a low-weight vector.

Consider . If *x* is a *k* sparse vector, we could represent it in a far more compressed manner. Thus, we define a measure of how “compressible” a vector is as a measure of how close it is to being *k* sparse.

**Definition 3.10 **Let be the *k* elements of largest magnitude in *x*. Let be the rest of *x*.

Therefore, we call *x* compressible if is small.

The goal here is to approximately recover *x* from few linear measurements. Consider we have a matrix such that each the *i*th row is equal to for some . We want to recover a from such that , where is some constant dependent on and *q*. Depending on the problem formulation, I may or may not get to choose this matrix .

There are many practical applications in which approximately sparse vectors appear. Pixelated images, for example, are usually approximately sparse in some basis *U*. For example, consider an *n* by *n* image . then for some basis *U*, and *y* is approximately sparse. Thus we can get measurements from .

Assume that *n* is a power of two. Then: 1. Break the image *x* into squares of size four pixels.

2.

Initialize a new image, with four regions .

3. Each block of four pixels, *b*, in *x* has a corresponding single pixel in each of , and based on its location. For each block of four *b*: Let the *b* have pixel values and .

4. Recurse on , and .

Normally, pixels are relatively constant in certain regions. So, the values in all regions except for the first are usually relatively small. If you view images after this transform, the upper left hand regions will often be closer to white, while the rest will be relatively sparse. A signal is called sparse if most of its components are zero. In empirical sense, many real-world signals are compressible that they are well approximated by sparse signals often after an appropriate change of basis. This describes why compression techniques such as JPEG, MPEG, or MP3 work extremely well in practice.

The basic approach to taking photo is to first take a high-resolution photo in the standard basis. That means, a light magnitude for each pixel and then to compress the picture later using software tool. Because photos are usually sparse in an appropriate basis. The compressed sensing approach asks, then why not just capture the image directly in a compressed form, i.e. in a representation where its sparsity shines through? For example, one can store random linear combinations of light intensities instead of the light intensities themselves. This idea leads to a reduction in the number of pixels needed to capture an image at a given resolution. Another application of compressed sensing is in Magnetic resonance imaging (MRI), where reducing the number of measurements decreases the time necessary for a scan.

**Theorem 3.8 **(Candès et al. 2006; Donoho 2006) There exists a with

and a poly-time algorithm *Alg* s.t. if then If *x* is actually k-spares, 2*k* measurements are necessary and sufficient.

### 3.8 The Matrix Completion Problem

A partial matrix is a rectangular array in which some entries are specified, while the remaining unspecified entries are free to be chosen from an indicated set. A completion of a partial matrix is a particular choice of values for the unspecified entries resulting in a conventional matrix. In matrix completion, the positive definite completion problem has received the most attention, due to its role in several applications in probability and statistics, image enhancement, systems engineering, etc. and to its relation with other completion problems including spectral norm contractions and Euclidean distance matrices which is important for the molecular conformation problem in chemistry. In a typical matrix completion problem, description of circumstances is sought in which choices for the unspecified entries may be made from the same set so that the resulting ordinary matrix over that set is of a desired type. A matrix completion problem asks whether a given partial matrix has a completion of a desired type; for example, the positive definite completion problem asks which partial Hermitian matrices have a positive definite completion. The properties of matrix completion problems have been inherited permutation similarity, diagonal matrix multiplication and principal submatrices. Completion problems have proved to be a useful perspective to study fundamental matrix structure.

In a typical matrix completion problem, description of circumstances is sought in which choices for the unspecified entries may be made from the same set *S* so that the resulting ordinary matrix over *S* is of a desired type. In the vast majority of cases that have been of interest in matrix completion problem.

While the problem of rank aggregation is old, modern applications – such as those found in web-applications like Netflix and Amazon – pose new challenges. First, the data collected are usually cardinal measurements on the quality of each item, such as 1–5 stars, received from voters. Second, the voters are neither experts in the rating domain nor experts at producing useful ratings. These properties manifest themselves in a few ways, including skewed and indiscriminate voting behaviours.

A motivation for the matrix completion or Netflix problem comes from user ratings of some products which are put into a matrix *M*. The entries of the matrix correspond to the

*j*’th user’s rating of product *i*. We assume that there exists an ideal matrix that encodes the

ratings of all the products by all the users. However, it is not possible to ask every user his opinion about every product. We are only given some ratings of some users and we want to recover the actual ideal matrix *M* from this limited data. So matrix completion is the following problem:

**Problem**: Suppose you are given some matrix . Moreover, you also are given

**Goal**: We want to recover the missing elements in *M*.

This problem is hard if we do not make any additional premises on the matrix *M* since the missing could in principle be arbitrary. We will consider a recovery scheme that relies on the following three premises.

1.

*M* is (approximately) low rank.

2. Both the columns space and the row space are “incoherent”. We say a space is incoherent, when the projection of any vector onto this space has a small norm.

3. If then all the entries of are bounded.

4. The subset is chosen uniformly at random.

**Note 3.1 **There is work on adversarial recovery where the values are not randomly chosen

Under these premises we show that there exists an algorithm that needs a number of entries in *M* bounded by . Here captures to what extent properties 2 and 3 above hold. One would naturally consider the following recovery method for the matrix *M*: Alas, this optimization problem is *NP*-hard. Hence, let us consider the following alternative optimization problem in trace norm, or *nuclear norm*. where the nuclear norm of *X* defined as the sum of the singular values of *X*, i.e.

. This problem is a semi-definite program (SDP), and can be solved in time polynomial in .

While a several heuristics have been developed across many disciplines, the general problem of finding the lowest rank matrix satisfying equality constraints is NP-hard. Most low-rank matrices could be recovered from most sufficiently large sets of entries by computing the matrix of minimum nuclear norm that agreed with the provided entries, and moreover the revealed set of entries could comprise a vanishing fraction of the entire matrix. The nuclear norm is equal to the sum of the singular values of a matrix and is the best convex lower bound of the rank function on the set of matrices whose singular values are all bounded by 1. The intuition behind this heuristic is that whereas the rank function counts the number of non-vanishing singular values, the nuclear norm sums their amplitude. Moreover, the nuclear norm can be minimized subject to equality constraints via semi-definite

### 3.8.1 Alternating Minimization

Alternating minimization is a widely used heuristic for matrix completion in which the goal is to recover an unknown low-rank matrix from a subsample of its entries. Alternating minimization has been used in the context of matrix completion and continues to play an important role in practical approaches to the problem. The approach also formed an important component in the winning submission for the Netflix Prize. The iterative procedure behind Alternating Minimization (AM) is given below. We try to find an approximate rank-*k* factorization , where *X* has *k* columns and *Y* has *k* rows. We start off with initial . Then we do as follows: 1. initialize 2.

**for** : a.

b.

3.

**return**

Rigorous analyses of modifications of the above AM template have been carried out in (Hardt 2014; Hardt and Wootters 2014). The work (Schramm and Weitz 2015) has also shown some performance guarantees when the revealed entries are *adversarial* except for random.

Now let us elaborate the main theorem and related definitions. be the singular value decomposition. (See that

**Definition 3.11 **Let

and .)

**Definition 3.12 **Define the incoherence of the subspace *U* as ,

where is projection onto *U*. Similarly, the incoherence of *V* is , where is projection onto *V*.

**Definition 3.13** .

**Definition 3.14**

, where is the largest magnitude of an entry of *UV*. then with high probability *M* is the unique

**Theorem 3.9 **If solution to the semi-definite program s.t. .

We know that . can be since a standard basis vector appears in a column of *V*, and can get down to 1 is a kind of best case scenario where all the entries of *V* are similar to . Further, all the entries of *U* are similar to , if you took a Fourier matrix and remove some of its columns.

Ultimately, the condition on *m* is a good bound if the matrix has low incoherence.

Reference (Candès and Tao 2010) proved that is essential. If you want to recover *M* over the random choice of via SDP, then you need to sample at least that many entries. The condition isn’t entirely compact because of the square in the log factor and the dependence on . However, Cauchy-Schwarz inequality implies .

The algorithm looks as follows when we want to minimize : Select , and a stepsize *t* and iterate (a)–(d) some number of times:

(a) (b) (c) (d)

**Definition 3.15 ** **Claim**

The dual of the trace norm is the operator norm:

This is logical since the dual of for vectors is . Furthermore, the trace norm and operator norm are similar to the and norm of the singular value vector respectively.

**Lemma 3.1** *Proof ***(2)** **(3)**: AM-GM inequality: .

**(3)** **(1)**: We simply need to show an *X* and *Y* which gives .

Set . Given , then . i.e. write the SVD of *A* and apply *f* to each diagonal entry of . It is simple to verify that and that the square of the Frobenius norm of is exactly the trace norm.

**(1)** **(2)**:

Let *X*, *Y* be some matrices such that . Then

(3.5)

*Proof* .

By taking and we get the trace norm.

Write s.t. . Write .

Then using a similar argument to

, hence proved.

While the principle of alternating maximization is well known in the literature, it had not been used before in the context of the present topic. Since it is not a matrix decomposition method, it can also be adapted to large-scale problems using essentially actions of matrix exponentials on vectors.

*Models of Computation for Big Data*, Advanced Information and Knowledge Processing

**4. Assorted Computational Models**

Rajendra Akerkar

(1) Western Norway Research Institute, Sogndal, Norway

**Rajendra Akerkar** **Email:**

This chapter presents some other computational models to tackle massive datasets efficiently. We will see formalized models for some massive data settings, and explore core algorithmic ideas arising in them. The models discussed are cell probe, online bipartite matching, MapReduce programming model, Markov chain, and crowdsourcing. Finally, we present some basic aspects of communication complexity.

### 4.1 Cell Probe Model

The cell-probe model is one of the significant models of computation for data structures, subsuming in particular the common word-Random-access machine (RAM) model. We suppose that the memory is divided into fixed-size cells (words), and the cost of an operation is just the number of cells it reads or writes. Let be a universe of size *m*, and let with . An algorithm is supplied with *S*, and it provides answer queries on elements of *S* or even *U*. The set *S* is kept in memory in *cells*, each of bits. The algorithm is executed in the following two stages: 1.

Preprocessing: On receiving *S*, store *S* in memory in some suitable form. We denote the space utilised, measured in number of cells, by *s*.

2. Query: Given , return some information about *x* depending on the problem. Let *t* denote the maximum number of memory cell probes the algorithm takes to process each query.

The performance of the algorithm is measured only by the parameters *s* and *t*. The time taken by the preprocessing step is not counted. In the ‘Query’ step, number of memory probes is significant. Further, no information is carried over from the preprocessing to the query stage unless explicitly stored in the data structure. One can imagine this as two distinct

For each *S*, once it has been preprocessed, the execution of the Query algorithm can be represented by a decision tree. Given an , the algorithm proceeds as follows: depending on *x* it chooses a memory cell and *probes* it (reads its contents). Depending on the contents of that cell it probes some other cell, and so on. Let us fix *S*. For every , we have a decision tree. The vertices are labeled by pointers to memory cells. The root’s label is the pointer to the cell probed first. Each vertex has a child for every possible outcome of probing the location it points to. If *t* is the query complexity, then the depth of each tree is at most *t*.

### 4.1.1 The Dictionary Problem In the dictionary problem, we are given an S which we need to store in memory in some form.

For each , we will have a memory cell in a data structure *T*, which contains *x* and a pointer to some memory location containing auxiliary data about *x*. Furthermore, the algorithm might allocate some additional cells which will help it in processing queries.

Given , the problem is to find *i* such that , or report that . The aim is to simultaneously reduce *s* and *t*. A completely simple approach is to store the characteristic bit vector of the set *S* in the preprocessing phase, and answer every query with a single probe. This scheme has and

. The standard approach is to maintain a sorted array for storing *S*, and to use binary search for locating elements. For this scheme, and .

For the dictionary problem we will use the *Fredman–Komlós–Szemerédi (FKS) scheme* from (Fredman et al. 1984). It achieves and .

**Theorem 4.1 **(Fredman–Komlós–Szemerédi) There exists a solution to the dictionary problem with and .

In the preprocessing phase, a good hashing function maps *S* without collisions. The algorithm would then store a description of *h*, and information about in the cell numbered *h*(*x*). In the Query phase, the algorithm on input *x* would read *h*, compute

*h*(*x*) and look up that cell. But *h* must have a compact description, otherwise reading *h* itself

will need too many probes. We can find an *h* with a compact description which, though not collision-free, results in sufficiently small buckets. Then for each bucket, we can find a second- level hash function that is collision-free and has a compact description. Putting this together, both the storage requirement and the number of probes will be small.

Let us begin with a claim. Select a hash function uniformly at random from a family of pairwise independent hash functions. For , let be the *i*th bucket;

. Let be the size of the *i*th bucket; . The claim below shows that the expected sum of the squared bucket sizes is *Ox*(*n*).

**Claim** .

*Proof*

, let be the indicator variable of the event over the choice of , is the number of elements in *S* that *u* clashes with, and is

**h. Now, for each**

hence equal to . Since for all the sum is equal to , and since , we have .

Now we can present the preprocessing algorithm. For every of size *n*, aforementioned claim guarantees the existence of a hash function for which . Fix such an *h*. From now onwards we will use to denote the value taken by the random variable when this *h* that we have fixed is chosen as the hash function. For each *i* with , let be a family of pairwise independent hash functions

. If we pick an randomly from this family, then the probability that has a collision within is at most . Thus there is one function which is collision-free within . For each *i* fix one such . The algorithm proceeds as follows. Given *S*, it determines as above. It allocates *n* chunks of memory, the *i*th being of size . Call the *i*th chunk . An array of *n* cells is allocated, one cell for each . The *i*th cell, say , contains the address of the first cell of . An additional array of size is used to store a description of the functions . The storage required is thus *O*(*n*) for all the chunks together, plus whatever is required to store the functions.

The Query algorithm on input proceeds as follows: Read the description of *h* and compute . Read cell and the description of . Adding the contents of cell to gives a location *m*(*x*). If the cell at this location does not contain *x*, then conclude that

. If it does, then read on for auxillary information about *x*. The choices of *h* and ’s ensure that for every we are mapped to a distinct memory cell.

We need now to describe how we efficiently store and compute the functions *h* and ’s. Take and

. Then *h* and each can be described using 2 cells (for storing *a* and *b*) and computable in constant time. Hence and .

Hence proved.

### 4.1.2 The Predecessor Problem

For a non-empty set *S*, and for every , the predecessor is defined as We will prove an upper bound on *s* and *t* for the Predecessor Problem. We will present an algorithm for which and , which is not the best procedure. The efficient algorithm for the cell probe model is due to *Beame and Fich* (Beame and Fich 2002), who show how to achieve and

. They have shown this bound to be tight for deterministic algorithms.

We use *X-tries* which is known as van Emde Boas trees, see (Cormen et al. 2009) to design a solution. One can think of each element of as a bit binary string. We build a complete binary tree of depth , whose leaves correspond to elements of [*m*]. Each edge from a vertex to its left child is labelled 0 and each edge from a vertex to its right child is labelled 1, and the labels of the edges along the path from the root to a leaf *u* when concatenated give the binary representation of *u*. Call this tree *T*. We edit this tree by deleting all leaves that correspond to vertices not in *S*, all vertices that become leaves because of these deletions and so on. Finally we have a binary tree whose leaves are exactly the elements of *S*. Call it . The number of leaves in is *n*. As in every intermediate level there can be at most levels, the size of is . See that every element

*n* vertices, and there are

of *S* is its own predecessor. For an element , let *v* be the deepest ancestor of *u* in *T* that is also present in . (Such a *v* must exist since at least one ancestor of *u*, the root of *T*, is in .) By the construction above, *v* is not a leaf in . Now there are two possibilities.

1.

*u* is in the right subtree of *v* in *T*. By choice of *v*, *v* does not have a right child in , but is

not a leaf, so it has a left child. Clearly in this case the predecessor of *u* is the right most leaf of the left subtree of *v* in . In the preprocessing step we will identify such vertices *v* and create a link pointing from *v* to the rightmost leaf of its left subtree.

2.

*u* is in the left subtree of *v* in *T*. By choice of *v*, *v* does not have a left child in . To find a

predecessor, we need to go up from *v* until we find a vertex with a left child, go to the left subtree, and report the rightmost leaf there. So in the proprocessing step we put a link from *v* to the rightmost leaf in the left subtree rooted at the deepest ancestor of *v* with a left child. If there is no such ancestor of *v* in , then we link from *v* to a special vertex that

Pred will denote a value for .

Therefore for each , once we get to the vertex *v*, we immediately obtain the predecessor by following the links. Thus we will be able to find the vertex *v*, given *u*. Let the bit string corresponding to *u* be where . So the path from the root to the vertex *v* we are looking for is where

. (Note: exactly when .) The idea is to do a binary search in to identify *v*. If we can check whether a binary string forms a path from the root to some vertex in with *O*(1) probes, then we can get to the vertex *v* with probes. This will give us the essential probe result.

### 4.2 Online Bipartite Matching

Introduced in 1990 by Karp, Vazirani, and Vazirani (Karp et al. 1990), on-line bipartite matching was one of the first problems to receive the attention of competitive analysis. In recent years, the problem of maximum online bipartite matching with dynamic posted prices, motivated by the real-world challenge of efficient parking allocation. Smart parking systems are being deployed in an increasing number of cities. Such systems allow commuters and visitors to see in real time, using cellphone applications or other digital methods, all available parking slots and their prices. In the original bipartite matching problem we seek to find a maximum matching, i.e. a matching that contains the largest possible number of edges given a graph.

On the other hand, in a “online” bipartite matching problem, we observe vertices one by one and assign matchings in an online fashion. Our goal is to find an algorithm that maximizes the competitive ratio *R*(*A*).

**Definition 4.1**

(*Competitive ratio*) (4.1) where and denote the size of matching for an algorithm *A* and maximum matching size respectively, given input {graph, arriving order}.

Obviously , but can we find a lower bound for *R*(*A*)?

**4.2.1 Basic Approach**

Since each edge can block at most two edges, we have . On the other hand, for any deterministic algorithm *A*, we can find an adversarial input *I* such that .

Consider the graph, where there is a perfect matching from *n* vertices on the left to *n* vertices on right, and the second half of *u*s are fully connected to the first half of *v*. Under this setting, the number of correctly matched vertices in the second half of *v* is at most *n* / 2. The expected number of correctly matched vertices in the first half is given by:

(4.2) (4.3) (4.4)

Since , the competitive ratio *R*: This randomized algorithm does not do better than 1 / 2.

**4.2.2 Ranking Method**

Consider a graph *G* with appearing order . Without selecting a random edge, we randomly permute the *v*’s with permutation . We then match *u* to where denotes the neighbors of *u*.

Let us prove that this algorithm achieves a competitive ratio of . We begin by defining our notation. The matching is denoted by . denotes the vertex matched to *v* in perfect matching. , where *U*, *V*, *E* denote left vertices, right vertices and edges respectively.

**Lemma 4.1 **Let with permutation and arriving order induced by

respectively.

- augmenting path from
*x*downwards.

and , if *v* is not matched under , then *u* is matched to

**Lemma 4.2 **Let with .

**Lemma 4.3**

Let be the probability that the rank-*t* vertex is matched. Then (4.5)

*Proof* Let *v* be the vertex with . Note, since is uniformly random, *v* is uniformly random.

Let . Denote by the set of left vertices that are matched to rank vertices on the right. We have . If *v* is not matched, *u* is matched to some such that , or equivalently, . Hence,

However this proof is not correct since *u* and are not independent and thus . Instead, we use the following lemma to complete the correct proof.

**Lemma 4.4 **Given , let be the permutation that is with *v* moved to the *i*th rank. Let

. If *v* is not matched by , for every *i*, *u* is matched by to some such that .

*Proof*

By Lemma inserting *v* to *i*th rank causes any change to be a move up.

*Proof *(*By Lemma* ) Given , let be the permutation that is with *v* moved to the *i*th

rank, where *v* is picked uniformly at random. Let . If *v* is not matched by (with probability ), then *u* is matched by to some such that , or equivalently . Choose random and *v*, let with *v* moved to rank *t*. . According to

Lemma

, if *v* is not matched by (with probability ), *u* in is matched to with

, or equivalently . Note, *u* and are now independent and holds. Hence proved.

With Lemma

is

equivalent to . Solving the recursion, it can also be rewritten as for all *t*. The competitive ratio is thus, .

### 4.3 MapReduce Programming Model

A growing number of commercial and science applications in both classical and new fields process very large data volumes. Dealing with such volumes requires processing in parallel, often on systems that offer high compute power. Such type of parallel processing, the MapReduce paradigm (Dean and Ghemawat 2004) has found popularity. The key insight of MapReduce is that many processing problems can be structured into one or a sequence of phases, where a first step (Map) operates in fully parallel mode on the input data; a second step (Reduce) combines the resulting data in some manner, often by applying a form of reduction operation. MapReduce programming models allow the user to specify these map and reduce steps as distinct functions; the system then provides the workflow infrastructure, feeding input data to the map, reorganizing the map results, and then feeding them to the appropriate reduce functions, finally generating the output.

While data streams are an efficient model of computation for a single machine, MapReduce has become a popular method for large-scale parallel processing. In MapReduce model, data items are each pairs. For example, you have a text file ‘input.txt’ with 100 lines of text in it, and you want to find out the frequency of occurrence of each word in the file. Each line in the input.txt file is considered as a value and the offset of the line from the start of the file is considered as a key, here (offset, line) is an input pair. For counting how many times a word occurred (frequency of word) in the input.txt, a single word is considered as an output key and a frequency of a word is considered as an output value. Our input is (offset of a line, line) and output is (word, frequency of word). phase, and Reduce phase:

*Map*: Map function operates on a single record at a time. Each item is processed by some

*map*function, and emits a set of new pairs.

*Combine*: The combiner is the process of applying a reducer logic early on an output from a

single map process. Mappers output is collected into an in memory buffer. MapReduce framework sorts this buffer and executes the commoner on it, if you have provided one. Combiner output is written to the disk.

*Shuffle*: In the shuffle phase, MapReduce partitions data and sends it to a reducer. Each

mapper sends a partition to each reducer. This step is natural to the programmer. All items emitted in the map phase are grouped by key, and items with the same key are sent to the same reducer.

*Reducer*: During initialization of the reduce phase, each reducer copies its input partition

from the output of each mapper. After copying all parts, the reducer first merges these parts and sorts all input records by key. In the Reduce phase, a reduce function is executed only once for each key found in the sorted output. MapReduce framework collects all the values of a key and creates a list of values. The Reduce function is executed on this list of values and a corresponding key. So, Reducer receives and emits new set of items.

MapReduce provides many significant advantages over parallel databases. Firstly, it provides fine-grain fault tolerance for large jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch. Secondly, MapReduce is very useful for handling data processing and data loading in a heterogeneous system with many different storage systems. Third, MapReduce provides a good framework for the execution of more complicated functions than are supported directly in SQL.

Data streaming and MapReduce have emerged as two leading paradigms for handling computation on very large datasets. As the datasets have grown to tera- and petabyte input sizes, two paradigms have emerged for developing algorithms that scale to such large inputs: streaming and MapReduce (Bahmani et al. 2012). In the streaming model, as we have seen, one assumes that the input can be read sequentially in a number of passes over the data, while the total amount of random access memory (RAM) available to the computation is sublinear in the size of the input. The goal is to reduce the number of passes needed, all the while minimizing the amount of RAM necessary to store intermediate results. In the case the input is a graph, the vertices V are known in advance, and the edges are streamed. The challenge in streaming algorithms lies in wisely using the limited amount of information that can be stored between passes.

Complementing streaming algorithms, MapReduce, and its open source implementation, Hadoop, has become the *de facto* model for distributed computation on a massive scale. Unlike streaming, where a single machine eventually sees the whole dataset, in MapReduce, the input is partitioned across a set of machines, each of which can perform a series of computations on its local slice of the data. The process can then be repeated, yielding a multi- pass algorithm. It is well known that simple operations like sum and other holistic measures as well as some graph primitives, like finding connected components, can be implemented in MapReduce in a work-efficient manner. The challenge lies in reducing the total number of

### 4.4 Markov Chain Model

Randomization can be a useful tool for developing simple and efficient algorithms. So far, most of these algorithms have used independent coin tosses to generate randomness. In 1907, A. A. Markov began the study of an important new type of chance process. In this process, the outcome of a given experiment can affect the outcome of the next experiment. This type of process is called a Markov chain (Motwani and Raghavan 1995). Specifically, Markov Chains represent and model the flow of information in a graph, they give insight into how a graph is connected, and which vertices are important.

A *random walk* is a process for traversing a graph where at every step we follow an outgoing edge chosen uniformly at random. A *Markov chain*is similar except the outgoing edge is chosen according to an arbitrary fixed distribution.

One use of random walks and Markov chains is to sample from a distribution over a large universe. In general, we set up a graph over the universe such that if we perform a long random walk over the graph, the distribution of our position approaches the distribution we want to sample from. Given a random walk or a Markov chain we would like to know: How quickly can we reach a particular vertex; How quickly can we cover the entire graph? How quickly does our position in the graph become “random”? While random walks and Markov chains are useful algorithmic techniques, they are also useful in analyzing some natural processes.

**Definition 4.2**

(*Markov Chain*) A Markov Chain is a sequence of random variables on some state space *S* which obeys the following property: We take these probabilities as a *transition matrix P*, where . See that is necessary for *P* to be a valid transition matrix.

If is the distribution of *X* at time 0, the distribution of *X* at time *t* will then be .

**Theorem 4.2**

(The Fundamental Theorem of Markov Chains) Let *X* be a Markov Chain on a finite state space satisfying the following conditions:

*Irreducibility*

There is a path between any two states which will be followed with probability, i.e. , . between them in the Markov chain, i.e.

. *X* is aperiodic if this is 1 for all *u*, *v*.

Then *X* is ergodic. These conditions are necessary as well as sufficient.

This follows for an ergodic chain with stationary distribution .

This is called the *hitting time* of *v* from *u*, and it obeys for an ergodic chain with stationary distribution .

### 4.4.1 Random Walks on Undirected Graphs

We consider a random walk *X* on a graph *G* as before, but now with the premise that *G* is undirected.

Clearly, *X* will be irreducible iff *G* is connected. It can also be shown that it will be aperiodic iff *G* is not bipartite. The direction follows from the fact that paths between two sides of a bipartite graph are always of even length, whereas the direction follows from the fact that a non-bipartite graph always contains a cycle of odd length.

We can always make a walk on a connected graph ergodic simply by adding self-loops to one or more of the vertices.

**4.4.1.1 Ergodic Random Walks on Undirected Graphs**

**Theorem 4.3 **If the random walk *X* on *G* is ergodic, then its stationary distribution is given by .

*Proof*

Let be as defined above. Then:

So as , is the stationary distribution of *X*. In general, even on this subset of random walks, the hitting time will not be symmetric, as will be shown in our next example. So we define the commute time .

### 4.4.2 Electric Networks and Random Walks

A resistive electrical network is an undirected graph; each edge has branch resistance associated with it. The electrical flow is determined by two laws: Kirchhoff’s law (preservation of flow - all the flow coming into a vertex, leaves it) and Ohm’s law (the voltage across a resistor equals the product of the resistance times the current through it). View graph *G* as an electrical network with unit resistors as edges. Let be the effective resistance between vertices *u* and *v*. The commute time between *u* and *v* in a graph is related to by . We get the following inequalities assuming this relation.

If , In general, , We inject *d*(*v*) amperes of current into . Eventually, we select some vertex and remove 2*m* current from *u* leaving net current at *u*. Now we get voltages

. Suppose we have . Let *L* be the Laplacian for *G* and *D* be the degree vector, then we have (4.6) You might now see the connection between a random walk on a graph and electrical network. Intuitively, the electricity, is made out of electrons each one of them is doing a random walk on the electric network. The resistance of an edge, corresponds to the probability of taking the edge.

### 4.4.3 Example: The Lollipop Graph

This is one example of a graph where the cover time depends on the starting vertex. The lollipop graph on *n* vertices is a clique of vertices connected to a path of vertices. Let *u* be any vertex in the clique that does not neighbour a vertex in the path, and *v* be the vertex at the end of the path that does not neighbour the clique. Then while . This is because it takes time to go from one vertex in the clique to another, and time to successfully proceed up the path, but when travelling from *u* to *v* the walk will fall back into the clique times as often as it makes it a step along the path to the right, adding an extra factor of *n* to the hitting time.

To compute . Let be the vertex common to the clique and the path. Clearly, the path has resistance . current is injected in the path and current is injected in the clique.

Consider draining current from *v*. The current in the path is as current is drained from *v* which enters *v* through the path implying using Ohm’s law ( ). Now consider draining current from *u* instead. The current in the path is now implying by the same argument.

Since the effective resistance between any edge in the clique is less than 1 and current is injected, there can be only voltage gap between any 2 vertices in the clique.

We get in the former case and in the latter.

### 4.5 Crowdsourcing Model

Crowdsourcing techniques are very powerful when harnessed for the purpose of collecting and managing data. In order to provide sound scientific foundations for crowdsourcing and support the development of efficient crowdsourcing processes, adequate formal models must be defined. In particular, the models must formalize unique characteristics of crowd-based settings, such as the knowledge of the crowd and crowd-provided data; the interaction with crowd members; the inherent inaccuracies and disagreements in crowd answers; and evaluation metrics that capture the cost and effort of the crowd. users of different expertise and reliability, and whose time, memory and attention are limited; handling data that is uncertain, subjective and contradictory; and so on. Particular crowd platforms typically tackle these challenges in an ad hoc manner, which is application- specific and rarely sharable. These challenges along with the evident potential of crowdsourcing have raised the attention of the scientific community, and called for developing sound foundations and provably efficient approaches to crowdsourcing. In cases where the crowd is utilised to filter, group or sort the data, standard data models can be used. The novelty here lies in cases when some of the data is harvested with the help of the crowd. One can generally distinguish between procuring two types of data: general data that captures truth that normally resides in a standard database, for instance, the locations of places or opening hours; versus individual data that concerns individual people, such as their preferences or habits.

### 4.5.1 Formal Model We now present a combined formal model for the crowd mining setting of (Amarilli et al.

2014; Amsterdamer et al. 2013).

Let be a finite set of item names. Define a *database* as a finite bag (multiset) of *transactions* over *I*, s.t. each transaction represents an occasion, e.g., a meal. We start with a simple model where every *T* contains an itemset , reflecting, e.g., the set of food dishes consumed in a particular meal. Let *U* be a set of users. Every is associated with a *personal database* containing the transactions of *u* (e.g., all the meals in

*u*’s history). denotes the number of transactions in . The frequency or *support* of an

itemset in is . This individual significance measure will be aggregated to identify the overall frequent itemsets in the population. For example, in the domain of culinary habits, *I* may consist of different food items. A transaction will contain all the items in I consumed by *u* in a particular meal. If, for instance, the set is frequent, it means that these food and drink items form a frequently consumed combination.

There can be dependencies between *itemsets* resulting from semantic relations between

*items*. For instance, the itemset is semantically implied by any transaction

containing , since jasmine tea is a (kind of) tea. Such semantic dependencies can be naturally captured by a *taxonomy*. Formally, we define a taxonomy as a partial order over *I*, such that indicates that item is more specific than *i* (any is also an *i*).

Based on , the semantic relationship between items, we can define a corresponding

order relation on itemsets. For itemsets *A*, *B* we define iff every item in *A* is implied by some item in *B*. We call the obtained structure the *itemset taxonomy* and denote it by . is then used to extend the definition of the support of an itemset *A* to supp , i.e., the fraction of transactions that *semantically imply* *A*.

Reference (Amarilli et al. 2014) discusses the feasibility of *crowd-efficient* algorithms by using the computational complexity of algorithms that achieve the upper crowd complexity bound. In all problem variants, they have the crowd complexity lower bound as a simple lower bound. For some variants, they illustrated that, even when the crowd complexity is feasible, the underlying computational complexity may still be infeasible.

### 4.6 Communication Complexity

Communication complexity explores how much two parties need to communicate in order to compute a function whose output depends on information distributed over both parties. This mathematical model allows communication complexity to be applied in many different situations, and it has become an key component in the theoretical computer science toolbox. In the communication setting, Alice has some input *x* and Bob has some input *y*. They share some public randomness and want to compute *f*(*x*, *y*). Alice sends some message , and then Bob responds with , and then Alice responds with , and so on. At the end, Bob outputs

*f*(*x*, *y*). They can choose a protocol , which decides how to assign what you send next based

on the messages you have seen so far and your input. The total number of bits transfered is .

The communication complexity of the protocol is where is a distribution over the inputs (*x*, *y*) and the protocol. The communication complexity of the function *f* for a distribution is

The communication complexity of the function *f* is

### 4.6.1 Information Cost

Information cost is related to communication complexity, as entropy is related to compression.

Recall that the entropy is . Now, the mutual information between *X* and *Y* is how much a variable *Y* tells you about *X*. It is actually interesting that we also have .

The information cost of a protocol is This is how much Bob learns from the protocol about *X* plus how much Alice learns from the protocol about *Y*. The information cost of a function *f* is For all protocol , we have , because there are at most *b* bits of information if there are only *b* bits transmitted in the protocol. Taking the minimum over all protocols implies . This is analogous to Shannon’s result that .

It is really interesting that the asymptotic statement is true. Suppose we want to solve *n* copies of the communication problem. Alice given and Bob given , they want to solve , each failing at most 1 / 4 of the time. We call this problem the direct sum . Then, for all functions *f*, it is not hard to show that

.

**Theorem 4.4**

(Braverman and Rao 2011) In the limit, this theorem suggests that information cost is the right notion.

### 4.6.2 Separation of Information and Communication

The remaining question is, for a single function, whether , in particular whether . If this is true, it would prove the direct sum conjecture . The recent paper by Ganor, Kol and Raz (Ganor et al. 2014) showed that it is not true. They gave a function *f* for which and . This is the best because it was known before this that . The function that they gave has input size . So, it is still open whether .

A binary tree with depth is split into levels of width . For every vertex *v* in the tree, there are two associated values and . There is a random special level of width . Outside this special level, we have for all *v*. We think about and as which direction you ought to go. So, if they are both 0, you want to go in one direction. If they are both 1, you want to go in the other. Within the special level, the values and are uniform. At the bottom of the special level, *v* is *good* if the path to *v* is following directions. The goal is to agree on any leaf where is a descendent of some good vertex.

Here we do not know where the special level is, because if you knew where the special level was, then *O*(*k*) communication suffices. The problem is you do not know where the special level is. You can try binary searching to find the special level, taking communication. This is basically the best you can do apparently.

We can construct a protocol with information cost only *O*(*k*). It is okay to transmit something very large as long as the amount of information contained in it is small. Alice can transmit her path and Bob just follows it, and that is a large amount of communication but it is not so much information because Bob knows what the first set would be. The issue is that it still gives you bits of information knowing where the special level is. The idea is instead that Alice chooses a noisy path where 90% of the time follows her directions and 10% deviates. This path is transmitted to Bob. It can be shown that this protocol only has *O*(*k*) information. Therefore, many copies can get more efficient.

### 4.7 Adaptive Sparse Recovery Adaptive sparse recovery is like the conversation version of sparse recovery.

In non-adaptive sparse recovery, Alice has and sets . She transmits . Bob receives *y* and recovers . In this one-way conversation,

In the adaptive case, we have something more of a conversation. Alice knows *x*. Bob sends and Alice sends back . Then, Bob sends and Alice sends back . And then, Bob sends and Alice sends back , and so on.

To show a lower bound, consider stage *r*. Define *P* as the distribution of . Then, the observed information by round *r* is

. For a fixed *v* depending on *P*, as , we know that With some algebra (Lemma 3.1 in (Price and Woodruff 2013)), we can bound the above expression by . It means that on average the number of bits that you get at the next stage is times what you had at the previous stage. This implies that *R* rounds take measurements. And in general, it takes measurements.

**Footnotes**

## 1 Some itemsets that are semantically equivalent are identified by this relation, e.g., is represented by the equivalent, more concise because drinking jasmine tea is a simply case of drinking tea.

**References** Achlioptas D (2003) Database-friendly random projections. J Comput Syst Sci 66(4):671–687 [ Ahn KJ, Guha S, McGregor A (2012) Analyzing graph structure via linear measurements. SODA 2012:459–467 [ Ailon N, Chazelle B (2009) The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J Comput 39(1):302–322 [ ] Alon N (2003) Problems and results in extremal combinatorics-I. Discret Math 273(1–3):31–53 [ ] Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequency moments. J Comput Syst Sci 58(1):137–147 [ ] Amarilli A, Amsterdamer Y, Milo T (2014) On the complexity of mining itemsets from the crowd using taxonomies. ICDT Amsterdamer Y, Grossman Y, Milo T, Senellart P (2013) Crowd mining. SIGMOD Andoni A (2012) High frequency moments via max-stability. Manuscript Andoni A, Krauthgamer R, Onak K (2011) Streaming algorithms via precision sampling. FOCS:363–372 Avron H, Maymounkov P, Toledo S (2010) Blendenpik: Supercharging LAPACK’s least-squares solver. SIAM J Sci Comput 32(3):1217–1236 [ ] Bahmani B, Kumar R, Vassilvitskii S (2012) Densest subgraph in streaming and mapreduce proc. VLDB Endow 5(5):454–465 [ Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D (2004) An information statistics approach to data stream and communication complexity. J Comput Syst Sci 68(4):702–732 [ ] Beame P, Fich FE (2002) Optimal bounds for the predecessor problem and related problems. JCSS 65(1):38–72 [ ] Braverman M, Rao A (2011) Information equals amortized communication. FOCS 2011:748–757 Brinkman B, Charikar M (2005) On the impossibility of dimension reduction in . J ACM 52(5):766–788 [ ] Candès EJ, Tao T (2010) The power of convex relaxation: near-optimal matrix completion. IEEE Trans Inf Theory 56(5):2053– 2080 [ ] Candès EJ, Romberg JK, Tao T (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inf Theory 52(2):489–509 [ ]

Chakrabarti A, Khot S, Sun X (2003) Near-optimal lower bounds on the multi-party communication complexity of set disjointness.

In: IEEE conference on computational complexity, pp 107–117 Chakrabarti A, Shi Y, Wirth A, Chi-Chih Yao A (2001) Informational complexity and the direct sum problem for simultaneous message complexity. FOCS:270–278 Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. ICALP 55(1) [ ]Clarkson KL, Woodruff DP (2013) Low rank approximation and regression in input sparsity time. In: Proceedings of the 45th annual ACM symposium on the theory of computing (STOC), pp 81–90 Cormen TH, Leiserson CE, Rivest RL, Stein C (2009). Introduction to algorithms. MIT Press

Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms

55(1):58–75 [ ] Dasgupta A, Kumar R, Sarlós T (2010) A sparse Johnson: Lindenstrauss transform. STOC:341–350Dean J, Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. In: proceedings of the sixth symposium on

operating system design and implementation. (San Francisco, CA, Dec 6–8). Usenix Association Demmel J, Dumitriu I, Holtz O (2007) Fast linear algebra is stable. Numer Math 108(1):59–91 [ ] Dirksen S (2015) Tail bounds via generic chaining. Electron J Probab 20(53):1–29 [ ] Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306 [ Drineas P, Mahoney MW, Muthukrishnan S (2006) Sampling algorithms for regression and applications. SODA 2006:1127– 1136 [ Emmanuel J (2009) Candès and Benjamin Recht. Exact matrix completion via convex optimization. Found Comput Math 9(6), 717–772 [ ] Feigenbaum J, Kannan S, McGregor A, Suri S, Zhang J (2005) On graph problems in a semi-streaming model. Theor Comput Sci 348(2–3):207–216 [ ]Fernique X (1975) Regularité des trajectoires des fonctions aléatoires gaussiennes. Ecole d’Eté de Probabilités de Saint-Flour IV,

Lecture Notes in Math 480:1–96 [ Fredman ML, Komlós J, Szemerédi E (1984) Storing a sparse table with O(1) worst case access time. JACM 31(3):538–544 [ ] Frieze AM, Kannan R, Vempala S (2004) Fast Monte-Carlo algorithms for finding low-rank approximations. J ACM 51(6):1025– 1041 [ ] Ganor A, Kol G, Raz R (2014) Exponential separation of information and communication. ECCC, Revision 1 of Report No. 49 Globerson A, Chechik G, Tishby N (2003) Sufficient dimensionality reduction with irrelevance statistics. In: Proceeding of the 19th conference on uncertainty in artificial intelligence, Acapulco, Mexico Gordon Y ((1986–1987)) On Milman’s inequality and random subspaces which escape through a mesh in . In: Geometric aspects of functional analysis vol 1317:84–106Gronemeier A (2009) Asymptotically optimal lower bounds on the NIH-multi-party information complexity of the AND-function

and disjointness. STACS, pp 505–516 Gross D (2011) Recovering low-rank matrices from few coefficients in any basis. IEEE Trans Inf Theory 57:1548–1566 [ ] Gross D, Liu Y-K, Flammia ST, Becker S, Eisert J (2010) Quantum state tomography via compressed sensing. Phys Rev Lett 105(15):150401Guha S, McGregor A (2012) Graph synopses, sketches, and streams: a survey. PVLDB 5(12):2030–2031 Guyon I, Gunn S, Ben-Hur A, Dror G (2005) Result analysis of the NIPS 2003 feature selection challenge. In: Neural information processing systems. Curran & Associates Inc., Red Hook

Hanson DL, Wright FT (1971) A bound on tail probabilities for quadratic forms in independent random variables. Ann Math Stat

42(3):1079–1083 [ ] Hardt M (2014) Understanding alternating minimization for matrix completion. FOCS:651–660 Hardt M, Wootters M (2014) Fast matrix completion without the condition number. COLT:638–678 Indyk P (2003) Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In: ACM-SIAM symposium on discrete algorithms Indyk P (2006) Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53(3):307– 323 [ ] Indyk P, Woodruff DP (2005) Optimal approximations of the frequency moments of data streams. STOC:202–208Jayram TS (2009) Hellinger strikes back: a note on the multi-party information complexity of AND. APPROX-RANDOM, pp 562–

573 Jayram TS, Woodruff DP (2013) Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error. ACM Trans Algorithms 9(3):26 [ ] Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26:189–206 [ ]Johnson WB, Naor A (2010) The Johnson-Lindenstrauss lemma almost characterizes Hilbert space, but not quite. Discret Comput

Geom 43(3):542–553 [ ]Jowhari H, Saglam M, Tardos G (2011) Tight bounds for samplers, finding duplicates in streams, and related problems. PODS

2011:49–58 Kane DM, Meka R, Nelson J (2011) Almost optimal explicit Johnson-Lindenstrauss transformations. In: Proceedings of the 15th international workshop on randomization and computation (RANDOM), pp 628–639 Kane DM, Nelson J (2014) Sparser Johnson-Lindenstrauss transforms. J ACM 61(1):4:1–4:23 [ ]Kane DM, Nelson J, Woodruff DP (2010) An optimal algorithm for the distinct elements problem. In: Proceedings of the twenty-

ninth ACMSIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS), pp 41–52Karp RM, Vazirani UV, Vazirani VV (1990) An optimal algorithm for on-line bipartite matching. In: STOC ’90: Proceedings of the

twenty-second annual ACM symposium on theory of computing. ACM Press, New York, pp 352–358 Keshavan RH, Montanari A, Oh S (2010) Matrix completion from noisy entries. J Mach Learn Res 99:2057–2078 [ ] Klartag B, Mendelson S (2005) Empirical processes and random projections. J Funct Anal 225(1):229–245 [ ] Kushilevitz E, Nisan N (1997) Communication complexity. Cambridge University Press, Cambridge [

Lévy P (1925) Calcul des probabilités. Gauthier-Villars, Paris [ Lloy S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137 [ ] Lust-Piquard F, Pisier G (1991) Non commutative Khintchine and Paley inequalities. Arkiv för Matematik 29(1):241–260 [ ] Matousek J (2008) On variants of the Johnson-Lindenstrauss lemma. Random Struct Algorithms 33(2):142–156 [ ] Mendelson S, Pajor A, Tomczak-Jaegermann N (2007) Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom Funct Anal 1:1248–1282 [ ] Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge, pp 0–521-47465-5 Nelson J (2015) CS 229r: Algorithms for big data. Course, Web, Harvard Nelson J, Nguyen HL, Woodruff DP (2014) On deterministic sketching and streaming for sparse recovery and norm estimation.Linear algebra and its applications, special issue on sparse approximate solution of linear systems. 441:152–167 [ ] Nisan N (1992) Pseudorandom generators for space-bounded computation. Combinatorica 12(4):449–461 [ ]

Papadimitriou CH, Raghavan P, Tamaki H, Vempala S (2000) Latent semantic indexing: a probabilistic analysis. J Comput Syst Sci

61(2):217–235 [ ] Price E, Woodruff DP (2013) Lower bounds for adaptive sparse recovery. SODA 2013:652–663 Recht B (2011) A simpler approach to matrix completion. J Mach Learn Res 12:3413–3430 [ ] Recht B, Fazel M, Parrilo PA (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501 [ ]Rokhlin V, Tygert M (2008) A fast randomized algorithm for overdetermined linear least-squares regression. Proc Natl Acad Sci

105(36):13212–13217 [ ] Rubinfeld R (2009) Sublinear time algorithms. Tel-Aviv University, Course, Web [Sarlós T (2006) Improved approximation algorithms for large matrices via random projections. In: 47th annual IEEE symposium

on foundations of computer science FOCS:143–152 Sarlós T, Benczúr AA, Csalogány K, Fogaras D, Rácz B (2006) To randomize or not to randomize: space optimal summarise for hyperlink analysis. In: International conference on world wide web (WWW) Schramm T, Weitz B (2015) Low-rank matrix completion with adversarial missing entries. In: CoRR. Talagrand M (1996) Majorizing measures: the generic chaining. Ann Probab 24(3):1049–1103

Wright SJ, Nowak RD, Figueiredo MAT (2009) Sparse reconstruction by separable approximation. IEEE Trans Signal Process

57(7):2479–2493 [ ]