Get Complete Project Material File(s) Now! »
Recursive RLS sampling
Another approach to multiple-step RLS sampling, called Recursive-RLS and reported in Algorithm 6, was recently proposed by Musco and Musco, (2017). For clarity, we also report an equivalent iterative version. Similarly to Many-Step-RLS, the algorithm’s goal is to approximate sufficiently well i in order to be able to perform approximate RLSs sampling and use Corollary 3.1. While Many-Step-RLS is a strictly bottom-up approach that starts with a large (small deff( ) and dictionary) and increasingly shrinks it (large deff( ) and dictionary), Recursive-RLS has both a top-down and bottom-up phase.
In the top-down phase it recursively randomly halves the number of columns in the map : at each level t for each column in t 1 a 1=2 coin flip is performed, and if it is successful the column is included in t. As a consequence we have a top-down chain of maps where t w.h.p. contains O(n=2t) atoms, and all atoms in t are included in t 1. The goal of the recursive chain is to reduce the size of t sufficiently so that it can be used to compute approximate RLSs with a small computational cost. This base-case is encountered after log(n) recursion, where the algorithm constructs a dictionary Ilog2(n) = f(1; i)g based on log2(n) that contains only a constant number of atoms. Starting from there, the algorithm performs a bottom-up pass. At each level t the dictionary It+1 from the lower level t + 1 is used to estimate approximate RLSs et;i for t. The RLSs estimator used by Musco and Musco, (2017) is identical to Eq. 3.2, although in their independent discovery Musco and Musco, (2017) derived it starting from P, C and ci (Definition 2.8) rather than , , and i (our definition). This makes it hard to see Algorithm 6 Recursive-RLS sampling (Musco and Musco, 2017) (Recursive version) Input: map , regularization , failure probability Output: Dictionary I
1: Sample by randomly including each column of w.p. 1=2
2: if has less than 16 columns then
3: Set I = f(1; i)g for all i in
4: else
5: Invoke Algorithm 6 on adjusting =2
6: Set I to the result returned by the recursive invocation
7: end if
8: For all i in compute et;i using I and Eq. 3.2 with » = 1=2.
Sequential RLS sampling without removal
While both Many-Step-RLS and Musco and Musco’s Recursive-RLS achieve near-linear runtime, they have two major drawbacks: (1) they require multiple passes over the data or alternatively random access to the dataset, and (2) they have inherent bottlenecks that make it difficult to parallelize them. In this section we focus on solving the first of these drawbacks with a simple sequential sampling approach that can operate in a single pass over the dataset, processing the dataset as if it was coming from a continuous stream of data. The parallelization problems will be addressed in the following section by showing how to generalize sequential sampling to operate on multiple streams at the same time.
To describe this new streaming setting, we will slightly modify our notation. Given our dataset D = f tgnt=1 we can define the partial dataset Dt that contains the first t samples. Associated with Dt we have all the corresponding quantities t, t, t, t. Similarly, our goal is now to generate dictionaries It that are (« ; )-accurate w.r.t. to their respective t. We can also define RLSs t;i and effective dimension dteff( ) w.r.t. the partial dataset t. For notational convenience, we still refer to t;i as a RLSs when t < i (when we have not Algorithm 7 Streaming Oracle RLS sampling without removal (SOracle-RLS-NoRem) Input: Regularization , accuracy « , budget q
1: Initialize I0 = ;
2: for t = f1; : : : ; ng do
3: receive t
4: compute pt = t;t using an oracle limited to Dt
5: draw qt B (pt; q)
6: if qt 6= 0, add 1 qt ; t to It .Expand pt q
7: end for
seen the sample yet), but in this case we will arbitrarily set t;i = 1 for all t < i. Remember also that in the streaming setting if a past sample i is not explicitly stored in the current dictionary It, we cannot access it, i.e., we delete from memory all samples that are not explicitly stored at each step.
Oracle RLS sampling in the streaming setting
From the previous sections’ results on RLS sampling, we know that choosing whether to include sample t or not in It with probability pt based on the sample’s final RLS n;t will be enough to generate an (« ; )-accurate dictionary. This is essentially what Oracle-RLS does, with a computationally unbounded oracle providing the exact RLSs n;t. Unfortunately, in the streaming setting at time t we did not yet see the n t future samples remaining in the stream, and not even the oracle can compute these quantities. We will now present SOracle-RLS-Rem, a simple oracle-based algorithm, reported in Algorithm 7, that generalizes Oracle-RLS to the streaming setting. We call this approach sequential RLS sampling without removal because once an atom is added to the dictionary It, it is never subsequently removed.
The basic idea behind SOracle-RLS-NoRem is to use the sample’s current RLS t;t to decide whether or or not to include it without waiting to see the rest of the stream. Even when using exact RLS t;t, this approach can be grouped in the family of approximate RLS sampling methods: since we cannot (this time in an absolute sense) compute the probabilities n;t, and we try to approximate them with t;t. Due to the oracle, all the samples in the dictionary remain independent, and if we can satisfy the invariant pi = pt = t;t n;t required by Proposition 2.5 we will generate an (« ; )-accurate dictionary. Moreover, if the sum P i;i is not much larger than dt ( ), we can use Proposition 2.2 to show that we t i=1 eff generate a small dictionary. We now focus on characterizing how RLSs and dteff( ) evolve over time as new samples arrive (e.g., the relationship between t;i and t+1;i).
Implementing the oracle and guarantees
We can now remove the necessity of an oracle, and introduce KORS, reported in Algorithm 8, our first single-pass RLS sampling algorithm that achieves near-linear runtime and is not based on an oracle. Note that since computing exact RLSs t;t would be too expensive, we will again resort to approximate RLSs et;t and approximate probabilities pet = et;t. Starting from an empty dictionary I0, at each time step the algorithm loads sample t into memory and combines the new sample (1; t) to the dictionary It 1 to constructs the temporary dictionary It; . Using the associated selection matrix St; , this augmented dictionary can be effectively used to compute et;t using the Eq. 3.2 RLS estimator, t;i = (1 « ) iT( tSt; St;T tT + ) 1 i e = 1 » ( iT i iT tSt; (St;T tT tSt; + IIt; ) 1St;T tT i): 3.4
Note that combining two dictionaries I and I0, each (« ; )-accurate w.r.t. to disjoint datasets D and D0 generates a dictionary that is (« ; )-accurate w.r.t. D [ D0. Therefore, if It 1 is (« ; )-accurate, combining it with (1; t) makes I an (« ; )-accurate dictionary, because (1; t) can be seen as a (0; 0)-accurate dictionary w.r.t. to t. Taking into account Lemma 3.3, we see that if I is (« ; )-accurate, then et;t is -accurate. This cycle of using an accurate dictionary to estimate accurate approximate RLSs, and accurate approximate RLSs to compute accurate dictionaries will be at the core of the analysis of the algorithm in Theorem 3.11. After computing the approximate RLS, KORS sets the approximate probability pet = et;t and draws a Binomial qt B(pet; q). If qt 6= 0, we choose to Expand It 1 and t is added with weight to the next dictionary It. If qt = 0, the sample is not added and will be discarded at the end of the iteration to free space. Similarly to Two-Step-RLS or Many-Step-RLS, all rows and columns for which St; is zero (all points outside the temporary dictionary It; ) do not influence the estimator, so they can be excluded from the computation. Therefore, once a sample t is dropped from memory we do not need to re-load it anymore for the rest of the algorithm’s execution. This means that we only need to load the dataset once and sequentially, making KORS a single-pass algorithm. This is vital in the streaming setting where we cannot go back and resample a discarded atom. Moreover, since the difference between It 1 and It 1; or It is a single atom t, et;i can be efficiently computed in O(jIt; j2) space and O(jIt; j2) time using an incremental version of Eq. 3.4 that exploits rank-1 updates. Overall, the algorithm will take O(n maxt jItj2) time and O(maxt jItj2) space to run.
Table of contents :
1 Introduction
1.1 Why Efficient Machine Learning?
1.2 Sequential dictionary learning in a streaming setting
1.3 Our contributions
1.4 Thesis structure
1.5 Notation
2 Dictionary Learning for Covariance Matrices
2.1 Nyström in an Euclidean space: leverage score sampling
2.1.1 Subspace approximation
2.1.2 Nyström sampling
2.1.3 Uniform sampling
2.1.4 Oracle leverage score sampling
2.2 Nyström in a RKHS: ridge leverage score sampling
2.2.1 Regularized subspace approximation
2.2.2 Oracle ridge leverage score sampling
2.3 Extended proofs
3 Sequential RLS Sampling
3.1 Multi-Pass RLS sampling
3.1.1 Two-Step RLS sampling
3.1.2 Many-Step RLS sampling
3.1.3 Recursive RLS sampling
3.2 Sequential RLS sampling without removal
3.2.1 Oracle RLS sampling in the streaming setting
3.2.2 Implementing the oracle and guarantees
3.3 Sequential RLS Sampling with removal
3.3.1 Oracle RLS sampling with removal
3.3.2 Implementing the oracle and guarantees
3.3.3 Comparison of all RLS sampling algorithms
3.3.4 Straightforward extensions and practical tricks
3.3.5 Open questions: lifelong learning and deterministic dictionary learning
3.4 Proof of main results
3.4.1 Decomposing the merge tree
3.4.2 Bounding the projection error kPfh;lg eP fh;lgk
3.4.3 Bounding the martingale Yh
3.4.4 Bound on predictable quadratic variationWh
3.4.5 Space complexity bound
3.5 Extended proofs
4 Sequential RLS Sampling in the Batch Setting
4.1 Sequential RLS sampling for kernel matrix approximation
4.1.1 Kernel matrix reconstruction
4.2 Efficient learning with kernels
4.2.1 Kernel ridge regression
4.2.2 Kernel principal component analysis
4.3 Sequential RLS sampling for graph spectral sparsification
4.3.1 Graph spectral sparsification in the semi-streaming setting
4.3.2 SQUEAK for graph Laplacian approximation
4.4 Efficient graph semi-supervised learning
4.4.1 Semi-supervised learning with graph sparsifiers
4.4.2 Bounding the stability
4.4.3 Experiments on spam detection
4.4.4 Recap and open questions: dynamic sparsifiers
5 Sequential RLS Sampling in the Online Setting
5.1 Online kernel learning and the kernelized online Newton step
5.1.1 Kernelized online newton step
5.2 Sketched KONS
5.2.1 Limitations of budgeted OKL algorithms
5.3 The PROS-N-KONS algorithm
5.3.1 Regret guarantees
5.3.2 Experiments on online regression and classification
5.3.3 Recap and open questions: do we need restarts?
5.4 Extended proofs
References