IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

2309

On the Additive Properties of the Fat-Shattering Dimension Ohad Asor, Hubert Haoyang Duan, and Aryeh Kontorovich

Abstract— The properties of the VC-dimension under various compositions are well-understood, but this is much less the case for classes of continuous functions. In this brief, we show that a commonly used scale-sensitive dimension, Vγ , is much less well-behaved under Minkowski summation than its VC cousin, while the fat-shattering dimension retains some compositional similarity to the VC-dimension. As an application, we analyze the fat-shattering dimension of trigonometric functions and series.

Index Terms— Combinatorial dimension, Minkowski addition, scale-sensitive.

fat-shattering,

I. I NTRODUCTION Combinatorial dimensions play a central role in learning theory, as they allow one to precisely characterize the learnable function classes [1]–[4]. Unfortunately, in all but the simplest cases, these dimensions are quite difficult to compute exactly. Hence, it is useful to be able to estimate the combinatorial dimension of a complicated function family in terms of its simpler constituents. In the case of binary-valued function classes, the relevant combinatorial parameter is the VC-dimension, which is fairly well-behaved under addition. In particular, suppose that F, G :  → {−1, 1} are two concept classes and define H :  → {−1, 1} by H = {sgn( f (x) + g(x)) : f ∈ F, g ∈ G}. Then [5], [6] VCdim(H) = O(VCdim(F) + VCdim(G)).

(1)

Some lower bounds on the VC-dimension of compositions are also known [7], [8]. More generally, for k ≥ 2, given any concept classes F1 , F2 , . . . , Fk and function u : {−1, 1}k → {−1, 1}, one can define the concept class u(F1 , F2 , . . . , Fk ) as the set consisting of all functions u( f1 , f2 , . . . , f k ) :  → {−1, 1} for u( f1 , f2 , . . . , f k )(x) = u( f1 (x), f 2 (x), . . . , f k (x))

(2)

where fi ∈ Fi , i = 1, 2, . . . , k. Vidyasagar [9] provides an upper bound on the VC-dimension of this composition class in terms of the dimensions of the individual concept classes. Theorem 1 ([9]): If VCdim(Fi ) < ∞ for all i = 1, 2, . . . , k, then there is a finite α = α(k), depending only on k, such that

dimensions. Intuitively, these encorporate a measure of margin or scale into their notion of shattering (a formal definition will be given below). Unlike VC-dimension, compositional properties for various scale-sensitive dimensions are far less well understood. In this brief, we explore this gap and study the additive properties of generalized versions of the VC-dimension for function classes. In Section IV, we show that some dimensions do not possess a property analogous to (1), while others are far more well-behaved under Minkowski summation—and even satisfy a generalized version of Theorem 1 [10]. Finally, in Section V, we analyze the fat-shattering dimension of trigonometric functions and series. II. R ELATED W ORK Since the seminal work of [1], [2], it has been known that, modulo measure-theoretic technicalities, a class of continuous-valued functions admits distribution-free risk bounds if and only if its scale-sensitive dimension is finite at each scale. This generalizes the analogous characterization of PAC-learnable Boolean function classes as those having a finite VC-dimension [11]. Regarding the latter, one commonly encounters situations where a complicated function class is constructed by combining simpler hypotheses via basic operations; examples include neural networks [5] and boosting [6]. This was the primary motivation behind results such as (1), as well as those in this brief. A precursor to the scale-sensitive dimensions was Pollard’s pseudo-dimension Pdim(·) [12], which, while capable of providing simple upper bounds [13], is too crude to characterize distributionfree convergence. One way of computing Pdim(F) is to calculate the VC-dimension of the set of all mappings (x, t) → 1{ f (x)>t }, indexed by f ∈ F [4]. This characterization implies that Pdim(F + G) ≤ O(Pdim(F) + Pdim(G)). Other approaches to learning continuous-valued functions by examining their discretized behavior include [14], [15]. A powerful alternative framework for analyzing the complexity of continuous function classes is that of Rademacher complexity [16], [17]. We will not define here the Rademacher complexity R(·) of a function class F, but note in passing that it is rather well-behaved with respect to Minkowski addition and scalar multiplication [18] R(αF + βG) ≤ |α|R(F) + |β|R(G).

k

VCdim(u(F1 , F2 , . . . , Fk )) < α max VCdim(Fi ). i=1

Real-valued functions admit various analogues of the VC-dimension, going by the general name of scale-sensitive

A recent example where Rademacher complexity is used to control approximation error may be found in [19]. III. D EFINITIONS

Manuscript received October 15, 2013; revised March 18, 2014 and May 24, 2014; accepted May 24, 2014. Date of publication June 10, 2014; date of current version November 17, 2014. The work of A. Kontorovich was supported in part by the Israel Science Foundation under Grant 1141/12 and in part by a Yahoo Faculty Award. O. Asor is with Advanced Computing Research and Development, Rehovot 76124, Israel (e-mail: [email protected]). H. H. Duan is with the Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON K1N 6N5, Canada (e-mail: [email protected]). A. Kontorovich is with the Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, Israel (e-mail: [email protected]). Digital Object Identifier 10.1109/TNNLS.2014.2327065

Let F be a collection of functions f :  → R and recall the definition of the Vγ and fat-shattering dimensions [1], [2]: a set S ⊂  is said to be Vγ -shattered by F if there exists some constant r ∈ R such that for each label assignment y ∈ {−1, 1} S there is an f ∈ F satisfying y(x)( f (x)−r ) ≥ γ > 0

(3)

for all x ∈ S. The set S is γ -fat shattered by F, on the other hand, if there exists some function r : S → R, called the witness of shattering,

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2310

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

such that for each label assignment y ∈ {−1, 1} S there is an f ∈ F satisfying y(x)( f (x) − r (x)) ≥ γ > 0

(4)

for all x ∈ S. The Vγ dimension, denoted by Vγ (F), and the γ -fat shattering dimension, denoted by fat γ (F), of F are, respectively, the cardinality of the largest set Vγ -shattered and γ -fat shattered by F. If there exist sets of arbitrarily large sizes that are Vγ -shattered, or γ -fat shattered, by F, then Vγ (F) = ∞, or fatγ (F) = ∞, respectively. The only difference between the Vγ and the fat-shattering dimensions is that r is required to be a constant for Vγ -shattering while it can be a real-valued function on S for γ -fat shattering. As a result Vγ (F) ≤ fat γ (F)

(5)

for any function class F. Recall from basic analysis that a function u : X → Y between two metric spaces (X, d X ) and (Y, dY ) is uniformly continuous if for every  > 0, there exists δ = δ() such that for all x, x ∈ X d X (x, x ) < δ ⇒ dY (u(x), u(x )) < . If F, G are two sets of real-valued functions defined on the same domain, F + G denotes their Minkowski sum F + G = {h = f + g : f ∈ F, g ∈ G}. The natural numbers and integers are denoted by N and Z, respectively. For c = (c1 , c2 , . . .) ∈ RN , we write c 1 to denote  i∈N |ci |. Standard asymptotic order-of-magnitude notation O(·) and (·) is used. IV. A DDITIVE P ROPERTIES The next result shows that there is no Vγ -shattering analogue of (1) by exhibiting function families F, G such that Vγ (F), Vγ (G) are small while Vγ (F + G) is large. Theorem 2: There exist function classes F, G such that for all γ > 0, Vγ (F) = Vγ (G) = 1 and Vγ (F + G) = ∞. Proof: For −∞ < a < b < ∞, let F be the collection of all increasing functions on [a, b] and let G be the collection of all decreasing functions on [a, b]. Clearly, Vγ (F) = Vγ (G) = 1 for all γ > 0. However, H = F + G = BV [a, b] is the collection of functions of bounded variation on [a, b] and is easily seen to have 2 Vγ (H) = ∞ for all γ > 0, see [20]. On the other hand, the fat-shattering dimension is significantly better behaved under Minkowski summation. More generally, it admits a version of Theorem 1. Theorem 3 ([10]): Let γ > 0, k ≥ 2, F1 , F2 , . . . , Fk be function classes consisting of functions f :  → [−1, 1], and u : [−1, 1]k → [−1, 1] be uniformly continuous. Then, there are 0 < α, β < ∞, depending only on u, k, and γ, such that fatγ (u(F1 , F2 , . . . , Fk )) ≤ α

k 

fatβ (Fi )

i=1

where u(F1 , F2 , . . . , Fk ) consists of all functions u( f1 , f 2 , . . . , f k ) :  → [−1, 1] as defined in (2). Since addition is a uniformly continuous operation, we have the following result. Corollary 4: Let F and G consist of functions f :  → [−0.5, 0.5] and H = F + G be their Minkowski sum. Then, there is a constant 0 < α < ∞ and a function β : (0, ∞) → (0, ∞) such that fatγ (H) ≤ α(fatβ(γ ) (F) + fatβ(γ ) (G)).

It is instructive to contrast Corollary 4 with Theorem 2. In addition, the class of functions on [a, b] with bounded total variation norm has a well-behaved fat-shattering dimension, and thus the Vγ dimension as well, by (5). Recall the definition of total variation of a function f : [a, b] → R Vab ( f ) = sup P

|P|−1 

| f (xi+1 ) − f (xi )|

(6)

i=0

where the supremum is over the set of all finite partitions P = (a = x0 , x1 , . . . , x|P| = b) of [a, b]. Theorem [20]: Let F be the collection of all f : [a, b] → R such that Vab ( f ) ≤ L. Then   L . Vγ (F) = fat γ (F) = 1 + 2γ In contradistinction to Theorem 2, it may be the case that Vγ (F) and Vγ (G) are large while Vγ (F +G) is small. We state the following simple lemma without proof. Lemma 5: Let F be a collection of differentiable functions f : R → R such that for each f ∈ F, the derivative f vanishes on at most L points. Then, Vγ (F) ≤ L + 2 for all γ > 0. Theorem 6: There exist function classes F, G such that for all γ > 0, Vγ (F) = Vγ (G) = ∞ and Vγ (F + G) ≤ 3. Proof: Let F be the collection of all differentiable f : R → R such that f < −1 on (−∞, 0) and −1 ≤ f ≤ 1 on (0, ∞). Let G be the collection of all differentiable g : R → R such that −1 ≤ g ≤ 1 on (−∞, 0) and g > 1 on (0, ∞). Consider an h ∈ F + G. By construction, h < 0 on (−∞, 0) and h > 0 on (0, ∞). Since h only vanishes at x = 0, the claim follows by Lemma 5. 2 Remark 7: An analogous construction exists for the VC-dimension. Let F : 1 → {−1, 1} and G : 2 → {−1, 1} be two concept classes defined over two disjoint sets 1 , 2 . Put  = 1 ∪ 2 and define F¯ :  → {−1, 1} by  f (x), x ∈ 1 f¯(x) = 1, x ∈ 2 and G¯ :  → {−1, 1} analogously. Define   ¯ g¯ ∈ G¯ . H = F¯ ∨ G¯ = h = f¯ ∨ g¯ : f¯ ∈ F, ¯ or Then, VCdim(H) = 0, regardless of how large VCdim(F) ¯ might be. VCdim(G) However, such construction of having infinite dimension for F and G but finite dimension for F + G cannot exist for the γ -fat shattering dimension. Theorem 8: For any γ > 0, if F and G are nonempty function classes and fat γ (F) = ∞ or fatγ (G) = ∞, then fat γ (F + G) = ∞. Proof: Without loss of generality, let fat γ (F) = ∞ and n ∈ N. There exists some set S ⊆ , with |S| = n, which is γ -fat shattered by F with witness of shattering r : S → R. Then, F + G also γ -shatters S: pick any g ∈ G and consider the witness of shattering r + g : S → R; given a label assignment y ∈ {−1, 1} S , there exists f ∈ F such that y(x)( f (x) − r (x)) ≥ γ > 0 so for f + g ∈ F + G y(x)([ f (x) + g(x)] − [r (x) + g(x)]) ≥ γ > 0. Hence, fat γ (F + G) = ∞.

2

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

V. T RIGONOMETRIC F UNCTIONS AND S ERIES In this section, we give explicit upper and lower estimates on the fat-shattering dimension of trigonometric functions and series. Let Fsin be the family of functions f : R → R defined by Fsin = { f (x) = sin(αx) : α ∈ R} and define the concept class sgn(Fsin ) : R → {−1, 1} by

2311

for some 0 < M, A < ∞. Then   MA fatγ (F) ≤ + 1, 2γ

fat γ (F) ≤ M (2γ /A)

A well-known construction due to Levin and Denker [4], [21] shows that VCdim(sgn(Fsin )) = ∞, providing an example of a concept class parametrized by a single real number with infinite VC-dimension. We extend this construction to the fat-shattering dimension. Theorem 9: For all 0 < γ < 1 fatγ (Fsin ) = ∞. Proof: Our proof is analogous to the construction in [21, Sec. 2.3]. Let n ∈ N and 0 < γ < 1 be given. Put β = (sin−1 γ )/π and

1 +1 . b=2 2 1/2 − β Define x ∈ Rn by i = 1, . . . , n.

(7)

where M (·) is the packing number of [0, M]. Since M (ε) = M/ε + 1, the claim follows. 2 Remark 12: If either the range of x or the frequency α is unbounded, we have fat γ (F) = ∞ for 0 < γ < 1. This is easily seen from the construction in Theorem 9. Suppose the range of x is [0, 1] and α is unbounded. Instead of the x and α constructed in (7) and (8), respectively, put x = x/b and α = bα. The case of bounded α and unbounded x is handled analogously. Our observations regarding sine functions can be extended to trigonometric series. Theorem 13: Let I ⊆ R be a (possibly unbounded) segment and let A, L ∈ (0, ∞]. Let F I,A,L be the family of functions f : I → R defined by F I,A,L ∞ 

 = f (x) = ci sin(αi x) : c ∈ RN , c 1 ≤ L , α ∈ [−A, A]N .

For each y ∈ {−1, 1}n , define α=

b 1 + 2 2

(10)

Proof: Since | f | ≤ A for all f ∈ F, it suffices to prove that (10) holds when F is the collection of all A-Lipschitz functions on [0, M]. Appealing to [22, Th. 2], we have

sgn(Fsin ) = {sgn( f ) : f ∈ F}.

xi = b−i ,

0 < γ < 1.

i=1 n 

(b − yi − 1)bi.

Then (8)

i=1

It is straightforward to verify that yi sin(παxi ) ≥ γ ,

i = 1, . . . , n.

Obviously, fatγ (Fsin ) = 0 for γ > 1. The case of γ = 1 requires a separate treatment. Theorem 10: fat1 (Fsin ) = 1. Proof: Clearly, Fsin can shatter a single point at γ = 1. Suppose, contrary to our claim, that Fsin can shatter some two-point set, {x1 , x2 } at γ = 1. This implies the existence of α1 , α2 ∈ R such that sin(α1 x1 ) = +1 ⇐⇒ α1 x1 ∈ (2Z + 1/2)π sin(α1 x2 ) = +1 ⇐⇒ α1 x2 ∈ (2Z + 1/2)π sin(α2 x1 ) = +1 ⇐⇒ α2 x1 ∈ (2Z + 1/2)π sin(α2 x2 ) = −1 ⇐⇒ α2 x2 ∈ (2Z + 3/2)π. Rearranging, we have x1 2Z + 1/2 2Z + 1/2 ∈ ∩ . x2 2Z + 1/2 2Z + 3/2

(9)

In particular, (9) implies the existence of k, , m, n ∈ Z such that 4k + 1 4m + 1 = 4 + 1 4n + 3 which is impossible. 2 A partial converse to Theorem 9 is that restricting both the range of x and the frequency α yields finite fat-shattering dimension. Theorem 11: Let F be the family of functions f : [0, M] → R defined by F = { f (x) = sin(αx) : α ∈ [−A, A]}

⎧ = 0, ⎪ ⎪ ⎪ ⎪ 1, ⎪ ⎨=   |L A + 1, fatγ (F I,A,L ) ≤ |I2γ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ = ∞,

γ >L γ =L |I | < ∞,

(11)

γ < L < ∞, A < ∞ γ < L , ∞ ∈ {|I |, A}

where |I | is the length (Lebesgue measure) of I .  Proof: The first relation is obvious, since  i∈N ci sin(αi x) ≤ c ≤ L.  To prove the second relation, note that  1   i∈N ci sin(αi x) = L if and only if |sin(αi x)| = 1 for each i ∈ N. Hence, the argument from Theorem 10 can be applied termwise to show that F I,A,L cannot shatter two points at γ = L. To  prove the third relation, we observe that c 1 < ∞ implies that ci sin(αi x) converges pointwise. Furthermore, there is no loss of generality in assuming that each ci > 0 (since the  corresponding αi can be replaced by its negative). Hence, c −1 i∈N ci sin(αi x) 1 is a convex combination of A-Lipschitz functions defined on a compact domain and therefore is itself Lipschitz with constant at most A. Thus, the packing bound for Lipschitz functions invoked in the proof of Theorem 11 applies here as well. The fourth relation is an immediate consequence of Remark 12, since in particular F I,A,L contains functions of the form f (x) = L sin(αx). 2 Remark 14: The analysis above can be extended to more general   trigonometric series of the form ai sin(αi x) + bi cos(βi x) in a straightforward fashion. VI. C ONCLUSION We have investigated the behavior of various scale-sensitive dimensions under Minkowski addition, providing some positive and negative results. An intriguing future research direction would be to make the dependence of α, β on u, k, and γ in Theorem 3 more explicit. This could be of relevance, for example, to boosted regression [23] or continuous-output neural networks.

2312

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 12, DECEMBER 2014

ACKNOWLEDGMENT The authors would like to thank N. Alon for helpful correspondence and discussions. R EFERENCES [1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler, “Scalesensitive dimensions, uniform convergence, and learnability,” J. ACM, vol. 44, no. 4, pp. 615–631, 1997. [2] P. L. Bartlett and P. M. Long, “Prediction, learning, uniform convergence, and scale-sensitive dimensions,” J. Comput. Syst. Sci., vol. 56, no. 2, pp. 174–190, 1998. [3] M. Kearns and U. Vazirani, An Introduction to Computational Learning Theory. Boston, MA, UAS: MIT Press, 1997. [4] V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA: Springer-Verlag, 1995. [5] E. B. Baum and D. Haussler, “What size net gives valid generalization?” Neural Comput., vol. 1, no. 1, pp. 151–160, 1989. [6] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997. [7] D. Eisenstat, “k-fold unions of low-dimensional concept classes,” Inform. Process. Lett., vol. 109, nos. 23–24, pp. 1232–1234, 2009. [8] D. Eisenstat and D. Angluin, “The VC dimension of k-fold union,” Inform. Process. Lett., vol. 101, no. 5, pp. 181–184, 2007. [9] M. Vidyasagar, A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. London, U.K.: Springer-Verlag, 1997. [10] H. H. Duan, “Bounding the fat shattering dimension of a composition function class built using a continuous logic connective,” Waterloo Math. Rev., vol. 2, no. 1, pp. 4–20, 2012. [11] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the Vapnik–Chervonenkis dimension,” J. Assoc. Comput. Mach., vol. 36, no. 4, pp. 929–965, 1989.

[12] D. Pollard, Convergence of Stochastic Processes. New Haven, CT, USA: Springer-Verlag, 1984. [13] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning. Boston, MA, USA: MIT Press, 2012. [14] C. Zhang, W. Bian, D. Tao, and W. Lin, “Discretized-Vapnik– Chervonenkis dimension for analyzing complexity of real function classes,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 9, pp. 1461–1472, Sep. 2012. [15] C. Zhang and D. Tao, “Structure of indicator function classes with finite Vapnik–Chervonenkis dimensions,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 7, pp. 1156–1160, Jul. 2013. [16] P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3, pp. 463–482, Nov. 2002. [17] V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” Ann. Statist., vol. 30, no. 1, pp. 1–50, 2002. [18] S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification: A survey of some recent advances,” ESAIM, Probab. Statist., vol. 9, pp. 323–375, Nov. 2005. [19] G. Gnecco and M. Sanguineti, “Approximation error bounds via Rademacher’s complexity,” Appl. Math. Sci., vol. 2, nos. 1–4, pp. 153–176, 2008. [20] H. U. Simon, “Bounds on the number of examples needed for learning functions,” SIAM J. Comput., vol. 26, no. 3, pp. 751–763, 1997. [21] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining Knowl. Discovery, vol. 2, no. 2, pp. 121–167, 1998. [22] L.-A. Gottlieb, L. Kontorovich, and R. Krauthgamer, “Efficient classification for metric data,” in Proc. COLT, Haifa, Israel, Jun. 2010, pp. 433–440. [23] N. Duffy and D. Helmbold, “Boosting methods for regression,” Mach. Learn., vol. 47, pp. 153–200, 2002.

Copyright of IEEE Transactions on Neural Networks & Learning Systems is the property of IEEE and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

On the additive properties of the fat-shattering dimension.

The properties of the VC-dimension under various compositions are well-understood, but this is much less the case for classes of continuous functions...
160KB Sizes 4 Downloads 7 Views