As I have mentioned several times to you start with vapnik. That's where modern ...

graycat · on June 28, 2016

Thanks, I'll keep that.

"glivenko cantelli on steroids", good. Sounds like they actually did something.

Yes, I'm torqued by the new learning labels on old bottles of pure/applied math, but that is not in my way.

> The lingo differs here and there: estimating parameters become learning parameters etc.

Rubs my fur the wrong way.

If they have some stuff beyond borrowing from Breiman, okay.

What's "in my way" now is my startup: I've got the math derived and typed into TeX and the 80,000 lines of typing for the code, with the code running, intended for production, and in alpha test, so just for now I no longer have any pressing math problems to solve.

But, in time, I may return to my math and tweak it a little to try to get some variance reduction. Maybe some of the better machine learning literature would help, or maybe I'll just derive it myself again.

Function space geometry is about where much of my core math is.

Thanks.

srean · on June 28, 2016

Heh! indeed, its all geometry :)

Happy to hear back from you. I am actually gladdened that your anomaly detection work is getting some interest on HN lately. Hope something comes out of it. I am now slowly coming to the conclusion that pushing better methods on an existing stack would be really hard. Too much friction, too much politics. Perhaps the way is to create your own better cloud of servers, but that's really big league stuff. Not sure I have the stomach for that.

Curious if you have given any thought to the choice of the metric space where you define your statistics. That might play an important role from what I have seen. There might be an interesting manifold story there.

Big spoilsports are non-stationarities and even bigger are those fat tails. If only everything had a moment generating function.

I see that you have been pointed to Abu-Mostafa he is definitely a good source. Not that Andrew Ng is unaware of the stuff, far from it, he is fighting a different battle: to make parts of ML a commodity.

If you have time then you can browse the proceedings of COLT (conference on learning theory) and ICML.

> or maybe I'll just derive it myself again.

You almost always have to derive it yourself anyway even after you have seen the derivation by somebody else.

selimthegrim · on June 27, 2016

http://amlbook.com

https://www.youtube.com/watch?v=mbyG85GZ0PI (incidentally for graycat, Yaser Abumostafa is a Muslim Egyptian immigrant from Cairo) He covers the VC dimension in https://www.youtube.com/watch?v=Dc0sr0kdBVI, and leaves the proof to an appendix of the book.

graycat · on June 28, 2016

Thanks. I looked a few minutes at two of his lectures. I'll keep the URL and watch his lectures during dinners.

Lucky he got out of Egypt before they strung him up! Such good people is what US immigration has for some decades now tried to be about. Maybe we will get back to it.

"Muslim"? I don't care if he is Zoroastrian either. Or worships some sun god. I don't care about his religion. I do care if he wants to blow up buildings. Somehow I doubt if he does.

Looking at his videos, my first cut, crude guess is that he is looking at modern generalizations of old discriminate analysis. Yup, that can be important. Maybe it could be important for, say, one of my old interests, anomaly detection, say, as a doable alternative to Neyman-Pearson where often in practice we don't have nearly enough data. Maybe his interest is in medical diagnosis which, IIRC, was some of Breiman's interest.

But, first cut, it looks like, again, the criterion will be, does the model fit the data well? That is, we have little or nothing to recommend the model except that it fits the data well. But, then, in the case of his lectures, it looks like maybe he is making progress to also knowing that the model will predict well. I'm looking forward to how he does that.

In contrast, if that is important, in my work in anomaly detection, discussed here on HN often enough, I found false alarm rate from some derivations in applied probability with no model fitting at all. Okay, I don't care if the cat is black or white as long as it catches mice.

From a glance, it looks like he is addressing what is meant by learning -- terrific! Not just throwing words around! Then he seems to be addressing when such learning is feasible, etc. Sounds good; I've wondered some about something like that.

But my interest now in what he is doing is a bit limited since the core math in my startup seems to be quite different.

Thanks.

srean · on June 28, 2016

> it looks like maybe he is making progress to also knowing that the model will predict well. I'm looking forward to how he does that.

Yes that's exactly it. By way of Vapnik and Chervonenkis' result (essentially an uniform law of large numbers) one upper bounds the expected accuracy (over the unknown distribution) of a classifier in terms of the training error and another quantity that depends on the class of hypothesis that one is using. One can give bounds even when one is using an infinite class, for example all linear functions in the feature space, or some Hilbert space of functions.

This was one of _the_ major early break through result. Its often quoted in the context of ML but it really is a result in probability theory. Since it bounds the most pessimistic situation possible, they are quite bad (although achievable).

It also brought about a paradigm change in the mindset. Since the optimal classifier is just the thresholded conditional density, early approaches had focused mostly on estimating this conditional density. But that's an impossible task. V&C showed even if you do not have enough data to learn the density, you may have more than enough for good prediction accuracy. Don't learn the conditional density, just learn the discriminating function directly by optimizing its expected loss.

People have moved to different tools to bound expected prediction accuracy. You get a lot more reasonable bounds, say with the PAC-Bayesian theorem.

Key thing is that these are distribution independent, non-asymptotic and also dimensionality independent.