My life as a Ph.D. student, S01E03: Coursework so far

Apparently PhD students hate coursework, especially those irrelevant to their research. In CSCI-697 when every new PhD student is asked to introduce themself – including the thing that they are most and least looking forward to do, many mentioned coursework for the latter.

But I guess that really does not have anything to do with academic level, because every student hate coursework! Some student next door in my apartment building could party till after midnight even when the next day is a workday. Some people are not coming to the college to study! I won’t say that we don’t have those people in Tsinghua, but at least I didn’t personally know any.

I took CSCI-670 “Advanced Analysis of Algorithms”this semester. I didn’t really expect to learn much new knowledge. How advanced could it be? For many people, network flows are already pretty advanced. But for someone competed in algorithm contests in high school, even the mediocre ones like me, that’s a piece of cake.

Then, it turned out that – I’m not looking in the right direction. That’s one problem with a lot of algorithm contest competitors. What we take as advanced algorithms are really just difficult, delicate, cleverly constructed to solve a puzzle rather than a practical problem, and does not involve advanced mathematics.

Just as suggested by the name of the course, the advanced thing really is in the analysis. The algorithm itself could be pretty straightforward to implement. And the purpose of the course is not to teach you how to design algorithms. It talks about ideas not seen in basic algorithms and those delicate ones.

Our first topic is – PageRank. That’s an unexpected one! Isn’t that as simple as solving a linear system? That’s what we learned from our undergraduate network science course. Yes, but solving a linear system is not simple anymore when you have a billion variables! That’s the first important idea we are taught. In the world of big data, you need to change the way you see things. The data is there, but you can’t access all of it at once. You want to know something mathematically defined. How do you do it? you sample.

But sampling alone does not solve the problem. To calculate the exact PageRank you need to know the whole network anyway. So there is the second important idea: you don’t need to know everything exactly with full confidence to do useful things. Tolerating an error margin and a small chance of failure makes intractable problems tractable.

That’s like polling. To know exactly who’s gonna to be the president you need to ask pretty much everyone’s opinion. But if you allow an error of several percent, you just need to ask several thousands of people if you sample them from the correct distribution.

In our case, instead of the exact PageRank of each node, we want a set of significant nodes, with the property that any node with PageRank ≥ Δ is in the set and any node with PageRank ≤ Δ/2 is not in it, with high probability. For those node with PageRank in between, they can either be in or not in the output.

The gap is where the magic lies in. As long as the lower threshold is a constant fraction of the higher threshold, the problem is tractable, and after a lengthy mathematical analysis (that span several 2-hour lectures), it all boiled down to just randomly selecting some nodes, actually do the PageRank random walk, and record some frequency.

And the mathematical tool we used is the Chernoff bound. We learned that in our Mathematics for Computer Science course (taught by Yao himself!) in my freshman year, and in every theory related course we review it. It is no doubt the second most useful thing we learned in MfCS! (Wondering what’s the most useful thing? this course is unofficially known as LaTex Typesetting for Computer Science). It is in the same spirit of Markov’s and Chebyshev’s inequalities and the central limit theorem and looks quite fundamental, so I expected it to be as old. To my surprise, it is quite new in the sense that its namesake, Herman Chernoff, is still alive! And judging by the reaction when our lecturer said that he once met Chernoff in person in MIT, most other students are surprised too.

We are now on the second topic, the clustering problem. We will be constructing the k-nearest neighbor graph in arbitrary dimension in time O(n log^2 n), and the whole solution involved solving some quite varied problems.

It had something to do with the kissing number problem! That was another unexpected thing. I read about that problem in a discrete geometry book years ago, and now I know that it is more than a pure curiosity. Its solution is also interesting – we know the exact solution only in dimension d = 1, 2, 3, 4, 8 and 24! What an odd sequence! I know the 8 and 24 case have something to do with certain special lattices. I wish I were better at abstract algebra so that I can appreciate them!

Another related problem is – if you drive across Minnesota, the state of then thousand lakes, how many of them could you expect to see? The student that guessed 100 definitely had some good intuitions! We formulated an idealized version of the problem: if you have n non-overlapping circles on a sphere, then regardless of the configuration of the circles, a random great circle would in expectation cut through no more than 2√n of them. I actually proved this myself! I guess my mathematics is not bad after all.

And then yet another related problem – to pick the exact medium of n numbers you have to go through all of them. But what if you are doing divide-and-conquer and just need to pick some number close enough to the medium with high confidence? By iterative-middle-of-three method, with a constant error margin and failure rate the number of numbers we need to see is a constant regardless of n! Again we see the power of allowing error margin and chance of failure.

That’s pretty much everything up until now. No mind-blogging mathematics, but different ways of looking at the problem and putting everything together.

Share this:

Related

Leave a comment Cancel reply