My life as a Ph.D. student, S01E06: Entering working mode

I haven’t written anything here for two months… That’s what happens when you have research to do and no interesting story to tell. Now that several months has passed since I arrived here, the feeling of freshness is completely gone, and its time for serious work.

I don’t know how an average student spend the very earliest stage of his academic career – before his first publication. It must take some unusual talent to give big results and publish first-author top conference papers outright! I guess many have to do some mundane task as part of a large project, then get his first publication as a credit for his work. Then I should be happy that I don’t have those mundane task to do!

I’m basically still exploring GANs. I’ve probably found a way to generate high-resolution images with a single model. It would be really exciting if I can beat the state-of-the-art, but I’ll still consider it to be a good experience even if I can’t. I’ve learned a lot about neural networks, and I got a lot of interesting ideas to try, which I won’t tell you before I publish 🙂

The life is kind of dull, though. This is probably the first time in my life when I could keep working hard for weeks, but also the first time when I sometimes feel mentally exhausted. I know that that’s gonna be the norm. The world is in turmoil, but as long as WWIII does not happen and no one is shooting at you, it won’t make your life any different. As long as progress in your work keeps you cheered up things would be fine, but if you got stuck in your work and at the same time does not have a life, it would be bleak…

The semester is over. But you don’t really have holidays as a PhD student. You always have one conference submission deadline after another, and for me next year’s SIGGRAPH would be the first one. We are not sure if we could do what we want yet, but I hope we could, of course.

A little bit of thought on solving Rubik’s Cubes and the possibility of doing that with neural networks

I’ve been thinking about solving combination puzzles with computer for a long time.

I believe that every CS student have learned how to solve the 15-puzzle with heuristic search in elementary AI courses, or at least to solve the simpler 8-puzzle with BFS in elementary algorithm courses. At first sight, Rubik’s Cube may seem to be a similar puzzle, just a bit bigger.

Yes, they are all “sequential move” type of puzzles where you try to restore a bunch of things to some initial configuration by performing some operation sequentially. But Rubik’s Cube is definitely much more difficult, both for humans and computers. Size is not what really matters – you can make yourself an arbitrarily large (n^2-1)-puzzle, and it will still be easier than Rubik’s Cube, just quadratically more boring and time consuming. Apparently there are deep group theoretic reasons. Actually I once read about why some of these puzzles are more difficult than others in a Scientific American article. But I think this can be explained by 15-puzzle being more local than Rubik’s Cube: in Rubik’s Cube, one move could affect 1/3 of all blocks where in 15-puzzle, one move affect only one piece.

The consequence is that in 15-puzzle you can see clearly which configuration is closer to being solved – you can see it more easily as a human, and you can design a heuristic more easily for a computer. In Rubik’s Cube, this is much more difficult. I haven’t even seen one example of a good search heuristic for Rubik’s Cube. Maybe people just does not bother finding one: by memorizing some 200 move sequences and practice people could solve a Rubik’s Cube under 10 seconds, and the total number of states of a Rubik’s Cube, 43,252,003,274,489,856,000 (I can memorize this number) and the God’s number, 20, is just small enough to make it tractable to solve with a (bidirectional) brute-force search.

But when you have a larger combination puzzle – a 4×4×4 or 5×5×5 or bigger cube or a megaminx, the number of possible states explodes and brute-force search becomes mostly useless.

Somehow people could still solve them by finding patterns. Someone have even designed an algorithm to solve a n×n×n in Θ(n^2/log n) moves – what a strange time complexity, a log in the denominator!

Actually, finding solutions to all those exotic puzzles is my favorite part in this Rubik’s Cube game. The 200 or so formulas may have been found by brute-force computer search, but one you have played with a Rubik’s Cube for some time and if you are clever enough, you can construct those formulas yourself for other puzzles. You start by just randomly – or maybe purposefully – turning the faces and see what happens. Then you suddenly find a short move sequence that does something interesting – say, rotating a small number of pieces while leaving their position unchanged, or only moving some corner pieces but does not affect edge pieces.

Then you get a bunch of formula, each one of them will keep a certain set of pieces unchanged, move a second set of pieces in some desired fashion, and move the remaining pieces in someway you don’t really care.

Then with what you have at hand, you define your strategy. Be it edge-first, corner-first, layer-first or whatever you may want to call it, usually you have several stages. In each stage, you want to keep unchanged a larger set of pieces that are already restored than the previous stages, and try to restore several more pieces to advance to the next stage.

You might not notice it but what you are doing is actually repeatedly reducing the group of possible states to one of its subgroups, until you get a trivial group consisting only of the solved state.

The process might not be optimal, but it works.

But it seems that we have not formulated this process well enough so that we can let computers do it. I’ve been thinking about it for a while. May be we human and computers could cooperate: we define the stages so that the quotient group between two stages is not so large so that we can brute-force search for a formula that brings us from one stage to the next. It would be so good if the computer can also learn to define the stages. But that part does involve some insight into the structure of the puzzle.

Yesterday when I was dealing with the GAN thing, I suddenly came up with the idea of solving Rubik’s Cubes with a neural network.

I think that’s quite natural. These days people seem to believe that neural networks can do everything. But after a quick google search I found that few people have considered the same thing.

So it’s time to have a try! Brute-force search definitely is ineffective for solving complex combination puzzles but maybe neural networks could do the magic. The idea cannot be simpler: encode the state of each piece into one-hot vectors and just build a network to predict from the input state the mext move towards solving the puzzle.

Then it comes to the encoding, the training data and the structure of the network. I’ve seen some people represent the state of a Rubik’s Cube by one-hot encoding the color of stickers at each position… that sounds ridiculous. I think one-hot encoding the position of each piece makes more sense. We have 20 movable pieces, and each of them have 24 positions, so the input would be a 480-dimensional 0-1 vector. Note that this is actually more than what you need if you choose to use stickers: 48 movable stickers, each with 6 possible colors, producing a 288-dimensional 0-1 vector. But still, I think the position of the pieces is more intrinsic.

The training data would be obtained by scrambling the cube. We train the network to bring a state to its previous state in the scramble sequence. The scramble sequence should be generated from a random walk: with a fixed probability (say 0.05 which is the reciprocal of 20, the God’s number) we come back to the solved state, otherwise we choose randomly from 12 possible moves. This makes sense since it ensures that each state in the state space is visited exponentially more often from its nearest path to the solved state.

Then we come to the fun part – the structure. We could just use some plain fully-connected layers. But remember that there is symmetry in the Rubik’s Cube, as I mentioned in a previous post. So if we rotate the input state, the same rotation should apply to the prediction as well. This can be enforced through weight sharing. Imagine that all the neurons, including the input, are divided in to 24 classes, corresponding to 24 elements in the chiral octahedral symmetry group O. Then if there is a link between a neuron a a neuron b and under some spatial rotation neuron a becomes neuron c and neuron b becomes neuron d, then the weight between a and b and that between c and d should be identical! Now we have an interesting 24-fold weight sharing inside the network! This still looks like a fully connected network, yet it is strangely entangled in itself.

While I haven’t tried this yet, I’m expecting the network to do more than merely predicting the next step. By examining for each neuron which input state cause it to have maximum activation, we should be able to see what pattern the network is using to find the next move – what should be kept and what should be moved, which should help us find the stages in the human solving strategy.

玩玩我刚训练好的GAN……

吐个槽先……我感觉我的英文写作比中文好,但写英文还是挺费脑子的,写中文就可以很随便……毕竟中文是母语……

这是我第一次训练神经网络(跑别人写的代码不算),一周都在琢磨怎么把GAN训练出来,今天不想费脑子了,而且好像正好关于GAN的中文资料也很少,就用中文写了

Generative Adversarial Network,生成式对抗网络,是一种生成模型(废话)。基于神经网络的生成模型很常见,但主要是在序列形式的数据上训练RNN,比如生成文本啥的。GAN采取了不同的思路:对抗训练。两个神经网络:一个生成网络,学习从随机噪声向量产生与训练数据相似的样本,和一个判别网络,学习判别一个样本是来自训练数据还是来自生成网络。对抗的目标是,生成网络提高判别误差,判别网络降低判别误差。

对抗训练也不算是新思路了,比如让两个AI对弈或者对话啥的很久以前就有,但用于生成模型算是个比较新的用法。

训练过程也很简单:对生成网络和判别网络的训练交替进行,每一回合,首先从训练数据sample一个batch,标记为真,然后使用生成网络生成一个batch,标记为假,用这两个batch训练判别网络;再使用生成网络生成一个batch,标记为真,让判别网络进行判别,然后把误差反向传播到生成网络进行训练。

虽然思路很简单,但实际操作的时候还是比较tricky的,很难训练好。原因有几个:首先,生成网络的质量无法量化(使用判别误差显然是不行的),只能靠主观判断,而凭主观判断是难以看出训练到底进行到什么程度、有没有卡住的,也就导致很难找到合适learning rate。

其次,两个网络的训练速度是很不同的。为了确保两个网络共同进步,要使它们保持能力相当。原paper认为判别网络应当多训练。但我的实测结果是在使用相同的learning rate时判别网络对生成网络基本上是吊打,不知道是什么情况。如果尝试保持每一回合结束后的判别误差约等于random guess的话,每个回合大概要训练生成网络20次以上……我还没有做实验,不过感觉可能是和输入噪声变量的数量有关。

不过上面两点都是次要的。GAN最常见的失败模式是生成网络对所有输入给出相同的输出。这个很好理解:每一个回合,生成网络的最优策略当然是找到当前判别网络表现最差的那一个样本,然后对所有输入都给出那个输出。不过接下来判别网络马上就会发现这个样本是假的。然后下一个回合生成网络又会找到另一个判别网络表现最差的样本,就这样这个最差样本在样本空间里变来变去,两个网络捉迷藏……

这大概也是原paper认为判别网络应该多训练的原因:生成网络不可能一步就使得所有输入都输出那个最差样本,如果判别网络多训练,及时发现生成网络的输出变化趋势并将这个样本判别为假,让这个最差样本在样本空间里跑得比较快,生成网络就追不上,也就不会对所有输入都给出相同输出了。

同样道理,learning rate不能太大。

但实际操作的时候,还是总是会发生这种生成网络collapse的情况……

我是在MNIST手写数字数据集上尝试训练的。网络结构什么的……这么简单的数据集其实随便什么结构都好吧……判别网络输入接两层卷积,接一层全连接,接输出,使用batch bormalization。生成网络正好反过来。

在调整网络结构、调整learning rate、调整两个网络的训练速度比均无果之后,我想了个不太优美的办法。判别网络每次只判别一个样本。如果可以一次判别整个batch呢?虽然每一个样本都像是真的,但整个batch长得一样,明显就是假的嘛……

所以怎样让判别网络在判别单个样本的时候可以参照整个样本的信息?一个方法是,求整个batch的平均值,然后将每个样本与平均值的差值作为一个额外的channel。如果整个batch很相似,这个channel的值(的绝对值)会比较小,否则会比较大。

加了这个trick之后,还真的就训练出来了……看一下训练过程,可以发现生成网络还是会有将所有输入收敛到同一输出的趋势,但马上就要collapse的时候,判别网络习得了真batch和假batch的统计差异,然后再过几个回合,生成网络的输出就发散了。

看一下结果。以下是生成网络生成的数字:

211100

以下是真实数据:

true

还不错吧。不过也有一些明显的缺点:数字分布不均匀(1明显太多而8明显太少),以及有一些四不像的输出。再看看训练过程中输出的演化:

vis_x3

有一些很快就稳定了,还有一些跳来跳去,大概是位于输出不同数字的输入区域的边界上。

再看看其他好玩的事情,比如输入的128个标准正态分布随机变量都是干啥的。事实证明,不太好玩……由于128个太多了,而MNIST数据集很简单,没那么多的变化自由度,所以大部分变量只做了一点微不足道的工作,即使能看出来也难以解释到底是个啥作用……举个例子,以下每一组输出中只有一个输入从-2变化到2,其余变量固定:

var_vis_41

以下每组是随机两个输入之间的线性插值:

1d_vis_5

以下是四个输入之间的双线性插值:

2d_vis_8

针对在不同数字生成区域边界上存在的四不像输出,有可以改进的办法:由于MNIST是有class label的,可以做有监督学习:将class label转换成one-hot vector放在输入的种子里,确保不同数字生成区域分离,然后将判别网络改成判别真假+分类、生成网络的目标改成增加判别误差、减小分类误差,可以期望产生更好的输出。目前还在尝试中,结果如下

40000

看起来起码数字分布不均匀的问题是解决了……

这一周算是学习,接下来要把GAN投入实战来做本lab的一个项目了。

不过我自己是有其他打算的……GAN这么6的东西,赶紧拿来随机生成萌妹子啊!

想要高清无码大图是不太可能,而且看看一些在ImageNet之类的数据集上训练的结果可以知道,花花草草还行,想让GAN生成动物啥的结果会很猎奇……

但我想生成个眼睛啥的总还是可以做的吧,收集几万张图片也不是问题,有空试试。

My life as a Ph.D. student, S01E05: Still working on my first neural network…

I was planning to write a post after getting everything done but it took longer than what I like…

We are working on some generative models now and for the last week I was learning how to train a Generative Adversarial Network. GANs are super cool, but also notoriously difficult to train! They probably aren’t the best choice for a machine learning newbie like me.

But doing something challenging is fun! Using something that works out of the box is just too lame. And if it already works well, it likely isn’t worth studying.

Here I’d like to talk about what I’ve learned so far. Let’s see the result first. I’m getting something like this from a GAN trained on MNIST dataset:

27000

It is generating something. Not exactly digits but do resemble some hand written script…

I found that there are some practical issues not addressed in the DCGAN paper. It suggested using batch normalization. Batchnorm behaves differently when training and when evaluating. When training. it uses the mean and variance of the training batch. When evaluating, it uses the statistics of all the examples it has seen. It is logical to use evaluating mode for the generator when training the discriminator and vice versa.

In the original GAN paper, in each discriminator training round, the “true batch” from real data and the “false batch” from generated examples are fed into the discriminator separately. It is logical to make the distribution of each batch the same, so true and false examples should be mixed in the same batch.

To avoid making the wrong decision for the previous two issues I tried both ways round. But the suggestions given in the DCGAN paper still didn’t quite work out for me. I’ve followed them as much as I can. I can’t do the fractional-strided convolution for the generator network because that kind of convolution is not readily available in torch… But otherwise I did exactly what was advised. But still, the generator network keeps collapsing every input to a single output. According to what I read, this indeed is the most observed failure mode of GAN.

Then I came up with a not-so-elegant trick to solve this problem. The generator collapses because it thinks that output is the single best image to trick the discriminator into mistaking it for a real example from the dataset and learns to produce that output from any input. But then the discriminator quickly learns that that particular example is fake. Then the generator looks for the next output that can trick the discriminator…

This happens because the discriminator looks at a single example at a time. If we can somehow let the discriminator reject a batch if every example in the batch looks the same, then the generator should not be tempted to collapse every input into the same output anymore! How do we do this? My solution is, for each batch, take their mean. Then for each example in the batch, take its difference from the mean and concatenate the result with itself by adding them as additional channels. It is expected that if the batch is similar then the intensity in the additional channel would be small and the discriminator would be able to learn it.

But then there is a conflict with batchnorm. Since we want the discriminator to tell the difference between a true batch and a false bath, we cannot mix them in the same batch! But on the other hand, the difference of intensity between a true batch and a false batch is what makes the additional channels useful, with batchnorm, the differences are wiped out! To prevent this, we cannot let batchnorm normalize the batches individually and have to mix them!

So do we not use batchnorm? But without batchnorm, the training becomes very unstable.

The solution is to mix the two batches after their additional channels have been calculated separately.

This method actually works! the result is not perfect, but at least the generator does not collapse anymore! In the training process, you can see that at some point, the generator almost collapses but then the output soon starts to diverge.

At this point I’m not quite sure how to get the rest of things right. Tuning training parameters, or go model shopping?

And then there is one last mystery. The GAN paper says we should train the discriminator more. But in reality the discriminator constantly beat the crap out of the generator even if I train the generator 10 times as hard! Is this normal?

Actually the difficulty of balancing the training of the generator and the discriminator is another major factor that makes GAN hard to train, along with generator collapsing. The third factor probably is that since there are two networks competing, there is no single loss function to measure how well the training is going.

I’ll probably write something more elaborate if I do become better at training GANs.

And what is the one thing that I want to generate with a GAN?

Image of anime style eyes!

My life as a Ph.D. student, S01E04: Game on! (Well, not really)

USC is renowned for its School of Cinematic Arts. That’s at best remotely related to my work, though.

But only after I came did I discover that they have equally competitive game design programs.

I once played Cloud when I was in middle school. A beautiful game. And then, a while ago, I found that this game was made in USC and the creator is my compatriot and an alumnus of SJTU! Indie game development definitely was a rare thing back then in China, and arguably still is now.

I’m taking CSCI-522 “Game Engine Development” this semester. I did plan to take a graphics course, but my boss is not teaching this semester, and the other graphics course really is just elementary, so I opted to try something different.

I’m not planning to become a game developer. (Well, strictly speaking that possibility is not ruled out yet.) You don’t need a phd to make games!  Actually I doubt whether such courses will make you a “game developer”. My classmates are almost exclusively Master’s students. I don’t know whether they are MSCS or MFA, but either ways, being an expert in game engine is probably only gonna get you in to a big game company to work on their engines. You are still a programmer, just working with some different kind of program. You won’t be the next, say, Notch or Jonathan Blow. Does not sound cool.

But being cool is one thing, actually getting something to work is another thing! I barely know anything about the game development community but I guess in a big company you are guaranteed a good income while making a living as a indie game developer is much harder. If you think you had a great idea about game design but people don’t buy it, as often is the case, you’d probably rather learn to draw cute girls than designing games! At least you know someone’s gonna buy it if there is waifu! (No I’m joking)

But all those sort of things are none of my business. I’m taking this course to gain experience in working on giant complex projects. I heard that CSCI-522 is one of the more demanding courses offered by the compute science department. We are given a game engine that can barely do the basic rendering, animation, scripting, etc. and we will work on it so that in the end it will actually be something workable. Not being bold enough to read the source code of GCC or Linux kernel, this is the most complicated thing I’ve ever get my hands on.

In game engines efficiency is everything. Optimization is everything. That means you will work with a lower level language, and you won’t even want to use standard libraries. You don’t want generic solutions. You want something that works best for this particular program. So you do everything from scratch in C++: from memory management, to data structures, to runtime type information, to event system, and more.

And then, on top of that, you figure out how to abstract out the difference between hardware platforms and graphic APIs. Yes, we are gonna make it work on Windows and PSVita and iPad and more.

Then you do all the computation – physics and linear algebra and computational geometry, move objects around, pass control events around, gather draw calls and send them down the graphics pipeline.

Around that, you also need a script system and some sort of level editor. That really is a lot.

And while doing that, you use all kind of tricks to reduce coding work, do profiling, speed up compilation and help debugging.

Yes. Engineering. But as a graphics researcher, those are very useful skills if you want your research to be more than a paper and become something useful.

Thus far the technical part of the course has not gone beyond my existing knowledge, but I did learn some little tricks.

Here is one pre-class quiz question: you want to count the number of calls to certain functions. Suppose you want to implement several functions that take two integer arguments named a and b and return one integer. Define a macro called PROFILED_FUNC and a class called Tracker, such that

PROFILED_FUNC(func)
{
    //do some computation
    //return some value
}

defines such a function func, and when you call Tracker::stat(), it prints out the number of times each such function has been called.

The macro would be something like

#define PROFILED_FUNC(func)                            \
int func(int a, int b) {                               \
    static int callCount = 0;                          \
    if (callCount == 0) {                              \
        Tracker::register(#func, &callCount);          \
    }                                                  \
    callCount++;                                       \
    return _unprofiled_ ## func(a, b);                 \
}                                                      \
int _unprofiled_ ## func(int a, int b)

the # and ## are really alien to me! This made me believe more that I’ll never master C++.

And another trick: if you have a huge class definition in a header, you won’t want to include every time you use the class, since it slows down compilation. Time is money. It is common knowledge that we can forward declare a class, but that way we can only use a pointer to that class and cannot dereference it.

Now suppose you want to allocate a object of that class. Then you need to know its size. How do you avoid including the header?

You forward declare a function that allocate the object and return the pointer and put the definition in the implementation of that class.

And then there are other tricks. Thanks god that we don’t have template meta-programming!

闲得蛋疼,试论为什么人照镜子的时候会左右反转而不会上下反转

昨天看到这样一个问题:为什么人照镜子的时候会左右反转而不会上下反转?答案当然不是“人的眼睛是横着长的”……

之前我就见过一个类似的版本,然后直接看了答案——其实人照镜子的时候既不会左右反转也不会上下反转,而是会前后反转。当时我觉得“啊,好有道理”,然后就没有多想。不过昨天再次看到这个问题,觉得之前的答案并不完整:确实,人照镜子的时候是前后反转,但为什么大家普遍认为是左右反转呢?我决定以我的理解尝试解释一下这个问题。

我认为根本的问题在于人自身的对称性和对左右、前后、上下三组方向的理解方式的不同。

通常而言,上下是一个绝对概念,不依赖于人自身的姿势:在有重力的地方,重力的方向就是“下”,而重力的反方向就是“上”。而前后左右是相对概念,自己面向的方向是前,背对的方向是后。至于左右……为了避免引用宇称不守恒,不妨定义心脏所在的一侧为左(镜面人请自行把定义反过来)(没有心脏的人请自己看着办),对侧为右。因此人日常使用的“上下-左右-前后”坐标系既不是是世界坐标系也不是局部坐标系,而是二者的某种混合。由于人通常是直立的,这一世界坐标系中上、下的的定义方式与“头顶为上、脚底为下”的局部坐标系中的定义方式通常是一致的,因而不影响人描述其他事物相对于自身的方向。但当人处于其他姿势的时候就可能造成麻烦:比如人平躺的时候描述自己头顶所指的方向的物体,就不能仅仅使用上下左右前后。

照镜子的时候,人类判断镜中的像是否在某一方向上反转,也是采用相同的方式:对于上下,相对于世界坐标系进行判断;对于前后左右,相对于镜中的像的局部坐标系进行判断。如果镜子在头顶,问照镜子的人他在那个方向被反转了,很多人会同意镜中的像被上下反转了,因为它在世界坐标系中上下反转了。

但至此我们仍不能解释为什么事实上的前后反转会被解释为左右反转。显然左右反转并不是由于镜子置于我们的前方:如果镜子在我们的右侧,我们仍会判断为自己被左右反转。对于前述镜子置于上方的情况,如果追问其有没有被左右反转,我相信多数人会给出肯定的回答。事实上仅有上下反转,多数人却会解释为上下和左右同时反转,在两个方向反转,岂不是相当于并没有反转!问题显然在于前后和左右这两个局部坐标轴仍具有不同的性质。

日常生活中,我们总是可以将他人的局部坐标系经旋转平移与自己的局部坐标系重合。在描述某事物相对于他人的位置时,可以理解为先将其经同一旋转平移变换到自己的局部坐标系下然后进行判断。但在镜中,坐标系的手性会改变,理论上经过旋转平移是不能将镜中自己的像的局部坐标系变换为自己的局部坐标系的,而总是会在某一个方向被反转。因此前述“相对于镜中的局部坐标系进行判断”理论上是无法实现的。

但人体在左右方向具有对称性,因此如果我们在将镜中像的局部坐标系与自身的局部坐标系对齐的时候对齐前后轴而反转左右轴,我们的像仍然可以与自己重合,这就使得在实际上我们仍然可以“相对于镜中的局部坐标系”判断物体的位置关系。但这一过程中,左右是被反向的。

人体左右对称而前后不对称,导致了我们在将镜中的局部坐标系与自身的局部坐标系进行对齐的时候总是对其前后轴而反转左右轴,因此无论将镜子置于那个方向,我们镜中的像左右永远都会反转,而前后永远不会反转。至于上下则依镜子的方向而定,置于上下轴上则会上下反转,置于前后左右则不会上下反转。

那么假如人类不具有对称性,或者前后轴和左右轴均对称以至于无从选择变换局部坐标系的时候在哪一个方向方向对齐、在哪一个方向反转呢?

对于前一种情况,我认为总会有某一个方向,例如视觉器官所在的方向,会被认可为“正面”即“前方”,判断相对位置时会以这一方向对齐坐标系。对于后一种情况,大概根本不会存在“前后左右”这种相对方向的概念。

至此基本解决了为什么人照镜子的时候会左右反转而不会上下反转。但我认为我们在描述镜中其他物体的像的方向时仍有一些微妙之处。一些物体沿镜子排列时镜中的像在哪些轴被反向?垂直于镜子时呢?人观察的位置对结果有何影响?镜子的方向有何影响?物体自身的对称性又有何影响?如果物体表面上印有文字呢?人类对“上下左右前后”这一系列方向的认知,似乎远比我们自己意识到的复杂。

My life as a Ph.D. student, S01E03: Coursework so far

Apparently PhD students hate coursework, especially those irrelevant to their research. In CSCI-697 when every new PhD student is asked to introduce themself – including the thing that they are most and least looking forward to do, many mentioned coursework for the latter.

But I guess that really does not have anything to do with academic level, because every student hate coursework! Some student next door in my apartment building could party till after midnight even when the next day is a workday. Some people are not coming to the college to study! I won’t say that we don’t have those people in Tsinghua, but at least I didn’t personally know any.

I took CSCI-670 “Advanced Analysis of Algorithms”this semester. I didn’t really  expect to learn much new knowledge. How advanced could it be? For many people, network flows are already pretty advanced. But for someone competed in algorithm contests in high school, even the mediocre ones like me, that’s a piece of cake.

Then, it turned out that – I’m not looking in the right direction. That’s one problem with a lot of algorithm contest competitors. What we take as advanced algorithms are really just difficult, delicate, cleverly constructed to solve a puzzle rather than a practical problem, and does not involve advanced mathematics.

Just as suggested by the name of the course, the advanced thing really is in the analysis. The algorithm itself could be pretty straightforward to implement. And the purpose of the course is not to teach you how to design algorithms. It talks about ideas not seen in basic algorithms and those delicate ones.

Our first topic is – PageRank. That’s an unexpected one! Isn’t that as simple as solving a linear system? That’s what we learned from our undergraduate network science course. Yes, but solving a linear system is not simple anymore when you have a billion variables! That’s the first important idea we are taught. In the world of big data, you need to change the way you see things. The data is there, but you can’t access all of it at once. You want to know something mathematically defined. How do you do it? you sample.

But sampling alone does not solve the problem. To calculate the exact PageRank you need to know the whole network anyway. So there is the second important idea: you don’t need to know everything exactly with full confidence to do useful things. Tolerating an error margin and a small chance of failure makes intractable problems tractable.

That’s like polling. To know exactly who’s gonna to be the president you need to ask pretty much everyone’s opinion. But if you allow an error of several percent, you just need to ask several thousands of people if you sample them from the correct distribution.

In our case, instead of the exact PageRank of each node, we want a set of significant nodes, with the property that any node with PageRank ≥ Δ is in the set and any node with PageRank ≤ Δ/2 is not in it, with high probability. For those node with PageRank in between, they can either be in or not in the output.

The gap is where the magic lies in. As long as the lower threshold is a constant fraction of the higher threshold, the problem is tractable, and after a lengthy mathematical analysis (that span several 2-hour lectures), it all boiled down to just randomly selecting some nodes, actually do the PageRank random walk, and record some frequency.

And the mathematical tool we used is the Chernoff bound. We learned that in our Mathematics for Computer Science course (taught by Yao himself!) in my freshman year, and in every theory related course we review it. It is no doubt the second most useful thing we learned in MfCS! (Wondering what’s the most useful thing? this course is unofficially known as LaTex Typesetting for Computer Science). It is in the same spirit of Markov’s and Chebyshev’s inequalities and the central limit theorem and looks quite fundamental, so I expected it to be as old. To my surprise, it is quite new in the sense that its namesake, Herman Chernoff, is still alive! And judging by the reaction when our lecturer said that he once met Chernoff in person in MIT, most other students are surprised too.

We are now on the second topic, the clustering problem. We will be constructing the k-nearest neighbor graph in arbitrary dimension in time O(n log^2 n), and the whole solution involved solving some quite varied problems.

It had something to do with the kissing number problem! That was another unexpected thing. I read about that problem in a discrete geometry book years ago, and now I know that it is more than a pure curiosity. Its solution is also interesting – we know the exact solution only in dimension d = 1, 2, 3, 4, 8 and 24! What an odd sequence! I know the 8 and 24 case have something to do with certain special lattices. I wish I were better at abstract algebra so that I can appreciate them!

Another related problem is – if you drive across Minnesota, the state of then thousand lakes, how many of them could you expect to see? The student that guessed 100 definitely had some good intuitions! We formulated an idealized version of the problem: if you have n non-overlapping circles on a sphere, then regardless of the configuration of the circles, a random great circle would in expectation cut through no more than 2√n of them. I actually proved this myself! I guess my mathematics is not bad after all.

And then yet another related problem – to pick the exact medium of n numbers you have to go through all of them. But what if you are doing divide-and-conquer and just need to pick some number close enough to the medium with high confidence? By iterative-middle-of-three method, with a constant error margin and failure rate the number of numbers we need to see is a constant regardless of n! Again we see the power of allowing error margin and chance of failure.

That’s pretty much everything up until now. No mind-blogging mathematics, but different ways of looking at the problem and putting everything together.

Thoughts on how to formulate a Rubik’s Cube

Formulating Rubik’s Cubes and other sequential move puzzles is one thing that I have been thinking for a long time.

I learned how to solve a Rubik’s Cube when I was 13. But I’ve never been a speed-solving type of player, even now, I’m still using the most elementary methods and can barely solve a Rubik’s Cube in 50 seconds. My interest is in making cool patterns and collecting and solving puzzles of different shape and structure.

As a CS student it’s quite natural to want to write programs for everything you find worthy of doing with a computer. And so I wanted to write a Rubik’s Cube simulator! Not just a simple one, but one with which you can simulate any combinatoric puzzle – one to rule them all!

Then we are faced with the problem of how to represent these puzzles. They must be machine-readable, of course. Then I think it should be friendly to a human user. It should not only be human-readable: it is intended to be primarily hand-written, so it must be made sense easily, be concise in syntax, not tedious to write, but still with immense expressing power. That will make most existing formats undesirable: with xml you will probably end up writing more tags than useful content, and with json there is no structure.

Then, on some day last year, I found the key! What makes it possible to describe the complex structure of a combinatoric puzzle concisely is its symmetry. I started working at once and after several weeks I’ve got a simple parser for my custom cube description language and a simple simulator. The syntax resembles a general purpose programming language but much simpler.

Then it’s time for graduate school applications and I put that project aside. Now that I have some spare time, I’d like to pick it up.

Actually I want to start from scratch again. The last time, I didn’t even write a specification for my language. Implementing a language without a specification, that’s like writing a program without comment, only worse. And after a year I’ve got a lot of new thoughts on how to describe and implement a puzzle, so the basic mechanism of my program would probably change.

And my rendering was done in OpenGL 1 style (primarily due to the lack of good tutorials on OpenGL 3+). That definitely need to be changed.

This time, I’ll write the specification of my custom cube description language before writing the parser and explain all the ideas behind it, and provide annotated examples, so that someone other than me can actually understand this language and use it to model puzzles. Hopefully it will finally become a useful tool for those hobbyists who can not afford to by expensive puzzles.

(And this will also be the first project of Martian Computing Research! Wait for more to come, ahahaha!)

Here as the first part of the documentation I’ll explain the central idea of my design by trying to formulate the definition of the original Rubik’s Cube. The syntax here is not what actually used in the language, it just serves to illustrate the idea.

First, we will want to define all the blocks. Following the convention, we name the faces U (up), R (right), F (front), D (down), L (left) and B (back). It seems logical to simply name each block by all the faces it touches. Then we get something like this:

block U, D, R, L, F, B; // The center blocks
block UR, UL, UF, UB, DR, DL, DF, DB, RF, RB, LF, LB; // The edge blocks
block URF, UFL, ULB, UBR, DFR, DLF, DBL, DRB; // The corner blocks.

There are some ambiguities in the ordering of faces when one block touches multiple faces, but for now we just pick an arbitrary one.

Then we will want to define all possible states of a block. Here, a state is just a position and an orientation. The set of positions is the same as the set of blocks. How about the orientations? We could have something like position name plus rotation angle:

state UR, UR+180;
state URF, URF+120, URF+240;

But after a second thought we found that it would be a good idea to just cycle the face order of the name of a block to get names of states:

state UR, RU;
state URF, RFU, FUR;

This is logical since the number of orientations of a block is determined by the number of faces it touches. Note that “block URF is in state RFU” means that the U sticker on block URF is on face R, etc., so rotating a corner block 120 degrees clockwise correspond to cycle the face names in the state name backwards 1 step. This form also give us a lot of advantages. If block URF is at its own position, its rotation angle is obvious. What if it is at another position, say position DBL, with its U sticker on face L? There is no straightforward definition of rotation angle. With this form, we can get rid of rotation angle: since U sticker is on face L, R sticker is on face D and F sticker is on face B, block URF must be in state LDB! Actually this give us a meaningful definition of rotation angle: state name LDB is obtained by cycling position name DBL backward 2 steps, so the rotation angle is defined to be 240 degrees! Once we have defined rotation angle for every block in every position, we can verify that the sum of rotation angles of all blocks (modulus 360) is invariant under any operation, which is an important step in proving the total number of possible states of a Rubik’s cube.

Finally we will want to define all the operations. An operation moves blocks and change their states, so its definition comes in two parts: a geometric transformation and a table of state changes:

operation U {
    axis (0, 0, 1); // Assume that x axis points forawrds,
                    // y axis points rightwards and
                    // z axis points upwards
    angle -90;      // In mathematics positive angle usually
                    // means counterclockwise
    UR -> UF;
    RU -> FU;
    ...
    UB -> UR;
    BU -> RU;
    URF -> UFL;
    RFU -> FLU;
    FUR -> LUF;
    ...
    UBR -> URF;
    U -> U;         // The state does not change but geometric
                    // transformation still apply to block U
                    // so we include it.
}

If you know something about combinatorics, you should know that the change of states is actually a permutation. So we can write it more concisely in cycle notation:

operation U {
    axis (0, 0, 1);
    angle -90;
    state_cycle (UR, UF, UL, UB);
    state_cycle (RU, FU, LU, BU);
    state_cycle (URF, UFL, ULB, UBR);
    state_cycle (RFU, FLU, LBU, BRU);
    state_cycle (FUR, LUF, BUL, RUB);
    state_cycle (U);
}

We will then need to write this for every other operation! The inverse operations (U’, etc.) can be easily inferred by reversing the direction of the rotation and cycle so we don’t need to write them explicitly, but that’s still a lot of work! After that, we will have a full definition of a Rubik’s Cube.

Somehow we can feel the redundancy. But this redundancy is not like manually assigning the same value to each entry in a long array, which can easily be simplified with a for-loop. It is the result of the intrinsic symmetry, both geometric and algebraic, of the Rubik’s Cube. Actually, the Rubik’s Cube is probably the best real-life example of an object with a complex symmetry. If you check the “Group theory” article on Wikipedia, you can find a Rubik’s Cube right there on the beginning of the page!

But we only need a tiny portion of what group theory has to offer. We just focus on how the formulation of a Rubik’s Cube could be simplified.

Consider defining operation R. It would be good if we can just “rotate” the definition of operation U by -90 degrees around the x axis. What do we need to do this “rotation”? We need to rotate the axis around which operation U rotate the blocks (yes, we rotate a rotation!) and change the state names in the permutation such that U changes to R, R changes to D, etc., i.e. we do a permutation of face names. (And again, yes, we permute a permutation!) Let’s try it:

meta_rotation x {
    axis (1, 0, 0);
    angle -90;
    face_cycle (U, R, D, L);
}

apply this to operation U, we get something like:

operation R {
    axis (0, 1, 0);
    angle -90;
    state_cycle (RD, RF, RU, RB);
    state_cycle (DR, FR, UR, BR);
    state_cycle (RDF, RFU, RUB, RBD);
    state_cycle (DFR, FUR, UBR, BDR);
    state_cycle (FRD, URF, BRU, DRB);
    state_cycle (R);
}

Yes! this exactly what we wanted. Apply this meta-rotation twice more and we get the definitions of operations D and L. We can similarly define meta-rotations around other axes:

meta_rotation y {
    axis (0, 1, 0);
    angle -90;
    face_cycle (F, U, B, D);
}
meta_rotation z {
    axis (0, 0, 1);
    angle -90;
    face_cycle (R, F, L, B);
}

Apply y to operation U, we get definitions of operations F and B.

What if we apply z? Well, we get essentially the same definition of operation U, but it does look a bit different:

operation U {
    axis (0, 0, 1);
    angle -90;
    state_cycle (UF, UL, UB, UR);
    state_cycle (FU, LU, BU, RU);
    state_cycle (UFL, ULB, UBR, URF);
    state_cycle (FLU, LBU, BRU, RFU);
    state_cycle (LUF, BUL, RUB, FUR);
    state_cycle (U);
}

The state orders in the state cycle have changed! What does this tell us? We only need to say UR -> UF, then UF -> UL -> UB -> UR can all be generated by applying the appropriate meta-rotation! So to get the definition of all operations, we only need a part of one definition! We will get something like this:

apply (x, xx, xxx, y, yyy) {
    apply (z, zz, zzz) {
        operation U {
            axis (0, 0, 1);
            angle -90;
            U -> U;
            UR -> UF;
            RU -> FU;
            URF -> UFL;
            RFU -> FLU;
            FUR -> LUF;
        }
    }
}

If you know a bit about symmetry groups, you can see that we are actually applying the whole chiral octahedral symmetry group O to the minimal part of definition to get the full definition. the group O is of order 24 and here we are only explicitly writing out 1/24 of the full definition. Group O can be generated by any two of x, y and z, so we may write our definition as

generate_group (x, y) {
    operation U {
        axis (0, 0, 1);
        angle -90;
        U -> U;
        UR -> UF;
        RU -> FU;
        URF -> UFL;
        RFU -> FLU;
        FUR -> LUF;
    }
}

Comparing to the whole lot of things we need to write without symmetry, this is a big step forward.

We can also simplify the definition of states in the same way:

generate_group (x, y) {
    state U, UR, URF;
}

And also blocks. But this time we get URF, RFU and FUR which is actually the same block. But that’s not a problem, we can just do some tricks in the syntax if we really want to implement such a cube description language and allow blocks to have aliases:

generate_group (x, y) {
    block U;
    block UR alias RU;
    block URF alias RFU;
}

Note that we don’t need to write this explicitly

block URF alias FUR;

Because that will also be automatically generated for us!

Now we can write the definition of a Rubik’s Cube in full:

meta_rotation x {
    axis (1, 0, 0);
    angle -90;
    face_cycle (U, R, D, L);
}

meta_rotation y {
    axis (0, 1, 0);
    angle -90;
    face_cycle (F, U, B, D);
}

generate_group (x, y) {
    state U, UR, URF;
    block U, UR alias RU, URF alias RFU;
    operation U {
        axis (0, 0, 1);
        angle -90;
        U -> U, UR -> UF, RU -> FU;
        URF -> UFL, RFU -> FLU, FUR -> LUF;
    }
}

So simple! But this still is an “abstract” definition, to actually be able to visually simulate a Rubik’s Cube we need geometric models. But even that can be simplified. Remember that we have geometric transformations in our meta-rotation definition. We can define the 3D model, sticker color, etc. for one of each different kind of block and the rest will be handled by the symmetry group! One we get the idea of how to simplify the definition, adding more features is just a matter of syntax.

When Jesus come to find you

(I’m a catholic. No, I don’t believe in god, I’m a cat-holic, nyahahaha!)

I know that the United States is a religious country. In god they trust. But being confronted by a Christian Taiwanese immigrant asking if you are interested in hearing from Jesus was not something to be expected on the very first day after arriving here.

I did meet a few Christian people back in China. I knew a girl who is Catholic. When I was in high school, a neon Christian cross presumably belonging to a Protestant church can be seen from my dorm building. (I also knew an American teacher who is a Mormon, but that’s not Christianity anymore, right?)

Generally, in China, those Catholic people who try to sell me their belief piss me off. And those politics, those freedom of religion and human right thing involved there, pissed me off even more. There is no God and I don’t even care about what you are doing!

But this time, they did made me interested in what they are doing. Yes, that’s how you sell your belief! Be friendly and helpful and avoid talking about anything explicitly religious at first, instead of going straight to the (pointless) point.

So I paid several visits to their church. It is just a home church of Chinese Protestants, mostly students or alumni of USC. They does not seem to belong to any denomination, but I think they are affiliated with other such groups.

We ate food, read the Bible and sang hymns. That’s fine for me, does not sound so religious. But some new students has gone as far as getting baptized.

Do I believe in God, or in Jesus? Definitely not. I might not be bold enough to self identify as an atheist, but at least I’m an agnostic. I’d say I’m interested in the religion itself rather than believing it.

I do think, though, those people in the church are much better than an average Chinese person I may meet. Do they become good people because of their belief? It would be hard to give a scientific answer.

My life as a Ph.D. student, S01E02: because nearning is the key!

(Wait, shouldn’t that word be “learning”?)

(And how many keys do we have in the lab?)

Ugh, I hate machine learning. Deep learning especially.

I met machine learning for the first time when I was a freshman. Andrew Ng (I don’t need to introduce this guy, I guess) came to Tsinghua and gave a lecture about his work. It was so fascinating.

But when I got to work with machine learning for the first time, it was not so pleasant. We had a machine learning course. The lecturer is a theorist, with high self-esteem. (He asked which book did we use for our theory course. When someone mentioned that classic book by Sipser, he said we should at least use Modern Approach!) The course was mostly about proving one bound after another, and now I can remember none of them! We did have fun looking at some machine learning problems from a nonlinear programming or even game theoretic point of view, though. The only algorithm mentioned in the course was (kernel-) SVM, and that was not without a ton of theoretic analysis.

And he hates deep learning. He mentioned it only once, in the concluding part of the course, only to downplay it with sarcasm. Are all machine learning theorists like that?

(BTW the TA seemed to be a super lazy otaku girl)

And then, after learning abstract things for the whole semester, the final course project was a concrete one – to solve one problem on Kaggle! I don’t even have a decent learning algorithm in my repository.

But, challenge accepted!

And the result – it did not give me frustration. It gave me cancer. My best score on the benchmark (which is a mediocre one) was achieved with bugged code. Only after the course project deadline was I able to get a better score with a correct program.

Those with the best results were doing intern at an alumni’s computer vision startup. Surely, you can’t hope to beat state-of-the-art deep neural networks implemented with a powerful framework deployed on a cluster with your crappy hand-written naive algorithm running on your laptop!

What’s after that? The next semester, when I found out that that computational biology course is all about machine learning, I quit immediately. (Seriously, discovering new interaction between proteins and existing drugs by doing text mining on existing literature sounds just ridiculous.)

Well, I don’t really hate machine learning itself. It really is powerful. What I don’t like is the way people work with it, and the hype.

I’m not saying that you have to give a theory to justify your method. Theory of neural networks is hard, and somehow they just work super good in reality, I understand that. I mean, just dumping everything into your network, tuning your network at random, and hoping that magic will happen, that definitely is not the correct attitude! But, my perception is that, this is how a lot of people are doing machine learning right now.

And people are well aware of it! In China we call that “炼丹(liàn dān)” (or TCM if you prefer that), and at last year’s 21ccc one speaker called that “alchemy” – see, we even have internationally accepted terminology for those sort of things!

And the hype – the number of people doing machine learning is too damn high! It is not rocket science, but neither is it for dumb people who know only how to liàn dān!

I’ve avoided coming in touch with machine learning thus far. But now, it is coming for me!

Because, nearning is the key, ahahahaha!

Computer graphics today is not like decades ago when people focused on rendering. Those problems are largely solved. It has become much broader. Citing words from a talk I attended at MSRA, today, computer graphics is about generating novel content from existing content, in the form of images, videos, 3D models, or even something non-visual like text or sound. It is about capturing human’s creativity. And how do you do that? Learning, of course!

So I started learning to do learning!

The past several days were spent getting the environment sorted – to get cuda working on my Arch Linux machine was a bit of a pain for a casual linux user – the proprietary NVIDIA driver keeps crashing so I had to use bumblebee for my intel/NVIDIA dual GPU laptop, and I had to workaround the incompatibilities between cuda 7.5 and gcc 6. But I guess you can hardly call that a trouble.

I’m using torch because – that’s what my cooperating labmate is using. The first thing to try? Guess it – it’s the MNIST dataset!

The whole procedure was like – put a convolution layer – then a pooling layer – then another pair of such layers – then 3 fully connected layers – then just dump in the data and sit there watching the error drop.

No! That does not feel good. Did I do anything? The days when I was hopelessly tweaking my bugged crappy hand-written naive learning code were like a joke.

But anyway, this still is a step forward.

If learning is the key, then what is the key to learning? There must be something deep (no, I don’t mean a deeper network) that distinguishes groundbreaking machine learning research from liàn dān, something that you have to use your brain to figure out. We shall see.