A little bit of thought on solving Rubik’s Cubes and the possibility of doing that with neural networks

I’ve been thinking about solving combination puzzles with computer for a long time.

I believe that every CS student have learned how to solve the 15-puzzle with heuristic search in elementary AI courses, or at least to solve the simpler 8-puzzle with BFS in elementary algorithm courses. At first sight, Rubik’s Cube may seem to be a similar puzzle, just a bit bigger.

Yes, they are all “sequential move” type of puzzles where you try to restore a bunch of things to some initial configuration by performing some operation sequentially. But Rubik’s Cube is definitely much more difficult, both for humans and computers. Size is not what really matters – you can make yourself an arbitrarily large (n^2-1)-puzzle, and it will still be easier than Rubik’s Cube, just quadratically more boring and time consuming. Apparently there are deep group theoretic reasons. Actually I once read about why some of these puzzles are more difficult than others in a Scientific American article. But I think this can be explained by 15-puzzle being more local than Rubik’s Cube: in Rubik’s Cube, one move could affect 1/3 of all blocks where in 15-puzzle, one move affect only one piece.

The consequence is that in 15-puzzle you can see clearly which configuration is closer to being solved – you can see it more easily as a human, and you can design a heuristic more easily for a computer. In Rubik’s Cube, this is much more difficult. I haven’t even seen one example of a good search heuristic for Rubik’s Cube. Maybe people just does not bother finding one: by memorizing some 200 move sequences and practice people could solve a Rubik’s Cube under 10 seconds, and the total number of states of a Rubik’s Cube, 43,252,003,274,489,856,000 (I can memorize this number) and the God’s number, 20, is just small enough to make it tractable to solve with a (bidirectional) brute-force search.

But when you have a larger combination puzzle – a 4×4×4 or 5×5×5 or bigger cube or a megaminx, the number of possible states explodes and brute-force search becomes mostly useless.

Somehow people could still solve them by finding patterns. Someone have even designed an algorithm to solve a n×n×n in Θ(n^2/log n) moves – what a strange time complexity, a log in the denominator!

Actually, finding solutions to all those exotic puzzles is my favorite part in this Rubik’s Cube game. The 200 or so formulas may have been found by brute-force computer search, but one you have played with a Rubik’s Cube for some time and if you are clever enough, you can construct those formulas yourself for other puzzles. You start by just randomly – or maybe purposefully – turning the faces and see what happens. Then you suddenly find a short move sequence that does something interesting – say, rotating a small number of pieces while leaving their position unchanged, or only moving some corner pieces but does not affect edge pieces.

Then you get a bunch of formula, each one of them will keep a certain set of pieces unchanged, move a second set of pieces in some desired fashion, and move the remaining pieces in someway you don’t really care.

Then with what you have at hand, you define your strategy. Be it edge-first, corner-first, layer-first or whatever you may want to call it, usually you have several stages. In each stage, you want to keep unchanged a larger set of pieces that are already restored than the previous stages, and try to restore several more pieces to advance to the next stage.

You might not notice it but what you are doing is actually repeatedly reducing the group of possible states to one of its subgroups, until you get a trivial group consisting only of the solved state.

The process might not be optimal, but it works.

But it seems that we have not formulated this process well enough so that we can let computers do it. I’ve been thinking about it for a while. May be we human and computers could cooperate: we define the stages so that the quotient group between two stages is not so large so that we can brute-force search for a formula that brings us from one stage to the next. It would be so good if the computer can also learn to define the stages. But that part does involve some insight into the structure of the puzzle.

Yesterday when I was dealing with the GAN thing, I suddenly came up with the idea of solving Rubik’s Cubes with a neural network.

I think that’s quite natural. These days people seem to believe that neural networks can do everything. But after a quick google search I found that few people have considered the same thing.

So it’s time to have a try! Brute-force search definitely is ineffective for solving complex combination puzzles but maybe neural networks could do the magic. The idea cannot be simpler: encode the state of each piece into one-hot vectors and just build a network to predict from the input state the mext move towards solving the puzzle.

Then it comes to the encoding, the training data and the structure of the network. I’ve seen some people represent the state of a Rubik’s Cube by one-hot encoding the color of stickers at each position… that sounds ridiculous. I think one-hot encoding the position of each piece makes more sense. We have 20 movable pieces, and each of them have 24 positions, so the input would be a 480-dimensional 0-1 vector. Note that this is actually more than what you need if you choose to use stickers: 48 movable stickers, each with 6 possible colors, producing a 288-dimensional 0-1 vector. But still, I think the position of the pieces is more intrinsic.

The training data would be obtained by scrambling the cube. We train the network to bring a state to its previous state in the scramble sequence. The scramble sequence should be generated from a random walk: with a fixed probability (say 0.05 which is the reciprocal of 20, the God’s number) we come back to the solved state, otherwise we choose randomly from 12 possible moves. This makes sense since it ensures that each state in the state space is visited exponentially more often from its nearest path to the solved state.

Then we come to the fun part – the structure. We could just use some plain fully-connected layers. But remember that there is symmetry in the Rubik’s Cube, as I mentioned in a previous post. So if we rotate the input state, the same rotation should apply to the prediction as well. This can be enforced through weight sharing. Imagine that all the neurons, including the input, are divided in to 24 classes, corresponding to 24 elements in the chiral octahedral symmetry group O. Then if there is a link between a neuron a a neuron b and under some spatial rotation neuron a becomes neuron c and neuron b becomes neuron d, then the weight between a and b and that between c and d should be identical! Now we have an interesting 24-fold weight sharing inside the network! This still looks like a fully connected network, yet it is strangely entangled in itself.

While I haven’t tried this yet, I’m expecting the network to do more than merely predicting the next step. By examining for each neuron which input state cause it to have maximum activation, we should be able to see what pattern the network is using to find the next move – what should be kept and what should be moved, which should help us find the stages in the human solving strategy.




Generative Adversarial Network,生成式对抗网络,是一种生成模型(废话)。基于神经网络的生成模型很常见,但主要是在序列形式的数据上训练RNN,比如生成文本啥的。GAN采取了不同的思路:对抗训练。两个神经网络:一个生成网络,学习从随机噪声向量产生与训练数据相似的样本,和一个判别网络,学习判别一个样本是来自训练数据还是来自生成网络。对抗的目标是,生成网络提高判别误差,判别网络降低判别误差。



虽然思路很简单,但实际操作的时候还是比较tricky的,很难训练好。原因有几个:首先,生成网络的质量无法量化(使用判别误差显然是不行的),只能靠主观判断,而凭主观判断是难以看出训练到底进行到什么程度、有没有卡住的,也就导致很难找到合适learning rate。

其次,两个网络的训练速度是很不同的。为了确保两个网络共同进步,要使它们保持能力相当。原paper认为判别网络应当多训练。但我的实测结果是在使用相同的learning rate时判别网络对生成网络基本上是吊打,不知道是什么情况。如果尝试保持每一回合结束后的判别误差约等于random guess的话,每个回合大概要训练生成网络20次以上……我还没有做实验,不过感觉可能是和输入噪声变量的数量有关。



同样道理,learning rate不能太大。


我是在MNIST手写数字数据集上尝试训练的。网络结构什么的……这么简单的数据集其实随便什么结构都好吧……判别网络输入接两层卷积,接一层全连接,接输出,使用batch bormalization。生成网络正好反过来。

在调整网络结构、调整learning rate、调整两个网络的训练速度比均无果之后,我想了个不太优美的办法。判别网络每次只判别一个样本。如果可以一次判别整个batch呢?虽然每一个样本都像是真的,但整个batch长得一样,明显就是假的嘛……
















针对在不同数字生成区域边界上存在的四不像输出,有可以改进的办法:由于MNIST是有class label的,可以做有监督学习:将class label转换成one-hot vector放在输入的种子里,确保不同数字生成区域分离,然后将判别网络改成判别真假+分类、生成网络的目标改成增加判别误差、减小分类误差,可以期望产生更好的输出。目前还在尝试中,结果如下







My life as a Ph.D. student, S01E05: Still working on my first neural network…

I was planning to write a post after getting everything done but it took longer than what I like…

We are working on some generative models now and for the last week I was learning how to train a Generative Adversarial Network. GANs are super cool, but also notoriously difficult to train! They probably aren’t the best choice for a machine learning newbie like me.

But doing something challenging is fun! Using something that works out of the box is just too lame. And if it already works well, it likely isn’t worth studying.

Here I’d like to talk about what I’ve learned so far. Let’s see the result first. I’m getting something like this from a GAN trained on MNIST dataset:


It is generating something. Not exactly digits but do resemble some hand written script…

I found that there are some practical issues not addressed in the DCGAN paper. It suggested using batch normalization. Batchnorm behaves differently when training and when evaluating. When training. it uses the mean and variance of the training batch. When evaluating, it uses the statistics of all the examples it has seen. It is logical to use evaluating mode for the generator when training the discriminator and vice versa.

In the original GAN paper, in each discriminator training round, the “true batch” from real data and the “false batch” from generated examples are fed into the discriminator separately. It is logical to make the distribution of each batch the same, so true and false examples should be mixed in the same batch.

To avoid making the wrong decision for the previous two issues I tried both ways round. But the suggestions given in the DCGAN paper still didn’t quite work out for me. I’ve followed them as much as I can. I can’t do the fractional-strided convolution for the generator network because that kind of convolution is not readily available in torch… But otherwise I did exactly what was advised. But still, the generator network keeps collapsing every input to a single output. According to what I read, this indeed is the most observed failure mode of GAN.

Then I came up with a not-so-elegant trick to solve this problem. The generator collapses because it thinks that output is the single best image to trick the discriminator into mistaking it for a real example from the dataset and learns to produce that output from any input. But then the discriminator quickly learns that that particular example is fake. Then the generator looks for the next output that can trick the discriminator…

This happens because the discriminator looks at a single example at a time. If we can somehow let the discriminator reject a batch if every example in the batch looks the same, then the generator should not be tempted to collapse every input into the same output anymore! How do we do this? My solution is, for each batch, take their mean. Then for each example in the batch, take its difference from the mean and concatenate the result with itself by adding them as additional channels. It is expected that if the batch is similar then the intensity in the additional channel would be small and the discriminator would be able to learn it.

But then there is a conflict with batchnorm. Since we want the discriminator to tell the difference between a true batch and a false bath, we cannot mix them in the same batch! But on the other hand, the difference of intensity between a true batch and a false batch is what makes the additional channels useful, with batchnorm, the differences are wiped out! To prevent this, we cannot let batchnorm normalize the batches individually and have to mix them!

So do we not use batchnorm? But without batchnorm, the training becomes very unstable.

The solution is to mix the two batches after their additional channels have been calculated separately.

This method actually works! the result is not perfect, but at least the generator does not collapse anymore! In the training process, you can see that at some point, the generator almost collapses but then the output soon starts to diverge.

At this point I’m not quite sure how to get the rest of things right. Tuning training parameters, or go model shopping?

And then there is one last mystery. The GAN paper says we should train the discriminator more. But in reality the discriminator constantly beat the crap out of the generator even if I train the generator 10 times as hard! Is this normal?

Actually the difficulty of balancing the training of the generator and the discriminator is another major factor that makes GAN hard to train, along with generator collapsing. The third factor probably is that since there are two networks competing, there is no single loss function to measure how well the training is going.

I’ll probably write something more elaborate if I do become better at training GANs.

And what is the one thing that I want to generate with a GAN?

Image of anime style eyes!