Playing Chess - LLMs and Actual Chess AIs

Jul 04, 2023

ChatGPT at Chess

ChatGPT can’t play chess that well. For example, it can’t even find a checkmate in one move on the third move of the game! It doesn’t have a way to look ahead any moves and it sometimes makes illegal moves (see for example this animated game). Some have claimed with the right prompting about legal moves it can play at the levels of Stockfish (a top AI program) but I don’t think that’s possible. The right prompts can improve its answers, but it can’t play above grandmaster level if it doesn’t know potential future moves. It plays the wrong move in simple positions even when provided with additional prompting. It’s possible they somehow replayed an existing game so it looked like it played well, but it doesn’t truly understand chess.

A large language model (LLM) is capable of learning strategy games to some extent, but it needs to be trained on more data and it still won’t be at the level of a top chess AI. An LLM trained entirely on Othello games learned an internal model of the rules and made legal moves 99.99% of the time, but it didn’t master the game. LLMs are general purpose learners, but one shouldn’t compare it to an AI that is actually built to play chess well.

True Chess AI

Chess programs used to be hand-coded to play chess - programmers and chess masters developed the heuristics the computer would use to evaluate a move, such as a numerical value for each piece and complex ways to evaluate positions. This was considered “AI” before machine learning took over the space. In 2017, DeepMind developed AlphaZero, which played itself chess for a few hours (running on powerful hardware) and became far stronger than all previous chess programs. All the centuries of human chess ability and the decades of humans developing chess programs had been surpassed in a few hours by an AI starting with zero knowledge of chess strategy.

During a game, AlphaZero only evaluated 10,000’s moves per second compared to 10,000,000’s moves evaluated by Stockfish. How was it able to play so much better? It developed a better “intuition” of a good chess move so it was able to focus on those moves without searching an entire branching tree of moves. On every turn it played sample moves to the end of the game to see which move led to victory more often. Its “intuition” was developed with reinforcement learning from playing itself during the initial training.

Later, DeepMind developed MuZero which wasn’t even given the rules of the games it played. It learned the rules by playing the game and soon reached AlphaZero’s playing level, and it also mastered a whole suite of Atari arcade games. While MuZero started without any knowledge of specific games, it still used reinforcement learning to improve at the game, so it’s far more powerful at games than an LLM.

Human-level Chess

In 2020, researchers from Microsoft developed Maia Chess, which is specifically designed to play chess like a human. They trained it on different ratings of players with the goal of playing in the style of those players. It uses no game tree to look ahead any moves, it just tries to predict the most likely move based on the current position and the recent moves. It seems it can “intuit” threats and predict moves accordingly without actually working out those moves directly. In a way this makes it more similar to an LLM or to an image recognition system than to AlphaZero. You can play Maia on Lichess to try it out! Here’s an interesting result to ponder:

The version trained on 1100-rated games played at a higher level than 1100
The version trained on 1900-rated games played at a worse level than 1900

1100 rated players blunder frequently but not the majority of their moves. A system trying to predict their most likely move every turn actually ends up blundering less often. 1900 players blunder less often and also are able to look ahead a few moves in the game. A system trying to predict their moves without looking ahead won’t match their actual capability, although it would be interesting to see the results when combined with a simple look-ahead. (They mention that adding a look ahead capability didn’t improve the accuracy of the prediction, but it should be able to improve its gameplay.)

I’m not aware of follow-up work for Maia, but the ideas behind could be used to develop a more practical chess AI for practice and feedback. Current chess programs judge all moves based on perfect AI-play, but humans don’t actually play like that. For example it’s sometimes worthwhile (particularly in speed chess) to set a “trap” - do a move that might not technically be the best but leaves the opponent a good chance of making a mistake. AI programs like Stockfish will always assume the best play of both sides so they won’t understand such logic, but a Maia-style evaluator could recognize the value of such a move. Similarly, if you have an advantage, it’s not practically worthwhile to get into a risky complex position, since you might make a mistake and blow your advantage. A Maia-system could recognize such risks and recommend a safer path to victory.

Concluding Speculations

Deep learning systems are able to develop powerful intuitions about good chess moves. This is what allowed AlphaZero to play so much better than Stockfish while looking at less moves, and this is what allows Maia chess to predict human moves with better accuracy, without looking ahead any moves.

If someone trained an LLM on millions of chess games (as done here with GPT-2), it could get better at building a model of the rules and playing better. With the right form of reinforcement learning, an LLM could likely get really good at the game. I doubt it can get as good as Stockfish, but there’s no need for that.

When Maia-style chess tools are developed, people will be able to get more practical feedback on their game, although it will also make cheating much harder to detect. Currently, sites like Chess.com and Lichess check if a person’s moves match a computer’s moves suspiciously often, but with custom AI systems that will no longer be that detectable.

It’s interesting to compare the capabilities of different AI systems in playing chess, but also to compare chess to other domains. For example, why are LLMs not that good at chess but really good at predicting protein folding? Maybe DNA has all the necessary info to predict the protein’s shape, but for chess you really need to think ahead. Text prediction is powerful, but it isn’t everything.

Age of AI

Discussion about this post