Brief summary in case you just want the data: here’s a table of exact win/loss/tie probabilities for every pair of two card preflop hands in Texas holdem:

Eugene d’Eon and I have been playing around computing Nash equilibria for extremely simplified versions of heads-up Texas holdem poker. For those who don’t know the details, in Texas holdem each player is dealt two cards face down, followed by five cards face up which are shared between all players (with betting at various points in between these cards). More details are available on Wikipedia.

To keep the problem manageable, we considered only “preflop” betting, after the private hands have been dealt but before the shared cards are known. As in the previous post, consider two players Alice and Bob, where Bob posts a blind of 1. Alice looks at her two cards, and decides whether to bet or fold. If Alice bets, Bob then looks at his cards and decides in turn.

In order to compute the Nash equilibrium, we need to know the probability of Alice or Bob winning for each pair of possible poker hands (there are 169 unique two card hands). Initially, we were using the table from pokerstove.com for this purpose. However, their data only has six significant digits, which, frankly, is not cool. If we’re going to fritter away time analyzing simplified poker games, we might as well do an excellent job while we’re at it.

Therefore, I set out to compute the exact answers. There are only a finite number of possible hands, after all, so each probability is a rational number. And the number of hands isn’t too large: there are 48 cards left out of which we choose 5, so for each set of private cards we need to consider

$$\binom{48}{5} = 1712304$$

possible sets of shared cards. We also need to consider various ways the suits of the preflop cards can overlap. For example, if you have a king and the opponent has an ace, it’s better to have different suits so that you win if your king hits a flush. (This doesn’t matter too much in practice, but is everything if you want the exact numbers.) Taking preflop suits into account adds a factor of between 1 and 7 depending on the pair of hands (1 for AKs vs. KQs, 7 for AKo vs. QJo). There are $\binom{169}{2} = 14196$ pairs of hands, for a total of 20 to 170 billion games to loop over (the exact number turns out to be 87303531744 = 87.3 billion). Assuming we can optimize comparison of two hands down to a few hundred cycles (say 400), this would require on the order of an hour on my 2.3 GHz laptop with four hyper-threaded cores.

That’s a while, so initially I thought I could massively speed up the computation by decoupling computation of flushes and straight flushes, the hands involving suit, from the remaining suit-agnostic hands. An initial computation ignoring suits entirely (parallelized using OpenMP) managed all 14196 pairs of hands in only 1.5 seconds. Unfortunately, when I went to add suit support, I realized that at the exact level suit isn’t decoupled from card value after all. For example, the probability of a flush tie depends on the value of each players’ cards since it determines whether their cards will be higher than the five suited cards that appear. Dealing with such intricacies with decoupled value and suit is possible, but would end up dramatically increasing the complexity of the code, raising the chance of a bug and lack of our treasured exactness. Therefore, I gave up on decoupling and went back to full suit aware traversal.

52 cards cry out for packing into 64-bit integers as a bit set, providing a fast way to represent and compute with poker hands. For example, straights can be detected by shifting our bit set four times and or’ing to get the set of values ignoring suit, then shifting 5 times and and’ing to check if we have 5 consecutive cards. In order to decouple the evaluation of Alice and Bob, the core routine takes one player’s 7 total cards as input and produces an integer score representing the value of the hand, taking into account the relevant number of kickers for each. This requires lexicographic comparison of the type of hand first, followed by one or more different aspects depending on the type of hand. For examples, full houses are compared first by triple then by pair, and two pair is compared first by pair value and then by the value of the kicker.

Parallelized with OpenMP to take advantage of all available CPU cores, the full computation of exact win probabilities for all hands (“exact.txt”) took about 45 minutes on my laptop.

My initial code stored the bit vector in value-suit major order (four aces followed by four kings and so on down to twos). Unfortunately, value-major order prevented certain optimizations, such as detecting flushes in all suits using a simultaneous parallel prefix circuit instead of 4 separate ones, and also forced the use of 128 bits for the score value computed for each hand since the score needed two blocks of 52 bits each to represent two levels of lexicographic comparison (trips followed by pairs or pairs followed by kickers). Switching to suit-major order (4 blocks of 13 bits), fixed both these problems, allowing the use of 32 bit ints for most of the score computation and the final result, and reduced the time required for exact.txt to 42.5 minutes.

Next up, GPUs. Translating the score routine into naive OpenCL, I got only mild speedups of about 17% compared with 8 CPU cores. Replacing all branching with OpenCL comparison, select, and max intrinsics reduced this by a further 21%. Replacing scalar computation with four wide vector operations (easy given branch free code) gave another 17%. exact.txt now took 16 minutes to compute (these numbers don’t quite add up, so my git logs are missing something).

The minor annoyance of the OpenCL translation was choosing how to do the reduction (summing up the number of wins and losses over all hands). I eventually settled on the simple two phase scheme of reducing blocks of 256 entries down to 1 on the GPU, then completing the reduction on the CPU. The major annoyance is the complete lack of OpenCL profiling tools for ATI chips on Mac, which means it’s highly likely further optimization possibilities exist. One additional weirdness: if I do a bunch of OpenCL kernel invocations while simultaneously printing only a few hundred total lines to standard out, I get a 30% performance hit. This is likely due to context switching on the GPU, but I have no evidence for such.

To shorten the rest of the story, remaining optimizations included memoizing over the various combinations of preflop suits, avoiding an unnecessary extra kernel invocation to deal with the fact that 1712304 isn’t a multiple of 256 (small kernel invocations turn out to be quite expensive), and shaving cycles off the hand score routine down to a total of about 174 operations per 7 card hand (ignoring vectorization). The final bit twiddling, branch free, vectorized poker hand evaluation routine is here. The full git repository is

https://github.com/girving/poker

5 minutes on my laptop to evaluate 175 billion 7 card hards means roughly 1/2 billion hand evaluations per second. At roughly 175 operations per hand, our utilization is under 100 Gops. AMD advertises 480-696 single precision Gflops for my card, so we’re significantly under peak. Many of our operations are 64 bit, and they’re all integers, but this still hints at possible further speedup. Nvidia seems to have way better profiling tools, so I may see if I can get those running on some of Amazon’s EC2 GPU instances.

Throughout the process of rewriting and optimizing the code, exact.txt never changed, which provides some confidence that it’s correct. exact.txt also agrees with the pokerstove results to within their sadly lacking finite precision.

As an example of fun things we can say armed with exact win probabilities, Alice’s optimal equity in the simplified game mentioned above with a fixed bet level of 2 is exactly

$$\frac{63007785172571}{271956117382950} \approx 0.231684$$

Good to know.