How would one efficiently compute the optimal strategy for a game of Wordle?
Fundamentals
The game of Wordle relies on a definition of the following components:
- The number of characters in each word: .
- The set of all possible characters , which are simply the letters A-Z.
- The set of all theoretically possible words . This is simply any string of length , with characters from .
- The set of (remaining) candidates . We will denote the initial (complete) set of candidates using , such that .
- The set of all allowed words to guess . In order for a game to always be winnable, we assume .
- The set of all possible responses . Where each response consists of the colors black , green and yellow .
- We denote the all green (win) response as and denote the all black response as .
- The response function , that maps each candidate , guess pair to a corresponding response that would be shown by Wordle if was the secret word and you guessed .
We assume the following:
- each candidate word is equally likely to be the secret word.
- there is no limit on the number of guesses, as this simplifies calculations later on. This assumption is of course false, but I since any reasonable heuristic is capable of never exceeding the limit of guesses, the optimal strategy will never do so aswell. Consequently making this limit irrelevant.
- we are playing the game in “easy mode”, meaning we get to pick from the entire list of guesses every time, instead of just the list of remaining candidates (hard mode).
Response Function
There are many equivalent ways to determine the output of the response function. The pseudo-code for one such way is provided as follows.
function compute_response(secret, guess):
N_c = length(secret)
assert N_c == length(guess)
response = array(N_c, BLACK)
used = array(N_c, false)
// Pass 1: find greens
for i in 0..N_c-1:
if guess[i] == secret[i]:
response[i] = GREEN
used[i] = true
// Pass 2: find yellows
for i from 0..N_c-1:
if response[i] == GREEN:
continue
for j in 0..N_c-1:
if guess[i] == secret[j] AND !used[j]:
response[i] = YELLOW
used[j] = true
break
return response Intuitively, the total number of distinct responses is derived as . However, some of these responses are impossible, specifically the responses where of the letters are green, and is yellow. This results in theoretically maximal number of responses.
Note: If a guess contains a letter that does not occur in the secret word, it is guaranteed to get a black response. However, this does not mean that if a specific letter gets a black response, it does not occur in the secret word. As the guess word may just contain the letter more frequently than the secret word does.
Partitions
Given a guess , we may partition our candidate set into disjoint subsets , where is the set of candidates that would generate response if were guessed.
Properties:
- .
Useless Guesses
We say a guess is useless with respect to a non-empty candidate set if there exists a specific, singular response such that . That is, fails to partition .
Properties:
- Any guess is useless with respect to any singleton candidate set .
- For any useless guess , with singular response for some candidate set , all other responses yield the empty set.
- If a guess is in the candidate set (), and there is more than one candidate remaining (), then is not useless.
- For any candidate set , any guess can only be useful at most once. Any guess after the first will just yield the same response for all candidates.
- If a guess is useless with respect to , it will also be useless with respect to any nonempty subset of , yielding the same singular response .
- If a guess shares no letters with any candidate in , then it is useless, and its singular response is the all-black response .
- A guess is useful with respect to if and only if it strictly reduces the size of the candidate space.
Equivalent Guesses
We say a guess is equivalent to another guess with respect to a candidate set (denoted ), if they produce the same partitions.
Properties:
- If two guesses and are equivalent for a candidate set , then they are also equivalent for any subset of . Note the inverse does not hold.
- Guess equivalence forms an equivalence relation on the set of guesses with respect to a candidate set . This means the relation satisfies the following three properties for all :
- Reflexivity:
- Symmetry: .
- Transitivity:
Next, we would like to determine an efficient way to check whether two guesses are equivalent. First, we define as the subset of letters that appear in at least one candidate .
Next we recognize that for any letter , any guess containing this letter is guaranteed to get a black response in the corresponding position. This means these letters are interchangeable with respect to . To utilize this, we define the projection function that replaces any letter in not found in with a generic wildcard (e.g. ). If the projections of two guesses are equal, then the guesses are guaranteed to be equal aswell.
One could attempt to refine to turn this implication into a biconditional, but this would likely be no more efficient then evaluating directly.
As a Markov Decision Process
Using the aforementioned definitions, the full game of Wordle may be defined as an infinite horizon Markov Decision Process (MDP) as follows.
Trivially, each state is just the set of remaining candidate words . The set of all states then becomes the power set of the initial candidate set . However, we must add a terminal win state for when the secret word has been guessed correctly. We recognize the state with “no words remaining” is always unreachable, this essentially frees up the empty state which we will use to denote the win state instead.
The transition function defines the probability of transitioning to a next state given current state and guess .
We define the cost function , and discount factor such that the total cost corresponds to the total number of guesses made until the secret word was guessed.
The goal is to find an optimal policy, , that minimizes the expected total cost (expected total number of guesses).
Substituting these definitions into the Bellman optimality equations and simplifying gives:
Where is the set of all possible responses excluding the win response.
As a Partially Observable MDP
(TODO: explain that we can model it as a POMDP, and explain the equivalence with the aforementioned MDP)
The Infeasibility
The reason we can’t compute the optimal strategy in efficiently is because the number states is exponential in the number of initial candidates.
In order to make the algorithm feasible, we must drastically reduce the number of states to visit.
A first insight is to realize that not all states are reachable from . However, an attempt to find all reachable states via a BFS traversal will quickly show that even this simplified problem is still intractable, showing the same exponential growth.
A next insight is then, that many of these states are only reachable by making terrible guesses that clearly don’t correspond to the optimal value. For example, a guess filtering out only one candidate will lead to a new distinct state, but this state is likely irrelevant to the optimal strategy.
Integer optimization
To avoid the computational overhead and precision issues of floating point arithmetic, we reformulate the objective function. Instead of minimizing the expected number of guesses (which requires division), we minimize the total number of guesses required to solve for all candidates in .
Let be the minimum total guesses for candidate set . We can define the relationship to the expected value as:
We can then update the Bellman Equations as follows:
This gives the integer-only recurrence relation. Note that the term represents the fact that the current guess adds exactly 1 guess to the path of every candidate currently in the set.
Rewriting the recurrence relation in alternating recursive form we get:
We recognize the following base cases:
- Empty Set:
- Single Word:
- The word is guessed immediately.
- Two Words: .
- 1 guess identifies the first word.
- 2 guesses identifies the second.
Memoization
We use memoization store for each set of candidates where (it is not a base case). We use top-down dynamic programming instead of bottom-up dynamic programming since bottom-up cannot utilize the large amount of pruning we intend to do.
Branch and Bound
We use a Branch and Bound strategy to prune guesses that cannot possibly yield an optimal solution. This branch and bound algorithm is abstractly defined as follows:
min_state_val: Given an upper bound and a state , it returns if , and otherwise.
- (Optional): Sort/filter the guesses
- For each guess :
- (Optional): For cache best guess and if we managed to beat .
- Return
min_guess_val: Given an upper bound , a state and a guess , it returns if and otherwise.
- Compute each of the partitions for the given guess .
- If there is only a single partition, the guess was useless, so we return .
- Determine the initial lower bound using the (size of the) partitions.
- If , return .
- For each partition , we refine its contribution to the lower bound:
- Get exclusive lower bound
- If , return .
- At this point , and , so we return .
Finally to determine we first determine some initial upper bound , and then call the .
Maintaining useful guesses
We observe that as the set of candidates shrinks, the set of useful guesses in also shrinks. Consequently, iterating over the full set of all allowed guesses (approx. 13,000) is wasteful when is small, as most guesses will yield zero information.
In order to solve this, we can maintain a reduced list of potentially useful guesses to pass down the recursion tree. Note this set must be a superset of the truly useful guesses. By excluding guesses that could be capable of partitioning we lose the guarantee of optimality.
On top of this we can sort the guesses based on how “useful” they are expected to be. This can be done using any of the heuristic scores. Interestingly enough, determining whether a guess has “no common letters” with any of the remaining words, or directly determining whether it is capable of partitioning , combine very well with computing the heuristic scores for letter frequencies, or min expected remaining respectively.
Max distinct responses
We want to know the maximum number of candidates that can be distinguished in a single step.
Let be the number of distinct responses generated by a guess against a candidate set :
We define the Max Branching Factor as the maximum possible distinct responses achievable by any single guess:
The combinatorial limit for this max branching factor was previously determined to be . The practical limit is lower than this, specifically due to correlations between English words, no single guess can actually achieve 238 partitions.
Computing is an expensive operation (). However, we rely on the property that the branching factor is monotonic with respect to set inclusion. If , then for all , . Consequently:
This implies the max branching factor of the initial set of candidates is larger than or equal to any other max branching factor .
Lower bounds on
We want admissible lower bounds (optimistic estimates) on the cost to perform pruning.
Bound 1 (Information Theoretic)
Each response from Wordle provides information that reduces the candidate set. To distinguish among possibilities requires at least responses in expectation. Converting this to total guesses:
Where is a constant independent of , and can be computed extremely efficiently on most computers.
Bound 2 (Minimum Depth / Pigeonhole)
We derive the absolute minimum number of total guesses required for a set of size by assuming the best-case scenario for every guess.
- We must pick a single guess .
- If the secret word happens to be we solve it in 1 guess. This happens for at most one candidate in .
- For the remaining candidates, the guess is incorrect. Therefore, we need at least 1 additional guess to solve them. Giving a path length of at least 2.
Bound 3 (Capacity Bound Pigeonhole)
Using the branching factor we can tighten the lower bound for large sets (). The standard Pigeonhole bound assumes we can solve all remaining candidates at Depth 2. However, we are physically limited by the number of distinct buckets (responses) available.
- Depth 1: We can identify at most 1 candidate (the secret word itself).
- Depth 2: We can identify at most distinct candidates. (One distinct response per candidate, minus the “Win” response used at Depth 1).
- Depth 3: Any remaining candidates must be solved at Depth 3 or greater.
Derivation: If , the minimum configuration of guesses is:
- candidate at cost 1.
- candidates at cost 2.
- The remaining candidates at cost 3 (optimistically).
Note: While this geometric capacity constraint extends to deeper levels (Depth 4, 5, etc.), for standard Wordle the remaining candidates always fit within Depth 3 (since ), making further expansion unnecessary.
Combining Lower Bounds
The tightest lower bound is just the maximum of all lower bounds. Assuming , Bound 2 is tighter than bound 1 for all relevant candidate set sizes. On top of this, its also more efficient to compute. Bound 3 is tighter than bound 2 , but slightly less efficient to compute.
Lower bounds on
Finally, we may compute a lower bound on the specific total cost of a guess (with ) as follows:
where:
Upper bounds on
Since we are minimizing cost, the total cost produced by any valid policy is a valid upper bound on the true minimal total cost .
Let be a heuristic policy function that returns a guess for a candidate set . We can calculate the total cost by simulating the game tree using . This is computed as:
We use this as the first upper bound . We then repeatedly refine this upper bound as we go on.
Heuristics
Pick Max Frequency
Goal: Maximize coverage of common letters across candidates.
First we define the number of candidate words containing a specific letter :
For each guess , we sum the frequencies across the unique letters from .
- where is the set of distinct letters in .
Computing the score of all guesses has a time complexity of .
The approximation of the optimal guess is the guess that maximizes the score:
Pick Min Remaining
Goal: Minimize expected number of remaining candidates after the guess.
Expected remaining candidates for a given guess :
is independent of the guess, so we leave it out in the score.
Computing the score of all guesses has a time complexity of .
The approximation of the optimal guess is the guess that minimizes the score:
Algorithmic Optimizations
Indices
Instead of storing actual letters, responses, and words, it is beneficial to store the indices of their respective collections instead. This saves space, and leads to more efficient serialization or lookups.
To determine the ordering of the collections we simply sort the set alphabetically, and then assign assign an index to each element based on this ordering.
- For guesses and candidates, ordering alphabetically is trivial.
- For the responses, we interpret a response as a length string, where each of the symbols in the string correspond to black , green , or yellow . These strings are then sorted alphabetically. Equivalently, a response string may be efficiently mapped to an index by mapping the letters to numbers and reading the string as a little-endian ternary number.
- For letters A-Z, we can map a letter to its index by simply subtracting the letter A from all of the letters, consequently assigning the numbers 0-25 to the letters.
Response Cache
The solver will rely heavily on partitioning candidate sets based on responses. As a consequence the speed with which we can determine the response corresponding to a given secret word, and guess is important. To solve this we precompute a (rows x columns) response matrix, stored in a single vector in row-major order (to improve cache locality). Indexed using the combined indices of the guess and candidate, and containing the corresponding response indices.
Vector Maps
If the maximum index is quite small (as is the case with the letters, and responses), one may also replace hash sets/maps with a vector or bitset, removing the need for hashing entirely, at the cost of allocating a little more space for sparse sets.
This cost of allocation can be remediated by reusing the same set multiple times, in such a case you maintain two vectors, one functioning as the set/map, and the other listing the indices from the first vector that are non empty. Assigning values stays O(1), but clearing the vector takes O(n) instead of O(capacity), where capacity is the maximum index + 1.
For memory efficiency, it is likely that the first vector (the bitset) will be a bit-set. In such as case the O(n) reset operation is likely to be slower for dense sets than the O(capacity) operation. In this scenario one might consider a hybrid approach, where the vector map switches to bitset level clearing instead of the backtracking approach whenever the number of elements in the set exceeds some number.
In general, the most significant bottlenecks will be the way we represent candidates, and specifically how we partition these candidates. In order to compare the aforementioned methods, it is recommended to have the solver rely on an interface. In this case we can swap the backends easily, allowing the different approaches to be compared.
Hashing
If one is to rely on some sort of HashMap where the keys are based on the response, letter, guess, or candidate indices, you may benefit from using a simple hasher. Specifically Hash-DoS protection is redundant, and is included in the default Rust hasher.
Recursion vs Iteration
The branch and bound definitions define a recursive approach. At the cost of readability, one may consider implementing the logic in a single iterative function to reduce the overhead of function calls.
Policy Evaluation
We define two abstractions:
- A response policy, which must implement a method named
guessthat accepts an optional response, and returns the next action or an error if the response was unexpected. If no response was provided, this indicates the policy must provide its first guess. - A candidates policy, which must implement a method named
guessthat accepts the set of remaining candidates, and returns the next action, or an error if the remaining candidate set was unexpected.
A candidate set, or a history of previous guesses and their respective responses is a full description of the current state, however a single response (e.g. bbgyy) is not. Consequently, any response policy is responsible for keeping track of the state itself, whereas any candidates policy does not need to store any state.
Next we define a simple simulate function that relies on a response policy.
function simulate(policy, secret):
guess := policy.guess(none)
n_guesses := 0
while guess != secret:
response := compute_response(secret, guess)
guess := policy.guess(response)
n_guesses := n_guesses + 1
return n_guessesIn order to utilize the same logic for any candidates policy, we write some wrapper logic (that implements the response policy abstraction) that maintains the set of remaining candidates internally, and passes this down to the candidates policy.
Next we define an evaluate function to determine the total number of guesses for a policy to guess all the candidates.
function evaluate(policy, initial_candidates):
total_guesses := 0
for secret in initial_candidates:
total_guesses := total_guesses + simulate(policy, secret)
return total_guessesThe expected number of guesses is then equal to total_guesses divided by the number of elements in initial_candidates.
Storing Policies
Because a policy is deterministic, its execution for a given set of initial candidates and allowed guesses can be perfectly modelled using a static decision tree. We do not need to store the dynamic shrinking list of remaining candidates at each step, the structure of the tree itself implicitly encodes the state based on the history of responses.
Any policy is represented using a tree, where:
- Guesses are nodes, with the initial guess at the root, and the leaf nodes being the secret word.
- Edges are the responses returned by the game.
While the theoretical state space of all possible guess-response combinations is massive, we only need to store the states reachable by strictly following the policy.
Any such policy tree may be serialized in two ways:
- compact: maintaining the usage of indices to indicate guesses, and responses.
- readable: as a JSON file, where the each node object has two attributes “guess”, which is just a string representing the guess to make, and “children” which is a map of each response string to another node.
Any serialized policy tree must be accompanied by a hash of the alphabetically sorted set of allowed guesses and set of initial candidates. This hash can be checked to ensure the policy was created on the same set of initial candidates and allowed guesses.
Next we want an efficient way to determine these policy trees. First we define a separate simulate method (or rewrite the previous one) such that it returns the full history of guess response pairs given a response policy, and a secret word.
function simulate(policy, secret):
history := []
guess := policy.guess(none)
while guess != secret:
response := compute_response(secret, guess)
history.append((guess, response))
guess := policy.guess(response)
// Append the final winning guess
// The last response is always the win response (indicated by none).
history.append((guess, none))
return historyThe policy tree of a response policy may then be determined as follows:
function capture(policy, initial_candidates):
tree := {guess: none, children: {}}
for secret in initial_candidates:
history := simulate_history(policy, secret)
node := tree
for turn in history:
if node.guess is not none and node.guess != turn.guess:
throw Error("Policy is not deterministic!")
node.guess := turn.guess
if turn.response is none:
break
if turn.response not in node.children:
node.children[turn.response] := {guess: none, children: {}}
node := node.children[turn.response]
return treeOnce such a tree has been captured, we can build some wrapper logic that implements the response policy abstraction for any given policy tree.