Information Theory Finds the Best Wordle Starting Words

Information theory can help people mathematically calculate the best starting guess for a popular online game

Hand of a smartphone user playing Wordle. — Wachiwit/Alamy Stock Photo

How did you spend the past few years as the COVID pandemic raged and limited our leisure options? Software developer Josh Wardle and his partner passed the time with crossword puzzles from the New York Times. At one point, Wardle remembered an idea for a similar game he had thought up a few years earlier.

The word game he then created, called Wordle, based on his last name, became a smash hit in 2022. Twitter timelines flooded with Wordle results. Even though the game revolves around guessing a word that changes daily, there is a lot of mathematics behind it.

Wardle came up with the basic idea back in 2013. You have six attempts to correctly determine a five-letter word. You first type a word—for example, “start”—by inputting letters into five free fields. After that, the fields change color. They become green if the letter appears in the exact place in the solution word, yellow if the letter is included in a different place in the solution and gray if the letter is not part of the solution. Following these clues, you can type a second word and gather information about the letters of the solution word until you discover the answer you are looking for. The principle is somewhat reminiscent of Mastermind, a game that was popular in the 1970s.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

You can enter any English word consisting of five letters, of which there are about 10,000. Because that list also contains highly unusual expressions such as “aahed” (the past tense of “aah”), however, the solution word is part of a much shorter list of 2,309 common English terms. The goal is to find the solution word in as few tries as possible. Adding to the thrill, you can’t play the game multiple times in a row. Every day there is only one solution word—and it’s the same word for all players around the world. This twist gives the game a social component that has probably contributed to its popularity.

An Unexpected Success

But a global crowd-pleaser wasn’t what Wardle was aiming for at all. He picked up his Wordle idea again in early 2021 to make an easy-to-use game to pass the time with his partner. For several months they were the only two users. At some point, their family members caught wind of the game, and Wardle decided in October 2021 to offer it on his personal website, free of charge and without advertising. Shortly thereafter Wordle went through the roof. Ninety users were playing Wordle every day on November 1, 2021; by January 1, 2022, the number had already reached 300,000. Another week later the game had two million users.

In January 2022 the New York Times announced it had acquired the rights to Wordle for a low seven-figure sum. This further increased the game’s reach. By March 2022 tens of millions of people around the world had already played Wordle at least once. A special feature of the game is that after playing, you can download the color code from your game (that is, the colored playing fields) as an emoji and share it on social media to compare yourself with others. Most people need about four tries on average to solve a Wordle. Anything less than that is considered a success.

If you’ve ever tried your hand at Wordle, then you know the result depends heavily on the starting word you choose. For instance, “start” is not a very smart first attempt because it contains the letter T twice. You’ve wasted one of five places where you could have gathered information about other letters. Of course, you could be lucky, and the solution word could also contain two Ts—but in all other cases, you won’t gain any information. According to the New York Times, the most popular starting words are “adieu” or “audio.” Because both words consist of many vowels, they quickly make clear what letters are in the solution word. But is that really the best choice?

Information Content versus Hit Rate

Maybe it’s better to start with a word such as “Texas.” If a rare letter such as X is contained in the solution word, you would clear out a huge amount of the 2,309 possible solutions in the first step. In fact, only 37 of the possible words contain an X. The probability is high, however, that no X appears in the solution word. In these cases, that information is hardly worth anything. If one knows that the solution does not have an X, the possibilities are merely reduced from 2,309 to 2,272. Therefore, the player must ask, “Do I value gaining as much information as possible? Or would I rather have a high probability of guessing a letter correctly?”

The fact that information and probability are related is not new. Mathematician Claude Shannon, founder of information theory, recognized this and defined a measure of information content with this relationship in mind. Suppose one has a space with possible events—in our case, the 2,309 solution words of Wordle. One bit of information then corresponds to the feedback that halves the solution space, such as if the solution word contains the letter S, for example (about half of all solutions have at least one S).

Two bits of information clear out three quarters of the solutions—such as when the solution word contains a T. And with three bits of information, only one eighth of all words remain. This means that the more likely a letter is to be contained in the solution, the smaller its information content is.

None — For each bit of information, the possibilities are halved. If a Wordle solution word contains the letter S, for example, this cuts half of the possible solution words. Credit: Spektrum der Wissenschaft/Manon Bischoff

This idea can be expressed mathematically. The probability (p) of finding a word with a certain property (such as the letter A) can be calculated by dividing the total number of words containing A (represented as M_A) by the number of all words (M). So p = M_A / M. At the same time, the information (I), meaning “The word contains an A,” reduces the space of all possibilities (M) by the factor ½^I. We can present that as M_A = ½^I x M.

By inserting both equations into each other, one can conclude with a formula that combines information content and probability: p = ½^I x M / M, so p = ½^I. This can also be reversed and solved for I: I = –log₂p.

Shannon came across this amazing connection between probability and information content in 1948. According to a 1971 article published in Scientific American, Shannon said, “My greatest concern was what to call [this new quantity I]. I thought of calling it ‘information,’ but the word was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with [computer scientist, physicist and mathematician] John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one knows what entropy really is, so in a debate you will always have the advantage.’”

Ever since, the quantity I, defined above, has been called entropy.

But back to Wordle. Entropy can help us find a suitable starting word. The higher the entropy of a word, the higher the information gain. A high entropy is always accompanied by a low hit rate, however, so you should find a balance of both factors to choose the best possible starting word.

You can calculate the entropy expectation value for all possible inputs, as mathematician Grant Sanderson did in his YouTube channel 3Blue1Brown. To do this, Sanderson proceeded as follows: first, for each of the 10,000 or so input words, he calculated the frequency of color patterns that could emerge based on the 2,309 solution words.

For example, five gray squares (all letters incorrect) can appear 250 times. A green one followed by four gray squares (first letter correct and in the right place), on the other hand, can appear only 15 times, and so on. The more often a color pattern can occur, the higher the probability of encountering it after a word has been entered. At the same time, the color code provides information that can be measured by entropy. Because some solution words are excluded, the solution space decreases.

To find out how much information you will get, on average, from an initial word, you can calculate the entropy for each possible associated color code and weight it with the probability of occurrence. In other words, you can calculate an expected value. As it turns out, the word “soare” (an obsolete term for a young hawk) performs best, with an expected value of 5.89 bits. This means that if you start with this word, the space of possible solution words shrinks to an average of 2^–5.89, or 1.7 percent of the possibilities. So on average, about 22 solution words are still possible.

Start with “Soare” to Do Well

Wordle consists of not only one guess attempt but several. By choosing a suitable combination of two consecutive words, it may be possible to limit the number of possible solutions more than if one starts with soare.

Sanderson also followed this approach. He proceeded as follows: Suppose that after typing soare, you get five gray boxes. So you only know that the letters S, O, A, R and E are not part of the solution word. From this, Sanderson checked which second color pattern can emerge for all possible subsequent inputs and thus calculated the expected value for the entropy of the second input word. If after the start word soare, all fields are gray, the best choice for the second input is “clint.” (A clint, by the way, is a hard rock.)

Now you can search for the most appropriate second word for the other color patterns that may appear after you type soare. For example, for a green square followed by four gray squares, “thilk” (another obsolete term meaning “that” or “this”) gives the best result. If we now weight the entropy of the second words with the corresponding probabilities, we get a value of 4.11. That means with the start word soare, we gain, on average, 5.89 bits of information, and with the optimal second word, we gain another 4.11 bits. If one were to play Wordle perfectly, one would obtain an average of 10 bits of information after two attempts—that is, the solution space would be reduced by a factor of 2^–10, leaving an average of 2.25 solution words.

“Slane” as an Even Better Strategy

If you look at the optimal combination of two words, another selection turns out to be even more powerful: “slane” (a special spade for peat digging). This starting word provides an average of only 5.77 bits of information, but with an optimal second input, you receive another 4.27 bits on average. This brings the total to 10.04 bits and reduces the 2,309 possibilities to an average of 2.19 words.

If you want to design a Wordle algorithm that is as masterful as possible, it is important to consider the second word choice. But for human players, this strategy probably doesn’t matter much. After all, it’s impossible to remember which consequent word is most appropriate for every color pattern that occurs after slane. Therefore, it shouldn’t make much difference whether you start a game with soare or slane.

Nevertheless, it is quite useful to consider information theory when playing Wordle, as Quanta Magazine impressively illustrated. Suppose you start the game with “bloat” and get gray, gray, gray, yellow, yellow. Then you know the solution word contains an A and a T (but in different places) and no B, L or O. Second, you try your luck with “watch,” and you are almost there: the first field is gray; the other four are green. So the first letter is wrong, but all others are correct. How do you continue?

You could now simply guess, for example, “match.” But—assuming you are playing regular Wordle, rather than hard mode—from an information-theoretical perspective, you should enter “chimp.”

Sure, chimp can’t possibly be the solution. But it helps narrow down the options. After entering watch, there are still four words that come to mind: catch, hatch, match and patch. If you enter these one after the other, you can still win the game, but you may do poorly. Entering chimp, on the other hand, reveals which starting letter (C, H, M or P) is correct. Thus, you have won the game after four tries. If you like risk, you can of course try your luck and hope to guess the correct solution in the third attempt.

In any case, I will use soare as my starting word in the future. Let’s see how many tries I need for the next Wordle. In Germany, where I live, the average number of attempts per player is 4.01. In the U.S., that number is 3.92. Maybe with the help of information theory, we’ll manage to beat the record holder, Sweden (average: 3.72 attempts), in the coming months.

This article originally appeared in Spektrum der Wissenschaft and was reproduced with permission.