Word suggestion algorithm

Status: Draft | 2025

Author: Victor Ma

Terminology

Term

Definition

Current cell

The cell that the cursor is on.

Current slot

The slot that the cursor is on.

Current filter

The filter for the current slot.

Secondary slot

The slot that intersects the current slot, at the current cell.

Secondary filter

The filter for the secondary slot.

Intersecting slot

A slot that intersects a given slot.

Intersecting slots

All the slots that intersect a given slot.

Word suggestion list

The UI element that shows a list of words that fit the current slot.

Word suggestion algorithm

An algorithm that generates word suggestions for a given slot or grid.

Dead end word

A word that results in an unfillable grid.

Grid fill

A one-to-one mapping from the slots of a grid to words.

Word suggestion list
The word suggestion list.

Our current algorithm

Our current word suggestion algorithm works like this, where \(n\) is the offset of the current cell in the current slot, and \(m\) is the offset of the current cell in the secondary slot:

  1. (Phase 1) Get the set of possible letters for the current cell, constrained by the current filter.

    1. Get the set of words that match the current filter.

    2. Return the set of letters that appear as the \(n^{\text{th}}\) letter, in the set of words.

  2. (Phase 2) Get the set of possible words for the secondary slot, and the set of possible letters for the current cell—both constrained by the secondary filter and the current filter.

    1. Get the set of words that satisfy both these constraints:

      • The word matches the secondary filter.

      • The \(m^{\text{th}}\) letter of the word is in the set of letters from Step 1.

    2. Get the set of letters that appear as the \(m^{\text{th}}\) letter, in the set of words.

    3. Return the set of words and the set of letters.

  3. (Phase 3) Get the set of possible words for the current slot, constrained by the current filter and secondary filter.

    1. Get the set of words that satisfy both these constraints:

      • The word matches the current filter.

      • The \(n^{\text{th}}\) letter of the word is in the set of letters from Step 2.

    2. Return the set of words.

  4. (Return) Return the set of words from Phase 2 and Phase 3.

The word_list_find_intersection () function implements our current word suggestion algorithm.

Limitations

Our current algorithm only considers two constraints:

  • The filter for the current slot.

  • The filter for the secondary slot.

This means that the algorithm is unaware of the other intersecting slots—and the constraints that they impose on the current slot.

Consider the following grid:

+---+---+---+---+
|   |   |   | Z |
+---+---+---+---+
|   |   |   | E |
+---+---+---+---+
|   |   |   | R |
+---+---+---+---+
| W | O | R |   | < current slot
+---+---+---+---+

4-Down begins with ZER, so the only word it can be is ZERO. This constrains the bottom-right cell to the letter O.

4-Across starts with WOR. We know that the bottom-right cell must be O, so that means 4-Across must be WORO. But WORO is not a word. So, this grid is unfillable, which means that 4-Across and 4-Down are also unfillable.

Now, suppose that the current slot is 4-Across, and the current cell is the bottom-right cell. Then, our word suggestion algorithm checks the filters for 4-Across and 4-Down. It correctly detects that the two slots are unfillable, and returns the empty set.

But what if the current cell is one of the other cells in 4-Across? Then, the algorithm never checks the filter for 4-Down, and so it doesn’t know about the constraint that it imposes on the current slot. As a result, the algorithm returns all the words that match the filter WOR?, like WORD and WORM—even though they don’t actually fit in the slot.

Consequences

There are two undesirable consequences of this:

  • Our algorithm can generate dead end words. These are frustrating for the user. The user might enter a word from the word suggestions list, reach a dead end, and then need to undo all their progress since adding the suggested word.

  • Our algorithm’s output can vary based on the current cell. This is semantically incorrect, because the word suggestions for a slot should not change based on the cursor’s position.

So, we should change our algorithm, in order to fix these issues.

Lookahead-based algorithm

One possible solution is to add lookahead to our word suggestion algorithm. The lookahead-based algorithm takes into account the intersecting slots of the current slot, when generating the word suggestions. Here’s how it works:

  1. Iterate through the intersecting slots of the current slot.

    • For each intersecting slot, determine the constraint that it imposes on the current slot.

  2. Return the set of words that meet these constraints:

    • The word meets the constraints imposed by the intersecting slots.

    • The word matches the current filter.

Consider this grid from before:

+---+---+---+---+
|   |   |   | Z |
+---+---+---+---+
|   |   |   | E |
+---+---+---+---+
|   |   |   | R |
+---+---+---+---+
| W | O | R |   | < current slot
+---+---+---+---+
  ^ current cell

Suppose that the current slot is 4-Across, and the current cell is the bottom-left cell. Then, the lookahead-based algorithm checks the filter for 4-Down. The algorithm sees that the filter constrains the bottom-right cell to the letter O. This makes the current filter be WORO, which doesn’t match any word. So, the lookahead-based algorithm returns the empty set.

Problem solved?

This approach solves the problem of our word suggestion algorithm producing different outputs for the same slot, based on the cursor’s position. Now, the current cell has no impact on the output.

This approach also significantly reduces the amount of dead end words in the output. However, it does not eliminate them altogether.

Still broken

Consider this grid:

+---+---+---+---+
|   |   |   | N |
+---+---+---+---+
| Q | U | I |   |
+---+---+---+---+
|   |   |   | X |
+---+---+---+---+
| W | E | S |   | < current slot
+---+---+---+---+

2-Across starts with QUI, so the only word it can be is QUIZ. This makes the filter for 4-Down be NZX?. No word matches this filter, and so, 4-Down is unfillable. This means that 4-Across is also unfillable, because it intersects with 4-Down.

Now, suppose that the current slot is 4-Across. Then, the lookahead-based algorithm (as currently defined) does not detect that the current slot is unfillable.

The algorithm sees that the filter for 4-Down is N?X?. But it does not see that 2-Across forces 4-Down to be NZX?. So, the algorithm thinks that 4-Down can be NEXT, and that 4-Across can be WEST—which is incorrect.

This failure is because the lookahead-based algorithm checks the intersecting slots of the current slot—but it does not check the intersecting slots of the intersecting slots of the current slot. In other words, the lookahead does not go deep enough to discover that the current slot unfillable.

Additional levels of lookahead

To handle this grid properly, we need to add another level of lookahead. To do this, we make the algorithm recursively perform the lookahead process on all the slots from the first level of lookahead. This gives the algorithm a more accurate set of possible words for the intersecting slots of the current slot. That, in turn, lets the algorithm more accurately constrain the current slot.

For each additional level of lookahead that we want to add, we increase the recursion depth by one.

Level

Constraints considered

0

Current filter.

1

Current filter, intersecting filters of the current slot.

2

Current filter, intersecting filters of the current slot, intersecting filters of the intersecting slots of the current slot.

The lookahead-based algorithm with two levels does handle the example grid properly. But it’s clear that we can always create a grid that trips up the algorithm—unless the algorithm has enough levels to visit every slot on the grid.

Unsolveable?

Consider the following grid. Every slot is unfillable. This is because 4-Across (ZXCVBN?) is unfillable, and that “unfillability” propagates to every other slot.

+---+---+---+---+---+---+---+
|   | U | A | L | I | T |   |
+---+---+---+---+---+---+---+
| U | ■ | ■ | ■ | ■ | ■ | O |
+---+---+---+---+---+---+---+
| I | ■ | A | X |   | ■ | G |
+---+---+---+---+---+---+---+
| E | ■ | ■ | ■ | B | ■ | U |
+---+---+---+---+---+---+---+
|   | H | U | M |   | ■ | R |
+---+---+---+---+---+---+---+
| ■ | ■ | ■ | ■ | ■ | ■ | T |
+---+---+---+---+---+---+---+
| Z | X | C | V | B | N |   |
+---+---+---+---+---+---+---+

Now, suppose the current slot is 2-Across (AX?). Then, our algorithm needs at least six levels of lookahead to properly detect that the grid is unfillable. Otherwise, it thinks that the grid can be filled with AXE, EBB, THUMB, QUIET, QUALITY, and YOGURTS (starting with 2-Across and spiralling out).

However, it’s not practical to add more than one or two levels of lookahead. The number of intersections that the algorithm checks is roughly \(s^n\), where \(s\) is the average slot size, and \(n\) is the number of levels.

So practically, the limit is one or two levels. This means that we cannot completely prevent dead end words with the lookahead-based algorithm.

AC-3-based algorithm

In order for our word suggestion algorithm to visit all the slots on the grid—and thus eliminate all dead end words—we need to use a different algorithm entirely. For this, we look to the field of constraint satisfaction problems (CSPs).

Express grid filling as a CSP

For a given grid and word list, we can express the problem, Find a valid grid fill for this grid, as a CSP. What follows is one possible CSP model for this problem, though other models exist.

Let \(V\) be the set of variables for the CSP, and

\[ V = \{a_1, a_2, \dots, a_m\} \cup \{d_1, d_2, \dots, d_n\}, \]

where \(a_i\) is the word in the \(i\)-Across slot and \(m\) is the number of across slots in the grid; and \(d_j\) is the word in the \(j\)-Down slot and \(n\) is the number of down slots in the grid. So, there is one variable for each slot, and each variable is set to the chosen word for that slot.

Let \(D\) be the set of domains for the CSP, and

\[ D = \{D_v \mid v \in V\}. \]

Let \(F_v\) be the set of words that match the filter for the slot that \(v\) represents, and that are in the word list, and let

\[ D_v = F_v. \]

In other words, the domain of a variable is the set of word-list words that match the variable’s slot’s filter.

Let \(I\) be the set of intersection constraints for the CSP, and

\[ I = \{I_{v_1v_2} \mid v_1, v_2 \in V \land \text{The slots for $v_1$ and $v_2$ intersect each other} \land \text{The slot for $v_1$ appears before the slot for $v_2$ in the clue list}\}, \]

where

\[\begin{split} \begin{aligned} I_{v_1v_2} ={} &\text{The values of $v_1$ and $v_2$ have the same letter} \\ &\text{at the offsets where their slots intersect}. \end{aligned} \end{split}\]

Let \(Q\) be the set of unique word constraints, and

\[ Q = \{Q_{v_1v_2} \mid v_1, v_2 \in V \land v_1 \neq v_2 \land \text{The slots for $v_1$ and $v_2$ have the same length} \land \text{The slot for $v_1$ appears before the slot for $v_2$ in the clue list}\}, \]

where

\[ Q_{v_1v_2} = \text{The value of } v_1 \neq \text{the value of } v_2. \]

Let \(C\) be the set of constraints for the CSP, and

\[ C = I \cup Q. \]

Solution to the CSP

A solution to this CSP is a valid grid fill for the given grid and word list. Suppose the CSP has a standard word list, and the following grid:

+---+---+---+---+
| ■ | ■ | ■ | N |
+---+---+---+---+
| T | I | M |   |
+---+---+---+---+
| ■ | ■ | ■ | X |
+---+---+---+---+
| W | E | S |   |
+---+---+---+---+

Then, the unique solution is

\[\begin{align*} a_1 &= \text{TIME} \\ a_2 &= \text{WEST} \\ d_1 &= \text{NEXT}.\\ \end{align*}\]

Arc consistency

However, our goal is not to find a solution to the CSP. Our goal is to get the set of possible words for a slot.

To do this, we make every variable arc consistent with every other variable. To do this, we can use the AC-3 algorithm.

Resources

Learning resources

Resources for learning about CSPs and the AC-3 algorithm:

Other resources

Miscellaneous resources related to CSPs and crosswords:

Algorithms comparison

The following is a comparison between the lookahead-based algorithm, and the AC-3-based algorithm. Assume that the lookahead-based algorithm uses a single level of lookahead.

Dead end words

The AC-3-based algorithm eliminates all dead end words. The lookahead-based algorithm does not eliminate all dead end words.

However, the lookahead-based algorithm does eliminate most of the dead end words. The reasoning for this is:

  1. The lookahead-based algorithm accounts for the slots closest to the current slot.

  2. The slots closest to the current slot impose most constraints on the current slot.

  3. Therefore, the lookahead-based algorithm eliminates most of the dead end words.

Closer slots impose more constraints

Closer slots impose more constraints on the current slot, because they are fewer steps removed from the current slot.

Suppose \(i_1\) is an intersecting slot of the current slot. Then, \(i_1\) has a large influence on the current slot, because \(i_1\) directly constrains the character set for one of the current slot’s cells.

Now, suppose \(i_2\) is an intersecting slot of \(i_1\). Then, the only way that \(i_2\) can constrain the current slot is by constraining \(i_1\). And the only way it can constrain \(i_1\) is by constraining the character set for one of \(i_1\)’s cells. So, \(i_2\) must constrain one of \(i_1\)’s cells to the point that \(i_1\) becomes constrained enough to add an additional constraint on the current slot.

And for a slot \(i_3\) that intersects \(i_2\), the constraining effect is another order of magnitude smaller, and so on, for each additional step removed a slot is.

So, the impact that a slot has on the current slot grows weaker, the further away it is. This means that the lookahead-based algorithm eliminates most of the dead end words.

Crossword aids

Here are some crossword aids that would be good to implement:

Crossword aid

Description

Fillability indicator

An indicator for the fillability of the grid. If at least one slot is unfillable, then the grid is unfillable.

Cell constraint heat map

A heat map overlaid on top of the grid that indicates how constrained each cell is. A cell that can be any letter from A to Z is completely unconstrained. A cell that can only be a single letter is maximally constrained.

Most-constrained-slot button

A button that moves the cursor to the most constrained slot. The most constrained slot is the slot that has the smallest set of possible words—and that is not fully filled.

The AC-3-based algorithm functions on the grid level. Its input is the entire grid, and its output is the set of possible words for every slot on the grid. This makes it well suited for implementing the crossword aids.

The lookahead-based algorithm functions on the slot level. In order to implement crossword aids with it, we need to turn the algorithm into a grid-level algorithm, like this:

For each slot in the grid:
  Run the lookahead-based algorithm

However, the crossword aids are less accurate with the lookahead-based algorithm. This is because the algorithm’s outputs may contain dead end words, which pollute the data that the crossword aids rely on. The more dead end words there are, the less accurate the crossword aids become.

Speed

The lookahead-based algorithm takes \(n\) times longer to run than our current word suggestion algorithm, where \(n\) is the average slot size.

Our current algorithm takes less than 16 ms to run (because we are able to hit 60 fps). So, even for the maximum slot size of 21 (extra-large grid), the lookahead-based algorithm runs in under a second.

Implementation difficulty

  • The lookahead-based algorithm is easy to implement.

  • The AC-3-based algorithm is difficult to implement.

Which algorithm?

We should implement the lookahead-based algorithm. It significantly reduces dead end words, it runs quickly, and it’s easy to implement.

Prior Art

The following is true about most crossword editors:

  • They use a grid-level word suggestion algorithm.

  • It takes a while for the word suggestion list to populate.

  • They have a fillability indicator.

  • They have a cell constraint heat map.

  • They have a most-constrained-slot button.

Open source editors:

Implementation

The following is a possible implementation for the lookahead-based word suggestion algorithm. It runs the intersection function on each cell of the current slot, and then returns the set intersection of all the outputs.

lookahead_based_algorithm (current_slot)
{
  final_words_set = NULL;

  for (cell : current_slot)
    {
      if (cell != "")
        continue;

      intersecting_slot = get_int_slot (current_slot, cell);
      words_set = intersection_func (current_slot, intersecting_slot);

      if (final_words_set == NULL)
        final_words_set = words_set;
      else
        final_words_set = (final_words_set) ∩ (words_set);

      if (final_words_set == {})
        return final_words_set;
    }

  return final_words_set;
}

Must be asynchronous

This algorithm runs the intersection function \(n\) times, where \(n\) is the size of the current slot. The intersection function itself takes a few milliseconds to run. This means that we cannot hit our target of 16 ms per frame, if we run the lookahead-based algorithm in the main thread. So, we must implement the algorithm asynchronously.

Next steps

  1. Finish the WIP parts of this design doc.

  2. Implement the lookahead-based algorithm synchronously.

  3. Discuss and decide next steps:

    • Implement the lookahead-based algorithm asynchronously?

    • Implement AC-3-based algorithm?

    • Hybrid approach?

Open questions

Areas for improvement