Apple Game Difficulty Analysis

Published March 22, 2026

Summary

We analyzed the boards of 500 rounds from the original Apple Game to find a simple metric that can estimate board difficulty. In conclusion, the total sum of board numbers best predicts difficulty (r = -0.751), and based on this, we can assign a 6-level difficulty rating from ☆☆☆☆☆ to ★★★★★.

1. Overview

How can we determine the difficulty of a generated board? Computing the maximum possible score for a given board is a problem close to NP-Hard, making it practically impossible to calculate exact difficulty scores in bulk. Moreover, even if we obtained a high maximum score, that does not guarantee the board feels easy to play — there are certainly edge cases where only an unusual opening move enables such a high score.

Nevertheless, we wanted to find a way to roughly estimate difficulty in polynomial time, so we analyzed the board data from 500 rounds of the original game.

We assumed that the true difficulty perceived by players is determined by the score obtainable by a greedy solver, and compared 10 metrics that can be computed with minimal computation.

2. Difficulty Measurement Method

To measure difficulty, we used a simple Greedy Algorithm. This algorithm behaves like a player with excellent eyesight and very fast hands but who barely thinks ahead. Specifically:

  1. Find all rectangular regions on the board whose numbers sum to 10
  2. Prioritize the combination that contains the largest number (ties broken by fewer cells)
  3. Remove the numbers in the selected region
  4. Repeat until no more valid combinations remain

In short, it always eliminates the largest available number first among current possibilities.

3. Difficulty Measurement Results

Figure 1. Score Distribution

Scores range from 69 to 149, with a mean of 113.1 and a standard deviation of 15.4.

4. Difficulty Estimation Metrics

Since applying the greedy algorithm to every board is impractical, we looked for simpler metrics that can estimate difficulty.

4.1 Board Sum

A higher sum means more large numbers, and large numbers are harder to pair. Therefore, a higher sum leads to lower scores.

Figure 2. Board Sum vs Score (r = -0.751)

MetricValue
Correlation coefficient (r)-0.751
Coefficient of determination (r²)0.564
Regression formulascore ≈ -0.33 × sum + 397
Mean score113.1
Score standard deviation15.4

With r² = 0.564, about 56% of score variance can be explained by the board sum alone.

4.2 Correlation by Digit

We checked which digits are associated with higher scores and which with lower scores when their count increases.

Figure 3. Correlation Between Digit Count and Score

More small numbers (1, 2, 3) lead to higher scores, while more large numbers (7, 8, 9) lead to lower scores. In particular, 9 has the strongest negative correlation because 9 can only pair with 1.

4.3 Comparison with Other Metrics

We also investigated other metrics that might predict difficulty beyond the board sum.

MetricDescriptionr|r|
Board SumSum of all 170 cells-0.7510.751
High/Low RatioLarge numbers / small numbers ratio-0.7180.718
Low (1+2+3)Combined count of 1, 2, 3+0.6390.639
High (7+8+9)Combined count of 7, 8, 9-0.6170.617
Count of 1Count of 1s+0.4180.418
Count of 9Count of 9s-0.3090.309
Adjacent PairsNumber of adjacent cell pairs summing to 10+0.3060.306
Min(#9, #1)Smaller of count of 9s and 1s (available 9+1 pairs)+0.1120.112
Digit Count SDStandard deviation of digit counts (skewness)-0.0660.066
Count of 5Count of 5s+0.0560.056

Figure 4. Comparison of Score Prediction Power by Metric

Board Sum has the highest |r| = 0.751. This makes sense since the sum is essentially a weighted sum of all digits. Among individual digits, the count of 9 shows the strongest predictive power.

An interesting finding is that the number of adjacent pairs (Adjacent Pairs) has a relatively low correlation. While having many adjacent pairs seems like it would make the board easier, in practice which numbers are involved matters more for the score than the number of pairs.

Figure 5. Score Scatter Plots for Top 4 Metrics

5. Estimated Difficulty Rating Method

We assigned a 6-level difficulty rating based on the board sum ranges.

Figure 6. Sum Distribution and Grade Boundaries

GradeNameSum RangeBoardsAvg ScoreMinMax
0 ☆☆☆☆☆ 0–799 23 132.8 115 148
1 ★☆☆☆☆ 800–819 49 128.8 108 149
2 ★★☆☆☆ 820–839 102 122.6 96 149
3 ★★★☆☆ 840–859 97 115.9 79 140
4 ★★★★☆ 860–879 104 107.2 77 133
5 ★★★★★ 880+ 125 98.2 69 123

Figure 7. Average Score by Grade

Average scores clearly decrease as the grade increases. The difference in average score between ☆☆☆☆☆ and ★★★★★ reaches 35 points.

6. Conclusion

After solving 500 original boards with a greedy solver, we found that board difficulty has a strong correlation with the number distribution. In particular, the total sum of board numbers alone can reasonably estimate board difficulty. We also confirmed the following findings:

  • The board sum alone explains 56% of score variance. A larger sum means more large numbers, making it harder to form pairs.
  • Among individual digits, the count of 9 has the greatest impact on difficulty. This is because 9 can only pair with 1. Conversely, a higher count of 1 makes the board easier, since 1 not only pairs with 9 but can also flexibly form combinations with other numbers.
  • On the other hand, metrics like the number of adjacent pairs or the count of 5 have surprisingly low predictive power, contrary to intuition.
  • Interestingly, the count of 8 shows a stronger negative correlation with score than 9 (r = -0.392 vs -0.309). Since this solver processes large numbers first, 9 gets paired quickly with 1 while the board is still dense. However, 8 only gets its turn after all 9s are processed, and by then the board has many empty cells, making it harder to find valid rectangles. A strategy that processes 8 first might yield different results, which remains a topic for future research.

As a result, we assigned a 6-level "estimated difficulty" rating from ☆☆☆☆☆ to ★★★★★ based on the board sum. This allows us to present a reasonably convincing difficulty rating with very little computation. Feedback from actual players has been mostly positive regarding the difficulty grades, although there were occasional comments about boards feeling too hard despite a low star rating.