What’s the Best Initial Guess for Wordle?

Recently, a daily word game called Wordle has gone viral on Twitter. The rule of the game is simple, guess a five letters word in 6 tries. Some of the hint after guessing a word is shown in the figure below.

I found the game as interesting and challenging since there are hundreds of thousands of words in English but it is not my native language. Luckily I know Python! I calculated the statistics of letters occurrence in English and make an informed initial guess


First thing first, I load all the words from a .txt file provided on GitHub. Since the quiz only considers a five letters word, we eliminate all the words with length other than that. Some words might not be in the quiz dictionary, but I just let it be.

Next, we calculate the occurrence of the alphabet. in the five letters words. This can be done easily by using dictionary in Python, iterate through all the words and letters, count it, and normalize the data. The code can be seen below.

import string
letter_count = dict.fromkeys(string.ascii_lowercase, 0)

for word in words_5_letters:
  for letter in word:
    letter_count[letter] = letter_count[letter] + 1

total_count = sum(letter_count.values())

letter_count_normalized = {key: value/total_count for key, value in letter_count.items()}

sorted_letter_prob = {k: v for k, v in sorted(letter_count_normalized.items(), key=lambda item: item[1], reverse=True)}
Statistics 1: English Letter Probability in 5 Letters Words

The result shows that more than 10% of the letters is ‘a’, with ‘e’ trails behind at almost 10% and ‘s’ at slightly higher than 8%. This is slightly different from the distribution for words with any length, shown in the image below. With this statistics, we are confident that we should include those letters in our initial guess.

Frequency Table
English Letter Frequency (source)

Another statistics, let’s count the occurrence probability of a word in a letter. With “’Pandas’ library, the code is shown below.

import pandas as pd

data = dict.fromkeys(string.ascii_lowercase, [0, 0, 0, 0, 0])
df = pd.DataFrame.from_dict(data, orient='index')
df.columns = [1, 2, 3, 4, 5]

for word in words_5_letters:
  for count,letter in enumerate(word):
    df.loc[letter][count+1] = df.loc[letter][count+1] + 1
df_transposed = df.transpose()
df_normalized = df_transposed.div(df_transposed.sum(axis=1), axis=0)

The result is shown in the figure below. One can see that the most common letter for first to fifth position in a word are ‘s’, ‘a’, ‘r’, ‘e’, and ‘s’. This can be a hint on what first word is good as a guess. However, since using ‘s’ double is not efficient, we can substitute the first letter with the next high occurring letter, ‘c’.

Statistics 2: Letter Occurrence Probability in the Words

Hold up! what about the Statistics 1? Yes! We should also consider it so let us do the calculation! Suppose that we consider both Statistics 1 and 2 is equally important so we can measure the most probable word by averaging those criteria. The function to calculate the score is written below

def count_score(word):
  count_crit_1 = 0
  count_crit_2 = 0

  for count,letter in enumerate(word):
    count_crit_1 = count_crit_1 + sorted_letter_prob[letter]/100
    count_crit_2 = count_crit_2 + df_normalized.iloc[count][letter]
  return (count_crit_1 + count_crit_2)/2*100

Making an Informed Guess

To make an initial guess, let us iterate through all the five letters words and see which one has the highest probability. Note that we should not include a word with doubled letter since it is not an efficient guess. The code is shown below.

def letter_is_not_doubled(check_string):
  count = {}
  condition = True
  for s in check_string:
    if s in count:
      count[s] += 1
      count[s] = 1
  for key in count.keys():
    condition = condition and (count[key] == 1)
  return condition
words_score = {}

for word in words_5_letters:
  if letter_is_not_doubled(word):
    words_score[word] = count_score(word)
sorted_words_score = {k: v for k, v in sorted(words_score.items(), key=lambda item: item[1], reverse=True)}

From this calculation, we found that the highest probability word is ‘tares’. Thus, we can use this word as our first guess, an informed guess!

Once you play the quiz, you will notice that one guess is not enough so we need another one. For our next guess, we do not want to include the letters already exist in the first guess. Let us define a function to filter out which letter we want to exclude and calculate the score again.

def not_contain_this_letter(word, not_contain):
    condition = True
    for letter in not_contain:
        condition = condition and (letter not in word)
    return condition
not_contain = 'tares'

word_guess = [word for word in words_5_letters if not_contain_this_letter(word, not_contain)]

words_score_2 = {}

for word in word_guess:
    if letter_is_not_doubled(word):
        words_score_2[word] = count_score(word)
sorted_words_score_2 = {k: v for k, v in sorted(words_score_2.items(), key=lambda item: item[1], reverse=True)}

From this filtering, we found out that the best word for the second guess is ‘colin’! Using the same technique, by excluding ‘tares’ and ‘colin’ we found that the third best guess in case two is not enough is ‘bumpy’.

There you go! Make “tares” as your initial guess, following with ‘colin‘ for the second one, and ‘bumpy’ in case you think you need a third one.

Now you can play Wordle with statistically best initial guess. Good luck!

Magnetometer Calibration Using Levenberg-Marquardt Algorithm

Recently I worked on a magnetometer calibration method. This method is based on Levenberg-Marquardt Algorithm (LMA), a non-linear least-squares optimization algorithm. The method is implemented on ArduPilot and PX4, an open-source flight controller firmware.

I have to admit, formulizing mathematical notion from code is not straightforward. I spent several days to learn from LMA basic and finally understanding the sphere fit and ellipsoid fit algorithm.

In case you are wondering about the mathematical part, I write the formulation of the algorithm on PDF since it can’t be viewed on WordPress (unless I pay more for the plugin).

Click here for the document.

Python vs Julia: Speed Test on Fibonacci Sequence

Recently MIT released a course on Computational Thinking with code 18.S191 and it is available on YouTube. I can code in C++ and Python, so the founder’s claim that this code is as fast as C and as easy as Python gains my interest.

Introduction to Julia

Julia is created in 2009 and first introduced to public in 2012. The developers aimed for scientific computing, machine learning, data mining, and large-scale linear algebra. We might have heard this application on Python, but Julia gives advantages to programmer compared to Python.

Continue reading Python vs Julia: Speed Test on Fibonacci Sequence

Humanitarian Robotics: Autonomous Landmine Detection Rover

Although war is not happening, the dangerous impact is still tangible today. Landmine has been one of the threats left by the past wars, killing 15,000–20,000 people every year according to UN Mine Action Service. Demining efforts cost US$ 300–1000 per mine and imposing danger to people, resulting one person is killed and two are injured for every mines cleared.

HRATC 2017

Robot can be really helpful in solving this problem, as it is designed to do the “dull, dirty, dangerous, and difficult” tasks. In 2017, IEEE Robotics Automation Society’s Special Interest Group on Humanitarian Technology (RAS–SIGHT) held a competition. The competition was Humanitarian Robotics and Automation Technology Challenge (HRATC), held at the 2017 International Conference on Robotics and Automation (ICRA’17).

Autonomous Landmine Detection Rover
Continue reading Humanitarian Robotics: Autonomous Landmine Detection Rover

Text Extraction from a Table Image, using PyTesseract and OpenCV

Extracting text from an image can be exhausting, especially when you have a lot to extract. One commonly known text extraction library is PyTesseract, an optical character recognition (OCR). This library will provide you text given an image.

PyTesseract is really helpful, the first time I knew PyTesseract, I directly used it to detect some a short text and the result is satisfying. Then, I used it to detect text from a table but the algorithm failed perform.

Figure 1. Direct use of PyTesseract to Detect Text in a Table

Figure 1 depicts the text detection result, with green boxes enclosing the detected words. You may realized that most of the text can’t be detected by the algorithm, especially numbers. In my case, these numbers are the essentials of the data, giving me value of daily COVID-19 cases from a local government in my hometown. So, how extract these information?

Getting Started

When writing an algorithm, I always try to think as if I’m teaching the algorithm the way humans do. This way, I can easily put the idea into more detailed algorithms.

When you’re reading a table, the first thing you might notice is the cells. A cell might be separated from another cell using a border (lines), which can be vertical or horizontal. After you identify the cell, you proceed to read the information within. Converting it into algorithm, you may divide the process into three processes, namely cells detection, region of interest (ROI) selection, and text extraction.

Continue reading Text Extraction from a Table Image, using PyTesseract and OpenCV

Monte Carlo Localization

Monte Carlo localization (MCL) is also known as particle filter localization. Given a map of an environment, the algorithm estimates the position and orientation of a robot as it moves and senses the surrounding. The algorithm uses particle to represent the distribution of likely states, with each particle representing a possible state.

The algorithm starts by distributing the particles in the configuration space. After that, it starts to loop its three main parts, namely motion update, sensor update, and resampling. During motion updates, particles are shifted into new possibilities of states. Whenever the robot senses the environment, the particles are resampled using the recursive Bayesian estimation. Lastly, the particles with higher likelihood tend to survive in the resampling process.

Continue reading Monte Carlo Localization

Programming for Robotics – ROS: Exercise 3

This post is a continuation of previous project, on learning ROS. The lecture and exercise is given by Robotic Systems Lab. – ETH Zurich, can be accessed through this website.

This time, the exercise goal is to make the robot hit the pillar in the simulation environment. The pillar position is found by measuring closest distance from LIDAR measurement. The speed is set to be constant and a simple P controller is made to direct Husky towards the pillar. Both speed and P-gain is written in param file, making it easy to tune avoiding re-build the code. A marker in RViz is created to visualize location of the pillar.

Exercise 3

The video above shows my result for the exercise. You can find the exercise paper sheet on this link.

Thanks to Robotic Systems Lab – ETH Zurich for sharing this helpful course!

Programming for Robotics – ROS: Exercise 2

This post is a continuation of previous project, on learning ROS. The lecture and exercise is given by Robotic Systems Lab. – ETH Zurich, can be accessed through this website.

This second lecture is focused on ROS package structure, ROS C++ client library, ROS subscribers and publishers, ROS parameter server, as well as RViz visualization. I haven’t included integration and programming with Eclipse in my work.

Exercise 2

The video above shows my result for the exercise. You can find the exercise paper sheet on this link.

Thanks to Robotic Systems Lab – ETH Zurich for sharing this helpful course!

Programming for Robotics – ROS: Exercise 1

Recently, I found an interesting website to learn about ROS and Gazebo. This website is provided by Robotic System Lab – ETH Zurich. Even though I’ve made several ROS project, including the one for my internship at ProtoSpace Toulouse – Airbus, the online course has a lot to offer about ROS.

The first lecture is an introduction to ROS, such as masters, nodes, topics, and launch files. The lecture file can be found in this link. At the end of the lecture, an exercise is given which require students to control a simulation in gazebo using keyboard input as well as to make a launch file.

Exercise 1

The video above is a simulation of the exercise, controlled by keyboard using teleop_twist_keyboard.

Continue reading Programming for Robotics – ROS: Exercise 1

Obstacle Avoidance for Quadcopter with Ultrasonic Sensor

In 2017, I made an obstacle avoidance for quadcopter using an ultrasonic sensor as my undergraduate thesis project. The goal is to do proof of concept of an ellipsoid based guidance proposed in the following paper. The paper take a fixed-wing aircraft as the model. However, since implementing it on a fixed-wing model is an extremely laborious work, the implementation is done on a quadcopter.

Quadcopter System

To mimic the radar sensing, in this project, I used an rotating ultrasonic. The ultrasonic is rotated by using a servo that is commanded through a microcontroller, Arduino. The sensing result is estimated as an ellipsoid geometry, considering the size and orientation of the obstacle as well as aircraft velocity vector. A contact point is set as a guidance.

Continue reading Obstacle Avoidance for Quadcopter with Ultrasonic Sensor