1 Introduction

Password guessing is a common methods an attacker will deploy for compromising end users. A human password guesser will often leverage auxiliary information (spoken language, knowledge of the website the passwords were created for, or demographics of the users) in order to tailor their guesses. However, nowadays, many guessing attacks are automated. It is important for us to understand the capabilities of such automated attacks. In this paper, we investigate whether an automated guessing algorithm can similarly identify such patterns and leverage them to improve its guessing.

A commonly used method for guessing passwords involves using wordlists of password guesses and guessing these in an optimum order in order to compromise as many users as possible. Guessing passwords in the optimal order is important for an attacker as they wish to compromise many users with a small number of guesses. In particular, an online attacker is often using automated attacks and is limited to a certain number of guesses before a lockout is triggered. It is a challenge for an attacker to discover which guesses will result in the highest success rates.

In this paper, we suggest and develop an explore and exploit protocol based on the classic multi-armed bandit problem (MAB). This protocol can be used to guess a password set effectively using guesses from a selection of wordlists. Because of the learning nature of the model over a small number of guesses, we view this as a method which could prove effective as an online and offline guessing method.

We see at least three potential offensive use-cases for such a guessing model. The first approach is the most direct and utilises the real-time convergence of the MAB. An online guesser, guessing a selection of users passwords from a website, will learn from each success and use it to inform the next guess made against all users. A second approach involves an attacker gathering information by applying the multi-armed bandit to an offline leaked dataset of users from a given organisation. They can use the MAB to determine the optimum choice of wordlist and then carry out a tailored (online) attack on other users from the same organisation. This has the potential to be effective as passwords created by users of the same organisation can significantly improve guessing returns [1]. A similar method could be deployed by learning information via online guessing against a subset of accounts. The MAB can guess against these accounts, learning characteristics until they are locked out. Once the MAB has highlighted the appropriate wordlist, then the wordlist can be used against other users, avoiding triggering lockout on potentially more valuable users.

While offensive use cases are interesting, a more immediate application of this work is to emphasise the need for organisations to encourage users away from predictable password patterns. This work demonstrates that passwords differ measurably depending on their source use and that an automated password guesser can take advantage of this to significantly improve guessing success. This indicates that websites should consider blocklisting passwords in a way that is tailored to their particular subject matter and users. In particular, websites who have experienced previous password leaks could work at restricting future users from using passwords which occurred with a high frequency in that leak.

The paper continues as follows: Section 2 describes related work. Section 3 describes our set-up of the multi-armed bandit learning model. Section 4 demonstrates that our multi-armed bandit can be used to identify dataset source and that leveraging this knowledge improves guessing. We also show that the multi-armed bandit can improve guessing even when it is not leveraging specific knowledge about the dataset source. Section 5 demonstrates that language and nationality characteristics can also be derived during guessing and that these can be adaptively leveraged to improve guessing success. Section 6 provides a brief summary and conclusion. Appendix A provides the analysis behind the variable choice and implementation decisions for our model set up.

2 Related work

The most widely used guessing strategy involves combining wordlists with word mangling rules. The success of this strategy, which is demonstrated by Hashcat and John the Ripper, is best highlighted through the success of both teams in the annual KoreLogic “Crack Me If You Can” contest and also in the wide spread use of both software [2,3,4]. In 2005, Narayanan and Shmatikov employed Markov models to enable faster guessing [5]. A Markov model can predict the next character in a sequence based on the current character. In 2009, Weir et al. used probabilistic context-free grammars (PCFG) to guess passwords [6]. PCFGs characterise a password according to its “structures”. Structures can be password guesses or word mangling templates that can take dictionary words as input. In 2016, Wang et al. developed a targeted password guessing model which seeds guesses using users’ personally identifiable information [7]. Wang et al. leverage existing probabilistic techniques including Markov models and PCFG as well as Bayesian theory. They create tags for specific personally identifying information (PII) associated with a user. In their most successful version, TarGuess-I, they use training with these type-based PII tags to create a semantic aware PCFG. Independently, Li et al. also created a method for seeding password guesses with personal information. Their guessing also extended the probabilistic context free grammar method [8]. Also in 2016, the use of artificial neural networks for password guessing was proposed by Melicher et al. [9]. Artificial neural networks are computation models inspired by biological neural networks. These were used to generate novel password guesses based on training data. In 2019, Hitaj et al. proposed using deep generative adversarial networks (GAN) to create password guesses [10]. A generative adversarial network pits one neural network against another in a zero-sum game. PassGAN is able to autonomously learn the distribution of real passwords from leaked passwords and can leverage these to generate new guesses. In 2020, Pasquini et al. [11, 12] introduced the idea of “password strong locality” to describe the grouping together of passwords that share fine-grained characteristics. This password locality can be leveraged to train their learning model to generate passwords that are similar to those seen and to help with password guessing.

Our model differs from previous work in that previous work has tried to create effective words to be guessed against passwords and has used learning techniques to inform how to create these words. In our work, we assume lists of guesses exist in the form of multiple wordlists; our learning technique informs which wordlist will be most effective for guessing the particular password set and how to combine guesses from multiple wordlists in order to utilise them effectively. In particular, instead of creating a single ordered wordlist, our method proposes splitting wordlists according to their characteristics, e.g. a German password wordlist and a wordlist leaked from LinkedIn. In this way, the wordlists with characteristics that are relevant to the passwords being guessed can be chosen and then others ignored.

We believe this is an effective way to tackle optimising guessing because research has shown that demographics and password source play a significant role in users’ password choice. In 2012, Malone and Maher investigated passwords created for different websites and found that nationality plays a role in user’s choice of passwords [13]. They also found that passwords often follow a theme related to the service they were chosen from. For example, a common LinkedIn password is “LinkedIn”. This result was further investigated by Wei et al. in their 2018 paper “The Password Doesn’t Fall Far: How Service Influences Password Choice” [14].

Researchers have also studied password sets drawn from particular languages. Sishi et al. studied the strength and content of passwords derived from seven South African languages [15]. Li et al. completed a large scale study of Chinese web passwords [16]. Weir et al. used Finnish passwords as a basis for studying the use of PCFG in the creation of password guesses and mangling rules [6]. Dell et al. [17] included both Italian and Finnish password sets and guessed them using English, Italian and Finnish wordlists. In this work, we show that nationality plays a role in password choice even when the spoken language is the same.

3 The multi-armed bandit

The learning technique we use is based on an adaptation of the classic multi-armed bandit problem. The multi-armed bandit problem describes the trade-off a gambler faces when presented with a number of different gambling machines. Each machine provides a random reward from a probability distribution specific to that machine. The crucial problem the gambler faces is how much time to spend exploring different machines and how much time to spend exploiting the machine that seems to offer the best rewards. The objective of the gambler is to maximise the sum of rewards earned through a sequence of lever pulls.

In our scenario we consider each different wordlist as a bandit that will give a certain distribution of successes when used. In order to make effective guesses, we want to explore the returns from each wordlist and also exploit the most promising wordlist. With each guess, we learn more about the distribution of the password set that we are trying to guess. Leveraging this knowledge, we can guess using the wordlist that best matches the password set distribution, thus maximising rewards.

3.1 Password guessing: problem set-up

Suppose we have n wordlists. Each wordlist, \(i = 1 \ldots n\), has a probability distribution \(p_i\), and \(\sigma _i(k)\) denotes the position of password k in wordlist i. So, the probability assigned to password k in wordlist i is \(p_{i,\sigma _i(k)}\).

Suppose we make m guesses where the words guessed are \(k_j\) for \(j = 1 \ldots m\). Each of these words is guessed against the N users in the password set, and we find \(N_j\), the number of users’ passwords compromised with guess number j.

To model the password set that we are trying to guess, we suppose it has been generated by choosing passwords from our n wordlists. Let \(q_i\) be the proportion of passwords from wordlist i that was used when generating the password set. Our aim will be to estimate \(q_1, \dots , q_n\) noting that

$$\begin{aligned} \sum _i^n q_i = 1 \qquad \text{ and } \qquad q_i \ge 0. \end{aligned}$$
(1)

This means that the \(q_i\) are coordinates of a point in a probability simplex. If the password set was really composed from the wordlists with proportions \(q_i\), the probability of seeing password k in the password set would be

$$\begin{aligned} Q_k := \sum _{i=1}^{n} q_i p_{i,\sigma _i(k)}. \end{aligned}$$
(2)

By construction, the \(Q_k\) are between 0 and 1, and because \(\sum q \le 1\), then the \(\sum Q_k \le 1\).

3.2 Maximum likelihood estimation

Given this problem set-up, we will construct a likelihood function which will describe the likelihood that a given set of parameters \(q_1, \dots , q_n\) describe the password set. In this section, we introduce this likelihood estimator and describe the methods for convergence to a unique maximum.

3.2.1 Likelihood function

We construct the following likelihood for our model with m guesses:

$$\begin{aligned} \mathcal {L}= & {} \left( {\begin{array}{c}N\\ N_1 \, \cdots \, N_m \, (N - N_1 \cdots -N_m)\end{array}}\right) \, Q_{k_1}^{N_1} Q_{k_2}^{N_2} \cdots Q_{k_m}^{N_m} \nonumber \\{} & {} {\times } \,\left( 1- Q_{k_1} \cdots - Q_{k_m} \right) ^{N -N_1 \cdots - N_m}, \end{aligned}$$
(3)

where the first term is the multinomial coefficient representing the number of possible orderings for successfully guessing all N users’ passwords where \((N_1 \dots N_m)\) are the successes for the guesses we have already made, and \((N-N_1 \dots -N_m)\) are the successes for the guesses we have yet to make. The second term, \(Q_k\), denotes the probability with which we expect to see password \(k_j\) in the password set to the power of how many times it was actually seen. The final term represents the remaining guesses and states that they account for the remaining users’ passwords in the password set that have not yet been compromised.

Our goal is to maximise this likelihood function by choosing good estimates for \(q_1, \ldots q_n\) based on our observed rewards from each previous guess. Note that a single password can exist in multiple wordlists, so with each guess, we learn more about \(q_i\) for all of the wordlists. In fact, one of the interesting features of this model compared to a traditional multi-armed bandit model is that one lever pull can provide information about all the bandits.

We can take the \(\log \) of the likelihood function to create a simplified expression. In addition, we can remove the multinomial which is simply a constant for any values of \(\vec {Q}\). This leaves us with

$$\begin{aligned} \begin{aligned} \log \mathcal {L} =&\text{ const } \!+ N_1 \log {Q_{k_1}} {+}\,N_2 \log {Q_{k_2}} \cdots \!+ N_m \log {Q_{k_m}}\\&{+}\! (N \!-\!N_1\!-\! \cdots \!-\! N_m)\, \log { \left( 1\!-\! Q_{k_1}\!-\! \cdots \!-\! Q_{k_m} \right) }. \end{aligned} \end{aligned}$$
(4)

Note that if the \((1- Q_{k_1}- \cdots - Q_{k_m}) = 0\), then the probability of a password existing outside the first m passwords is 0, so the number of times we see it would also have to be 0: \((N -N_1- \cdots - N_m) =0\). Therefore, in the case that \((1- Q_{k_1}- \cdots - Q_{k_m}) = 0\), the whole term will be 0 and \(\log (0)\) is not an issue. A similar argument applies to all the other terms, since if the \(Q_{k_m}\) was zero, the \(N_m\) would also be zero, and you would not have seen the password.

In [18], we prove that the log-likelihood function, \(\log \mathcal {L}\), is concave. This means that the likelihood function has a unique maximum value [19], making it a good candidate for numerical optimisation. We will use gradient descent to find the \(q_i\) that maximise \(\mathcal {L}\) after m guesses subject to the constraints (Eq. 1).

3.2.2 Gradient descent

As we apply iterations of gradient descent to estimate the parameters \(q_1, \dots , q_n\) which maximise the likelihood function, we must maintain the constraints of the system (Eq. 1). To meet these constraints, we project the gradient vector onto a probability simplex and then adjust our step size so that we stay within that space.

With each iteration of gradient descent, we move a step in the direction which maximises our likelihood function. The gradient is scaled by a factor \(\alpha \) to give a step size. This is further scaled by an amount \(\beta \) to ensure the move from \(\vec {p}\) to \(\vec {p} + \alpha \beta \vec {g}\) satisfies \(\beta \le 1\), \(\beta ||\vec {g}|| \le 1\) and \(\vec {p} + \beta \vec {g}\) lies within the simplex.

3.3 Multi-armed bandit implementation

The multi-armed bandit problem involves a number of design choices. The key variables are the initial \(q_i\) values, the choice of word to guess and how big of a step size to take in the gradient descent. In Appendix A, we test multiple options for each of these variables. Below, we summarise the optimal implementation for our model.

Initialisation We expect the gradient descent to improve with each guess made since every guess provides it with more information. There are a number of ways of initialising the \(\hat{q}_i\) in gradient descent after each guess provides new information. Based on our analysis (shown in Appendix A), we will initialise the \(\hat{q}\) values to \(\hat{q}=1/n\) and reset them to this value before each new guess. This means we do not carry forward information from a previous bad estimation.

Guess choice Once we have generated our estimate of the \(\hat{q}\)-values, we want to use them to inform our next password guess. We will use what we denote the Q-method for the guess choice decision.

The Q-method uses the predicted \(\hat{q}\)-values to estimate the probability of seeing each word k in the passwordset. If, for example, we have a word k which has probability \(p_1(k)\) in wordlist 1 but also occurs in wordlists 2 and 3 with probabilities \(p_2(k)\) and \(p_3(k)\), respectively. Using Eq. 2, we can compute the total probability of this word occurring in the password set by multiplying the probability of each password occurring in a given wordlist by the weighting assigned to that wordlist. So if wordlists 1, 2, 3 are weighted as \(q_1, q_2, q_3\), then the probability of password k occurring in a passwordset made up of these wordlists is \(p(k)=q_1p_1(k)+q_2p_2(k)+q_3p_3(k)\). Computing this for each word guess option k should determine which k has the highest probability of being in the password set and use this word as the next guess.

Gradient descent step-size Recall that each gradient descent iteration, we move a step in the direction which maximises our likelihood function. The gradient is scaled by a factor \(\alpha \) to determine how far we move towards the perceived maximum. A step size which is too large could fail to converge to the maximum and instead overstep it; a step size that is too small means we may never reach the maximum value.

From our analysis in Appendix A, we can conclude that the best estimates for the q values are provided using the constant alpha method for determining step size. This method involved a set alpha included in the iteration which results in a reduced step size as we approach the maximum. In particular, \(\alpha =0.1\) is an effective value of alpha for our model.

Pseudo-code for our implementation of the multi-armed bandit is shown in Algorithm 1. The full code is also available on GitHub [20].

4 Dataset source

In this section, we will investigate whether the multi-armed bandit can identify the source of a dataleak, given a set of possible options. This information could be valuable to validate the source of a password leak or to aid a password guesser in tailoring their guesses. We will also show that the multi-armed bandit can adaptively choose between wordlist options to improve guessing success even when we have no information about the password leak source.

To do this, we utilise existing real-world leaked password datasets. The datasets we have chosen are computerbits.ie, hotmail.com, flirtlife.de and 000webhost.com. The 10 most popular passwords in each of these password leaks are shown in Fig. 1.

figure a

Multi-armed bandit.

Computerbits.ie dataset: \(N=1795\) In 2009, 1795 users’ passwords were leaked from the Irish website Computerbits.ie. The most popular words in this dataset include many Irish-orientated words: dublin, ireland, munster, celtic. Also, the second most popular password for the website Computerbits.ie was “computerbits”, reinforcing the idea that the service provider has an impact on the user’s choice of password [14]. We make the assumption based on the website domain and origin that the dominant nationality of the users in this dataset is Irish.

Hotmail.com dataset: \(N=7300\) Ten thousand users’ passwords from the website Hotmail.com were made public in 2009 when they were uploaded to pastebin.com by an anonymous user. Though it is still unknown, it is suspected that the users were compromised by means of phishing scams [21]. The most popular password in this leak was “123456” which occured 48 times (0.66% of users chose this password).

Flirtlife.de dataset: \(N=98{,}912\) In 2006, over 100,000 passwords were leaked from a German dating site Flirtlife.de. A write-up by Heise online [22], a German security information website, states that the leaked file contained many usernames and passwords with typographic errors. It seems that attackers were harvesting the data during log-in attempts. After cleaning this data (using the methods specified in [13]), we were left with 98,912 users and 43,838 unique passwords. The passwords in this data are predominantly German and Turkish.

Fig. 1
figure 1

Percentage the 10 most common passwords account for in each dataset

000webhost.com dataset: \(N=15{,}252{,}206\) In 2015, 15 million users’ passwords were leaked from 000webhost.com [23]. The attacker exploited a bug in an outdated version of PHP. The passwords were plaintext and created with a composition policy that forced them to be at least 6 characters long and include both letters and numbers. The leaked dataset was cleaned in the same way as in [23]. There are 10 million unique passwords in the dataset. The rank 1 password represents a surprisingly low 0.16% of the users’ passwords.

The above datasets were chosen because they are available online for others to replicate this work, they are used regularly within the literature [13, 23, 24], and because they each contribute interesting characteristics in terms of either user demographics or composition restrictions. Cleaning of the password datasets was completed according to the needs of the dataset; in particular, we ensured that each user only contributed a single password.

4.1 Dataset source 1: 100% from Flirtlife.de

Let us take the following scenario: A password guesser or security architect finds a list of passwords online. They suspect the passwords were leaked from a particular location (for example from the flirtlife.de website), but they do not know how to validate it. The investigator can read the passwordset into the multi-armed bandit along with a selection of candidate sources and determine if the suspected source is correct.

To test the effectiveness of such a test, we create a new passwordset containing 1000 users passwords sampled without replacement from the flirtlife.de dataset. We read three candidate sources into the multi-armed bandit: the remaining 97,912 users’ passwords from Flirtlife, 1795 users from computerbits.ie and 7300 users’ passwords from hotmail.com. We set the multi-armed bandit to guessing the sampled passwordset. It will make one guess at a time and then compute the estimated weight of each wordlist. If the scheme is effective, it should be able to approximate that the sample is best matched to the Flirtlife dataset.

The results are shown in Fig. 2. By 10 guesses, the multi-armed bandit has assigned a 90% weighting to the Flirtlife dataset. Showing that the sample most likely originated from this source. Note that the sample was chosen without replacement, and therefore, the 1000 users in the sample are distinct from the 97,912 users in the Flirtlife dataset. The multi-armed bandit is able to effectively match the sample to the correct dataset because users using the same website choose similar passwords [1].

Figure 2(r) compares the guessing success for guessing the Flirtlife sample using a selection of different dictionary options. The black line denotes the optimum guessing success. This would be the result if every guess was correct and made in the optimum order. Guessing with the Hotmail and Computerbits datasets expectedly does poorly as these have little in common with the Flirtlife sample. Guessing using information from users in the Flirtlife dataset (excluding those included in the sample) is the most effective method of guessing. Guessing with a combination of Computerbits, Hotmail and Flirtlife also does well as Flirtlife makes up the majority portion of this combination (with 98,912 users compared to Computerbits and Hotmail’s combined 9095 users). The multi-armed bandit model outlined in Algorithm 1 also results in high guessing success.

Fig. 2
figure 2

Dataset source 1: 100% from Flirtlife.de. Left: estimating the weighting (q-value) of a 1000 user sample chosen without replacement from the Flirtlife dataset. The multi-armed bandit can identify that this sample was from Flirtlife, assigning it a 90% weighting. Right: guessing success for guessing using 3 individual wordlists, the 3 wordlists combined and the multi-armed bandit method. The black line shows optimal guessing success

Fig. 3
figure 3

Dataset source 2. Left: estimating the weighting (q-value) of a 10,000 user password set created as 60% drawn from 000webhost, 30% from Hotmail and 10% from Computerbits. The black lines show the true weightings. After 3 guesses, the Multi-armed bandit can accurately estimate the weightings. Right: guessing success. The multi-armed bandit provides the best guessing returns compared to guessing using individual wordlists or simply combining all the available wordlists

4.2 Dataset source 2: 60% 000webhost, 30% Hotmail, 10% Computerbits

We now investigate whether the multi-armed bandit can determine the weightings for a passwordset even when it was created as a combination from multiple sources. To do this, we create password sets from a particular mix of sources.

We create a password set made up of 10, 000 users’ passwords; 6000 were selected randomly from the 000webhost dataset, 3000 from the hotmail.com dataset and 1000 from the computerbits.ie dataset.

In Fig. 3(l), we show the weightings (\(\hat{q}-values\)) that the multi-armed bandit assigns to each of the candidate datasets: 000webhost, Hotmail and Computerbits. Since we created the password set, we know the true weightings are 0.6, 0.3 and 0.1, respectively, and these are shown as the black solid horizontal lines. The multi-armed bandit does a good job of estimating the actual weightings. After just 3 guesses, the Multi-armed bandit can accurately estimate the weightings that should be assigned to each of the three wordlists.

Fig. 4
figure 4

Dataset source 3: 60% Flirtlife, 30% Hotmail, 10% Computerbits. Left: estimating the weighting (q-value) of this 10,000 user passwordset. After 5 guesses, the multi-armed bandit can accurately estimate the correct weightings. Right: guessing success. The multi-armed bandit provides the best guessing returns

Fig. 5
figure 5

Dataset source 4. Left: estimating the weighting (q-value) of a 10,000 user password set created from all four wordlist options: 55% drawn from Hotmail, 30% from Flirtlife, 10% from 000webhost and 5% from Computerbits. Within 20 guesses, the multi-armed bandit can estimate the correct weightings. It estimates the correct order after 4 guesses. Right: guessing success. Again, the multi-armed bandit method provides the best guessing returns

In Fig. 3(r), we plot the number of successful password guesses for the different wordlist guessing options: guessing with each wordlist individually, combining all the wordlists together and using the multi-armed bandit to adaptively choose between the wordlists. Despite 000webhost being to the dominate source for the passwords, it is not effective at guessing the passwords efficiently. When we guess solely using the 000webhost.com dictionary, we get lower guessing returns than when we use the multi-armed bandit to adaptively inform our guesses. We believe this is because of the flattened nature of the distribution which likely results from the composition restrictions that were placed on the passwords. Guessing the top password from the 000webhost wordlist results in only 0.16% of the users’ passwords. Whereas guessing the top password in the Hotmail dictionary results in 0.66% of the users being compromised. Also, because the 000webhost passwords are restricted by a composition policy, they do not accurately guess the 40% of passwords in the passwordset that come from hotmail.com and computerbits.ie. It is interesting that combining all the wordlists into “wordlists combined” does no better than 000webhost.com and significantly worse that the multi-armed bandit method. It is worth noting that the multi-armed bandit guessing would also be influenced by the high ranking of 000webhost passwords and their low guessing success. However, it still performs significantly better because it guesses using information and weightings from all the dictionaries.

4.3 Dataset source 3: 60% Flirtlife, 30% Hotmail, 10% Computerbits

To see whether the Multi-armed bandit is still effective when the 000webhost.com set is not included, we create a new passwordset. This time, we take 60% of the passwords from Flirtlife.de, 30% from Hotmail.com and 10% from Computerbits.ie.

In Fig. 4(l), we plot the estimated q-values after the gradient descent was completed for each guess. Again, even after a small number of guesses, we have good predictions for how the password set is distributed between the three wordlists.

In Fig. 4(r), we show the number of users successfully compromised after each new guess. After 100 guesses, the multi-armed bandit method had compromised 795 users, in comparison to the 870 users compromised by guessing the correct password in the correct order for every guess.

4.4 Dataset source 4: passwords from all four sources

Finally, we create a new password set this time made up of 10, 000 users’ passwords from all 4 different wordlists: 55% were selected randomly from the hotmail.com dataset, 30% from the flirtlife.de dataset, 10% from the 000webhost dataset and 5% from the computerbits.ie dataset.

In Fig. 5(l), we plot the estimated q-values after the gradient descent has completed for each guess. The actual proportions are shown as solid horizontal lines. Within 20 guesses, the multi-armed bandit can estimate the correct weightings. It estimates the correct order after 4 guesses.

Fig. 6
figure 6

Dataset source unknown. Left: estimating the weighting (q-value) for 4 candidate wordlists. Right: guessing success. Despite no link between the wordlists and the password set, the multi-armed bandit still provides the highest guessing success

Fig. 7
figure 7

Irish users. Left: q-value estimates for the Irish password set from Computerbits.ie. The multi-armed bandit identifies that the passwordset is best linked to the Irish users wordlist. Right: guessing success. The multi-armed bandit and the Irish users wordlist offer similar guessing returns, and both are better than using a generic (all users) passwordset

Figure 5(r) shows the guessing returns for guessing with the individual wordlists, guessing with all the wordlists combined and guessing using the multi-armed bandit. The multi-armed bandit again performs best. The Flirtlife and Hotmail wordlists perform well, but the Computerbits, combined and 000webhost wordlists perform poorly.

4.5 Dataset unknown source

In the previous simulations, we showed that if we include the source dataset as a wordlist option, the multi-armed bandit can identify it and use this knowledge to improve guessing success. We now investigate whether the multi-armed bandit can still be leveraged to improve guessing success even if there is no obvious link between the wordlists provided and the passwordset.

To investigate this, we use the 2009 rockyou.com password leak which includes 32 million plaintext user credentials. This password set has been frequently used by researchers in the field and therefore allows effective comparison to other works. All four wordlists were used to guess the Rockyou password set: Computerbits, Hotmail, Flirtlife and 000webhost. However, this time, we have no a priori information about a relationship between the passwordset, Rockyou, and these wordlists.

Figure 6(l) shows the estimated breakdown of Rockyou between the four wordlists. Hotmail is assigned the highest rating with 000webhost, Flirtlife and Computerbits falling below it respectively. In terms of the breadth of the audience demographic in each of the wordlists, this assessment of the breakdown seems logical. The nationality specific websites such as computerbits.ie and flirtlife.de fall lowest, and 000webhost.com, which enforces composition restrictions, fares slightly worse than hotmail.com.

Fig. 8
figure 8

German users: Left: q-value estimates for the German password set from flirtlife.de. The multi-armed bandit identifies German users as the best linked to the Flirtlife passwordset after 50 guesses. Right: guessing success. The multi-armed bandit offers significantly better guessing returns than guessing using the German users passwordset or the all users passwordset

In Fig. 6(r), we compare using the multi-armed bandit adaptive guessing (solid purple line) to guessing using each wordlist separately. The multi-armed bandit performs well. After 100 guesses, it has compromised 945, 371 (64% optimum, 2.9% total) users in comparison to 804, 731 (54% optimum, 2.5% total), 703, 041 (47% optimum, 2.2% total), 603, 783 (41% optimum, 1.9% total) and 64, 024 (4.3% optimum, 0.2% total) from Flirtlife, Hotmail, Computerbits and 000webhost, respectively. Notice that the combination guessing strategy follows the distribution of the 000webhost wordlist all the way until guess 59 when it eventually guesses the most popular password 123456. Simply combining wordlists means that whichever wordlist has the most users in it will be dominant. This gives good evidence to support our suggestion of splitting wordlists out based on their characteristics and adaptively learning which wordlist to choose from, rather than the traditional method of creating one large wordlist for guessing.

5 Language and user nationality

It is well known that user demographics, such as nationality and language, play an important role in their password choices [13, 25, 26]. Indeed, this is information that human password guessers might look for when determining their guessing strategies. In this section, we investigate whether the multi-armed bandit can identify these characteristics and leverage them to improve guessing.

Determining the dominant language used in a password set would be a relatively simple task if we have the entire password set as a plaintext list. However, as passwords are hashed and salted or protected by a server, we must instead make a guess, and if it is successful, we can reflect on what language seems to be resulting in the most successes. The multi-armed bandit can help us with this learning exercise.

Clearly, a simple method could tell the difference between, say, Chinese and English passwords. We are interested in the more challenging setting of distinguishing between Irish users’ passwords and English users’ passwords when the spoken language is the same, or between English and German passwords where both use the Latin alphabet. In this section, we will show that our learning methods are able to identify these subtle distinctions.

The two password sets we will try guessing are the computerbits.ie password set and the flirtlife.de password set. Computerbits.ie is made up of 1795 Irish users. Flirtlife.de is made up of 98, 912 predominantly German and Turkish users. The two wordlists were drawn from the large set of 31 password leak datasets known as Collection #1 [27]. One of these password datasets was selected, and from this, we extracted all the passwords whose corresponding email address contained the country code top-level domain “.ie” and separately “.de”. These formed our nationality specific user wordlists from Ireland and Germany with 90, 583 and 6, 541, 691 users, respectively.

Irish passwords We are interested in whether the multi-armed bandit will match the distribution of the Irish password set computerbits.ie to the extrapolated Irish wordlist taken from the subset of Collection #1 (denoted “Irish users” from now on).

In Fig. 7(l), we included three wordlists: the hotmail.com leaked passwords, the flirtlife.de password set and the Irish users. Hotmail.com is an international website. However, it is suspected that the Hotmail users in the dataset we have were compromised by means of phishing scams aimed at the Latino community. Flirtlife is a dating site with predominantly German and Turkish users. Figure 7(l) plots the breakdown estimated by the multi-armed bandit. From the first guess, it estimates that the passwords in the computerbits.ie set match closely to the passwords in the Irish users wordlist. Notice that some weighting is assigned to the Hotmail wordlist but essentially none to the flirtlife.de password set.

In Fig. 7(r), we guess the passwords in the Irish computerbits.ie password set. The black line shows the returns for an optimum first 100 guesses. We also guess them using the order and passwords from the full Collection #1 password set that the Irish users were chosen from. We label this full dataset “all users”. We made 100 guesses against the 1795 users in the Computerbits password set. The top 100 most popular words were chosen in order from each wordlist. The wordlist composed of only Irish users performed better at guessing than the wordlist with all users’ passwords in it. We also include the guessing success for our multi-armed bandit model. It performs as well as guessing using the Irish users set showing that it was able to quickly learn the nationality and adapt its guessing accordingly

German passwords We now try to guess the flirtlife.de password set using the wordlist of German users. While flirtlife.de is a German dating site, its main users were both German and Turkish.

In Fig. 8(l), the multi-armed bandit does not link the Flirtlife passwords to the German users wordlist until after the high frequency passwords, up to 50, have been guessed. However, in Fig. 8(r), the guessing success is still highest for the multi-armed bandit. The next best wordlist option is the German user wordlist and finally guessing using all users’ passwords. This indicates that while German passwords do feature strongly, it is not the only nationality featuring in the password set. Recall that Flirtlife is made up of users from two nationalities and languages: German and Turkish. The multi-armed bandit in this case is the best option for guessing as it takes into account guesses from all wordlists and adaptively chooses between them.

6 Conclusion

This research demonstrates that an automated password guesser can learn characteristics of a password set with each guess made and that it can leverage this information to improve guessing success.

We have shown that a multi-armed bandit model can adaptively choose between different wordlists to improve guessing success. We have also demonstrated that characteristics such as dataset source, language and nationality can be inferred from a leaked passwordset in an automated way using this multi-armed bandit (MAB) technique. Our MAB learning algorithm develops its learning about the distribution of the password set it is guessing with every guess made. Importantly, it requires no a priori training. In many previous wordlist approaches, a single ordered wordlist is created. In our method, wordlists are separated based on their source or characteristics. In our examples, the separation of wordlists consistently improves guessing success over using a single ordered wordlist. This adaptive learning model demonstrates that a password guesser can learn about a password set with each guess made and emphasises the effectiveness of dynamic real-time analysis of guessing returns.

Knowing the potential of this guessing model is useful for both users and organisations. It provides evidence for the importance of guiding users away from passwords which reflect characteristics associated with demographic or website specific terms. It also demonstrates that password choices differ measurably depending on their source use. This could indicate that websites could consider tailored blocklisting techniques. In particular, websites who have experienced previous password leaks could work at restricting future users from using passwords which occurred with a high frequency in that leak.