Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

1

Published in Journal of Statistical Planning and Inference 141, issue 1, 597-601, 2011
which should be used for any reference to this work

Simple random sampling with over-replacement


1
Erika Antal , Yves Tille´
Institute of Statistics, University of Neuchatel, Pierreˆ a Mazel 7, 2000 Neuch atel, Switzerlandˆ

abstract

Keywords:

Survey sampling There are several ways to select units with replacement and an equal inclusion expectation. We present
Simple random sampling with replacement a new sampling design called simple random sampling with over-replacement. Its interest lies in the
Discrete probability distribution high variance produced for the Horvitz– Thompson estimator. This characteristic could be useful for
Resampling method resampling methods.
0. Introduction

There are several methods for drawing a sample, different goals, and different situations that require different sampling designs. The most basic
sampling procedures are simple random sampling with and without replacement. In this paper we show that there are several ways to select units with
replacement with an equal inclusion expectation. A new method is proposed where the repetition of the units in the sample is more important than
with usual simple random sampling with replacement. This sampling design called simple random sampling with over-replacement provides a larger
variance. This property could be interesting for resampling methods. We show how to implement this design and we compare it to simple random
sampling with and without replacement.

1. Main concept and notation

A sampling design on a population U={1, y, k, y, N} is a procedure that allows us to randomly select statistical units. Some statistical units can be
selected several times in the sample. In survey sampling theory, it is usual to define a sample as a subset of the population U. However, this definition
is rather restrictive because it is limited to samples for which the units are selected only once, i.e. when sampling is done without replacement.
A more flexible notation consists in defining a sampling design by a positive, discrete random vector S ¼ ðS 1, ... ,Sk, ... ,SNÞu, where Sk is the
number of times unit k is selected in the sample. The same notation can thus be used to define sampling designs with or without replacement. If the
sample is selected without replacement, then Sk can only take the values 0 and 1. If the sample has a fixed sample size n, then P k2USk ¼ n.
The inclusion expectation of unit k is pk ¼ EðSkÞ. Since a unit can be selected several times in the sample, pk can take any nonnegative value. The
joint inclusion expectation of two units k and ‘ is the expectation of the product of S k and S‘, i.e. pk‘ ¼ EðSkS‘Þ. Moreover, Dk‘ ¼ cov½Sk,S‘ ¼ pk‘pkp‘.
If the sample is selected without replacement, then the inclusion expectation is called inclusion probability.
Let y1,y,yN denote the values taken on the units of the population by an interest variable y. Suppose now that we want to estimate the total of these

values Y ¼ Pk2Uyk. If all the pk 40, this total can be estimated without bias by Yb ¼ Pk2USkyk=pk. This estimator is called the Horvitz–Thompson
estimator if the sample is selected without replacement and the Hansen–Hurwitz estimator if the sample is selected with replacement (see Hansen and
Hurwitz, 1949; Horvitz and Thompson, 1952).
The variance of Yb is varðYbÞ

¼ XX yky‘ Dk‘:

1 Corresponding author.
E-mail addresses: [email protected] (E. Antal), [email protected] (Y. Tille´).
2

k2U ‘2U pkp‘

If all the pk‘40, this variance can be estimated without bias by means of the following formula:

bÞ ¼ XX SkS‘yky‘ Dk‘ : ð1Þ

vardðY k2U ‘2U pkp‘ pk‘

Nevertheless, this variance estimator is often very unstable. It can take negative values (see, for instance Tille´, 2006, pp. 26–29). When the sampling
design has a fixed sample size, the variance can be written as

varðYbÞ ¼ 1 XXyk y‘2Dk‘,

2 k2U ‘2U pk p‘

and can be estimated by bÞ ¼ 1 XXS kS‘yk y‘2Dk‘ ,

varðY
2 k2U ‘2 pk p‘ pk‘
U

which can also be written under the quadratic form

vardDðYbÞ ¼ XXS kpS‘kypk‘y‘ Dk‘, ð2Þ


U k2U ‘2

with
8 P SjDkj

>>>>>><j 2 U pkj
if k ¼ ‘,
Dk‘ ¼ jak

>
pk‘
>>>> >:D k‘ if ka‘:

2. Simple random sampling without replacement

PN
A sampling design is said to be simple and without replacement if PrðS ¼ sÞ ¼ n!ðNnÞ!=N!, for all s 2 SNn , where SNn ¼ fs 2 f0,1gNj s ¼ ng.
k ¼1 k

In simple random sampling without replacement, Dk‘ ¼ nðNnÞ=fN2ðN1Þg, if ka‘ 2 U and Dkk ¼ nðNnÞ=N2,k 2 U, which gives the variance of the

estimator of the total varðYbÞ ¼ N2ðNnÞs2=fðN1Þng, where

P P
s2 ¼ N1 k2U ðykYÞ2, and Y ¼ N1 k2U ky. Moreover, we have Dk‘=pk‘ ¼ ðNnÞ=fNðn1Þg when ka‘ 2 U and

P P
Dkk=pkk ¼ ðNnÞ=N,k 2 U, which gives vardðYbÞ ¼ N2ðNnÞsb2=ðNnÞ, where sb2 ¼ ðn1Þ1 S ðykYbÞ2, and Yb ¼ n1
k2U k Sy.
k2U k k

3. Simple random sampling with replacement

A sampling design is said to be simple and with replacement if

PrðS ¼ sÞ ¼ N1n s1 snk sN1 for all s 2 N, Rn


3

P
where RNn ¼ fs 2 Nj Nk ¼1 sk ¼ ng. Vector S therefore has a multinomial distribution. A well-known result is that a multinomial distribution can be
derived from a sequence of Poisson independent random variables that are conditioned on their sum. More formally, consider N random Poisson
variable X1,y, XN with the same parameter l, i.e.
PrðXk ¼ xkÞ ¼ ellxk=xk!,xk ¼ 0,1,2,3, .... Then, one can prove that
!

Pr X1 ¼ x1, ... ,XN ¼ xniX¼N1Xi ¼N 1n s1 snk sN1,

for all ðx1, ... ,xNÞ 2 RNn . The conditional distribution no longer depends on l anymore (see Bol’shev, 1965; Johnson et al., 1997, p. 65).
Two ways of implementing simple random sampling with replacement are given in Tille´ (2006, pp. 60–61). In simple random sampling with
pk ¼ n=N for all k 2 U, and Dk‘ ¼ nðN1Þ=fN2ðN1Þg, when ka‘ 2 U and Dk‘ ¼ nðN1Þ=N2,k 2 U, which gives the variance of the
replacement,
Hansen–Hurwitz estimator of the total varðYbÞ ¼ N2s2=n. Moreover,
we have

8 N1 >
> if k , Dk‘ <N1þn ¼‘
1
¼ pk‘ >>:n 1 if ka‘:

Although it is possible to construct an unbiased estimator of the variance by using expression (1), the result obtained is very strange and should not
be used (see Tille´, 2006, p. 58). It is nevertheless possible to construct an unbiased estimator by using the quadratic form based on the D k‘ given in
expression (2)

8< 1 if k ¼ ‘,

Dk‘ ¼ : n11 if ka‘,

which gives vardðYbÞ ¼ N2sb2=n.

4. Simple random sampling with over-replacement

Simple random sampling with replacement can be viewed as a conditional distribution of independent Poisson variables. What happens if instead
of using the Poisson distribution, we use another discrete distribution? If we use a sequence of geometric random variables conditioned on their size,
we obtain another sampling design with replacement with a fixed sample size. We have called this design simple random sampling with over-
replacement because the repetitions of the units are more frequent than in a usual simple random sampling with replacement.
First, consider a sequence of N independent geometric random variables X k: PrðXk ¼ xkÞ ¼ ð1pÞpxk,xk ¼ 0,1,2,3, ... with parameters pk 2 ð0,1Þ.
PN
The sample size ns ¼ k ¼1 Xk is random. Let us now calculate the conditional geometric sample design. If S k denotes the random variable that gives
the number of times unit k is selected in the sample, we have

X
PrðS1 ¼ x1, ...SN ¼ xNÞ ¼ Pr X1 ¼ x1, ...XN ¼ x k ¼N 1Xk ¼ n!
N

¼ PRQnNNkQkN1¼ð11ð1pÞppxÞkpxk ¼ qNnNpn ¼ 1 ¼ 1 :

P qNpn cardRNn Nþnn1 R

All the samples have exactly the same probability of being selected. By noting that

NN þn1 and #RNnj1 ¼ N1þnnjj1,


#Rn ¼ n

we can derive the marginal distribution of Sk:


4

N1þnj1
nj
PrðSk ¼ jÞ ¼ Nþn1 , j ¼ 0, ... ,n,
n

which is an inverse (or negative) hypergeometric distribution (see Johnson et al., 1993, pp. 239, 264). We thus have
E(Sk) = n/N and
nðN1ÞðNþnÞ

varðSkÞ ¼ N2ðNþ1Þ :

This sampling design has a fixed sample size, which implies that P k2ScovðSk,S‘Þ ¼ covðn,S‘Þ ¼ 0. Moreover, since all the units are treated
symmetrically, covðSk,S‘Þ ¼ varðSkÞ=ðN1Þ. The matrix of Dk‘ is thus given by

8< 1 if k ¼ ‘,
ðN1ÞðNþnÞn

Dk‘ ¼ N2ðNþ1Þ :N1 1 if ka‘,


Table 1
Comparison of the variance of the three simple designs.

Sampling design Variance of the estimator of the total

Simple without replacement ðNnÞN2s2

ðN1Þn
Simple with replacement N2s2

n
Simple with over-replacement ðNþnÞN2s2

ðNþ1Þn
which allows us to compute the variance of the Hansen–Hurwitz estimator:

ðNþnÞN2s2 varðYbÞ
¼ :
ðNþ1Þn
This variance is much larger than the variance obtained under simple random sampling with replacement.
Simple random sampling with over-replacement can be implemented by a rejective procedure that consists in selecting geometric samples until a
sample size n is obtained. Tille´ (2006, p. 34) also proposed a general sequential algorithm in order to quickly generate multivariate random variables.
This algorithm is based on the computation at each step of the conditional distribution probabilities of the S k, that is

Nk1þnkj
nkj

PrðSk ¼ jjSk1, ... ,S1Þ ¼ Nkþnk , j ¼ 0,1,2,3, ... ,nk,


nk

where n1=n and


Xk1
nk ¼ n Sj, k ¼ 2, ... ,N:
j ¼1

Algorithm 1is the application of the general algorithm presented in Tille´ (2006, p. 34) to sampling with over-replacement. It provides an efficient
implementation of sampling with over-replacement.

Algorithm 1. Algorithm for simple random sampling with over-replacement


For k=1,y,N unit k is selected Sk times, where Nk1þnkj!

nkj
PrðSk ¼jÞ¼ !, j¼
0,1,2,3, ... ,nk. Nkþnk nk

5. Discussion

Table 1 shows the three variances of simple designs. Compared to simple random sampling with replacement, we find that simple random
sampling without replacement and simple random sampling with over-replacement have a symmetric position. Indeed, for random sampling without
replacement, the finite population correction factor is (Nn)/(N1) and for simple random sampling with over-replacement, the over-replacement
correction factor is (N+ n)/(N+1).
5

Simple random sampling with over-replacement is interesting because it shows that there are several methods of sampling with replacement that
have an equal inclusion expectation in the sample. It is also possible to define a large range of simple random sampling by combining several simple
random sampling designs. For instance, one can select a subset of observations by simple random sampling with replacement and a second subset by
simple random sampling with over-replacement. So, a large range of sampling designs with replacement can be defined with different variances of
the estimator of the total. Antal and Tille´ (2010) have used simple random sampling with over-replacement to construct new bootstrap methods for
complex sampling designs. The main idea consists of mixing simple random sampling with over-replacement with other sampling designs in order to
construct ad hoc resampling designs for reproducing the correct estimator of variance in a complex sampling design. Sampling with over-replacement
is thus not only a simple mathematical curiosity but can be used in practical applications.

References

Antal, E., Tille´, Y., 2010. A direct bootstrap method for complex sampling designs from a finite population, submitted for publication. Technical Report, University of Neuchatel.ˆ
Bol’shev, L.N., 1965. On a characterization of the Poisson distribution. Teoriya Veroyatnostei i ee Primeneniya 10, 64–71.
Hansen, M.H., Hurwitz, W.N., 1949. On the determination of the optimum probabilities in sampling. Annals of Mathematical Statistics 20, 426–432.
Horvitz, D.G., Thompson, D.J., 1952. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 663–685.
Johnson, N., Kotz, S., Kemp, A., 1993. Univariate Discrete Distributions. Wiley, New York.
Johnson, N.L., Kotz, S., Balakrishnan, N., 1997. Discrete Multivariate Distributions. Wiley, New York.
Tille´, Y., 2006. Sampling Algorithms. Springer, New York.

You might also like