Advertisement

Anyone know the probability density function for this?

Started by August 08, 2013 12:55 AM
18 comments, last by alvaro 11 years, 6 months ago

*Not really gamedev related, but question was asked on retro games site (world of spectrum, for ZX Spectrum nostalgia)*

Let's say I have a piece of string of length 1. I make N cuts in the string (cut positions are made from a uniform(0, 1) distribution), what is the probability density function for the length of the N+1 pieces? (Note: I make all the cuts at once, I don't make a cut, discard a piece, make another cut in what is left, etc.)

It's pretty obvious the mean is going to be 1/N and from a simulation done (in ZX Basic, lol) looks to me like the limiting pdf is going to be the exponential distribution with lambda = 1/N.

The constraint is that if x0, x1, ..., xn are the lengths of the pieces then x0 + x1 + ... + xn = 1

Over to you guys! I'm sure there must be a name for this distribution but a quick look at the 144 continuous pdf's on wikipedia hasn't shown up anything.

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

Well, let's take the leftmost piece of string. It doesn't really matter where we look since the cuts have a uniform distribution. Let's try to work out the probability that this piece of string has length \(l\) or less. Clearly this is easy - all we need is that no cuts be done at a distance from the left endpoint of the string less than \(l\). So the probability that the piece of string has length \(l\) or less is just \(1 - (1 - l)^n\) where \(n\) is the number of cuts. This is, of course, valid for \(0 \leq l \leq 1\) only, since the probability that a cut falls outside that interval is \(0\); the density of the random variable is also \(0\) outside it.

In other words, \(\text{Pr}[0 \leq L \leq l] = 1 - (1 - l)^n\), where \(L\) is our random variable. Let's also recall that because the cuts are made with a uniform distribution, the "order" of the string doesn't matter as long as you stay within the unit interval (you can also use a symmetry argument - the uniform distribution is symmetric about the unit interval, and therefore so are the cuts) and we can conclude that:

\[\displaystyle \text{Pr}[0 \leq L \leq l] = 1 - (1 - l)^n ~ ~ ~ \text{for} ~ ~ 0 \leq l \leq 1\]

This is the cumulative distribution function of the length of a piece of string. To get the probability density function of this random variable, differentiate:

\[\displaystyle f_L(l) = \frac{\text{d}}{\text{d} l} \left [ 1 - (1 - l)^n \right ] = n (1 - l)^{n - 1}\]

Let's verify this quickly by ensuring the mean value is as expected. Consider \(n\) cuts. Then the mean length is going to be:

\[\displaystyle E[L] = \int_{-\infty}^{+\infty} l \cdot f_L (l) ~ \text{d} l = \int_{0}^{1} l \cdot n (1 - l)^{n - 1} ~ \text{d} l = \frac{1}{n + 1}\]

As expected. For instance, with a single cut, you end up with a mean length of one half. For two cuts, one third, and so on.

To answer the question, then, the probability density function of the length of one piece of the string cut \(n\) times is:

\[\displaystyle f_L(l) = \begin{cases}n (1 - l)^{n - 1} ~ ~ ~ &\text{if} ~ ~ 0 \leq l \leq 1 \\ 0 ~ ~ ~ &\text{otherwise}\end{cases}\]

Take as an example the case of three cuts, and the theoretically determined PDF looks like this:

0Z3H7Qq.gif

And the same PDF experimentally determined by programmatically cutting a piece of string three times (here's hoping I didn't screw up my implementation) with 500 bins and 5 million samples:

zLPKoOR.png

So, yeah, here you go smile.png don't know if this family of PDF's has a name, though.

Just to make sure this is what you meant, my implementation of the "string-cutting routine" was:


Select N uniform integers in [0..1)
Sort them into a real array named "cut"
Calculate the length of each piece of string:
   cut[0] - 0
   cut[1] - cut[0]
   cut[2] - cut[1]
   ...
   1 - cut[N - 1] 

Resulting in N + 1 lengths.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Advertisement

That looks pretty good, I'll look more into it later! I'll link to that post on the other forum thread.

Of course I meant to say the mean would be 1/(N+1) for N cuts ;) I was thinking N pieces when I said 1/N.

Here's the thread on WoS forums (warning, contains ZX Basic!) http://www.worldofspectrum.org/forums/showthread.php?t=44605

EDIT: Is that going to converge to an exponential distribution as I predicted?

EDIT2: Bonus question:

Iif the string isn't length 1 anymore, but length L (an integer), and the cuts are made at integral points only, what is the distribution then? What if we disallow 0 length pieces of string (i.e. all cuts at integers are distinct)?

(in other words, what is the discrete pdf?)

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley
What Bacterius said. You could have found it on Google by typing `minimum of uniform random variables'.

Order statistics then huh? I don't think we did those in my prob & stats courses in uni (only did 2 years of that, I was pure maths only in my final year). And it was a long time ago! Nice one!

EDIT: And LOL @ "This is the cumulative distribution function of the length of a piece of string"

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

What Bacterius said. You could have found it on Google by typing `minimum of uniform random variables'.

Thanks for the keywords Alvaro! I didn't identify the problem as a minimum of random variables, but it makes a lot more sense that way.


EDIT: Is that going to converge to an exponential distribution as I predicted?

Not sure what you mean by exponential distribution, the PDF is always going to be a polynomial in the length, and exponential in the number of cuts, but the distribution does get sweked towards zero rather quickly with every additional cut.


EDIT2: Bonus question:

Iif the string isn't length 1 anymore, but length L (an integer), and the cuts are made at integral points only, what is the distribution then?

You should still be able to use the same reasoning above, using discrete sums instead of integrals (though it might require a few modifications). I would hope the PDF would look the same, only as a discrete approximation of the continuous version. Think about it: if you add enough points and rescale the PDF down by the number of points, you are essentially converging towards the continuous PDF by the definition of the integral!


What if we disallow 0 length pieces of string (i.e. all cuts at integers are distinct)?

That is.. more complicated, I think, since then the cuts are no longer independent sad.png and when that happens in statistics things tend to become hairy very quickly, though in the limit (with sufficiently large number of integers) you will still converge towards the original PDF, as the probability of cutting in the same location twice tends to zero. After all, you could say that the continuous version of the problem in your original post is really just taking the limit of this process, where the probability of cutting in the same spot is zero (the probability that two uniformly selected reals in any given non-empty interval are equal is, indeed, zero).

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Advertisement

Yeah, although sums/discrete pdf's tend to be hairier than integrals in my experience. If 0 length pieces of string are disallowed, it's going to be worse, although then it is a case of picking subsets from {1, 2, ..., L-1} (points where you cut the string) and analysing the pdf for string lengths resulting from that, and straight away it's switched to a 2 variable pdf (length L and number of cuts, N).

EDIT: I see that there is no section in wikipedia for the order statistic for a discrete uniform distribution either.

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

I am sure the probability distribution converges to an exponential for large numbers of cuts.

My informal reasoning is that, since the cuts happen as independent random events, as the number of cuts grows the situation resembles a Poisson process.

Let's try it another way. We can change our probability distribution so its average is 1, by multiplying it by n+1. The resulting density function is n/(n+1)*(1-x/(n+1))^(n-1), whose limit is exp(-x). Undoing the multiplication by n+1, we get that for large values of n the distribution is well approximated by (n+1)*exp(-(n+1)*x), which is an exponential distribution with lambda = n+1.

EDIT: Typo in a formula.

Yep, my intuition was that it was a Poisson process as well, sort of like modelling the distribution of the number of arrivals in a time period of 1 with mean arrival time 1/(n+1).

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

I'd like to thanks Paradigm Switcher, Bacterius and Alvaro for taking the time (and having the interest) to contribute to answering the question I posed on the WoSpectrum forum (that PS reposed here). I can certainly see that BASIC (particularly on a ZX emulator!) is going to be a very tedious way of making further progress!

In light of this, I have a question to Bacterius (or any of you): What software did you use to write/run the program/model you presented in post #2?

I'm looking for a simple way to get started in the statistics/programming/modelling field. I have access to R and MATLAB on Mac platform. While I have learnt to create basic plots and do some very simple data manipulations, I haven't progressed to writing programs within them - are they even appropriate? I am trained in Molecular biology but have only ancient Maths and BASIC programming skills that are 20 years old and largely forgotten.

This topic is closed to new replies.

Advertisement