May 25, 2007

So far, all we’ve done is look at individual runs of the simulation. While this is all well and good to get a feel for evolving populations, we sometimes need to ask wider questions. For example, in my past post on Genetic Drift we stated that Drift has less of an effect in smaller populations by just comparing two simulations. If we wanted to know how population size affects the strength of drift we need to get a little more scientific and create replicates.

Replicates are multiple answers to the same question.


In our simulations, we use random numbers in order to simulate effects like Genetic Drift. The down side of this is that these random numbers introduce a certain level of ‘noise’ in to the answer. By running the simulation multiple times we can take an average and get a clearer picture.

You can think of it in terms of a ‘population of results’; in order to understand the whole population, we need to take several samples. I’ve taken our simulation from the last entry which finds the number of generations until Genetic Drift stops changing the population and run it multiple times, taking the answers from each run and putting them into our histogram object. The output looks like this:

Now, I’m prepared to say that this looks like a normal distribution. It’s complicated by the fact that it’s close to zero on the X-Axis, so it looks skewed.

So the replicates in a simulation are effectively samples taken from the theoretical population of answers, which is normally distributed around a mean value. This is why we run a lot of replicates and then take the mean and standard deviation as our answer.

How Many Samples?

Now there’s the question. How do we know when we’ve got the right number of replicates? The only way to find out for sure is to look at the results of a given simulation for different numbers of replicates. Here’s a chart of the means from another set of simulations against the number of replicates run:

As you can see once there are more that about 230 replicates, the mean answer seems to settle at a value between 60 and 70.

There’s a trade-off here; more replicates are more accurate, but take more time to produce. Now, I’m a careful type of guy so I think I’ll use 400 replicates so that there’s a buffer against highly randomised data. I’ll use this number of replicates for all future simulations.

I also collected the frequency histograms for all these simulation runs, which I have plotted here in 3D to show how the distributions change with more replicates:

You can see that with a small number of replicates (towards the back of the 3D area) there is no real peak to the distributions, while for larger numbers of replicates (towards the front of the 3D area) there is a definite peak.

I won’t be going into any real statistical analysis of results for the simulations that appear here : I don’t believe they’re really necessary.


If you’re interested in writing software, check out my other blog: Coding at The Coal Face


Well, it looks like Word 2007 has a blogging feature. I shall investigate further …