Stats question...

Started by kirkd February 03, 2003 11:16 AM

3 comments, last by kirkd 22 years ago

Author

505

February 03, 2003 11:16 AM

Let's say I have 50 items to choose from and I make 5000 selections from that pool of 50. The assumption is that each has a similar probability to be chosen and I would expect a 1/50 chance for each. Now, let's suppose that there is some bias for some of the items in the pool and they will be selected more often than others. What is the test for significance in this situation? In other words, how many times does an item need to be selected to say that it is being chosen for and its selection is not necessarily random. Thanks for any tips... -Kirk BTW, I'm not looking for a specific answer, but a way to solve the problem. [edited by - KirkD on February 3, 2003 12:18:17 PM]

pawn69

122

February 03, 2003 03:01 PM

Most random functions give you linear (or uniform) distribution. What you are probably looking for is a rand function that gives you 'normal' distribution, where items in the middle of the population are more common than items on the fringes.

Edit:
Numerical Recipies has source and documentation on normal and exponential random number generators.

[edited by - pawn69 on February 3, 2003 4:05:36 PM]

kirkd

Author

505

February 03, 2003 05:31 PM

Actually, this is not for the assessment of a random number generator. I''ve been developing some methods by way of evolutionary computation in which items are selected randomly and combined into a sigle organism. Over time, those organisms with higher fitness persist and others die off, then I look at the collection of items within the organisms. I''m looking for the significance based upon the total number of items selected across the population.

Here''s what''s going on. After the population has converged, I have 25 organism each with somewhat different structures. I can determine the "items" that made up those structures. Due to the stochastic nature of the algorithm, some of what is there is likely to be evolutionary tag-alongs. If I take the collection of all "items" from all organisms I get about 5000 total. Then I count the frequency of each item. Now, as an obvious example, if item #1 occured 2 times and item #2 occured 500 times, item #2 is chosen a significant number of times whereas #1 is likely a tag along.

My question is, what is the signficance test such that I can determine which items are truly significant versus those that are tag alongs.

I hope this makes sense.

-Kirk

LilBudyWizer

491

February 03, 2003 06:15 PM

A simple approach would be to add duplicates into the pool. It doesn''t give you very fine control and may eventually get unweldy. Another alternative is basically a histogram. Each element is assigned a value that is its part out of the whole. You don''t actually have to normalize to one if you just keep track of what they all sum to and call that one. You then generate a number and scan the pool accumulating each items value until you get a total greater than the value you generated. That is then the element you select.

How you actually implement depends on how much performance matters and the mix of accesses versus updates. If you update infrequently then you can track the cumulative sum and perform a binary search to select. If you update more frequently then you can do it as a tree where all elements are leaves and each inner node has the total for all elements under it. You don''t really need any elaborate balancing algorithm since you have a fixed number of nodes and it is a small number of nodes. If search time became a problem you could just periodically rebalance. You break it up into individual elements and start building from the bottom up. You join the two elements/partical trees and put the result back into the list of candidates. Basically a hoffman, coffmon or whatever encoding tree.

Keys to success: Ability, ambition and opportunity.

pawn69

122

February 03, 2003 06:31 PM

right, makes sense now.

First off, I think the solution will depend a lot on the data curve of the ordered occurances.

If its linear, then you cant really say which items are significant, and you will probably have to use an "everything below X% is tag along" approach.

If you get x^2, log, or something with steps then there is probably some analysis you can do. I''ll have to look through my stats book at home, to see if anything pops out.

Stats question...

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Stats question...

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines