Probability of a particular condition being true



Hi all,
I'm hoping this is an easy answer and I'm glad I can operate under anonymity to ask it!

I have a set of data which is a subset of a population. I'll be literal just in case it helps.

The subset of data is a subset of manufacturing orders.
Within this subset there are some orders which have transactional activity/records AFTER the order was received into stock. The reason for this is usually some administrative workaround that I am trying to understand in more detail.
To start, I'd like to know an approximation of how many orders have this particular condition. The easiest way for me to achieve this is to do some random sampling. Grab an individual PO, and check to see if there were additional transactions after physical receipt.

So, I'm looking for a formula! For example, if I had a subset of 1000 POs, how many would I have to randomly sample and check to have a 90% confidence that X% of the sample has condition(extra transactions)=True?

Thanks very much for your help on this!



I'm going to answer my own question, hopefully it helps someone

Hi, yep, found the answer.

Here we are talking about Population proportion sampling, I'm sure if you google those magic words you can find more, but here's my crack at an explanation:

You can use the following simple formula for a 90% CI upper/lower bound:

CI = p +/- ( 1.645 * ((p * (1-p) / n) ^ 0.5)

Here 1.645 is the z value for 90% conf level, so if you want some different conf level change that 1.645 to the appropriate z value.

p = what your result was of your sample, if you tested 50 samples and got 20=True then p =20/50=0.4. If you aren't sure, use 0.5, it will give you the worst possible result)

n = # of samples (in example above, 50)

Alternatively, if you know what CI you want, and what Confidence level, use this to calculate the appropriate sample size (n): (Sorry I like to use lots of parenthesis to be sure when not having a proper equation writer font, don't want you to have to go calling your dear aunt Sally)

n = (( z^2) * (p * (1-p))) / (h^2)

h = acceptable margin of error (use 0.1 for a +/- 10%)
p = same as above
z = confidence level (again for my use I pretty much stick with 1.645 as it gives a relatively standard 90%)

There are assumptions, including homogeneity of the population.

In my case, the cost of being inaccurate is low, I'm really just trying to get a number out there that's in the ball park. So I'm leaving out any deeper dive into assumptions and stipulations.

Anyone wants to come in an interrupt this conversation I'm having with myself to point out other important points or things I missed or things I'm wrong on, etc, please do!