Test V Control File - Ensuring no significant differences between them

Hi all, I'm new here and would appreciate a little guidance if at all possible please.

A direct mail file of prospective new customers is to be used to test the value of a new marketing message (Test) on response against the existing message (Control) and this is the only change being tested.

To ensure it’s the message change and not the differences in the make-up of prospects within the test & control files that are driving the end results the test and control files need to be similar or at least not significantly different from each other for the key test variable. (For example: Prospect score)
The test & control files will be randomly split 50:50 from the total prospect file using a random number generator. The total prospect file size could range from 200 to 5000+ records in total.

So, I believe I have two independent samples for which I want to measure the difference in the means using a two tailed test (H0: x̅1=x̅2 H1: x̅1 ≠x̅2 ) with sample sizes >30 and as I can calculate the Standard deviation of the total file from which the test & control files are derived, I therefore should be using a Z-test, right?

Q1: How to choose an acceptable critical value - How close to ‘similar’ or how different is acceptable? i.e
- Is a p-value just above a chosen critical value of 0.05 really showing me that the two files are similar enough to accept (H0) or would I be better served using a critical value of 0.1 or higher?
- Dose a higher p-value mean the means of the key test variable within the test & control files are closer therefore the make-up of both files is similar for that key variable and therefore less likely to drive differences in response rates?
- However, do I really want similar or just not significantly different to ensure there are no selection impacts on my end results? See Q2

Q2: With that in mind
- do I really only want to make sure they are not significantly different from each other or is there merit in forcing them to be as similar as possible? i.e sorting on key variable and selecting every 2nd record as the test 50%.
- Does adding such bias render the overall test meaningless or would it still be valid?

Q3: Would Stratified sampling be a better solution and maintain proportions of key variables within each test and control file but still randomly selected within each strata?

Q4: If I use Stratified sampling to split out the test from the control does this type of sampling have impacts on the T-test or Z-test used to measure the differences between the test & control file (prior to mailing) or on the significance testing used at the end of the campaign to measure the differences in the response results gained?

All and any help will be gratefully received, thank you in advance.