Collection of Data
Hypothesis testing, types of data, sampling methods, questionnaire design, capture-recapture
Hypotheses
A statistical investigation always begins with a hypothesis — a testable statement about a population. Before collecting any data, you must state clearly what you are trying to find out.
A testable statement about a relationship or characteristic in a population. It must be specific and measurable.
The default assumption that there is NO relationship or difference. For example: 'There is no correlation between height and shoe size.'
The hypothesis that there IS a relationship or difference. For example: 'There is a positive correlation between height and shoe size.'
Always write the null hypothesis as 'There is no…' and the alternative hypothesis as 'There is a…'. The null hypothesis is what you assume to be true until evidence suggests otherwise.
A student thinks that taller students have bigger shoe sizes. Write a suitable null and alternative hypothesis.
Types of Data
Data can be classified in several ways. Understanding the type of data you have determines which statistical methods and diagrams are appropriate.
Data that is descriptive and non-numerical. Examples: eye colour, favourite subject, gender.
Data that is numerical and can be measured or counted. Examples: height, number of siblings, temperature.
Quantitative data that can only take specific, separate values (usually whole numbers). Examples: number of cars, shoe size, number of goals scored.
Quantitative data that can take any value within a range. Examples: height, weight, time, temperature.
Data collected by the investigator themselves for a specific purpose. More reliable but time-consuming.
Data collected by someone else, already existing. Quicker to obtain but may be less reliable or not perfectly suited to your investigation.
Shoe size is discrete (you cannot have size 7.3). Height is continuous (you can be any height). Be careful — 'number of' questions are almost always discrete.
Sampling Methods
It is usually impractical to collect data from every member of a population (a census). Instead, we select a sample. The method of sampling affects how representative the sample is.
The complete set of individuals or items being studied.
A subset of the population selected for study.
Every member of the population has an equal chance of being selected. Method: assign numbers to all members, then use a random number generator or lottery.
The population is divided into groups (strata) and a sample is taken from each group in proportion to the group's size.
Sample from stratum = (Stratum size ÷ Total population) × Total sample size
A school has 300 Year 10 and 200 Year 11 students. A stratified sample of 50 is required.
Select every nth member from a list. First, calculate the sampling interval: n = Population ÷ Sample size. Choose a random starting point between 1 and n.
The population is divided into clusters (e.g. schools in a city). Some clusters are randomly selected and all members of those clusters are sampled.
The interviewer selects a set number of people from each category (e.g. 10 males and 10 females). Not random — the interviewer chooses who to approach.
Selecting whoever is available at the time. Quick but likely to be biased.
A sample is biased if it does not fairly represent the population. Bias can arise from the sampling method (e.g. only surveying friends), the questions asked, or the timing of data collection.
Questionnaire Design
A well-designed questionnaire collects accurate, unbiased data. Poor questionnaire design is a common source of bias.
A question that suggests a particular answer. Example: 'Don't you agree that school should start later?' — this leads the respondent towards agreeing.
A question that contains an assumption. Example: 'How often do you waste time on social media?' — assumes the respondent uses social media and that it is a waste of time.
Used for sensitive questions. Respondents secretly answer one of two questions (chosen randomly, e.g. by flipping a coin). The researcher does not know which question was answered, protecting anonymity while still allowing statistical analysis.
- Questions are clear and unambiguous
- Response boxes do not overlap (e.g. 1–5, 6–10, not 1–5, 5–10)
- An 'other' option or open-ended option is included where needed
- Questions are not leading or loaded
- Sensitive questions use the random response method
- The questionnaire is piloted (tested) before use
In the exam, you may be asked to criticise a questionnaire question. Look for: overlapping response boxes, missing response options, leading language, or vague wording.
Capture-Recapture
The capture-recapture method is used to estimate the size of a wildlife population (or any population that is difficult to count directly).
- Capture a sample of size M and mark/tag each individual
- Release them back into the population
- Wait for them to mix randomly with the rest of the population
- Capture a second sample of size n
- Count the number of marked individuals in the second sample: m
N = (M × n) ÷ m Where: N = estimated population size, M = number marked in first capture, n = size of second sample, m = number of marked individuals in second sample
50 fish are caught, tagged and released. Later, 40 fish are caught and 8 are tagged. Estimate the population size.
- The population is closed (no births, deaths, immigration or emigration between samples)
- Marked individuals mix randomly with the rest of the population
- Marks do not affect the behaviour or survival of individuals
- Marks are not lost between samples
- Each individual has an equal probability of being captured