Back to Chapter Notes
1

Collection of Data

Hypothesis testing, types of data, sampling methods, questionnaire design, capture-recapture

Hypotheses

A statistical investigation always begins with a hypothesis — a testable statement about a population. Before collecting any data, you must state clearly what you are trying to find out.

Hypothesis

A testable statement about a relationship or characteristic in a population. It must be specific and measurable.

Null Hypothesis (H₀)

The default assumption that there is NO relationship or difference. For example: 'There is no correlation between height and shoe size.'

Alternative Hypothesis (H₁)

The hypothesis that there IS a relationship or difference. For example: 'There is a positive correlation between height and shoe size.'

Exam Tip

Always write the null hypothesis as 'There is no…' and the alternative hypothesis as 'There is a…'. The null hypothesis is what you assume to be true until evidence suggests otherwise.

Types of Data

Data can be classified in several ways. Understanding the type of data you have determines which statistical methods and diagrams are appropriate.

Qualitative Data

Data that is descriptive and non-numerical. Examples: eye colour, favourite subject, gender.

Quantitative Data

Data that is numerical and can be measured or counted. Examples: height, number of siblings, temperature.

Discrete Data

Quantitative data that can only take specific, separate values (usually whole numbers). Examples: number of cars, shoe size, number of goals scored.

Continuous Data

Quantitative data that can take any value within a range. Examples: height, weight, time, temperature.

Primary Data

Data collected by the investigator themselves for a specific purpose. More reliable but time-consuming.

Secondary Data

Data collected by someone else, already existing. Quicker to obtain but may be less reliable or not perfectly suited to your investigation.

Exam Tip

Shoe size is discrete (you cannot have size 7.3). Height is continuous (you can be any height). Be careful — 'number of' questions are almost always discrete.

Sampling Methods

It is usually impractical to collect data from every member of a population (a census). Instead, we select a sample. The method of sampling affects how representative the sample is.

Population

The complete set of individuals or items being studied.

Sample

A subset of the population selected for study.

Simple Random Sample

Every member of the population has an equal chance of being selected. Method: assign numbers to all members, then use a random number generator or lottery.

Stratified Sample

The population is divided into groups (strata) and a sample is taken from each group in proportion to the group's size.

Stratified Sampling Formula
Sample from stratum = (Stratum size ÷ Total population) × Total sample size
Systematic Sample

Select every nth member from a list. First, calculate the sampling interval: n = Population ÷ Sample size. Choose a random starting point between 1 and n.

Cluster Sample

The population is divided into clusters (e.g. schools in a city). Some clusters are randomly selected and all members of those clusters are sampled.

Quota Sample

The interviewer selects a set number of people from each category (e.g. 10 males and 10 females). Not random — the interviewer chooses who to approach.

Opportunity (Convenience) Sample

Selecting whoever is available at the time. Quick but likely to be biased.

Bias

A sample is biased if it does not fairly represent the population. Bias can arise from the sampling method (e.g. only surveying friends), the questions asked, or the timing of data collection.

Questionnaire Design

A well-designed questionnaire collects accurate, unbiased data. Poor questionnaire design is a common source of bias.

Leading Question

A question that suggests a particular answer. Example: 'Don't you agree that school should start later?' — this leads the respondent towards agreeing.

Loaded Question

A question that contains an assumption. Example: 'How often do you waste time on social media?' — assumes the respondent uses social media and that it is a waste of time.

Random Response Method

Used for sensitive questions. Respondents secretly answer one of two questions (chosen randomly, e.g. by flipping a coin). The researcher does not know which question was answered, protecting anonymity while still allowing statistical analysis.

Features of a Good Questionnaire
  • Questions are clear and unambiguous
  • Response boxes do not overlap (e.g. 1–5, 6–10, not 1–5, 5–10)
  • An 'other' option or open-ended option is included where needed
  • Questions are not leading or loaded
  • Sensitive questions use the random response method
  • The questionnaire is piloted (tested) before use
Exam Tip

In the exam, you may be asked to criticise a questionnaire question. Look for: overlapping response boxes, missing response options, leading language, or vague wording.

Capture-Recapture

The capture-recapture method is used to estimate the size of a wildlife population (or any population that is difficult to count directly).

Method
  • Capture a sample of size M and mark/tag each individual
  • Release them back into the population
  • Wait for them to mix randomly with the rest of the population
  • Capture a second sample of size n
  • Count the number of marked individuals in the second sample: m
Capture-Recapture Formula
N = (M × n) ÷ m

Where: N = estimated population size, M = number marked in first capture, n = size of second sample, m = number of marked individuals in second sample
Assumptions of Capture-Recapture
  • The population is closed (no births, deaths, immigration or emigration between samples)
  • Marked individuals mix randomly with the rest of the population
  • Marks do not affect the behaviour or survival of individuals
  • Marks are not lost between samples
  • Each individual has an equal probability of being captured