Assuming data are approximately normally distributed, and that the two
groups are independent random samples from their respective populations,
the null hypothesis of no difference $H_0: \mu_a = \mu_b$ (new policies
had no effect) can be tested against the alternative $H_a: \mu_a > \mu_b,$
using a Welch 2-sample t-test.
The test statistic is
$$T = \frac{\bar X_a = \bar X_b}{\text{SE}}, \text{ where }\;
\text{SE} = \sqrt{\frac{s_a^2}{23} + \frac{s_b^2}{18}}.$$
The $T$-statistic is approximately distributed according to Student's t distribution
with degrees of freedom $\nu,$ computed according to a formula you can
find in your textbook or on Wikipedia. The number of degrees of freedom $\nu$ depends on sample sizes and sample variances in such a way that
$\min(n_1 - 1, n_2 -1) \le \nu \le n_1 + n_2 -2$ and it is nearer to $n_1 + n_2 - 2$ if the sample variances are nearly equal. This test does not assume that
population variances are equal, and that assumption should not be made
unless there is sound prior evidence that the population variances must be nearly
equal.
Output from Minitab statistical software shows the test statistic, the degrees of freedom, and the
corresponding P-value.
Two-Sample T-Test
Sample N Mean StDev SE Mean
1 23 8.20 2.40 0.50
2 18 7.40 1.80 0.42
Difference = μ (1) - μ (2)
Estimate for difference: 0.800
T-Test of difference = 0 (vs >):
T-Value = 1.22 P-Value = 0.115 DF = 38
Because the P-value 0.115 > 0.05 = 5%, the difference between
the sample means is not enough to reject the null hypothesis.
The company may be 'encouraged' that $\bar X_a = 8.20 > \bar X_b = 7.40,$
but the improvement is not enough to be called 'statistically significant' at the 5% level.
It would not be unusual for such a difference in means to occur
by random chance in samples as small as these.
By contrast, if both sample sizes had been ten times as large (with the
same sample means and SDs), then the last line of Minitab output would read:
T-Test of difference = 0 (vs >):
T-Value = 3.86 P-Value = 0.000 DF = 407
Thus the difference between the two sample means would have been highly
significant, with a P-value smaller than 0.0005.
Notes:
(1) A possible pitfall in analyzing these data lies in the use of
the terminology 'Before' and 'After'. This terminology is sometimes used
in a paired design, in which people in the same group give scores before
and after being subjected to some treatment (a drug, a training course, etc.).
That cannot be the case here because (a) we are told that a "new group" was
chosen for the questions after the change in policy. and (b) because the
sample sizes are different.
(2) Some textbook authors show only the
'pooled' two-sample t test because $\nu = n_1 + n_2 - 2$ is very easy to
find. However, current statistical practice is to use the Welch (or 'separate-variances' t test) unless there is a good reason to believe that the Before
and After populations have equal variances. (By now there is a vast body of
simulation evidence that the pooled test may give incorrect results unless
population variances are truly equal.)
(3) Some textbook authors may
say that it is OK to do a z test (using the standard normal distribution
instead of Student's t distribution) because $n_1 + n_2 + 2 > 30,$ but
most practicing statisticians would not use such a test unless the
population standard deviations $\sigma_b$ and $\sigma_a$ are known.