He’d been wrong, there was a light at the end of the tunnel, and it was a flamethrower. – Terry Pratchett, Mort

Scaling methods

So if we have actually decided that we need to scale, what shall we do?

There are a large number of methods available to scale assessment marks, probably as many as can just be made up. I have seen a number suggested and used and I thought it would be interesting to list these and then look at how they change a distribution of marks for both a normal distribution and a non-normal distribution.

Constant change – the original mark has a constant added to it and then any marks above 100 are capped at 100 and any below 0 are capped at 0 (depending on the direction of scaling), $$x_{\text{new}} =a+x_{\text{original}}$$ This means that the change affects everyone the same, and assumes that were the exam easier every student would get the same additional amount of marks (or loses marks if the scaling is down). With only one scaling value, scaling to fix the mean is often undertaken. From Figures 1 and 2 it can be seen that the standard deviation and the general shape is not changed.
Percentage change – the original mark is multiplied by a scale factor and then any marks above 100 are capped at 100 (if the scaling increases the mark), $$x_{\text{new}}=ax_{\text{original}}$$ This means that the change affects the higher scores the most, and assumes that were the exam easier the student who got more marks originally would gained more marks compared to a student who got less marks originally (or loses marks if the scaling is down). With only one scaling value, scaling to fix the mean is often undertaken. From Figures 1 and 2 it can be seen that the standard deviation will increase slightly if scaling up (reduces if scaling down) using this method, the general shape is not greatly changed but some skew is added.
Inverse percentage change – for want of a better name, this acts the same way as the percentage change, but in reverse, i.e. the increase to students with a lower original mark is higher (and any marks below 0 would be capped at 0 if the scaling reduces the mark), $$x_{\text{new}} =\frac{100-a}{100x_{\text{original}}}+a$$ This assumes that were the exam easier the students who got less marks originally would gained more marks compared to a student who got more marks originally (or loses marks if the scaling is down). This is based on the principal that it is often easier to gain marks at the lower level. With only one scaling value, scaling to fix the mean is often undertaken. From Figures 1 and 2 it can be seen that the standard deviation will decrease slightly if scaling up (increases if scaling down) using this method, but the general shape is not greatly changed apart from some skew added.
Relative volatility – not sure if there is a real name, but as a chemical engineer this happens to have the same formula. It has the 0 and 100% points pinned and then the in between points are scaled to a smooth curve, $$x_{\text{new}} =\frac{100ax_{\text{original}}}{100+(a-1)x_{\text{original}}}$$ This means the greater change happens around the middle, i.e. you might expect the students with the very highest and lowest marks to be more fixed. With only one scaling value, scaling to fix the mean is often undertaken. From Figures 1 and 2 it can be seen that the standard deviation will reduce slightly using this method, but the general shape is not greatly changed.
Power scale – similar to the relative volatility method it has the 0 and 100% points pinned and then the in between points are scaled to a smooth curve, $$x_{\text{new}} =100^{(1-a)}x_{\text{original}}^a$$ This means the greater change happens around the middle, i.e. you might expect the students with the very highest and lowest marks to be more fixed. However, in this case peak in mark change is around 31% rather than around 45%. With only one scaling value, scaling to fix the mean is often undertaken. From Figures 1 and 2 it can be seen that the standard deviation will reduce slightly if scaling up (increase if scaling down) using this method, but the general shape is not greatly changed.
Piecewise-linear approach - A set of the key boundaries are set to values and linear extrapolation between them. Not shown in the figures as there are two many parameters to set, $$x_{\text{new}} =\begin{cases} \displaystyle \frac{30x_{\text{original}}}{a_{30}} & \text{for } x_{\text{original}} < a_{30} \\ \displaystyle 30+\frac{10(x_{\text{original}} -a_{30} )}{a_{40}-a_{30}} & \text{for } a_{30} \le x_{\text{original}} < a_{40} \\ \displaystyle 40+\frac{10(x_{\text{original}} -a_{40} )}{a_{50}-a_{40}} & \text{for } a_{40} \le x_{\text{original}} < a_{50} \\ \displaystyle 50+\frac{10(x_{\text{original}} -a_{50} )}{a_{60}-a_{50}} & \text{for } a_{50} \le x_{\text{original}} < a_{60} \\ \displaystyle 60+\frac{10(x_{\text{original}} -a_{60} )}{a_{70}-a_{60}} & \text{for } a_{60} \le x_{\text{original}} < a_{70} \\ \displaystyle 70+\frac{30(x_{\text{original}} -a_{70} )}{100-a_{70}} & \text{for } a_{70} \le x_{\text{original}} \end{cases}$$ Where $a_{30}$ is the value in the original distribution that we want to be equal to 30% in the new distribution, $a_{40}$ is the value in the original distribution that we want to be equal to 40% in the new distribution, etc. These can be simplifed with less values or made more complex with more. Based on the values picked for $a_{30}$ to $a_{70}$ virtually any shape can be made including a normal distribution with a set mean and standard deviation (though that suffers the same issue as the normalization scaling). It can also be use to scale part of the distribution and not other parts.
Arctan scale – this scaling can change the mean value and also compresses the extreme values, $$x_{\text{new}}=\mu_{\text{new}} +\frac{50}{\pi}\arctan\left(\frac{x_{\text{original}} -\mu_{\text{original}}}{a}\right)$$ This means with two scaling values we can fix the mean to a desired new value and then use the a parameter to scale the standard deviation to the desired value. From Figures 1 and 2 we can see that although the extreme values are compressed in the distribution doesn’t become normal.
Normalization – this scaling converts any distribution into the normal distribution with a specified mean and standard deviation, $$x_{\text{new}} =\mu_{\text{new}} +\sigma_{\text{new}}\mathcal{N}^{-1}\left(\frac{i-0.5}{n}\right)$$ Where $i$ is the mark rank order position, $n$ is the number of marks, and $\mathcal{N}^{-1}$ is the inverse normal distribution, i.e. finding the $z$ value from the probability. This makes any distribution normal with the required mean and standard deviation; however, it doesn’t just scale the marks, but it makes the marks competitive, i.e. the position in the class really defines the mark and not the original scored mark.

Graph showing the scale up produced by different methods with an origial normal distribution

Figure 1. Example scale-up produced by different methods with an origial normal distribution set of marks. Final distribution fitted to an average of 63%, and were possible the standard deviation was set to 10. Histogram shown with 5% groupings.
Graph showing the scale up produced by different methods with an origial non-normal distribution

Graph showing the scale up produced by different methods with an origial non-normal distribution

Figure 2. Example scale-up produced by different methods with an origial non-normal distribution set of marks. Final distribution fitted to an average of 63%, and were possible the standard deviation was set to 10. Histogram shown with 5% groupings.

Summary

So there a lots of options, but are any of them actually good? Any option where the mean and standard deviation are both fit will cause the final marks to become more competive as they are basically fixed (within a small range) by the class position rather than the original mark. Other methods might then be more applicable to different situations, for example the inverse percentage change could be more justified if the exam duration was reduced (e.g. fire alarm with no option to extend exam time) as it could be expected that students with a lower mark could have gained more marks with extra time as there were more avaliable to them, or the constant change scale if there was an error on the paper that made a question impossible.

So really we want to avoid bulk scaling like this, but if it is done, hopefully the above will allow an option that reflects what is needed to be selected.