Overview of my research
I used Scottish Qualifications Authority administrative data, from 2002 to 2009 for my PhD. My PhD investigated secondary school students’ subject choices and attainment in facilitating subjects – traditional academic subjects that facilitate university entry. Secondary education in Scotland is characterised by substantial socio-economic inequalities in attainment and gendered patterns of performance. Individuals from the most deprived backgrounds do significantly and systematically less well than those from more affluent households while boys underachieve compared to girls. Evaluating attainment in terms of numbers of qualifications achieved, ignores the importance of subject choice. Some subjects are more important than others for progression to tertiary education and employment opportunities. I examined individuals’ decisions to stay on at school to take Highers (qualifications necessary for university access) and their decisions to take four or more Highers in facilitating subjects (the crucial number for entry to prestigious universities) using sequential logit analysis. Grade attainment in these facilitating subjects was examined using multinomial logit analysis. Gendered patterns of choice and attainment in Maths and Science were further explored, using logistic regression, to ascertain whether there might be a genetic component in terms of increased testosterone exposure in-utero.
The SQA database of candidate results was set up in 2000. I made a direct approach to SQA to access this in 2009. A formal data share agreement between SQA and Stirling University was entered into for an initial period of two years; this was reviewed subsequently and extended on two further occasions. The SQA data are managed as three separate databases that can be linked as required, these are: candidate results, candidate details, presenting centre (school, college or other institution) details. The table below shows the data held in the three databases.
An SQA statistician performed the data linkage work in SAS (this process takes three-four full working days) and a CSV file containing the linked data was released to Stirling University in early 2010. At the time, SQA did not charge for such work. I linked the Scottish Index of Multiple Deprivation (SIMD) to individuals and centres via their postcodes to provide an indicator of socio-economic background, using SIMD quintiles as the specific measure as these are used routinely in the reporting of official Scottish Government statistics. All three component databases provided particular challenges as outlined briefly below.
Summary of any challenges faced
Investigating, cleaning and rendering the raw administrative data was a slow, painstaking process. Initially, because of their sensitive nature (containing candidate names and addresses), the data had to be held securely on a Stirling University server. The very large size of the dataset (just under 1.5 million observations containing both independent and state school and college results) meant that it could not be accessed remotely and worked on efficiently. In the event, the data were cleaned and rendered on site at Stirling University over a four-year period (2010 to 2014), part-time, one day per week (as I was a full-time academic at another university). Once a variable for candidate household had been derived, the data were anonymised by removing names and addresses, making them portable and allowing the pace of the research to speed up.
A major drawback of using administrative data for research purposes is that they are not collected for this purpose. Administrative data entry can be highly variable. Data will be entered inconsistently by different individuals across years or more superfluous data, from an administrative perspective, may not be entered at all. The result is that much time can be spent trying to both rationalise and complete the dataset, as it was in my case, before derived variables can be created or the data can be added to with information from other sources. Frustratingly, these inconsistencies often may not be detected until midway through a rendering process or, worse, a piece of analysis.
Candidate Results: I transformed candidates’ results (grades) into values for the various subjects they studied (indicated by the Product Codes). When an SQA qualification syllabus is reviewed a new Product Code is generated for the revised qualification. This meant that the same subject, for instance, Higher Economics, had multiple Product Codes that had to be rationalised before subject names could be attached for use as variable names. Once the Product Codes were rationalised and renamed as subjects, they were amended to incorporate the level of the award (e.g. Standard Grade or Higher) as indicated by the Product type. The different subject grades were converted into their UCAS points’ equivalent to enable both comparison of attainment across qualifications and the creation of aggregate absolute and relative measures of attainment. Candidate Details: The main challenge here was to create a household id to identify sibling groups. This was carried out by grouping individuals according to their surname, first line of address and post code using STATA’s group command. The major obstacle here was the inconsistent entering of names and addresses; all of which had to checked and rationalised.
Centre Details: when linking SIMD information to schools’ postcodes, I discovered that where schools had been closed or merged, their postcodes were missing. This information had to be checked and entered to maximise usable observations. To analyse the decision to stay on at school after age 16, I linked an indicator of youth employment by local authority (LA) area (sourced from the Labour Force Survey). In the process of linking this information, it became apparent that schools’ LAs had not always been entered. Again, to retain the maximum number of observations in the dataset, this information had to be retrieved and entered. Had I not undertaken this work, I would have lost approximately 5% of the data which starts to bring into question whether or not an actual population is being examined.
Thoughts for fellow and future eCRUSADers
It is crucial to be aware of the fact that administrative data are collected for administrative purposes not research; so they may not always be complete. If administrators are under pressure, the data entered will only be complete to the extent necessary to carry out the administrative task.
Also, they may not give you the depth of information that you would like. In my case, I had initially hoped to track candidates in terms of moving home and associated SIMD areas but only the current (last) address of a candidate is recorded, all previous addresses being replaced.
Check not just the completeness of the data in terms of the number of entries but also the consistency of those entries; for example, whether names and/or addresses are consistently spelled.
When creating complex derived variables or cleaning/rendering data for completeness, always test your routine on synthetically created problematic data first, then on a subset of your actual data. (And record your software routine/syntax used to enable reproduction of results!)
I carried out much of my data rendering (e.g., replacing Product Codes with subject names) using Excel spreadsheets as this enables both comprehensive data management and ease of checking. I then converted these into Stata datafiles.
Specifically, if you are thinking of working with SQA data, you can either make a direct approach (as I did) to establish a data share agreement between SQA and your institution, or you can apply to access SQA data through Administrative Data Research (ADR) Scotland. To make a direct approach to SQA, contact firstname.lastname@example.org. In 2020, I was advised that if a new request to SQA was successful, the whole acquisition process may take up to 3 months but currently there was no charge for provision. You should note that being allowed direct access to candidate names and addresses as an individual researcher is not possible now; for data protection reasons this information is no longer released. If you want to, for instance, identify siblings, this is something that would need to be explored through ADR Scotland. Scottish Candidate Numbers are also no longer provided and would be replaced with pseudonymised identifiers. Nevertheless, while the data that would be released directly to researchers are more restricted than previously, they still contain identifiable panels of individuals, schools and LAs.
My best advice is to start early to explore what is available and, therefore, what is possible by way of analysis.