Welcome to another Researcher Experience post on eCRUSADers — our series where early career researchers share first-hand accounts of working with Scottish administrative data, to help others navigate the same journey. This post comes from Dr Tamsin Nash, a Clinical Research Fellow at the University of Glasgow and one of my own PhD students, which makes me especially proud to be sharing it here!
At a glance
- Pan-cancer cohort of ~500,000 patients diagnosed in Scotland between 2009 and 2024
- Linked health datasets and 2011 Census data across two separate PBPP panels
- Navigated ethics approval, IG training, and data sharing agreements across multiple institutions
- Practical tips on version control, code lists, and building relationships with data custodians
Overview of Tamsin’s research
My PhD project aims to quantify disparities in the quality of cancer care by socioeconomic position and ethnicity. In order to assess quality of care received by individual patients, I requested health datasets linking detailed, individual level information on diagnosis, treatment and outcomes from across the cancer pathway. I also linked individual level data from the 2011 Census to quantify socioeconomic and ethnic disparities in treatment quality and survival. Data linkage was coordinated by the electronic Data Research and Innovation Service (eDRIS), who created a cohort of ~500,000 patients diagnosed with cancer in Scotland between 2009 and 2024, accessed through the National Safe Haven. This cohort included all cancer sites, but because prognosis varies significantly by tumour site, I ‘blocked’ the dataset into smaller virtual cohorts defined by ICDO3 topography and morphology codes. I am now analysing survival and treatment quality by tumour site, beginning with breast and lung cancer.
Datasets used:
Summary of challenges faced
Submitting my request during First Year PhD
Because of the time required for a data request to be drafted, submitted, approved, and released, I knew it was important to submit my data request early on in my PhD. However, this was tricky as I had not yet refined my research questions, which must be clearly described in data request. This “catch 22” was the first challenge I faced in applying for data.
Fortunately, I had excellent supervisors who advised keeping my application and research questions broad, and requesting as much relevant data as possible. This allowed me to submit the first draft of my data request early on.
By the time the first draft had been submitted to the panel, I’d liaised with data custodians and met with my thesis panel, allowing me to refine the application in line with my developing research questions.
Deciding what data to request & meeting data custodians
Linking data from different sources requires researchers to establish a relationship with the data custodian. Those familiar with the data can explain exactly what is included, how it is collected and where to find the most up-to-date dictionaries.
For several datasets, eg SMR06, SMR00 and SMR01, there is a wealth of information online, including via RDS and eCRUSADers. For others, such as QPI and National SACT datasets, previous use was more limited. For these datasets in particular, the advice and support of data custodians was helpful for a number of reasons.
Analysts were able to share and discuss the data dictionaries, ensuring I requested the correct variables. They were able to gave insights into previous uses of the data, the novelty of my planned analysis, and in some cases, shared methods and code to speed my analysis and ensure it was consistent with previous work.
From the custodians’ perspective, understanding the aims and timeline of my project allowed them to plan workloads and suggest potential collaborations. As a result of these conversations, I included validation of several PHS datasets as a separate work package, strengthening the overall application.
Because I was linking multiple datasets, I had to arrange meetings with stakeholders from all the dataset owners. This often involved sending ‘cold’ emails to analysts and governance officers I had never met. Maintaining a flexible schedule to set up meetings and organising video calls to build trust and rapport with custodians was time consuming, but very worthwhile in terms of writing a successful data application. Here again, good supervision was vital to the success of my application, as it is unlikely that I’d have been able to do this without their insight into the data infrastructure and organisational links to PHS, the NHS and NRS.
Linking requests for health datasets and Census data
Because I was applying for both health datasets and Census data, my request needed to be submitted to two separate panels: the Health and Social Care PBPP and Statistics PBPP. I knew that the panel would likely suggest amendments, meaning that if I submitted both requests at once, the two could diverge significantly. With the help of my supervisors and eDRIS research coordinator, I decided to submit my HSC-PBPP application first, aiming to receive full approval before submitting my S-PBPP application. Both requests described the project in full, including all the data I planned to use and explaining that the applications were linked to one another.
Because the request forms were similar, I was able to re-use the majority of the text from my HSC-PBPP form, including the amendments I’d made following the HSC-PBPP panel review. As a result, the S-PBPP application was fairly detailed and comprehensive, with only minor amendments required after submission compared to the HSC-PBPP application. Another benefit of submitting the requests in series was work could start on data extraction as soon as the HSC-PBPP application was approved. eDRIS kindly agreed to begin preparing the data before the S-PBPP request was approved, and to allow access to each health dataset in the safe haven as soon it was ready. This meant that I was able to get familiar with the data and begin my analysis much sooner than if I had waited for both datasets to be approved and for all the data to become available.
This arrangement also had implications for the way my data was indexed and linked. The standard process is for eDRIS to both indexing and linkage. In this case, linkage was still coordinated by eDRIS, but indexing was done by NRS due to the inclusion of census data.
Collecting supporting documents
I needed to submit multiple supporting documents alongside my applications, and gathering these took some organisation. The following documents in particular required some additional work:
1. IG Training
– For Health datasets, the MRC’s Research, GDPR and Confidentiality course is sufficient and can be completed quickly online.
– For Census data, this must be ONS Safe Researcher Training (ONS-SRT), which involves an interactive online course and exam. Booking and completion can take a few weeks.
– You also need to gather documentation for all researchers accessing the data.
2. Ethical approval
– Ethics for NHS data accessed through eDRIS is covered by the National Safe Haven governance, meaning I did not need to complete a separate ethics application.
– However Census data was not covered and required ethical approval.
– I contacted the Edinburgh research office who suggested that because I was using some NHS data, I should apply for a research sponsor through ACCORD and submit a CR007-T19 ‘data only’ protocol to the Edinburgh Medical School Research Ethics Committee. My ACCORD sponsor reviewed the form and provided some minor feedback before submission.
– Much of the application reused content from my HSC-PBPP submission; the time from submission to approval was around 8 weeks.
3. Data Protection Impact Agreement
– This is affiliated with your university. You need to find out who the data protection officer is for your institution and contact them for advice.
4. Data sharing agreement
This was organised by the Edinburgh research office, but they needed some time to prepare it in advance.
5. Data flow diagram
eDRIS will make this for you but you need to help them by telling them where data is coming from and going to.
Thoughts for fellow and future eCRUSADers
1. Future-proof your data
Projects often take several years, meaning datasets might be outdated before publication. Multiple versions of datasets may exist (e.g. SIMD data is available from 2012, 2016 and 2020; Census data is available from 2011, and now 2022), and variable names or definitions may change between versions. It’s worth checking with data custodians which version is best, and whether you can include multiple or future years’ datasets in your request.
2. Avoid manually copying code lists
For some datasets, such as PIS data, I had to compile lists of relevant BNF codes by drug category. If you can, avoid manual look-up and transcription, as you will inevitably make errors or omissions (or go insane). Instead, use curated code lists shared via platforms like eCRUSADers, RDS, or Github; or included as supplementary materials in published studies. It’s much safer to use these types of tables, or to automate code selection (e.g. using R), than to manually type out the list yourself.
3. Be obsessive about version control
Version control was much more important than I initially realised. Losing changes between drafts can introduce serious errors to your final dataset. Multiple drafts of my application were sent back and forth between supervisors, stakeholders and advisors. If I could go back in time, I would create a robust system for naming, saving and backing up these drafts from the outset.
4. Revisions process
My HSC-PBPP and S-PBPP applications each went through three rounds of clarification with their respective panels. The questions asked were at times quite technical and methodological, which was daunting. I was very worried about describing my statistical plans in perfect detail. Now, I would say it’s actually more important to maintain momentum and respond quickly than to aim for perfection.
Public Benefit Privacy Panel Timelines
Project: PhD project “Understanding socioeconomic inequalities in cancer care: Data linkage to identify priorities for treatment equity”
| Stage | HSC-PBPP | S-PBPP |
|---|---|---|
| Preparation of application | August 2023 – June 2024 | August 2023 – November 2024 |
| Initial submission to approval | June 2024 – September 2024 | November 2024 – February 2025 |
| Approval to data access | September 2024 – March 2025 | February 2025 – October 2025 |


