Course Round Up: An Introduction to Data Science for Administrative Data Research

by | Oct 26, 2021

Dates of course: May 2021
Organised by: Scottish Centre for Administrative Data (SCADR)
Post summary: I was lucky enough to bag myself a spot on this year’s SCADR course “An introduction to data science for administrative data research”. In this post, I present an overview of the course and its content. I also provide my thoughts on the parts I felt to be most useful for fellow eCRUSADers as well as where I felt there were things missing.

In a rush? Skip ahead in the contents to find out about:

  1. The course in a nutshell and overall thoughts
  2. The course structure
  3. The lectures
  4. The practical sessions
  5. Thoughts on the most useful parts
  6. What was missing or could be improved?
  7. Who should attend the course
  8. How much does it cost and what time commitment is involved
  9. When will the course be running again? 

1. The course in a nutshell and overall thoughts

Overall, the course is designed to introduce researchers to the world of administrative data, with a particular focus on Scotland. The course includes a series of lectures and practical sessions, allowing participants to get some hands on experience working with a synthetic administrative dataset.

The course was very well structured and organised. The content was extremely useful and the course instructors were incredibly knowledgeable. The course provided participants with a lot of very helpful information. For anyone starting out doing research using administrative data I would most definitely recommend it.

Back to contents

2. The course structure

The course was split into a mixture of lectures and practical lab sessions over a four-week period. Since this year’s course was delivered solely online, there was also a live Q&A session each week.

 

Back to contents

3. The lectures

Week 1: Week 1 kicked off with an introduction to the course, the learning outcomes and to administrative data. Chris Dibben, SCADR Director, welcomed the participants and gave a very useful summary of the history of research using administrative data in Scotland and the UK. He also outlined the key stages a researcher will go through when carrying out research using administrative data (see figure below). The following lectures, delivered by experienced SCADR researchers, also covered some of the different sources of administrative data in the UK, the benefits and limitations of working with administrative data and the Five Safes Framework. There was also a brief introduction to programming in R to assist with the practical sessions.

Week 2: The focus of week 2 was to introduce participants to some of the administrative datasets available for research in Scotland (and the UK). These included:

  • Scottish Government Education and Analytical Services data and the education datasets (Scotland focus)
  • Department for Work and Pensions data (UK focus)
  • National Records of Scotland data including Scottish Longitudinal Study (Scotland focus)
  • Health data (Scotland focus)

Each lecturer discussed the datasets available, the type of information contained in them, how to access them and links/email addresses to find out more information.

A further lecture in week 2 covered record linkage, the different methods of linkage and some of the implications of incorrect linkage on research. The final two lectures to assist with the practical sessions looked at working with dates and times, and indexing, linking and joining data.

Week 3: Week 3 explored data provenance, law, trust and public engagement. The data provenance lecture, delivered by SCADR co-Director Iain Atherton, outlined the importance of understanding where your data come from because ultimately this will affect what you get out of it. He provided a useful outline of steps researchers should take when working with administrative datasets:

The lecture on law, trust and public engagement was incredibly helpful for any researcher starting out in administrative data research (I would go as far to say that attendance at this lecture should be mandatory for anyone who wants to access administrative data for research!) The session highlighted how important it is to develop the social licence when carrying out research in on unconsented personal information. Some useful questions to ask ourselves were considered when thinking about public engagement. For example, to whom does your research relate? Who will be impacted from the outcomes of your research? What conditions or social issues does your project explore? Doe the findings have the potential to impact a wide proportion of the public?

The final week 3 lecture explored data visualisation in R.

Week 4: The lectures in week 4 gave participants some insight into some existing research that has used administrative data in Scotland. These included Scotland’s maternity dataset (SMR02), linked survey and DWP data, drug consignment data, and Scottish Government data on looked after children.

Each of the speakers also talked about the main challenges faced during those research projects and in particular imparted their words of wisdom for other researchers. The lecturers also pointed towards some published work using Scottish administrative data. For example:

Clemens T, Dibben C, Pearce J, et al, 2020. Neighbourhood tobacco supply and individual maternal smoking during pregnancy: a fixed-effects longitudinal analysis using routine data. Tobacco Control ;29:7-14.

Pattaro, S., Bailey, N. & Dibben, C. Using Linked Longitudinal Administrative Data to Identify Social Disadvantage. Soc Indic Res 147, 865–895 (2020).

Back to contents

4. The practical sessions

Week 1: The practical sessions started in week 1 with an introduction to the R environment. Note that the course is designed for people who have used R before. However, even if you have no experience in R, the course instructors were very willing to help and the instructions for carrying out the data cleaning and analysis were extremely thorough.

The practical sessions use a synthetic Not in Employment or Education (NEET) dataset. This dataset was created to mimic the relationships between variables in the original NEET data but the observations do not pertain to real individuals.

Week 2: The focus of week 2 was first to handle the date and time variables in the dataset, create new variables and to check for any anomalies. All of which are typical exercises you will do when you work with administrative datasets. In the second part, participants had to convert the dataset from long to wide format, another common issue when working with administrative datasets, and merge datasets together.

Week 3: Week 3 was all about visualisation to explore the data to look for patterns and to check for any anomalies.

Week 4: The final practical session was focussed on modelling the data and producing tables of output.

Back to contents

5. What were the most useful parts?

  • The pre-recorded lectures worked very well. It meant you could listen in your own time and fast forward speed if you want to.
  • The links and information about using specific datasets in Scotland.
  • Hearing from researchers who have worked with specific administrative datasets in Scotland and seeing examples of published work using those datasets.
  • Learning top tips from experienced researchers that you can take into your own research.
  • Getting to do hands on work with some synthetic data to give researchers experience in dealing with common problems in administrative datasets.
  • The clear message that administrative data are often unconsented personal data and it is vital that we have to develop the social licence to use it.

Back to contents

6. What was missing or could be improved? 

  • More information on the specific data access processes and how long they typically take.
  • Information on the specific information governance training that researchers can do, for example the ONS Safe Researcher Training.
  • A practical session that includes carrying out disclosure control requests and checks.
  • A dataset that really looked like an administrative dataset- the NEET data were already pretty clean and even came with a codebook!
  • Information and guidance on how researchers can involve the public and patients in their research.
  • Information on how researchers can contact coders.

Back to contents

7. Who should attend this course?

The course is most suitable for those who are new to the world of administrative data research, particularly in Scotland. However, the course would also be useful for anyone working with administrative data in the UK, as many of the lessons learned will translate.

Back to contents

8. How much does it cost and what time commitment is involved?

In 2021, the course cost £120 per person. The course runs over a course of four weeks with both lectures and practical sessions. In total, SCADR estimate that you will need around 4-6 hours per week to watch the lectures and read the teaching materials and a further 3-4 hours per week to join the practical sessions.

Back to contents

9. When will the course be running again?

To enquire when the next training course will run, you can email scadr@ed.ac.uk.

Back to contents