Understanding data quality
There is a lot of talk about data quality but what does it actually mean?
In principle, it means that the data are fit for the purpose that you intend to use it for. At a basic level, this means that the concepts and definitions used match your requirements and you are content that fit for purpose methods have been used during the collection, processing and analysis of the data.
It is essential, before undertaking analysis, that you take time to understand the data that you plan to use and in doing so assess its strengths and weaknesses.
You should ask yourself questions such as:
- How were the data collected e.g. was it from a Census, survey, or administrative data, and what mode of data collection(s) was used?
- If from a survey, what was used as the sample frame? What was the size and design of the sample? What was the response rate, including what was included and excluded in the response rate calculation? (e.g. were blank questionnaires included?)
- What were the concepts, definitions and questions used to collect the data? Do they match your data requirements?
- How were the data captured, coded and cleaned (e.g. editing and imputation)?
- If a longitudinal dataset, have there been any changes to concepts, definitions, questions and the methods used to collect, process and analyse the data?
The Office for National Statistics produces Quality and Methodology Information Reports for its’ statistical outputs, which are a useful source of information about how the data have been collected, processed and analysed.
In the European and UK official statistical system the following quality dimensions are commonly used:
- Relevance – the degree to which the statistical product meets user needs in both coverage and content
- Accuracy and Reliability – accuracy is the proximity between an estimate and the unknown true value. Reliability is the closeness of the early estimates to subsequent estimated values
- Timeliness and Punctuality – timeliness refers to the time gap between publication and the reference period. Punctuality refers to the gap between planned and actual publication dates
- Accessibility and Clarity – accessibility is the ease with which users are able to access the data, also reflecting the format in which the data are available and the availability of supporting information. Clarity refers to the quality and sufficiency of the metadata, illustrations and accompanying advice
- Comparability and Coherence – comparability is the degree to which data can be compared over time and domain. Coherence is the degree to which data that are derived from different sources or methods, but refer to the same topic, are similar.
The Guardian, 30th November, 2016. Theresa May has asked Matthew Taylor to lead a review of modern employment. The fast-changing... Read more...
Self-employment and the fuzzy border between what is work and what is enterprise is back in the news, and the... Read more...