The aim of data preparation and data analysis is to increase or enhance data quality.
Why Data Preparation?
In science, working with data of questionable quality leads to false results, which not only prevent the gain of knowledge, but can also have undesirable practical consequences. Therefore data preparation and analysis are absolutely necessary. Data preparation includes all well-founded and documented processing or changes to the raw data material that increase the validity and (re)usability of the data and prepare the data analysis. This includes:
- the creation of structured data sets from the raw data
- the commentary
- the anonymisation of the data records
- the data correction
- the data transformation
Data quality can be specified for quantitative data by a number of criteria, including:
- uniformity (e.g. of dates and currencies, use of acronyms)
- exclusion of double values/multiple data rows
- proper handling of missing values
- detection and treatment of outlier values, whereby this often only takes place during data analysis
- plausibility of the response patterns
The data preparation fulfils several functions:
Avoidance of incorrect results
An evaluation of incomplete or incorrect data leads to incorrect results (so-called garbage-in-garbage-out principle). The analysis of an uncorrected data set, which contains e.g. duplications, typing errors or implausible answers, can distort the entire result and lead to wrong conclusions regarding content. Such problems can be avoided if the data quality is checked and ensured from the beginning. The drawback of suboptimal data quality is that it is only recognized the moment it is checked. If poor data quality is discovered late in a research process - for example, only during or at the end of the data analysis - all previously performed analysis steps are often in vain.
Avoidance of difficulties and delays in data analysis
The aim is to ensure that current and subsequent data analysis is carried out smoothly by the researcher, research partners or other collaborators who wish to perform re-analysis or secondary analysis of the data. This requires a strict organisation and sufficient commentary of the data sets by means of metadata (e.g. exact details of when, where and by whom the data were collected, what variable names and measured values mean, etc.) In addition, uniform forms of presentation and compatible formatting are necessary for the exchange and evaluation of data. A poorly prepared data set can become unusable, for example if a lack of commentary or a missing code plan means that it is not possible - or only at great expense - to reconstruct at a later date what certain measured values actually mean in terms of content.
Avoidance of ethical problems
Especially in qualitative research, the execution of clinical studies, failures to anonymise the raw data material can make the participating persons identifiable. Identifiability is - unless there is explicit consent of the study participants (e.g. permitted naming in an expert interview) - not only a violation of the GSP, but also of data protection laws. This applies regardless of whether or not the identification actually results in noticeable adverse effects for an individual participant.
Since data preparation represents an intervention in the data, the procedure for data preparation (step from raw to secondary data) must be justified and documented accordingly in the ELN / results report. It is therefore absolutely necessary to write down a defined algorithm how the data must be prepared.
- justify deviations
- an evaluation schema is stored and
- requirements are established that data must be available.
In order to be able to understand how data is evaluated, an openly designed analysis plan should be stored in the ELN.
While for simple evaluations, spreadsheet programs (e.g. Excel) are used, statistical programs such as SPSS are preferred for more complicated evaluations.
For more information see module „Evaluation“.
- Sittampalam GS, Coussens NP, Brimacombe K, et al., editors. Assay Guidance Manual [Internet]. Bethesda (MD): Eli Lilly & Company and the National Center for Advancing Translational Sciences; 2004-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK53196/
- Statistical experiment design for animal research: C.O.S. Sorzano and M. Parkinson; 09/2019
- Motulsky HJ (2015) Common misconceptions about data analysis and statistics. Pharmacol Res Perspect. 3(1). PubMed
- Guidelines on reporting of statistical analysis (in vivo research): ARRIVE 2.0