Use of secondary data analyses in research: Pros and Cons

LL Pederson; E Vingilis; CM Wickens; J Koval; RE Mann; LL Pederson; E Vingilis; CM Wickens; J Koval; RE Mann

ISSN: 2455-3484

Journal of Addiction Medicine and Therapeutic Science

Research Article Open Access Peer-Reviewed

Use of secondary data analyses in research: Pros and Cons

LL Pederson*, E Vingilis, CM Wickens, J Koval and RE Mann

Author and article information

Adjunct Professor, Department of Family Medicine, Western University, Canada

*Corresponding author: Linda L Pederson, Ph.D, Adjunct Professor, Department of Family Medicine, Western University, 45 Orchard St London, Ontario N6J2R4; Canada, Tel: 519-661-9369; E-mail: lindap@mindspring.com

DOI: 10.17352/2455-3484.000039

Published: 26 June, 2020 | Accepted: 07 July, 2020 | Received: 08 July, 2020

Cite this as

Pederson LL, Vingilis E, Wickens CM, Koval J, Mann RE (2020) Use of secondary data analyses in research: Pros and Cons. J Addict Med Ther Sci 6(1): 058-060. DOI: 10.17352/2455-3484.000039

Main article text

Introduction

What are secondary data? Secondary data refer to data that are collected by someone other than the user or are used for an additional purpose than the original one. A wide range of sources can be used as secondary data: censuses, information collected by government departments, organizational records and data that were originally collected for other research purposes [1-3]. Yee and Niemeier [4] discuss the benefits of longitudinal data as compared to repeated cross-sectional information.

Use of repeated cross sectional or longitudinal secondary data to explore social and health issues can result in the ability to provide comparative information about important environmental issues. For example, social or health related information could be examined before, during and after the current COVID 19 pandemic to gain some understanding of the course and impact of the outbreak and to inform resource allocation. Using secondary analyses of survey data collected by the China CDC, Gao, et al. [5] was able to provide timely information to demonstrate geographical differences and duration of Coronavirus in health care workers in China.

Secondary data can answer two types of questions: descriptive and analytical. Hence, the information can be used to describe events or trends or it can be used to examine relationships among variables cross-sectionally or longitudinally. Numerous secondary data bases exist and many are available online (e.g., The European Bioinformatics Institute database [6] provides a searchable database of biologic sources that can be linked to survey data). The Centre for Addiction and Mental Health (CAMH) conducts surveys in adults in Ontario, Canada (CAMH Monitor) that are repeated cross-sectional studies. The Monitor has been used in both descriptively and analytically and has provided important information on a multitude of health behaviors and policies.

Examples

An analysis of CAMH Monitor data from 1996-2006 provided important descriptive information about quitting smoking among individuals who were categorized as regular or occasional smokers. We found that the prevalence of having quit smoking for at least one year increased over time. In addition, females were more likely to show this increase than males, and older individuals more likely than younger ones [7]. These results provide us with the backdrop for examining additional questions in future research about why people quit, what programs might help people quit, and whether those who do quit are using new products that have become available such as e-cigarettes, waterpipes, smokeless tobacco and bidis. In addition, future research could be undertaken to explore whether methods of quitting have changed over time. Either survey questions could be developed to examine these issues or qualitative interviews could be used to supplement the information from the survey.

CAMH Monitor data have also been used descriptively to analyze effects of new legislation or policies by examining trends before and after the introduction of the legislation or policy, such as the potential impact of legislation on motor vehicle collisions in Ontario among smokers and nonsmokers. Legislation was enacted in Ontario in 2006 to prohibit smoking in vehicles when children and adolescents were present. We found that before the law was enacted the rate of reported collisions was higher among smokers than non smokers. Following the enactment of the legislation the rate among smokers decreased and there was no statistical difference between smokers and nonsmokers [8]. What is not known is whether drivers are in fact smoking while they are driving, their awareness of the legislation and whether their driving—smoking patterns changed because of the legislation. Another study examining cross-sectional CAMH data over time to assess legislative effects, found that texting and driving declined after introduction of more severe penalties [9].

Other examples of the use of CAMH Monitor data to evaluate policy interventions include Wickens, et al. [9] who assessed the impact of legislation to increase penalties for distracted driving on rates of texting and driving, and Mann, et al. [10] who evaluated the impact of legislation introducing administrative sanctions for impaired driving in on rates of driving after drinking in the province. These secondary analyses can also be supplemented with qualitative interviews to provide some explanation and background for the original findings.

Other types of secondary databases are longitudinal where large samples of individuals are followed over a number of years. For example, Wiesenthal and Vingilis [11] analyzed the Canadian National Population Health Survey (NPHS) descriptively and analytically to examine trends over time and relationships among variables. Specifically, they examined trajectories of distress in participants after they reported being injured from a motor vehicle collision. The NPHS, a Statistics Canada survey, is a repeated measures longitudinal survey to monitor the health and wellbeing of 20,000 Canadians. Participants were interviewed biennially from 1994/95 to 2002/03 (5 waves of interviews over a 9-year span). Because of the longitudinal nature of the secondary database, hierarchical linear modelling was used to identify within person trends; men experienced greater overall distress over time than women and a greater increase in distress over time. Moreover, the level of pre-injury distress predicted post-injury distress. This study revealed more complex and nuanced relationships among variables in their prediction of post-motor vehicle injury psychological distress. This secondary database provided numerous benefits. First, motor vehicle injuries are rare events; however, a sample of 20,000 individuals interviewed over 9 years provided enough cases of motor vehicle injury to examine the effects of injuries on distress. Additionally, evidence was mixed on whether pre-morbid distress predicted post-injury distress as all previous studies only had retrospective data on pre-injury distress levels. The use of a longitudinal secondary database provided information on distress levels before the injury occurred. The large sample size of injured individuals in this secondary database allowed for examination of mediators and moderators of the effects.

Finally, secondary data can be administrative data, that is, official records, such as hospital or police records. For example, the impact of new stunt driving legislation using stunt driving charges and collision casualty statistics, identified a decrease in charges and collision casualties among young males after the 2007 street racing legislation was introduced [12,13]. In addition, different types of secondary data can complement each other. Secondary data of hospital and police records can identify cases where individuals were apprehended or injured severely enough to go to hospital while self-report data identifies cases that might be missed by more official secondary data tools.

Discussion

Of course, there are some important factors that need to be considered in the use of secondary data.

Pros: First, there is much information available that has been collected in the past. This information can be used to make important contributions to knowledge, provide recommendations for policy, and provide the backdrop for future research.

Second, because the information is already available, subsequent research can be conducted in a timely manner, without the longer timelines for submitting proposals for funding and collecting original data. This is particularly salient because often events happen, such as the introduction of policies or historical events such as the current COVID 19 pandemic, before there is any opportunity for researchers to prepare to collect the relevant information needed to evaluate their impact. Third, often large sample sizes are available with secondary datasets, which is particularly important when investigating rare events. Moreover, certain types of secondary data have added benefits. For example, longitudinal secondary datasets have increased statistical power and can estimate a greater range of conditional probabilities compared to repeated cross-sectional secondary datasets [4].

The use of secondary data also gives researchers who have conducted the original surveys additional information that they can use to justify continuation of their original research. For example, there is strong epidemiological evidence connecting cannabis use to collision risk [13-16] that has spurred and informed experimental simulation studies examining precisely how cannabis affects driving [18,19].

Cons: As noted, secondary data may not provide all of the information of interest. Questions may not be worded as precisely as we would like to answer specific questions of interest. Analyses become more complicated if the question wording or methods of administration vary. In these cases, it is particularly difficult to decide how information from a range of years can be considered together. It is also critical to understand how the information was originally collected. Response rates to surveys have decreased over time, calling into question how representative the responses might be, which must be considered in the interpretation of secondary analyses. However, many well designed surveys include sampling weights to counter the biases that may occur from non-representative sampling. Longitudinal secondary datasets can suffer from attrition, although this is sometimes addressed by replacing lost respondents [4].

Online surveys are limited to those with access to the technology; targeted sub-groups who may not be the groups of interest when doing secondary analysis; and are correlational precluding cause and effect conclusions. Finally, ethics approval may be required if the information is being used for a purpose not originally proposed

Conclusion

It is important to make note of the limitations when presenting the information from secondary data and what the potential impact on the interpretation of the results can be. Nevertheless, secondary analysis can make important contributions to knowledge as well as provide directions for future research and programs. Tripathy (2013) [20] notes that while secondary data analysis can make important contributions to knowledge, it is important to follow specific guidelines in the use of such information, one of the most important being anonymization of the information.

Acknowledgment

We would like to thank the reviewers for their suggestions and helpful comments.

References

Copyright

© 2020 Pederson LL, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.