A scoping review on data integration in the field of infectious diseases, 2009-2018

Background and Aim: Little is known about data integration in public health research and its impact. This study aimed to summarize known collaboration information, the characteristics of the datasets used, the methods of data integration, and knowledge gaps. Materials and Methods: We reviewed papers on infectious diseases from two or more datasets published during 20092018, before the coronavirus disease pandemic. Two independent researchers searched the Medline and Global Health databases using predetermined criteria. Results: Of the 2375 items retrieved, 2272 titles and abstracts were reviewed. Of these, 164 were secondary reviews. Full-text reviews identified 153 relevant articles; we excluded 11 papers that did not meet our inclusion criteria. Of the 153 papers, 150 were single-country studies. Most papers were from North America (n=47). Viral diseases were the most commonly researched diseases (n=66), and many studies sought to define infection rates (n=62). Data integration usually employed unique national identifiers (n=37) or address-based identifiers (n=30). Two data sources were combined (n=121), and at least one data source typically included routine surveillance information. Conclusion: We found a growing usage of data integration in infectious diseases, emphasizing the advantages of data integration and linkage analysis, and reiterating its importance in public health emergency preparedness and response.


Introduction
The public health and socioeconomic consequences of the past outbreaks and the current coronavirus disease 2019 (COVID- 19) pandemic emphasize that the prevention and mitigation of outbreaks are critical. Outbreaks of West African Lassa fever, Ebola virus, Middle East respiratory syndrome coronavirus, and COVID-19 have shown that humans, animals, and the environment contribute to the development and spread of emerging infectious diseases [1][2][3][4].
The "One Health" initiative emphasizes the need for a collaborative and multisectoral approach to improving public health surveillance and risk management. International health organizations, including the World Health Organization and the Centers for Disease Control and Prevention, have long made it clear that cooperation among agencies and ministries in terms of data integration/sharing is of fundamental importance when seeking to increase public health surveillance capacity [5][6][7]. Collaboration efforts among governmental public health agencies have shown that no single actor/agency has all the knowledge and capacity about emerging infectious diseases. Such diseases are complex and associated with many uncertainties. Enhancements of national and global surveillance systems that collect and analyze the interconnections between people, animals, and the shared environment are critical when trying to prevent or rapidly respond to infectious diseases. Governments, scholars, and organizations worldwide have paid increased attention to public health surveillance and risk management. Recent advances in informatics have also greatly aided data integration [4,8].
Although environmental factors and zoonoses have become matters of public health concern, requiring multisectoral, the "One Health" initiative, including risk monitoring through public surveillance systems, and surveillance systems remain independent [1,9,10]. Despite the increasing interest in multisector collaboration/data integration worldwide [11,12], little is known about how data are integrated by public health researchers and the impact of such work.
This study aimed to review recent publications on infectious diseases and summarize information on sectors that collaborate the characteristics of the Available at www.onehealthjournal.org/Vol.7/No.2/1.pdf dataset, the methods of data integration, and knowledge gaps.

Ethical approval
This study is based on a literature review and hence, ethical approval is not necessary.

Study period and location
Data were extracted and interpreted from July 2019 to August 2019 at Seoul National University.

Scoping review
A scoping review seeks to define the direction and implications of future research by analyzing existing trends [13]. Such a review identifies research questions, searches for appropriate studies, selects studies using predefined criteria, and gathers and summarizes data [14,15].

Research questions
How has data integration in public health research on infectious diseases been presented? This study specifically explores the characteristics of the datasets used, the methods of data integration, and knowledge gaps.
We systematically searched the Medline and Global Health databases for papers published between January 1, 2009, and December 31, 2018, on data integration using search terms that had been identified by a group of experts in epidemiology and infectious diseases (Table-1).

Inclusion criteria
The use of two or more different data sources to generate new information promoting public health, with a specific focus on infectious disease under the following headings: Public health, infectious disease, collaboration, and study findings.

Exclusion criteria
Incident reports, modeling studies lacking actual data, reviews and perspectives, letters to editors, clinical trials, reports on laboratory procedures, and publications in languages other than English were excluded from the study.

Literature selection
Two researchers independently reviewed the article titles and abstracts of all the initial search results. Articles considered eligible by either reviewer underwent a full-text review. If the two reviewers disagreed on inclusion, a third reviewer made the final decision.

Data charting and synthesis
Excel program, developed with expert consultation from senior researchers, was used for data charting. Then, the following information was extracted: (1) General information (year of publication, country of study, areas of study, and objectives), (2) the research topic (infectious disease classification according to the pathogen agent), (3) database characteristics (numbers and types of data sources), (4) integration methods, and (5) collaboration information. The synthesis includes quantitative analysis (frequency) and qualitative analysis (thematic) to identify topics, objectives, methods, and gaps in data integration and/or linkage in infectious disease research. In addition, to identify and categorize sectors from which data were provided, we assigned datasets containing key information (results) to Group 1 and the other dataset to Group 2. If three or more datasets were integrated, the two principal datasets utilized for or associated with the primary study objective were selected. All datasets were classified into nine categories of sectors (A-I) in terms of data types and combinations. Finally, we used modified questions from the REporting of studies Conducted using Observational Routinely Collected Health Data (RECORD) checklist [16] to evaluate the papers and used four questions in the RECORD.

Results
Of 2375 initially retrieved papers, 2272 titles and abstracts were reviewed (103 excluded: Not in the English language). Of these, 164 were classified as secondary full-text reviews, excluding 2108 papers (1378 single-dataset papers, 347 review articles, 120 modeling studies, 114 laboratory procedures, 69 outbreak investigation reports, 16 clinical trials, 15 case reports, 49 other works including non-communicable disease articles, and study protocols). After excluding 11 papers that did not meet our inclusion criteria, we collected data from 153 articles ( Figure-1).
We reviewed 153 papers published from 2009 to 2018; most (n=150) were single-country studies. Table-2 presents the number of studies by country and only includes studies conducted in a single country (n=150). In Table-2, the United States accounted for the highest number of reports (37) followed by the United Kingdom (21) and Canada (10 cases). Reference to the World Bank national income standards [17] showed that high-income countries accounted for 107 reports; only two were from low-income countries.  Table-3 shows that most studies explored viral diseases (n=66), including infection with the human immunodeficiency virus (HIV). Data integration was applied to estimate incidence and mortality (n=62), identify risk factors (n=49), assess program impact (n=20), and perform spatial analysis (n=17). In most data integration approaches, unique national identifiers (n=34) or address-based identifiers (n=30) were used.
In Table-4, we summarize the types of sectors from which datasets were provided and the pairs of datasets used. Routine surveillance (n=96) was the most frequently used primary source of the dataset (Group 1) followed by physical/clinical data (n=20) and disease registry data (n=12). For the secondary dataset (Group 2), routine surveillance (n=45) again showed the most commonly used data source followed by hospital/clinical data (n=34). A total of 28 papers integrated datasets derived from two different routine surveillance sources, and 25 papers used routine surveillance and hospital/clinical datasets. Table-5 shows the results of the evaluation of selected papers using items from RECORD. Most of the papers described the name of the dataset (n=149), geographic region (n=143), and timeframe (n=130) in the title or abstract; 10 studies did not include information on data integration.

Discussion
We reviewed and analyzed studies that sought to extend current public health knowledge in the field of infectious diseases by integrating two or more data sources. The novel information may not have been attainable because the data sources were analyzed individually. Our work emphasizes the potential utility of data integration and identifies gaps in how research is performed and reported.    First, we found that the use of data integration when studying infectious diseases was very rare in low-income countries that lack resources, deliver suboptimal care, and have poor public health systems. In such countries, the infectious disease burden is high and outbreaks occur frequently. However, it should be noted that data integration for public health surveillance may already exist in these countries, but for internal country use and not necessarily for wider research purposes. Countries with a lack of resources often face barriers and disincentives to publications (e.g., technology, cost, and publishing processes), leading to some publication bias. Given the global mobility of the population today, systematic cooperation among government agencies has become essential [11,18]. Of the studies reviewed, two were conducted in Ethiopia. One linked malaria infection records to local health information, identified differences in transmission patterns, and generated data aiding the development of future strategies [19]. Another study linked temporal data to spatial information when exploring the path of malaria propagation, yielding primary data facilitating appropriate preventative interventions [20]. The findings suggest that examining the current status of establishment and linkage in monitoring systems across low-income countries should be a priority.
Second, most studies that linked data explored viral diseases, including HIV. New human viruses will continue to emerge in the absence of any vaccine or a known therapeutic intervention. Virus-related research is crucial; we must understand the basic biology of new viral pathogens and human susceptibilities [21,22]. An effective, global surveillance system for novel viruses is urgently needed to track future transmission and help inform vaccine needs and development. In addition, a well-established national surveillance system can help clinicians to promptly detect and respond to these threats and helps establish local vaccination strategies.
Third, in terms of sources used for data integration, country registration numbers served as effective links connecting various data sources through individual numbers, names, and ages. Government-organized data linkages are possible in Australia, the United Kingdom, and North Europe. In Australia, only an "integrating authority" can link data provided by researchers belonging to a data integration partnership, whereas Scotland has a "Health and Social Care Data Integration" framework [23,24].
One of the key advantages of data integration is that it can proceed to levels beyond what is possible with individual data sources. Population sizes are expanded and the denominators increase, enhancing the reliability of the disease-related estimates. The combination of routine surveillance and hospital data can reduce information bias or misclassification bias by including information from laboratory and clinical examinations. For example, in Scotland, mortality data collected over many years were linked to hospital data on Clostridium difficile infections, yielding an incidence rate calculated from unbiased population-based data [25]. In addition, data integration helps identify new factors that must be considered when establishing or prioritizing public health policies. For example, in the USA, community risk factors for Campylobacter infection, a food-borne disease (which had been ignored in the past), were assessed by integrating Campylobacter surveillance data with socioeconomic and environmental information [26].
In connection with the "One Health" initiative, data integration can be utilized for the control and prevention of zoonotic or vector-borne diseases by monitoring or exploring potential risk factors for humans. For instance, a Q-fever outbreak in goats in the Netherlands has led to thousands of human cases. In such cases, the data integrated from veterinary sectors and services could have enabled public health authorities to monitor outbreaks to avoid human cases [27]. This merged data from multiple disciplines and sectors will reduce the time to detect the source of infection and identify the animal host reservoir of human disease or zoonotic disease. In addition, practical resources can be shared and interpreted through data integration to strengthen cross-sector surveillance collaboration.
Finally, data integration not only yields useful insights but also extends the spatiotemporal scope of research. For example, a study from South Africa combined HIV monitoring and demographic data to assess the survival rate of children based on maternal HIV status. Such integrated data are important in terms of HIV treatment planning and adequate resource allocation [28]. Another study from Australia examined geographic clusters of babies born before arrival (BBA) by linking demographic and geographic information, thus identifying relevant geographic BBA status over time [29].
However, if the dataset communication standards vary, it is necessary to verify the accuracy of the data integration. Are the definitions of variables identical in both datasets? Were the principal data adequately identified? As data integration is not automated, researchers must critically examine every step of integration, including implementation and verification. Despite these difficulties, the expansion of research scope, especially that of large-scale epidemiological studies of infectious diseases, yields valuable new information on the pattern and direction of disease spread. Our ethical considerations included the need for patient privacy, data confidentiality and security, and respect for intellectual property.

Limitations
Our study had several limitations. First, we may have missed relevant papers despite our efforts to include broad search terms to find studies featuring data integration or the use of a multisectoral approach. Most of our results are presented within the public and medical health systems. We assume that this may be due to a lack of understanding or knowledge between human and animal or environmental sectors in terms of data features, such as types of surveillance, different entry points, and different search terms. We suggest that future studies include specific search terms or focus on specific diseases. Next, some papers did not adequately identify the sources of the datasets employed; we used our discretion to fill these gaps based on comments in the Discussions or Acknowledgments or the characteristics of the institutions involved (e.g., academia and government public health agencies). In addition, we were not able to look into how data analysis was carried out in the selected papers due to insufficient methodological descriptions in the papers. Future research should assess and report data analysis techniques for data integration. Finally, it is worth noting that there may be diverse routes to improving health security through data integration (e.g., early warning forecasting systems, rapid response, and deployment of resources, preventing spill-over risk, reducing the incidence, reducing cost, etc.). Since these are often not reported in the methods, future research should look into gray literature or government documents/ publications. Available at www.onehealthjournal.org/Vol.7/No.2/1.pdf

Conclusion
Our findings identify the growing usage of data integration in infectious diseases, emphasizing the advantages of data integration and linkage analysis, and reiterating its importance in public health emergency preparedness and response. It is necessary to progress beyond the use of traditional public health data sources. Hence, it is recommended that at least two such sources should be integrated through collaboration among stakeholders to enhance public health surveillance. The results may garner continued technical and financial support for the "One Health" initiative; we provide a synopsis of the potential advantages of data integration and linkage in daily practice.

Authors' Contributions
SK, CR, and ST: Conceptualized the review article. SK and CR: Conducted the bibliography research and drafted the manuscript. SK and SJK: Drafted, reviewed, and revised the manuscript. All authors contributed intellectually and approved the final manuscript.