The article focuses on utilizing Big Data to solve cardiovascular diseases and mentions the challenges of practically approaching it.
# (1) Missing data: It is due to the data being omitted by clinicians, considering it unnecessary, patient refusal, disagreeing with data collection, and unsolvable missing data. As a result, there is less than 10% manageable missing data, 10-60% unmanageable missing data giving different results among methods, and 60% of missing data does not have a valid statistical solution.
The paper mentioned different solutions: (a) Complete-case analysis, (b) Available-case analysis, (c) Imputation techniques, (d) Mixed effects regression models, (e) Generalized estimating equations, (f) Pattern mixture models and selection models.
In addition to that, my suggestion for this missing issue is that data fields should be validated during the process of submitting if the data collected is aimed at further research. There should also be a beneficial program for the patients such as providing a monthly lucky draw.
# (2) Selection Bias: As the patients differ in their geographic profiles, insurance coverage, and medical history, this results in different variable distributions in different treatment groups. Consequently, a large volume of data no longer ensures a representative sample, preventing making any valid inference, and generating several false positive results.
The paper suggests (a) Propensity score analysis, (b) Instrumental variable analysis, (c) Mendelian randomization for genetic studies, (d) Considering results as hypothesis-generating, and (e) Validating through RCTs.
My opinion on this issue is the same as the Propensity score analysis which matches patients with similar characteristics across treatment groups to reduce bias.
# (3) Data Analysis / Training: Lack of formal trainings in informatics, coding, data analysis, large database handling, inefficient algorithms leads to this complexity, resulting suboptimal analysis and inefficient data processing.
In my opinion, this could be easily improved by providing formal training programs, and collaboration between clinicians and data scientists. Analyzing singly-handed could lead to doing the right thing in the wrong way.
# (4) Interpretation and Translational Applicability of Results: Studies being complex and not self-explanatory with poor variables description, subjective assumptions in analysis, questionable data quality could contribute to unclear conclusion and biased interpretation.
To improve the interpretation and translation applicability of results, firstly, the variables and metadata in the datasets should be consistently well-defined to make it easier to interpret or use across studies. Standardization will address this issue. Secondly, validating through independent studies should be established whether replicating the same studies confirms the same results.
# (5) Privacy and Ethical Issues: Medical servers can be targeted by cybercriminals and there could be a risk of identifying individual information. As a result, it compromises individual privacy.
The paper discusses (1) using broad consent models, (2) implementing a “social contract”, (3) continuous improvement of data security systems, and (4) balancing privacy protection with community benefits.
My concern is about balancing privacy protection with community benefits. If it aims for benefits primarily, it is open to abuse such as corruption and public manipulation. Even if there are no benefits, ethical standards, and privacy policies should not harm people.