
Although there are great benefits and opportunities from big data, these make it great challenges to informaticians to retrieve specific data, process them, and transform them into valuable information or clinical applications. The following are such challenges and my suggestions on how to cope with them.
1. Multiple definition. There are some diseases or terms that have synonyms or can be written in other words. We have to think of all possible synonyms that have been used. Moreover, if the terms are unstructured data, we will face with the term which was misspelled and could not be tracked by searching the word. In this scenario, redundant matching of terms can be used to detect some misspelled words.
2. Unstructured data. In many hospitals, many data is still in unstructured form, for example, scan documents, images, history, physical examination, progress note, nurse note, etc. Transformation from unstructured to structured one is essential. However, we have to set up the data that need to be transformed. In addition, some data are found in scan documents written by difficult-to-read hand-writers.
3. Missing data. Every database has missing data which most of them often do not miss at random, leading to selective bias. If the missing data do not exceed 10 percent of all data, there are statistical methods to solve the problem. The methods include imputation techniques, mixed effects regression model, generalized estimating equations, and inference technique. However, increasing the proportion of missing values can lead to compromised results.
4. Data inconsistency. Inconsistencies in the data may occur after we duplicate the data. We have to check the consistency of the data by checking them.
5. Clinical applicability. There is evidence that most of the results analyzed from the big data are not true after well-designed randomized controlled trials are conducted. The results derived from a research on big data tell us about a trend that needs to be confirmed with a standard randomized controlled trial.
6. Legal and ethical issues. The privacy and confidentiality of patients are our main concerns. All identifiable data must be encrypted to de-identify the patients. Additionally, a password may be needed to access the database.