Can you give an example of data that you think it could be considered as “Big Data”? What are the characteristics of the data that fit into 5Vs, or 7Vs, or 10Vs of Big data characteristics?
We are in the world full of collections of data, which have been stored many years back and is tremendously increasing over time. Since my personal background field was medicine, I can say that healthcare data is uncountable and some of it rarely is being used. For example, summary of non-communicable disease (e.g. Diabetes, hypertension, dyslipidaemia, chronic kidney disease) are being sent to national HDC (Health Data Centre) periodically to visualise and take a proper action. The data format is known as 43-Files, mainly used for statistical and reimbursement purpose. Every hospital is obliged to sent these files for decades. Can you imagine how big these collections are?
These 43-Files collections fit in to these big data characteristics:
1. Volume – this is the best known characteristic of big data for most people. Like I have mentioned, 43-Files collected from hospitals over the years for every visit. You do the math but unfortunately I do not have a precise of how big they are. Let’s say in petabytes, I suppose.
2. Velocity – healthcare data is being generated every second there is a patient encounter.
3. Variety – 43-Files might lack of variety of data because they are stored in SQL (structured data) and exported as CSV files (comma separated values). They contains no images or any binary files other than a plain text.
4. Variability – 43-Files are inconsistent and prone to error due to its nature which comes from various sources, variations in human input format, and a high volume of patient visits.
5. Veracity and 6. Validity –– like I have mentioned in No.4, 43 Files are prone to error, so apparently they appear to have unuseful data.
7. Vulnerability –– 43 Files are unencrypted. Data protection is crucial and sending them back and forth must be done in a secure manner.
8. Volatility — for 43 Files are yet to be determined.
9. Visualisation — Yes
10. Value — 43 Files are intended to use for statistical purpose, but it also contains other data which may be useful upon data cleaning.