The 4 V’s of big data in the past few years, with many articles, opinions, criticisms, and hyped messages having been delivered on that subject. Objectively, the main point of the V-based characterization of big data is to highlight its most serious challenges: the capture, cleaning, curation, integration, storage, processing, indexing, search, sharing, transfer, mining, analysis, and visualization of large volumes of fast-moving highly complex data.
It did not take long for many of us to add some more V’s to the characterization of big data. Adding new V’s to the list of big data challenges, they were providing valuable lessons learned and best practices for the rest of us.So, what are the V’s representing big data’s biggest challenges? These V-based characterizations represent ten different challenges associated with the main tasks involving big data (as mentioned earlier: capture, cleaning, curation, integration, storage, processing, indexing, search, sharing, transfer, mining, analysis, and visualization).
- Volume: = lots of data
- Variety: = complexity, thousands or more features per data item, many data types, and many data formats.
- Velocity: = high rate of data and information flowing into and out of our systems, real-time.
- Veracity: = necessary and sufficient data to test many different hypotheses, vast training samples for rich micro-scale model-building and model validation, micro-grained “truth” about every object in your data collection
- Validity: = data quality, governance, master data management (MDM) on massive, diverse, distributed, heterogeneous, “unclean” data collections.
- Value: = the all-important V, characterizing the business value, ROI, and potential of big data to transform your organization from top to bottom.
- Variability: = dynamic, evolving, spatiotemporal data, time series, seasonal, and any other type of non-static behavior in your data sources, customers, objects of study, etc.
- Venue: = distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements, private vs. public cloud.
- Vocabulary: = schema, data models, semantics, ontologies, taxonomies, and other content- and context-based metadata that describe the data’s structure, syntax, content, and provenance.
- Vagueness: = confusion over the meaning of big data (Is it Hadoop? Is it something that we’ve always had? What’s new about it? What are the tools? Which tools should I use? etc.)