Before learning about big data, let us quickly see the motivations for Big Data and similar technologies.
We saw the three V’s of Big data already and not surprisingly, the motivations for Big Data and similar technologies are also based on these three V’s: Variety, Velocity and Volume.
- Traditional relational databases were not designed to handle huge volumes of data as generated by various data sources. Relational databases may be able to handle terra bytes of data, but not peta bytes and more. Big Data technologies such as Hadoop and NoSQL are more suited here.
- The speed (velocity) at which data may come in might also be very high in case of data generated today. Big Data technologies such as Hadoop and NoSQL can help us here also.
- Data is generated by a variety of sources as structures, semi structured and also structures and most of these data won’t fit into the traditional database management systems, which can handle mainly structured data.
As we saw the volume, velocity and variety of data has increased over years and is still increasing faster. Rather than increasing the computing powers of machines to handle this big data, we should use multiple computing resources working in parallel.
Traditional RDBMs usually worked with structured data using Structures Query Language (SQL). But not all data are structured and can be readily put into an RDBMs. Most of the data generated are unstructured or at least semi structured. Data can be structured, semi structured and unstructured, such as text, sensor data, audio, video, log files, xml, etc. Traditional databases were not designed for handling many varieties of data, but only structured data; hence we need an alternate way of processing than RDBMSs and SQL., and that is where Big Data comes in.
Hadoop mainly uses an algorithm called MapReduce that help multiple commodity hardwares to share the work load, work on different parts in parallel and even replicate data to achieve faster and reliable processing of data.
NoSQL databases can work well with semi structured data of various forms. They can also scale easily, unlike traditional databases.
SOURCES OF DATA GENERATION
The data in Big Data may be generated by humans and/or by machines.
Human Generated Data
Humans may generate data either knowingly or unknowingly.
- Intentional Data
- Intentional data generated by data include photos, videos and text, shares and likes in social media etc.
- Metadata is data about data, and often accompanies the data contents without being noticed by the end user. This data is usually machine-readable as they usually follow some protocol.
- Examples of metadata include:
- Photograph Exif metadata that will contain additional info like the location, time etc. when the image was taken.
- Cellphone metadata will contain the location and time details of the call.
- Email metadata will contain many additional data like to, cc, from etc.
- A Twitter tweet contain lot of metadata, even much bigger than the tweet content.
A lot of data is generated by machines or devices along with other processing.
- Machine generated data usually follow some protocol and hence it can be easily read and analyzed than human generated data.
- Example sources for machine generated data are:
- Cell phones connecting to towers exchange data
- Reading from medical devices
- Web crawlers and spam bots.
- There are many uses for these automatically generated data:
- Monitoring production lines
- Identifying and monitoring pets
- Infrastructure management
- Energy management
- Lynda.com’s Techniques and Concepts of Big Data with Barton Poulson.