목차
ppt
What is Big Data?
- No single standard definition...
- “Big Data” is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
- “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it...
What is Big Data?
- “Big data is high-volume, high-velocity and high- variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making.” by Gartner
What is Big Data?

What is Big Data?
-
Some makes it 4V’s

What is Big Data?
-
Some makes it 5V’s

Big Data Characteristics

Data is New King!

From Data To Wisdom

Quotable Quotes about Big Data
- “Information is the oil of the 21st century, and analytics is the combustion engine” by Gartner
- “Data is the crude oil of the 21st century. You need data scientists to refine it!” by Karthik
- “Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.” by Geoffrey Moore
Data Science
-
An area that manages, manipulates, extracts, and interprets knowledge from tremendous amount of data
-
Data science is a multidisciplinary field of study with goal to address the challenges in big data

Why is Big Data hard?
- How to store?
- 1000 * 1TB hard drives required to store 1PB of data
- How to move?
- Assuming 10GB network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB
- How to search?
- Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277 CPU days to process a 1TB and 785 CPU years to process a 1PB
- How to process?
- How to convert algorithms to work in large size
- How to create new algorithms
Why is Big Data hard?
- there are no one-size-fits-all solution
- Rapidly-evolving technology
- Many different tools!
- Different computation model: need new algorithms!
Solution to Big Data Processing
- Need to bring distributed storage and distributed processing to handle big data
- Issues:
- Distributing computation across many machines
- Maximizing performance
- Minimize I/O to disk,
- Minimize transfers across the network
- Combining the results of distributed computation
- Recovering from failures
Big Data

Infrastructure for Big Data

Google
Google data centers in the Dalles, Oregon
Example: Google DataCenter
Cluster Computing
- Cluster: collection of individual PC’s (compute nodes) connected by a high performance network
- Each compute node is an independent entity with its own
- Processor
- Mainmemory
- Oneormultiplenetworkingcards
- All compute nodes typically have access to a shared file system (Distributed file system)
Scalability
- Scalability is the ability of the system to adapt to increased demands in terms of processing
- A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance or increase in administrative complexity
Two types of scaling
- Scale up(=Vertical Scaling)
- Scale out(=Horizontal Scaling)
Scale Up vs. Scale Out