목차
ppt
What is Big Data?
- No single standard definition...
- “Big Data” is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
- “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it...
What is Big Data?
- “Big data is high-volume, high-velocity and high- variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making.” by Gartner
What is Big Data?
What is Big Data?
-
Some makes it 4V’s
What is Big Data?
-
Some makes it 5V’s
Big Data Characteristics
Data is New King!
From Data To Wisdom
Quotable Quotes about Big Data
- “Information is the oil of the 21st century, and analytics is the combustion engine” by Gartner
- “Data is the crude oil of the 21st century. You need data scientists to refine it!” by Karthik
- “Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.” by Geoffrey Moore
Data Science
-
An area that manages, manipulates, extracts, and interprets knowledge from tremendous amount of data
-
Data science is a multidisciplinary field of study with goal to address the challenges in big data
Why is Big Data hard?
- How to store?
- 1000 * 1TB hard drives required to store 1PB of data
- How to move?
- Assuming 10GB network, it takes 2 hours to copy 1TB, or 83 days to copy a 1PB
- How to search?
- Assuming each record is 1KB and one machine can process 1000 records per sec, it needs 277 CPU days to process a 1TB and 785 CPU years to process a 1PB
- How to process?
- How to convert algorithms to work in large size
- How to create new algorithms
Why is Big Data hard?
- there are no one-size-fits-all solution
- Rapidly-evolving technology
- Many different tools!
- Different computation model: need new algorithms!
Solution to Big Data Processing
- Need to bring distributed storage and distributed processing to handle big data
- Issues:
- Distributing computation across many machines
- Maximizing performance
- Minimize I/O to disk,
- Minimize transfers across the network
- Combining the results of distributed computation
- Recovering from failures
Big Data
Infrastructure for Big Data
Google
Google data centers in the Dalles, Oregon
Example: Google DataCenter
Cluster Computing
- Cluster: collection of individual PC’s (compute nodes) connected by a high performance network
- Each compute node is an independent entity with its own
- Processor
- Mainmemory
- Oneormultiplenetworkingcards
- All compute nodes typically have access to a shared file system (Distributed file system)
- Removes the necessity to replicate programs and data on all compute nodes
- All accesses to files require communication over the network
Scalability
- Scalability is the ability of the system to adapt to increased demands in terms of processing
- A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance or increase in administrative complexity
Two types of scaling
- Scale up(=Vertical Scaling)
- the computing resources on a node, via parallel
processing & faster memory/storage
- Specialized expensive hardware
- Single point of failure (SPoF)
- Scale out(=Horizontal Scaling)
- the computing to distributed nodes in a cluster
- Commodity hardware
- Any point may fail, but with no apparent loss of availability (fault tolerant)
Scale Up vs. Scale Out
Whatever system we choose, it has to scale for big data and big data processing and it has to be economical!
Apache Hadoop is Scale out platform!
Platform vs. Framework
- Platform
- an underlying computer system on which application programs can run.
- Ex) Window platform, Linux platform iOS platform, Android platform
- Framework
- an organizational structure in a specific language and possibly some libraries.
- Ex) .NET framework , EJB framework
End.