Manado, Indonesia. 95252
(+62) 823-9602-9583
bayudwiyansatria@gmail.com

Tag: Apache Spark

Software Engineer | DevOps Engineer

spark-vs-hadoop

What is differences about Apache Hadoop vs Apache Spark

What is Big Data? What size of Data is considered to be big and will be termed as Big Data? We have many relative assumptions for the term Big Data. It is possible that, the amount of data say 50 terabytes can be considered as Big Data for Startup’s but it may not be Big Data for the companies like Google and Facebook. It is because they have infrastructure to store and process this vast amount of data. Apache Hadoop and Apache Spark are both Big Data analytics frameworks they provide some of the most popular tools used to carry out common Big Data-related tasks.

Featured

Apache Spark Parallel Program Flows

Apache Spark Flows – Apache Spark consists of several purpose-built components as we have discuss at the introduction of apache spark. Let’s see what a typical Spark program looks like. Imagine that a 300 MB log file is stored in a three-node HDFS cluster. Hadoop File System (HDFS) automatically splits the file into 128 MB parts and places each part on a separate node of the cluster.

apache_spark_component

Apache Spark Component Parallel Processing

Apache Spark consists of several purpose-built components as we have discuss at the introduction of apache spark. Apache spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools. These are Apache Spark Component : Spark Core, Spark SQL, Spark Streaming, Spark GraphX, Spark MLlib.

These components make Spark a feature-packed unifying platform: it can be used for many tasks that previously had to be accomplished with several different frameworks. A brief description of each Apache Spark component follows.

Featured

Apache Spark Parallel Processing Introduction

Apache Spark is usually defined as a fast, general-purpose, distributed computing platform. Apache spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.