Home / Blog / Data Science / Apache Spark Building Blocks

Apache Spark Building Blocks

  • June 28, 2023
  • 6054
  • 47
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

 

Apache Spark is one of the best frameworks for managing enormous amounts of heterogeneous data. According to study, this open-source framework outperforms the standard MapReduce programme by a factor of about 100. While Spark was first made public by UC Berkeley's AMPLab in 2009, Apache only took ownership of it in 2010. Let's quickly comprehend Spark. For any activity or project to be completed, three crucial elements are required. For the system to carry out the commands, the processing power (processor), the information, and the task's syntax and logic are all specified. The major data stores that Spark can connect to are shown in the following screenshot.

Apache Spark

Apache Spark is extensively used in handling big data and performing a wide variety of tasks on batch data as well as streaming data, it can be used to perform machine learning as well as handle graph data. Apache Spark is not a programming language, to begin with. It is a framework/platform on which we can execute instructions (code). Spark brings in a wide variety of features which makes it a favorite among analysts in the big data circle. It also has interfaces with an amazing number of relevant datastores in the space of both distributed as well as non-distributed space. A few of these are HDFS, Cassandra, Openstack Swift, S3, and Kuda.

The fact that Spark offers several language APIs allows the whole programming community to use Spark, which is one of its strongest features. Spark primarily supports the language APIs of Scala, Java, Python, and R. Language APIs need special attention since they are a key factor in the widespread acceptance of Spark within the development community.

Scala is an open-source programming language that was released in 2004. Scala was utilised in the development of more than 70% of the Spark's code. It is a dynamic programming language that benefits from both functional-oriented and object-oriented programming principles.

Learn the core concepts of Data Science Course video on Youtube:

Some of Spark Features are:

  • Open Source
  • Distributed and parallel computation
  • In Memory computation
  • Used for Iterative and interactive applications
  • Distributed Datasets
  • Batch and Real-time applications
  • Programming in Scala, Java, Python, R, and SQL
  • Requires Commodity Hardware
  • Fault tolerance
  • Scalability
  • Runs both on Windows and Linux OS
  • Written into Scala programming language
  • Easy to use

Spark is renowned for its ability to process huge data quickly in a distributed setting. Spark leverages the cluster's distributed memory, which is how the magic happens. All transactions are then performed entirely in memory after the data has been put into the memory from the disc. Click here to learn Data Science Course in Chennai

SparkContext, RDD (data objects), and Operations make up Apache Spark.

SparkContext - Any application's starting point is the SparkContext. It makes it possible for the application to interact with data sources and manipulate data.

SparkSession - Spark 2.0 onwards a new entry point called SparkSession was introduced. SparkSession has built-in support for Hive (HiveContext) and SQL-like (SQLContext) operations.

The next component is the data on which the processing is to be done. In Spark, a special type of data is used. Spark holds the data in memory and processes the data in memory.

RDD: Resilient Distributed Datasets, is a read-only memory abstraction of the data object in Spark. RDDs are collections of records that are immutable, fault-tolerant, and partitioned.

In Spark 1.3 version release the Dataframe was introduced. The major difference is that RDD is unstructured objects whereas, the Dataframes are organized in a tabular fashion.

 

DataFrames are collections of organised data that are dispersed among nodes. SQL and DataFrames, which Spark introduced, are comparable to relational database tables or Python's Pandas DataFrames. The schema may be deduced from Dataframes.

Version 1.6 of the datasets was made available. Extensions to dataframes are datasets. Before running the code, Spark will be able to examine the schema to see what is being specified. In summary, while compiling object-oriented operations, Spark will be able to check and assess the data type associated with the data objects.

The term "dataset" refers to a collection of tightly typed, organised data. Datasets' main objective is to offer a simple means of performing transformations on objects without sacrificing the benefits of Spark's efficiency and resilience.

3rd Component - OperationsApache Spark supports 2 types of operations: Transformations and Actions

RDDs are the input for transformations, which produce one or more (new) RDDs as the result. RDDs cannot be changed since they are read-only, in-memory, immutable objects. Transformations produce a lineage graph known as a DAG (Directed Acyclic Graph), often referred to as lazy transactions. These lineage graphs are often referred to as RDD operator or dependency graphs.

DAGs can be seen as the execution plan for the transformation operations that we want to perform.

Transformations can be classified as Narrow transformations or Wider transformation

When data does not need to move between partitions to get functions executed are called Narrow transformations. The data reside in a single partition.
examples: map(), mapPartition(), flatMap(), filter(), union()

Wider transformations: The required data for computing reside on many partitions and hence data moves across from multiple partitions. Also known as shuffle transformations as data gets shuffled while the operations are executed.
examples: groupByKey(), aggregateByKey(), aggregate(), join(), repartition()

TRANSFORMATIONS EXAMPLES:

General

  • map
  • filter
  • flatMap
  • mapPartitions
  • mapPartitionsWithIndex
  • groupBy
  • sortBy

Math / Statistical

  • sample
  • randomSplit

Set Theory / Relational

  • union
  • intersection
  • subtract
  • distinct
  • cartesian
  • zip

Data Structure / I/O

  • keyBy
  • zipWithIndex
  • zipWithUniqueID
  • zipPartitions
  • coalesce
  • repartition
  • repartitionAndSortWithinPartitions
  • pipe

Action Operations:

The actions are referred to as urgent operations. While transformation operations produce new RDD(s), action activities produce results from the RDDs. The outcomes of action operations are kept on Spark drivers (drivers are JVM processes that coordinate the activities of workers and task execution) or in an external storage system.

ACTIONS EXAMPLES:

General

  • reduce
  • collect
  • aggregate
  • fold
  • first
  • take
  • forEach
  • top
  • treeAggregate
  • treeReduce
  • forEachPartition
  • collectAsMap
  • takeOrdered

Math / Statistical

  • count
  • takeSample
  • max
  • min
  • sum
  • histogram
  • mean
  • variance
  • stdev
  • sampleVariance
  • countApprox
  • countApproxDistinct

Set Theory / Relational

  • takeOrdered

Data Structure / I/O

  • saveAsTextFile
  • saveAsSequenceFile
  • saveAsObjectFile
  • saveAsHadoopDataset
  • saveAsHadoopFile
  • saveAsNewAPIHadoopDataset
  • saveAsNewAPIHadoopFile

Spark Optimization

Monitoring is crucial once an application is put into production to maintain the results and make sure that tasks are completed successfully. The effectiveness of tasks is often evaluated based on a few factors, including runtime, storage space, and metrics for data shuffles across nodes. The majority of developers just concentrate on creating apps; they pay little attention to refactoring and optimising the code.

Typically, optimisation is carried out on two levels: the cluster level and the application level. Typically, cluster-level optimisation entails using hardware and Spark clusters to their fullest potential. Spark has the ability to run in parallel, therefore the more hardware the better. Optimised performance also benefits from quicker networks and more memory, especially when shuffling data. The finest feature of it is autonomous memory management in the most recent Spark version (versions higher than 1.6). The default caching option of MEMORY_ONLY can be changed to MEMORY_AND_DISK as the storage level if RDD is excessively large and cannot fit in memory. With this option, the partitions on the disc are transferred to memory without being recalculated. When data is temporarily moved from memory to disc, disc storage is equally crucial. To guarantee that optimised results are produced, additional cores can also be changed. Number of executors, Cores allotted to each executor, and Memory allotted to each executor are a few variables that may be changed for optimal use of Spark tasks.

Project Tungsten deserves special attention whenever optimisation in Spark is considered. Since Tungsten was made available by default by Spark, no setup is required to utilise it. Tungsten is used to improve Spark applications' CPU and memory efficiency. The following list summarises the project's primary optimisation options:

  • By getting rid of JVM objects, memory use is decreased and memory management is optimised. Additionally, this eliminates the expense of garbage collection.
  • Because JVM objects are heavier than binary format, data is handled in binary format rather than JVM objects, which speeds up processing. UnsafeRow format is another name for the binary format.
  • The Spark's structured APIs are used to create the bytecode for the written code. When writing huge queries, one gets decent speed.

By eliminating serialisation problems, writing code in Scala or Java will optimise the application. It is strongly advised to write user-defined functions (UDFs) in Scala for optimisation. For improved optimisation, you can switch between APIs and RDD. When intensive calculation is required, one can initially use DataFrame APIs before switching to RDD for greater application control. Due to higher compression, binary file formats are always preferable over text (CSV or JSON) file formats. The binary format is superior for network transport and storage. Columnar file formats (Parquet & ORC), which are also favoured with Spark, are best if you regularly need to read and calculate certain fields from a table. Additionally, columnar formats compress data at a rapid pace. Additionally, compressions like snappy or LZF guarantee extremely significant data compression.

On structured APIs like SQL, DataFrame, and datasets, Catalyst Optimizer offers a considerable level of speed optimisation. The written query transforms the logical query plan into a physical query plan, which contains details on which files to read or which tables to join. The data is handled more quickly thanks to partitioning and bucketing algorithms. Additionally, Spark SQL's shuffle partitions have a default value of 200, which allows for parallelism. When we have a lot of data, we may reduce the size of the shuffle divisions by increasing the number of them. For the proper optimisation, spark. SQL.shuffle.partitions' value can be modified. When it comes to RDD, the number of partitions may be changed using coalesce () and repartition (). Shuffle hash, Broadcast hash, and Cartesian Spark SQL joins are useful for specific types of optimisations. When working with large queries, setting sparl. SQL.codegen to True is a good idea. The slow-running jobs can be monitored using the Spark UI, and once we have set spark speculation to True, we can make sure that the slow-running activities may be executed on another node for speedy completion.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad

Data Analyst Courses in Other Locations

ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka

 

Navigate to Address

360DigiTMG - Data Science Course, Data Scientist Course Training in Chennai

D.No: C1, No.3, 3rd Floor, State Highway 49A, 330, Rajiv Gandhi Salai, NJK Avenue, Thoraipakkam, Tamil Nadu 600097

1800-212-654-321

Get Direction: Data Science Course

Read
Success Stories
Make an Enquiry