How to setup a Pseudo-distributed Cluster with Hadoop 3.2.1 and Apache Spark 3.0

12 min readAug 14, 2020

https://techcrunch.com/2015/07/12/spark-and-hadoop-are-friends-not-foes/

This post is an installation guide for Apache Hadoop 3.2.1 and Apache Spark 3.0 [latest stable versions] based on the assumption that you have used Big Data frameworks like Hadoop and Apache Spark before and you want to try out the latest versions of the Hadoop and Spark environments for development purposes. Needless to say I will cover the fundamentals Apache Hadoop and Apache spark.

Note: This installation is not meant to be used in a real-life / production environment. My next post will cover the setup for a multi-node cluster setup for a production environment.

What is the difference between Stand-Alone mode and pseudo-distributed mode?

Single Node (Local Mode or Standalone Mode)
Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for debugging where you don’t really use HDFS.
You can use input and output both as a local file system in standalone mode.

You also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml, hdfs-site.xml.

Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the input and output.

Pseudo-distributed Mode
The pseudo-distributed mode is also known as a single-node cluster where both NameNode and DataNode will reside on the same machine.

In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such configuration is mainly used while testing when we don’t need to think about the resources and other users sharing the resource.

In this architecture, a separate JVM is spawned for every Hadoop components as they could communicate across network sockets, effectively producing a fully functioning and optimized mini-cluster on a single host.

So, in case of this mode, changes in configuration files will be required for all the three files- mapred-site.xml, core-site.xml, hdfs-site.xml.

HDFS and MapReduce:

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. Thus allowing the dataset to be processed faster and more efficiently than it would be in a more conventional computer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking. In short HDFS gives us a base to store large datasets which is distributed among multiple nodes and a faster and more efficient data retrieval technique using the MapReduce programming model.

Useful Resources:

Apache Hadoop 3.2.1

Apache Hadoop 3.2.1 incorporates a number of significant enhancements over the previous major release line…

hadoop.apache.org

https://en.wikipedia.org/wiki/MapReduce

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common — contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN — a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications
Hadoop MapReduce — an implementation of the MapReduce programming model for large-scale data processing.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm. In this post we shall install Apache Spark along with Hadoop.

Installation of Hadoop:

Pre-req:

A Linux distribution system (vm should work fine, but it is not recommended )
Sudo privileges
A Decent computer with stable internet connection (ony for downloading the necessary software)

Installation:

Install Java

sudo apt update
sudo apt install openjdk-8-jdk openjdk-8-jre
# this command is for an ubuntu system

2. See the Hadoop Wiki for known good versions. I used java version 8. Verify your installation using java -version.

(base) sriram@sriram-Inspiron-7572:~$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

3. To change the Java version used

(base) sriram@sriram-Inspiron-7572:~$sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).Selection    Path                                            Priority   Status
------------------------------------------------------------
  0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
* 2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual modePress <enter> to keep the current choice[*], or type selection number:

I have moved my java-8-openjdk-amd to /usr/local/ ( personal preference ). I suggest you follow the same for the sake of the tutorial or note down the location properly for $JAVA_HOME

4. Add JAVA_HOME to ~/.bashrc

Note: bashrc is a very powerful file, changes made to this file can corrupt your system. Nonetheless use the file carefully make sure you don’t delete / add unnecessary lines here. In this tutorial (and every tutorial) you’ll find that instructors suggest you use the nano / vi text editor. People from pure windows background might find it hard to use hence I would recommend you use gedit/subl for this. (just replace nano/vi with gedit)

$ sudo nano ~/.bashrc #to open bashrc

scroll to the end and paste these lines

# JAVA VARIABLES
export JAVA_HOME=/usr/local/java-8-openjdk-amd64 
export PATH=$PATH:$JAVA_HOME/bin

Save and close (ctrl + s and ctrl + x for nano)

5. Install ssh and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. Additionally, it is recommended that pdsh also be installed for better ssh resource management.(In Ubuntu)

(base) sriram@sriram-Inspiron-7572:~$ sudo apt-get install ssh
(base) sriram@sriram-Inspiron-7572:~$ sudo apt-get install pdsh

Make sure you add this line in you ~/.bashrc file

# this line is to ensure pdsh uses ssh
export PDSH_RCMD_TYPE=ssh

6. Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Once you are done it should like this:

(base) sriram@sriram-Inspiron-7572:~$ ssh localhost
Welcome to Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-42-generic x86_64)* Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage* Are you ready for Kubernetes 1.19? It's nearly here! Try RC3 with
   sudo snap install microk8s --channel=1.19/candidate --classichttps://microk8s.io/ has docs and details.2 updates can be installed immediately.
0 of these updates are security updates.
To see these additional updates run: apt list --upgradableYour Hardware Enablement Stack (HWE) is supported until April 2025.
*** System restart required ***
Last login: Fri Aug 14 13:17:31 2020 from 127.0.0.1

7. Download and extract hadoop 3.2.1 software package in the location of your choice.

$ wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz #DOWNLOAD
$ tar xzf hadoop-3.2.1.tar.gz #EXTRACT
$ mv hadoop-3.2.1 hadoop #rename
$ mv hadoop /usr/local/

You can manually download from the given link and extract the files and place it in any location. I placed hadoop at /usr/local/

Apache Download Mirrors

Home page of The Apache Software Foundation

www.apache.org

8. Set Hadoop environment variables

add this line to your /etc/environment file

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"

Add these lines of code to your bashrc file

# this line is added so that the environment file which contains $HADOOP_HOME, which is needed for running "hadoop" command anywhere in the system (multi-environment)
source /etc/environmentexport HADOOP_HOME=/usr/local/hadoop 
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
# this line is used to compile the java code in 64bit compiler instead of default 32bit (this will not affect functionality but will improve performance) this is associated with the WARN.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin 
export HADOOP_INSTALL=$HADOOP_HOME

To update the variables run source ~/.bashrc

9. ** Edit Config Files **

This is the most important section of the module. Follow the steps carefully.

Add these lines in the <configuration> tags of the following lines *
replace the existing <configuration> … </configuration> tags

$HADOOP_HOME/etc/hadoop/core-site.xml:

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/usr/local/hadoop/tmpdata</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>

$HADOOP_HOME/etc/hadoop/hdfs-site.xml:

<configuration>
<property>
  <name>dfs.data.dir</name>
  <value>/usr/local/hadoop/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/usr/local/hadoop/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
</configuration>

$HADOOP_HOME/etc/hadoop/mapred-site.xml:

<configuration> 
<property> 
  <name>mapreduce.framework.name</name> 
  <value>yarn</value> 
</property> 
</configuration>

$HADOOP_HOME/etc/hadoop/yarn-site.xml:

<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>127.0.0.1</value>
</property>
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
<property>
  <name>yarn.nodemanager.env-whitelist</name>   
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

10. Edit Hadoop-env.sh

The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.

When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file:

Note the value:

(base) sriram@sriram-Inspiron-7572:~$ $JAVA_HOME
bash: /usr/local/java-8-openjdk-amd64: Is a directory

>> /usr/local/java-8-openjdk-amd64<< and open hadoop-env.sh file

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system. If you have installed the same version as presented in the first part of this tutorial, add the following line:

export JAVA_HOME= /usr/local/java-8-openjdk-amd64

11. Format the file system

$ bin/hdfs namenode -format

12. If everything as gone well till now you should be able to see this, you have successfully installed the standalone version of hadoop.

(base) sriram@sriram-Inspiron-7572:~$ hadoop version
Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar

13. Verify the installation

(base) sriram@sriram-Inspiron-7572:~$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as sriram in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [localhost]
Starting datanodes
localhost: datanode is running as process 33621.  Stop it first.
Starting secondary namenodes [sriram-Inspiron-7572]
sriram-Inspiron-7572: secondarynamenode is running as process 33832.  Stop it first.
Starting resourcemanager
Starting nodemanagers
(base) sriram@sriram-Inspiron-7572:~$ jps
35475 Jps
33621 DataNode
35111 NodeManager
33832 SecondaryNameNode
34954 ResourceManager

PORTS: [localhost]

8080: Resource Manager

9870: Name Node

9864: Data Node

My bashrc file:

# this line is added so that the environment file which contains $HADOOP_HOME, which is needed for running "hadoop" command anywhere in the system
source /etc/environment# JAVA VARIABLES
export JAVA_HOME=/usr/local/java-8-openjdk-amd64 
export PATH=$PATH:$JAVA_HOME/bin# HADOOP VARIABLES
export HADOOP_HOME=/usr/local/hadoop 
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
# this line is used to compile the java code in 64bit compiler instead of default 32bit (this will not affect functionality but will improve performance) this is associated with the WARN.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin 
export HADOOP_INSTALL=$HADOOP_HOME# this line is to ensure pdsh uses ssh
export PDSH_RCMD_TYPE=ssh# SPARK VARIABLES
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

10. Test a Basic Command

HDFS Commands - GeeksforGeeks

HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing large data sets of…

www.geeksforgeeks.org

// Guess what the code does ? // (answer at the end)

(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hdfs dfs -mkdir /user
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hdfs dfs -mkdir /user/sriram
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ hdfs dfs -mkdir /input
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ hdfs dfs -put etc/hadoop/*.xml /input
2020-08-14 15:16:02,263 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,116 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,300 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,759 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:03,931 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,104 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,288 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,405 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:16:04,524 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
2020-08-14 15:18:41,134 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
2020-08-14 15:18:41,853 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/sriram/.staging/job_1597397135082_0001
2020-08-14 15:18:42,045 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-08-14 15:18:42,277 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/sriram/.staging/job_1597397135082_0001
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://127.0.0.1:9000/user/sriram/input
 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332)
 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:396)
 at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310)
 at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327)
 at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
 at org.apache.hadoop.examples.Grep.run(Grep.java:78)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
 at org.apache.hadoop.examples.Grep.main(Grep.java:103)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
 at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
 at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ bin/hdfs dfs -cat output/*
cat: `output/part-r-00000': No such file or directory
cat: `output/_SUCCESS': No such file or directory
(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$

(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ cat output/part-r-00000 
1 dfsadmin
1 dfs.replication

Code: [answer]

Made a dir: input on the HDFS
hdfs dfs -put etc/hadoop/*.xml /input : puts all .xml files in input
Returned every file that started with dfs into output

10. Finishing things off

(base) sriram@sriram-Inspiron-7572:/usr/local/hadoop$ stop-all.sh
WARNING: Stopping all Apache Hadoop daemons as sriram in 10 seconds.
WARNING: Use CTRL-C to abort.
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [sriram-Inspiron-7572]
Stopping nodemanagers
Stopping resourcemanager

Apache Spark

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads — batch processing, interactive queries, real-time analytics, machine learning, and graph processing. You’ll find it used by organizations from any industry, including at FINRA, Yelp, Zillow, DataXu, Urban Institute, and CrowdStrike. Apache Spark has become one of the most popular big data distributed processing framework with 365,000 meetup members in 2017.

Apache Spark vs. Apache Hadoop

Outside of the differences in the design of Spark and Hadoop MapReduce, many organizations have found these big data frameworks to be complimentary, using them together to solve a broader business challenge.

Hadoop is an open source framework that has the Hadoop Distributed File System (HDFS) as storage, YARN as a way of managing computing resources used by different applications, and an implementation of the MapReduce programming model as an execution engine. In a typical Hadoop implementation, different execution engines are also deployed such as Spark, Tez, and Presto.

Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others. Spark on Hadoop leverages YARN to share a common cluster and dataset as other Hadoop engines, ensuring consistent levels of service, and response.

In this post I will not dive deep into the spark framework, but give a quick installation guide.

Installation:

Apache Download Mirrors

Home page of The Apache Software Foundation

www.apache.org

Download the file from the above link and place it at /usr/local
Add the following lines on bashrc(change the location if you have extracted in a different place)

# Spark Variables
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

3. Add the following lines on $SPARK_HOME/bin/load-spark-env.sh

export SPARK_LOCAL_IP="127.0.0.1"

4. Verify installation

start-all.sh # To start all hadoop-daemons
spark-shell --master yarn # start spark with YARN2020-08-14 17:53:50,165 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-08-14 17:54:00,660 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = yarn, app id = application_1597405003831_0005).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.scala>

msris108/BIG_DATA-PROJECTS

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Check out my GitHub repo that covers basics of Spark and SparkML. More articles on spark and sparkml will be posted soon.

How to setup a Pseudo-distributed Cluster with Hadoop 3.2.1 and Apache Spark 3.0

What is the difference between Stand-Alone mode and pseudo-distributed mode?

HDFS and MapReduce:

Apache Hadoop 3.2.1

Apache Hadoop 3.2.1 incorporates a number of significant enhancements over the previous major release line…

Installation of Hadoop:

Pre-req:

Installation:

Apache Download Mirrors

Home page of The Apache Software Foundation

My bashrc file:

HDFS Commands - GeeksforGeeks

HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing large data sets of…

Apache Spark

Apache Spark vs. Apache Hadoop

Apache Download Mirrors

Home page of The Apache Software Foundation

msris108/BIG_DATA-PROJECTS

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Written by Sriram