Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Spark is a data processing system that can handle large data sets rapidly and spread

processing tasks across many devices, either on its own or in conjunction with other

distributed computing resources. These two characteristics are critical in the fields of big data

and machine learning, which necessitate the use of massive computing resources to process

large data sets. Spark also relieves developers of some of the programming pressures

associated with these activities by providing an easy-to-use API that abstracts away much of

the grunt work associated with distributed computing and big data processing.

Authentication involves confirming that the customer is who they say they are based on the

information they sent, and that the information matches their identity contained in our system

or a third-party data provider exactly. Authentication is typically not part of the framework

because it is difficult to know the user until they input information, particularly for

applications that enable users to communicate / enter data. In spark authorization and

authentication can be done by the following methods. However, since all of Spark's users

have already registered, the authentication layer is required to protect the whole system. So,

before you can connect to the Spark API, you must first pass Spark authentication, which

comes in a variety of flavors. The purpose of authorization is to ensure that the user can

access the services he or she has requested. Authentication and authorization are perhaps the

two most critical players in ensuring the infrastructure is stable, but security strategies are far

more than that.

Authentication Handler

It's a piece of software that manages the authentication process. Token, Credential, Adapter,

and Request Filter are the components. To begin, the Tokens are accepted and decoded by the

application sending the request for validation. Then passwords are checked to see if the user's

username and password are right when the request is made to access it.
Then, based on existing tokens for a user or new tokens for an existing user, tokens are

created. An exception is thrown if they don't fit. Request Filter will come into play and

authenticate all current users if no value for authentication is specified.

Spark currently supports shared secret authentication for RPC networks. The

spark.authenticate configuration parameter can be used to allow authentication. The precise

method for generating and disseminating the mutual secret depends on the implementation.

The secret must be identified by setting the spark.authenticate. Secret configuration option as

otherwise stated below. In that case, all Spark applications and daemons share the same

secret, limiting the security of these installations, especially on multi-tenant clusters.

The authentication mechanism is not aided by the REST Submission Server or

MesosClusterDispatcher. All network access to the REST API and MesosClusterDispatcher

(ports 6066 and 7077, respectively, by default) should be limited to hosts that can send jobs.

Authorization is required for each resource, and programs are included in this category. As a

result, almost every Spark programs will have its own authorization configuration. Database,

tables, and partitions layer authorization granularity. However, Hive Spark does not accept

Grant and Revoke.

JWT (JSON Web Token)

JSON Web Token encompasses of three parts, for example Header, Payload and Signature.

The header is divided into two parts the hashing algorithm and the token form. The payload

includes all of the data we want to send, while the signature consists of an encoded header

and payload appended with a hidden key. The JWT token is generated by combining these

three elements. When a user logs in with their credentials, the server verifies the request and

returns a token containing the user's identity, which is then stored on the client system and

allows the user to access the application. The token is now attached to the permission header
and sent to the server if a user requests access to a resource. If the token fits, the server allows

him or her access to the resource. A filter that implements the authentication method you

want to use is needed. There are no built-in authentication filters in Spark.

Where an authentication filter is present, Spark also supports UI access control. ACLs can be

configured separately for each framework. Spark distinguishes between "view" and "modify"

permissions (who is allowed to display the application's UI) (who can do stuff like destroy

jobs in a running application). JwtGenerator Interface and JwtParser Interface are two JWT

interfaces in Spark that generate and parse tokens, respectively.

Kubernetes

Spark can also generate a special authentication secret for each Kubernetes program.

Environment variables are used to spread the secret to executor pods. This means that any

user who has permission to list pods in the namespace where the Spark program is running

will now see the authentication secret.

Yarn

Spark on YARN will generate and spread shared secrets automatically. A unique mutual

secret is used by each program. This role relies on the ability of YARN RPC encryption to

secure the dissemination of secrets in the case of YARN.

Here are some additional security concerns that are added for authorization and

authentication purpose:

1. The files are encrypted, meaning you won't be able to read them even though you

have access to them. For example shuffle files and shuffle spills are temporary files

that are saved on local discs.


2. Throughout the organized hierarchy, Spark offers SSL support. Such that the user can

conveniently incorporate SSL configuration while also having the option to customize

each one separately.

3. If you like to use Kerberos to authenticate your identity, Spark supports it. In YARN

and Mesos modes, the delegation token must be configured.

4. Applications that never close or sessions that are never closed will run into problems

when they exceed the maximum time limit. Spark immediately renewed the token in

this situation, but you must customize your linking programs in YARN mode.

5. For messages sent between the client and the Spark server, Spark supports AES-based

encryption. RPC authentication must be allowed and installed correctly in order to

allow encryption.

References

1. Authentication overview. (n.d.). Spark Platform. 

https://sparkplatform.com/docs/authentication/authentication

2. Security - Spark 2.1.0 documentation. (n.d.). Apache Spark™ - Unified Analytics

Engine for Big Data.

 https://spark.apache.org/docs/2.1.0/security.html

3. Security - Spark 3.0.1 documentation. (n.d.). Apache Spark™ - Unified Analytics

Engine for Big Data. 

https://spark.apache.org/docs/latest/security.html

4. DataStax Enterprise security checklists. (n.d.). Retrieved May 24, 2021, from

Datastax.com
https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/configuring-spark/topics/spark-

enable-authentication.html

5. Enabling spark authentication. (n.d.). Retrieved May 24, 2021, from Cloudera.com

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/configuring-

spark/topics/spark-enable-authentication.html

6. (N.d.). Retrieved May 24, 2021, from Writingexpert.net website:

https://writingexpert.net/summarize-different-options-available-for-authorization-and-

authentication-in-spark/

You might also like