Pyspark MCQ

1.
given the sfpd rdd to create a pair rdd consisting of tuples consisting of the
form (category,1) in scala use?
--> val pairs = sfpd.map(x=>x.parallelize))
2.repartition(5) is the same as coalesce(5 shuffle=true). state true or false

-->True
3.Which is true for running spark on Hadoop YARN?

--> there are two deploy modes client and cluster
4. What is dynamic allocation?

--> dynamic allocation is a property where executors can be released back to
cluster
resource pool if they are idle for specified period of time
5. Accumulators are incremented can be read from spark workers ? T or F?

--> FALSE
6.The keys transformation returns an RDD with ordered keys from key value psir RDD?
T or F
--> TRUE
7. groupbyKey is less efficient than reducebykey ?

ans
8). which partitioner class used to order keys according to sort order respective
to given type?
--> Rangepartitioner
9.the primary Machine Learning api for spark now is ____ based api?
-->DataFrame
10. an existing RDD unhcrRDD contains refugee?

--> val country = unhcrRDD.map(x=>(x(0),x(3))).reducebykey((a,b)=>a+b)
11. the number of stages in a job is no of RDD in DAG, scheduler can truncate
lineage when ?
-->RDD is cched or persisted
12.combining a set of filtered edges and filtered vertives from a graph creates
what structure?
-->subgraph
13. what RDD function returns max,min,count,mean,std deviation?

--> stats
14.spark broadcast variables and setting variables in your driver program in

pyspark are same?
-->False
15.which of following in scala will give top 10 resolutuins assuming sfpdDF is

dataframe
registered as table-sfpd?
--> sqlContext.sql(“SELECT resolution.count(incidentnum) AS inccount FROM sfpd
GROUP BY resolution ORDER BY inccount DESC LIMIT 10”)
16. Given the pair RDD country that contain tuple (country, count()) which one to
get lowest
refugee in scala?
Ans; val low-=country.map(x=>(x._2,x._1)).sortbykey(false).first
17.Which parameters required for windowed operatrion as reducebykeyAndwindow?

--> window length and sliding interval
18. What r some of the things u can monitor in spark web UI?
--> All of above
19. Which of the following is not feature of spark?

--> it is cost efficient
20) How to enable dynamic allocation?

Ans; spark.dynamicallocation.enabled=True
21.which of the below command used to remove a broadcast variable bvar from memory?
--> bvar.unpersist()
22. A dataframe can be created from existing RDD . You would create dataframe from
existing
rdd by inferring schema using case classes in which case?
--> if all your users are going to need dataset parsed in same way
23. Dstream internally is?

-->continuous stream of rdd
24.memory_and_disk_ser storage level specifies what storage options for rdd?

-->in memory,ondisk,serialized
25) Which partition hinder spark performance?

Ans; Both small and large
26) Which dataframe method is used to remove column from resultant dataframe?
Ans; drop()
27) The foreach and map difference?

Ans; foreach is action and map is transformation
28) Difference between take(1) and first() ?

Ans; take(1) returns a list with one element from an RDD , first() returns one
element not
in list
29) Caching can use disk if memory not available. T or F

Ans; TRUE
30) sparkSQL translated commands into codes ,these codes are processed by ?
ans; executor node
31)
32) apache spark has api’s in ?

ans; All of above
33) pyspark is bunch figuring structure keeps running on grp of item and perform
information
unification . T or F.
ans;True
34) function used to call program written In shellscipt/perl into pyspark/

ans; pipe()
35) ___ leverages spark core fast scheduling capability for performing streaming
analytics?
Ans; SparkStreaming
36) We can create dataframe using

Ans; ALL of the above
37) Which Dstream output operation used to write output to console?

Ans; pprint()
38)What is the default partitioner class used by spark?

ans;Hash Partitioner
39) Some ways of improving performance of ur spark app einclude?

Ans; All of the above
40) Dataset was introduced in which spark release?

Ans; spark 1.6

Pyspark MCQ

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pyspark MCQ

Uploaded by

Copyright:

Available Formats

1.

2.repartition(5) is the same as coalesce(5 shuffle=true). state true or false

3.Which is true for running spark on Hadoop YARN?

4. What is dynamic allocation?

5. Accumulators are incremented can be read from spark workers ? T or F?

7. groupbyKey is less efficient than reducebykey ?

10. an existing RDD unhcrRDD contains refugee?

13. what RDD function returns max,min,count,mean,std deviation?

14.spark broadcast variables and setting variables in your driver program in

15.which of following in scala will give top 10 resolutuins assuming sfpdDF is

17.Which parameters required for windowed operatrion as reducebykeyAndwindow?

19. Which of the following is not feature of spark?

20) How to enable dynamic allocation?

23. Dstream internally is?

24.memory_and_disk_ser storage level specifies what storage options for rdd?

25) Which partition hinder spark performance?

27) The foreach and map difference?

28) Difference between take(1) and first() ?

29) Caching can use disk if memory not available. T or F

32) apache spark has api’s in ?

34) function used to call program written In shellscipt/perl into pyspark/

36) We can create dataframe using

37) Which Dstream output operation used to write output to console?

38)What is the default partitioner class used by spark?

39) Some ways of improving performance of ur spark app einclude?

40) Dataset was introduced in which spark release?

You might also like