Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

UNIT 1 - PART A

1. What are the four V’s of Big Data?

A. Volume
B. Variety
C. Veracity
D. All the Above

2. What are the different features of Big Data Analytics?

A. Open-Source
B. Scalability
C. Data Recovery
D. All the above

3. _________ is general-purpose computing model and runtime system for


distributed data analytics.
A. Mapreduce
B. Drill
C. Oozie
D. None of the above

4. The examination of large amounts of data to see what patterns or other useful
information can be found is known as

A. Data examination
B. Information analysis
C. Big data analytics
D. Data analysis

5. According to analysts, for what can traditional IT systems provide a foundation


when they’re integrated with big data technologies like Hadoop?
A. Big data management and data mining
B. Data warehousing and business intelligence
C. Management of Hadoop clusters
D. Collecting and storing unstructured data

6. Big data analysis does the following except

A. Collects data
B. Spreads data
C. Organizes data
D. Analyzes data

7. What are the main components of Big Data?


A. MapReduce
B. HDFS
C. YARN
D. All of these

8. Listed below are the steps that are followed in Big Data Analytic Process Except

A. Data Preprocessing
B. Data Cleaning
C. Data Analysis
D. Data Dissemination
9. The word 'Big Data' was coined in the year
A. 2000
B. 1970
C. 1998
D. 2005

10. Infer below the best answer to "which industries employ the use of so-called "Big
Data" in their day to day operations?

A. Weather forecasting
B. Marketing
C. Healthcare
D. All of the above

11. What makes Big Data analysis difficult to optimize?


A. Big Data is not difficult to optimize
B. Both data and cost effective ways to mine data to make business sense out
of it
C. The technology to mine data
D. All of the above

12. The new source of big data that will trigger a Big Data revolution in the years to
come is
A. Business transactions
B. Social media
C. Transactional data and sensor data
D. RDBMS

13. The word 'Big data' was coined by


A. Roger Mougalas
B. John Philips
C. Simon Woods
D. Martin Green

14. Concerning the Forms of Big Data, which one of these is odd?
A. Structured
B. Unstructured
C. Processed
D. Semi-Structured

15. Big Data applications benefit the media and entertainment industry by
A. Predicting what the audience wants
B. Ad targeting
C. Scheduling optimization
D. All of the above

PART B
1. Discuss statistical techniques

A. Point estimation, Testing Hypothesis, Correlation, and Regression.

B. Data Cleaning, Data Preprocessing, Data Analyzing, Data storage

C. Single layer, Multi-layer , Recurrent , Self-organized

D. Knowledge mining, Knowledge extraction, Data/pattern analysis, Data Archaeology,


Data dredging

2. Discuss the statement that is true about Statistics?


A. Statistics is used to process complex problems in the real world
B. Statistics is used to process simple problems in the virtual world
C. Statistics is used to process simple problems in the real world
D. None of the above

3. __________Statistics uses the data to provide descriptions of the population, either


through numerical calculations or graphs or tables.

A. Descriptive
B. Quantitative
C. Inferential
D. Qualitative

4. Categorize big data types with example.

A. Structured – Fixed format, Semi structured - text files, images, videos, Unstructured -
XML, JSON, Social Network

B. Structured – text files, images, videos, Semi structured - XML, JSON, Social
Network, Unstructured - Fixed format

C. Structured – Fixed format, Semi structured - XML, JSON, Social Network


Unstructured – text files, images, videos

D. Structured – XML, JSON, Social Network, Semi structured - text files, images,
videos, Unstructured - Fixed format

5. Identify the message related to veracity in Big Data.

A. Related to a size which is enormous


B. Heterogeneous sources and the nature of data
C. The speed of generation of data.
D. Uncertainty due to data inconsistency and
incompleteness

UNIT 2 - PART A
1. Define Data Streaming.
A. Stream data is always unstructured data.
B. Stream data is always a continuous data with high velocity.
C. Stream elements is always semi structured data.
D. Stream data is always structured data.

2. Which of the following statements about sampling are correct?

A. Sampling reduces the amount of data fed to a subsequent data mining


algorithm and keep statistical properties of the data intact

B. Sampling increases the amount of elements in a data stream

C. Sampling does not use statistical concept

D. Sampling increases the amount of data fed to the algorithm

3. A Bloom filter guarantees only


A. False Positives
B. False Negatives
C. False Positives And False Negatives
D. False Positives Or False Negatives, Depending On The Bloom Filter Type

4. The DGIM algorithm was developed to estimate the number of 1's in the window
A. With an error no more than 50%
B. With an error more than 50%
C. With an error of 25%
D. With no error
5. Name the applications used by the data streams.
A. Mining social network
B. Apache Spark Streaming
C. Cisco Connected Streaming Analytics
D. Oracle Stream Analytics

6. Which algorithm should be used to approximate the number of distinct elements in a


data stream?
A. Misra-Gries
B. Alon-Matias-Szegedy
C. DGIM
D. Flajolet-Martin

7. Which of the following statements about standard Bloom filters is correct?


A. It is possible to delete an element from a Bloom filter.
B. A Bloom filter always returns the correct result.
C. It is possible to alter the hash functions of a full Bloom filter to create more space.
D. A Bloom filter always returns TRUE when testing for a previously added
element.

8. Which of the following streaming windows show valid bucket representations according
to the DGIM rules?
A. 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 1
B. 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 1 0 0 1
C. 1 1 1 1 0 0 1 1 1 0 1 0 1
D. 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 1

9. Which attribute is _not_ indicative for data streaming?


A. Limited amount of memory
B. Limited amount of processing time
C. Limited amount of input data
D. Limited amount of processing power

10. Another research directions of Sentiment analysis or opinion mining are:

A. Subjectivity/objectivity identification.

B. The more fine-grained analysis model is called the feature/aspectbased sentiment


analysis

C. Identifying relevant entities, extracting their features/aspects, and determining


whether an opinion expressed on each feature/aspect is positive, negative or neutral

D. None of the above


11. The "Twitter datastream" contains tuples of the form:
(messageID, message, userID_of_posting_user, in_reply_to_messageID,
time_of_posting, language_of_message).

You can assume that messageIDmessageID and userIDuserID are unique, i.e. every


message has a unique identifier and every user has a unique identifier. If the message is not
posted in reply to any other message, we
have in_reply_to_messageID=nullin_reply_to_messageID=null. Examples of tuples in that
stream are:

(124324234324, "@Nelly: I had breakfast just now!", 33523232, 122192225674,


"23/11/2014", "English").
(435345332432, "Sitting in Paris, drinking a coffee", null, 122198435674, "24/11/2014",
"English").

We want to answer queries by sampling roughly 1/10th of the data. What is a good sampling
strategy to answer the following query: What is the fraction of messages written in English?

A. Sample userIDuserIDs and include all messages by a user


B. Sample language_of_messagelanguage_of_message and include all messages of a
language
C. Generate a random number rr between 0 and 9 and sample the tuple if r==0.
D. Sample messageIDs and include all messages with a particular messageID

12. Consider the same data as in the previous question. What is a good sampling strategy to
answer the following query: What fraction of users post only in English?
A. Sample userIDuserIDs and include all messages by a user
B. Sample language_of_messagelanguage_of_message and include all messages of a
language
C. Sample by the combined key (userID,language_of_message)
(userID,language_of_message)
D. Generate a random number rr between 0 and 9 and sample the tuple if r==0

13. Consider the same data as in the previous two questions. What is a good sampling
strategy to answer the following query: How many replies does a tweet have on
average?
A. Sample (userID,in_reply_to_messageID)(userID,in_reply_to_messageID)
B. Sample userIDuserIDs and include all messages by a user
C. Sample in_reply_to_messageIDs
D. Generate a random number rr between 0 and 9 and sample the tuple if r==0

14. System that possess data instructions without any delay is called as
A. Real time system
B. Online system
C. Offline system
D. Instruction system

15. Mention the algorithm which is not used for data streaming
A. Blooms filter
B. AMS method
C. Flajolet-Martin Algorithm
D. Real time Analytics

PART B
1. Discuss Sentimental Analysis.

A. Sentiment analysis also known as opinion mining refers to the use of natural
language processing text analysis and computational linguistics to identify and
extract subjective information in source materials.

B. Sentiment analysis also known as opinion mining refers to data that is continuously
generated by different sources.

C. Sentiment analysis also known as opinion mining refers to Queries or processing over
data within a rolling time window, or on just the most recent data record.

D. Sentiment analysis also known as opinion mining refers to analyze, systematically


extract information from, or otherwise deal with data sets that are too large or complex to
be dealt with by traditional data-processing application software.

2. Identify the use of Decaying Window in Data Stream.

A. The decaying window algorithm allows you to identify network of many computers to
solve problems involving massive amounts of data and computation

B. The decaying window algorithm allows you to identify information from, or otherwise
deal with data sets that are too large or complex to be dealt with by traditional data-
processing application software.

C. The decaying window algorithm allows you to identify the most popular
elements or trending in other words in an incoming data stream

D. The decaying window algorithm allows you to identify the infrastructure supporting the
data lifecycle from acquisition, preparation, integration, and execution

3. Select the Tools/Techniques that can be used with sentiment analysis

A. Latent semantic analysis


B. Semantic Orientation i.e., based on Pointwise Mutual Information
C. Grammatical dependency relations
D. SentiWordNet

4. Select the type of Algorithm used for the below steps


i. An array of n bits, initially all 0’s.
ii. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key”
values to n buckets, corresponding to the n bits of the bit-array.
iii. Set to 1 each bit that is hi(K) for some hash function hi and some key value K
in S.

A. Blooms Filter Algorithm


B. Flajolet-Martin Algorithm
C. DGIM Algorithm
D. Alon-Matias-Szegedy Algorithm
5. The DGIM algorithm was developed to estimate the counts of 1's occur within the
last k bits of a stream window N. Which of the following statements is true about the
estimate of the number of 0's based on DGIM?
A. The number of 0's cannot be estimated at all.
B. The number of 0's can be estimated with a maximum guaranteed error.
C. To estimate the number of 0s and 1s with a guaranteed maximum error, DGIM
has to be employed twice, one creating buckets based on 1's, and once created
buckets based on 0's.
D. Not estimated

UNIT 3 - PART A

1. Hadoop was written in _____________ language


A. C++
B. Python
C. Java
D. Scala

2. Hadoop is a framework that works with a variety of related tools. Common


cohorts include ____________
A. MapReduce, Hive and HBase
B. MapReduce, MySQL and Google Apps
C. MapReduce, Hummer and Iguana
D. MapReduce, Heron and Trumpet

3. The MapReduce algorithm contains two important tasks, namely __________.


A. mapped, reduce
B. mapping, Reduction
C. Map, Reduction
D. Map, Reduce

4. In how many stages the MapReduce program executes?


A. 2
B. 3
C. 4
D. 5

5. HDFS works in a __________ fashion.


A. Worker-master
B. master-slave
C. worker/slave
D. All of the above

6. What is full form of HDFS?


A. Hadoop Distributed File System
B. Hadoop Distributed Field System
C. Hadoop Distributed File Search
D. Hadoop Distributed Field Search

7. ______ task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.
A. Map
B. Reduce
C. NameNode
D. DataNode

8. Which of the following is used to schedules jobs and tracks the assign jobs to Task
tracker?
A. SlaveNode
B. MasterNode
C. JobTracker
D. Task Tracker
9. Which of the following is used for an execution of a Mapper or a Reducer on a slice of
data?

A. Task
B. Job
C. Mapper
D. PayLoad

10. ________ NameNode is used when the Primary NameNode goes down.

A. Rack
B. Data
C. Secondary
D. Both A and B

11. The minimum amount of data that HDFS can read or write is called a _____________.
A. Datanode
B. Namenode
C. Block
D. None of the above

12. For every node (Commodity hardware/System) in a cluster, there will be a _________.

A. Datanode
B. Namenode
C. Block
D. None of the above

13. Which of the following are the Goals of HDFS?

A. Fault detection and recovery


B. Huge datasets
C. Hardware at data
D. All of the above

14. What was Hadoop named after?


A. Creator Doug Cutting’s favorite circus act
B. Cutting’s high school rock band
C. The toy elephant of Cutting’s son
D. A sound Cutting’s laptop made during Hadoop development

15. _________ has the world’s largest Hadoop cluster.

A. Apple
B. Datamatics
C. Facebook
D. None of the mentioned

PART B
1. Illustrate rack awareness work in HDFS.

A. It refers to the knowledge of Cluster topology or more specifically how the


different data nodes are distributed across the racks of a Hadoop cluster.

B. It refers to the knowledge of collection of large and complex data sets, that makes it
difficult to process using relational database management tools or traditional data
processing applications. 
C. It refers to the knowledge of  analyzing larger sets of data representing them as
data flows.

D. It refers to the knowledge of online transactional data, while Hadoop is a big data


analytics system that focuses on data warehousing and data lake use cases.

2. Discuss the role of Job Tracker.

A. It works as a slave node for Job Tracker.It receives task and code from Job Tracker and
applies that code on the file. This process can also be called as a Mapper.

B. It will throw an exception saying that the output file directory already exists.

C. The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode. In response, NameNode provides metadata to Job
Tracker.

D. It distributes simple, read-only text/data files and/or complex types such as jars, archives,
and others.

3. Select the type of scheduler used in MapReduce for the following


 Scheduler designed to allow sharing a large cluster while giving each
organization a minimum capacity guarantee.
A. Fair Scheduler
B. TaskTracker Scheduler
C. Job Tracker Scheduler
D. Capacity Scheduler

4. Illustrate the general-purpose computing model and runtime system for distributed
data analytics.
A. Mapreduce - is a software framework for distributed processing of large data
sets on computing clusters.
B. Drill - is an open-source software framework that supports data-intensive
distributed applications for interactive analysis of large-scale datasets
C. Oozie - is a server-based workflow scheduling system to manage Hadoop jobs
D. None of the mentioned

5. Select the output of a mapper task


A. The Key-value pair of all the records of the dataset.
B. The Key-value pair of all the records from the input split processed by the
mapper
C. Only the sorted Keys from the input split
D. The number of rows processed by the mapper task.

UNIT 4 - PART A
1. Give Expansion for HITS
A. Hypervisor inductive topic search
B. Hashups Induced Topic Search
C. Hypervisor Induced Topic Search
D. Hyperlink Induced Topic Search

2. Which algorithm is mainly used by Google to determine the importance of a particular


page?
A. Singular value decomposition Algorithm
B. HITS Algorithm
C. PageRank Algorithm
D. Singular value decomposition Algorithm

3. In Hubs and Authorities algorithm, the authority update rule is calculated using
A. For each page p, update auth(p) to be the sum of the hub scores of all pages that
point to it.
B. For each page p, update auth(p) to be the sum of the authority scores of all pages that
point to it.
C. For each page p, update auth(p) to be the sum of the hub scores of all pages that it
points to.
D. For each page p,update auth(p) to be the sum of the authority scores of all pages that it
points to.

4. The data type to be visualized may include


A. One-dimensional data
B. Text and hypertext
C. Hierarchies, graph, Algorithm and Software
D. All the above

5. Identify the correct statements, which are used to collect the data for recommender
systems
A. Asking a user to rate an item on a sliding scale.
B. Asking a user to rank a collection of items from favorite to least favorite.
C. Presenting two items to a user and asking him/her to choose the better one of them.
D. Asking a user to create a list of items that he/she likes.
6. How Data can be visualized?
A. graphs
B. Charts
C. maps
D. All the above

7. Common use cases for data visualization include?


A. Politics
B. Sales and marketing
C. Healthcare
D. All of the above

8. List out the challenges faced by the collaborative filtering.

A. Modularity, Scalability, Shilling attacks


B. Data sparsity, Scalability, Shilling attacks
C. Data sparsity, Modularity, Scalability
D. Shilling attacks, Modularity, Data sparsity

9. The several passes required by Page Rank computation is


A. Approximation
B. Iterations
C. Conditions
D. Modularity

10. The correct recommendation system's algorithm


A. Collaborative page ranking
B. Collaborative filtering
C. Item based recommendation system
D. Object based recommendation system

11. The collaborative filtering algorithm is divided into


A. Item based.
B. User based.
C. User based, Item based.
D. Memory-based, Model based

12. The Model based filtering uses the techniques like


A. Association Rule
B. Clustering
C. Classification
D. All the above
13. Which of the following is not true in the below statement?
A. Data visualization include the ability to absorb information quickly
B. Data visualization is another form of visual art
C. Data visualization decreases the insights and take slower decisions
D. Data presentation architecture

14. Which of the following algorithm is less expensive and scalable?


A. User-Based Collaborative Filtering
B. Item-to-Item Collaborative Filtering
C. User-to-Item Collaborative Filtering
D. None of these

15. User-Based Collaborative Filtering methods –


A. Are based on user's similarity only
B. In such methods complexity grows linearly with the number of customers and
items
C. Suffers the problem of sparsity of recommendations on the data set
D. None of these

PART B

1. Identify the type of filtering based on past user behavior. Each user’s rating,
purchasing, or viewing history allows the system to establish associations between
users with similar behavior and between items of interest to the same users.

A. Association based filtering


B. Collaborative based filtering
C. Frequent based filtering
D. Content based filtering

2. Discuss the algorithm developed by Jon Kleinberg's.

A. Identifies good authorities and hubs. A higher authority weight occurs if the page
is pointed to by pages. A higher hub weight occurs if the page linked by many
different hubs.
B. PageRank is an algorithm used by Google Search to rank web pages in their search engine
results

C. Identifies incoming and outgoing links

D. Identifies dead ends and spider traps

3. Discuss about Data Visualization.

A. Data Visualization is used to communicate information clearly and efficiently to users by


the usage of information graphics such as tables and charts.

B. Data Visualization helps users in analyzing a large amount of data in a simpler way.

C. Data Visualization makes complex data more accessible, understandable, and usable.

D. All the above

4. Recommender systems can be defined as:


A. Systems that evaluate quality based on the preferences of others with a
similar point of view the preferences of others
B. Systems that evaluate quality based on the purchase history of any particular
person only
C. Systems that evaluate quality based on the demand of items
D. Systems that evaluate quality based on the association rule mining techniques

5. The approach basically based on the user profiles using features extracted from the
content of the items the user has evaluated in the past and a history of the user's
interaction with the recommender system.
A. Content Based Filtering
B. Collaborative Based Filtering
C. Memory Based Filtering
D. Item Based Filtering

UNIT 5 - PART A
1. How to change the column data type in Hive?
A. ALTER and CHANGE
B. ALTER
C. CHANGE
D. SET

2. Which of the following format is supported by MongoDB?


A. SQL
B. XML
C. BSON
D. All of the above

3. The key components of Cassandra


A. Cluster
B. Commit log
C. Data center
D. All the above

4. Integral literals by default are assumed to be _________


A. SMALL INT
B. INT
C. BIG INT
D. TINY INT

5. Which of the following is a NoSQL Database Type?

A. SQL
B. Document databases
C. JSON
D. All of the mentioned

6. Which of the following is a wide-column store?


A. Cassandra
B. Riak
C. Hive
D. Mongodb

7. NoSQL databases is mainly used for handling large volumes of ______________


data.
A. Structured
B. Unstructured
C. Semistructured
D. All the above

8. 8. Columns in HBase are organized to


A. Column Group
B. Column Families
C. Column list
D. Column base

9. Impala is an integrated part of a ____________ enterprise data hub.


A. MicroSoft
B. IBM
C. Cloudera
D. Apache

10. Which of the following is the Key components of Hive Architecture?


A. User Interface
B. Metastore
C. HiveQL Process Engine
D. All of the above

11. MongoDB is a _____________________ type of document/database.


A. key value
B. column based
C. object oriented
D. Relational document

12. HBase is a distributed ________ database built on top of the Hadoop file system.
A. Column-oriented
B. Tuple -oriented
C. Row-oriented
D. None of the above

13. Point out the correct statement


A. Hive is not a relational database, but a query engine that supports
the parts of SQL
B. Hive is a relational database with SQL support
C. Pig is a relational database with SQL support
D. None of the above

14. Most NoSQL databases support automatic __________ meaning that you get high
availability and disaster recovery.
A. Processing
B. Scalability
C. Replication
D. all of the mentioned

15. ___________________ is a query language for Hive to process and analyze


structured data in a Metastore.
A. SQL
B. NO SQL
C. HiveQL
D. All the above

PART B

1. Select the s a distributed column-oriented database built on top of the Hadoop file
system. It is an open- source project and is horizontally scalable.
A. Hbase
B. Sharding
C. Nosql
D. Hive

2. Identify the Massive Parallel Processing SQL query engine for processing huge
volumes of data that is stored in Hadoop cluster.
A. Cassandra
B. Impala
C. MongoDB
D. Hbase

3. Choose a data warehouse software project built on top of Apache Hadoop for
providing data query and analysis. It gives an SQL-like interface to query data stored
in various databases and file systems that integrate with Hadoop.
A. Mongodb
B. NoSql
C. HIVE
D. Cassandra

4. Identify the open source, distributed and decentralized/distributed storage system


(database), for managing very large amounts of structured data spread out across
the world.
A. Mongodb
B. NoSql
C. HIVE
D. Cassandra
5. Documents in the same collection do not need to have the same set of fields or
structure, and common fields in a collection's documents may hold different types of
data is known as ?
A. dynamic schema
B. mongodb
C. mongo
D. Embedded Documents

You might also like