Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

HIVE CASE STUDY

BY SK.MUDASSAR AND S.SANDHYA DUTTA

PROBLEM STATEMENT :
With the popularity of online sales growing, IT companies are looking for ways to
improve their sales by analysing client behaviour and learning about product trends.
Furthermore, the websites make it easy for users to find the things they need without having
to scavenge for them. Needless to say, the position of big data analyst is one of the most
sought-after job profiles of the decade. As a result, as part of this assignment, we will want
you, as a big data analyst, to extract data and generate insights from a real-world data set
from an e-commerce company.

DATASETS: Oct-2019, Nov-2019 and Attribute Description.


 Attribute description — Provides information about the attribute.
 Oct-2019-Customer data for the month of October.
 Nov-2019-Customer data for the month of November

CASE STUDY:
1. We uploaded the files to S3 before launching the Hadoop EMR Cluster.
2. In the user, we created a directory called "case study."

3. In the "case study," we loaded the S3 files.

4. Then we log in to "hive."


5. We created the database "Case Study" and checked to see if it was present.

6. Using serde, we established a table called "sales."


Query :
Create table if not exists sales(event_timestamp, event_type string, product_id string, category_id string,
category_code string, brand string, price float, user_id int, user_session string) ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.OpenCSVSerde’ STORED AS TEXTFILE LOCATION ‘/user/case-study/’
TBLPROPERTIES(“skip.header.line.count”=”1”);

7. We then checked to see if the tables had been created.

8. We also double-checked the table's data.


9. Creating the partitions and bucketing

10.Sales_data is a new table that is partitioned by "event type" and clustered by


"category id."

11.We entered the information into the "sales data" table.

12.We now compete with the prior table Sales and optimise the table sales_data.

The sales table took 3.4 seconds, whereas the sales data took 0.4 seconds.
Solutions

 Find the total revenue generated due to purchases made in October.

Query : Select sum(cast(price as decimal(9,2))) as OCT_REV from sales_data where


month(event_time)=10;

The total revenue for October is 35012658.32.

 Write a query to yield the total sum of purchases per month in a single output. 

Query: Select month(event_time) as month_id, sum(cast(price as decimal(9,2))) as revenue from


sales_data group by month(event_time);

The total revenue for October is 35012658.32.


The total revenue for November is 37646247.91
 Write a query to find the change in revenue generated due to purchases
from October to November.
Query : Select nov_rev, oct_rev, (nov_rev – oct_rev) as change_in_rev from(select sum(case when
month(event_time)=11 then cast(price as decimal(9,2)) else null end) as nov_rev, sum(case when

month(event_time)=10 then cast(price as decimal(9,2)) else null end) as oct_rev from sales_data);

The total revenue for October is 35012658.32.


The total revenue for November is 37646247.91
Change in revenue is 2633589.59

 Find distinct categories of products. Categories with null category code can be


ignored.

Query : Select distinct category_code from sales_data where category_code is not null;
 Find the total number of products available under each category.
Query: Select count(category_code) count, category_code from sales_data where category_code is
not null group by category_code;

 Which brand had the maximum sales in October and November combined?
Query: Select brand, count (*) as sales_count from sales_data group by brand order by sales_count
desc limit 5;

O/P: Brand with no name comes in first with 3645290, followed by runail.
 Which brands increased their sales from October to November?

Query : Select brand, count(case when month(event_time)=11 then 1 else null end) as nov_sales,
count(case when month(event_time)=10 then 1 else null end) as oct_sales from sales_data group by
brand having nov_sales > oct_Sales;
 Your company wants to reward the top 10 users of its website with a Golden
Customer plan. Write a query to generate a list of top 10 users who spend the
most.

Query : Select user_id count(user_session) as user_Session_count from sales_data group by user_id


order by user_session_count desc limit 10;

You might also like