Industrial Internship: Monday

INDUSTRIAL INTERNSHIP
WEEKLY PERFORMANCE REPORT (WPR)
Student Name: Vikas Gupta

Supervisor Name: B. P. Mishra/ Shivani Mishra
Coordinator/Team Leader Name: Namira Rangrej
Mentor Name: Pranshu Sharma
Organization: CureYa
Hours Worked: Monday-1 hrs, Tuesday-1 hrs, Wednesday-3 hrs, Thursday-2 hrs, Friday- 4 hrs
Summarize your thoughts regarding your internship this week. Include duties you have performed,
facts, and procedures you have learned, skills you have mastered, and observations you have made.
WEEK 4, From 17 May – 21 May, 2021

Insights: It was also a restless week just like the previous one but with a little chaos due to my new
role assigned to me. I was not aware that I am not able to handle just a team of two people under
me. Remaining calm in frustrated situations is the learning of the week. Students were not able to
understand the task given to them. So, they did what they understood. I should have made few
points in advance to explain the task given to them. However, this was the first time I was assigned
the site-supervisor role to handle a small team and thus I’ve learnt my lesson so now I can be more
cautious when this type of task come in my life in the future. Therefore, I’m thankful to CureYa
management for giving me this opportunity. I learnt that managing yourself and
handling/coordinating with others are quite distinct things. However I believe that “everything
happens for a reason”. Thus, I hold no grudges.
Monday:
Feature Engineering:-
One-Hot Encoding using get_dummies. Advantages: Straightforward to implement, Does not
require hrs of variable exploration, Does not expand massively the feature space number of
columns in the dataset. Disadvantage: Does not add any information that may make the variable
more predictive, does not keep the information of the ignored labels.
Different types of feature engineering techniques: Imputing, Handling Outliers, Binning, Log
transform, one-hot encoding, grouping operations, scaling.
Tuesday:
Krish Naik’s Explanation On:-
Machine Learning Pipelines:

I. Advance House Price Prediction-Feature Selection
II. Advance House Price Prediction-Feature Engineering
III. Advance House Price Prediction-Exploratory Data Analysis
Wednesday:
Completed task assigned by supervisor Shivani Mishra:
 Article on Data Collection related to Covid-19
 Handling of different tasks under supervisor role
Thursday:
Completed task assigned by mentor Pranshu Sharma:
 Exploratory Data Analysis (EDA) on Vehicle Price Prediction (CarPriceDekho)
Steps are given below:
1) Importing the dependencies,
2) Reading the dataset
3) Checking the shape of dataset
4) Checking for numerical and categorical values
5) Checking for unique values in Seller Type, Fuel Type, Transmission, Owner Column
6) Checking for Null Values
7) Insights of the data
8) The columns in the data frame
9) Creating a new feature by eliminating year column
10) Encoding: One-hot encoding
11) Checking for correlation among dependent and independent values
12) Plotting a Heatmap of correlation matrix
13) Splitting the data into dependent and independent variables
14) Importing the dependencies
15) Getting the important features
16) Plotting a graph of important features
17) Splitting the data into train and test data
18) Importing the Random Forest Regressor
19) Hyper Parameter Tuning
20) Building the model
21) Checking the performance
Friday:
5 components of a data stack:-
1. Collection: Collect the data you need to understand how your product or service is being
used.
2. Integration: Moving all of your data into the data warehouse in a timely and reliable manner.
3. Data Warehouse: All the data you collect should be stored in a data warehouse to ensure
you can effectively use it.
4. Transformation: Raw data is not always suitable for analytics. We need a way to clean it up
before doing analysis. i. e. removing duplicates and incorrect data.
5. Visualization: Representing the data visually to help us understand it and spot patterns.
All-in-one product analytics tools

Because the questions you want to answer are relatively simple, you can use a plug-and-play
analytics product. Just add a “script” tag to your site or app.
Basically an entire data stack in one product.
Pros: Low effort set up, relatively easy to use even for a short period of time.
Cons: Measuring custom or niche things can be hard or impossible.
Product Price Notes

Mixpanel From Free Aimed at a slightly less technical user
Heap From Free
Amplitude From Free
Indicative From Free Can connect to a data warehouse, allowing
more customization
Google Analytics From Free Heavily adopted in the marketing vertical
MixPanel Example.
Initial Traction:
Constraints:
 More questions than you have time to answer
 All-in-one tools can’t support the custom analysis you need
Typical questions you want to answer with data:
 How does shipping time impact likelihood of a second order?
 Does website loading time vary by user? Does it affect activation rate?
 What activities in a first visit increase the likelihood of retention?
The time has come to:

● Make your first data hire (more later)
● Add to your data stack:
o Data collection
o Data integration
o Data warehouse
o Visualization
Data Collection
All-in-one products handle data collection, but now you need that data in a data warehouse, so it’s
time to move to a dedicated data collection solution. Segment is the most popular choice, allowing
you to relatively easily track events in your product using their open-source APIs.
Product Price Notes

Segment From free, more likely $120/month Not only a data integration tool, but also
helps you move data to other places like
Adwords
Snowplow ? More focused on event data collection, more
complex to setup work and work with
Data Integration
By this stage you likely have many sources of data: event data, marketing data, customer support
data, financial data etc. Data integration tools help you move these data sources into your data
Warehouse
Product Price Notes

Fivetran $1.0/credit Polished, but can get expensive
Stitch $100/month Pricing increased recently unfortunately
Segment From free, You may already use Segment for event data collection by
more likely this stage. Segment is unlikely to support all of your needs,
$120/month primarily event data.
Supermetrics ? Specific to sales and marketing
Data Warehouse
Data warehouses are optimised for performing analytics on large amounts of data. Your data team
can use SQL to do custom analytics, combining all of your data sources to spot patterns and trends.
Product Price Notes

BigQuery* Usage based, 1TB free each month Part of Google Cloud
Snowflake Usage based
Visualization /BI
A BI tool will help you visualize that data you’re collecting in the data warehouse, which they’ll
directly connect to. There is a small selection - there is a lot of choice in this area!
Product Price Notes

Data Studio Free Impressively powerful for a free
product, not almost particularly
intuitive to use. Part of Google
Cloud.
Mode Free Trial, unclear after this Pretty powerful, well-liked by
the data science community
Metabase Free An open source solution, takes
a bit of work to set up
Looker From ~ $30K/annum Probably too expensive at this
stage, but probably the best of
Transformation the more expensive options
By this stage your data team will be spending a lot of time transforming your various data sources
Optimising
to an analytics-ready format. BI tools provide some functionality in this area, but it probably makes
sense to start using a dedicated “data modelling” platform.
Constraints:
●Low-hanging-fruit
Product insights are gone, more complex analytics is required
Price Notes
●Your data
Dataform team is growing, they need
Freeto be able to collaborate effectively
BigQuery users only,
Typical questions you want to answer with data: transitioning to Google Cloud
●How long after someone is on our From
dbt Cloud site should we show them a customer support chat?
$50 /month/user
●If someone wants to cancel their subscription, should we offer them a discount? How much?
●What is the most efficient way to pick parcels in the warehouse to reduce shipping times?
Example data stack progression
By this stage the data stack is almost complete - you have collection, integration, BI and a
Pre-product: N/A
warehouse. There are two changes you might want to make to help your data team to be successful:
● Switch to a more developer-friendly BI tool like Looker
Early usage: Amplitude
● Start using a dedicated transformation platform
Initial traction: Segment + Fivetran + BigQuery + DataStudio (+Amplitude)
Optimising: Segment + Fivetran + BigQuery + Dataform + Looker

Student Signature: Vikas Gupta Date: 21/5/2021
Head Co-ordinator Signature: Date:
Instructions: After the completed report has been signed by both the student and Head-
coordinator, the head-coordinator shall scan the form to a pdf format and email it to the Director-
1 ([email protected]) of the company. Specific problems, concerns or suggestions from
either the student/head-coordinator should be emailed separately to the C. E. O. ([email protected])
of the company.

Industrial Internship: Monday

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Industrial Internship: Monday

Uploaded by

Copyright:

Available Formats

INDUSTRIAL INTERNSHIP

WEEKLY PERFORMANCE REPORT (WPR)

Student Name: Vikas Gupta

WEEK 4, From 17 May – 21 May, 2021

Machine Learning Pipelines:

All-in-one product analytics tools

Product Price Notes

The time has come to:

Product Price Notes

Product Price Notes

Product Price Notes

Product Price Notes

Optimising: Segment + Fivetran + BigQuery + Dataform + Looker

Head Co-ordinator Signature: Date:

You might also like