Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

   Diversity

 +1.732.947.4119

Digital  Life Sciences Healthcare Staffing  Company  Contact Us Knowledge Center 

HOME  BLOGS  R PACKAGE DPLYR COMPARISON OF FUNCTIONS BLOG LIST

About Us

Rang Technologies has grown to


become a global leader in
Analytics, Data Science, Artificial
Intelligence, Machine Learning,
Salesforce CRM, Cloud, DevOps,
Internet of Things - IoT,
Cybersecurity, IT Consulting and
Staffing, and Corporate Training.

Categories
DATA SCIENCE
• Data Science (27)
R Package dplyr Comparison of • AI / Machine Learning (6)
functions • Big Data (2)

In R, you can accomplish the same task in different ways.


• Cloud (5)
This R document explains functions from R package--dplyr and • Cybersecurity (5)
in some places compares those functions with base functions. • Salesforce CRM (2)
• IoT (3)
• Clinical (5)
# import dplyr library
• Healthcare (4)
# we are going to work with R in built dataset airquality
library(dplyr) • Life Sciences (6)
head(airquality) • This & That (8)
• Recruiting Strategy (16)
• Diversity, Equity & Inclusion (1)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6

Filter function--dplyr vs subset vs [ ]

You might have guessed what filter function will do here. It


filters/subsets/slices the data depending on one or more
condition. I am discussing here the three commonly used ways
for subsetting.

# filter data with Wind > 7.0 for the month of May
#dplyr way
filter(airquality,Wind > 7.0, Month == 5)
Digital  Life Sciences Healthcare Staffing  Company  Contact Us Knowledge Center 
# base way
#first
airquality[airquality$Wind > 7.0 & airquality$Month == 5,]
#second
subset(airquality,Wind > 7.0 & Month == 5)

I won't recommend using the subset function unless you


completely understand subset function.

Use caution when using subset. For further readings look at the
stackoverflow
thread, http://stackoverflow.com/questions/9860090/in-r-why-
is-better-than-subset

Mutate function

Mutate function is used to create new variables without


affecting existing variables. In other sense it creates new
variable and keeps the old variable (you will understand why
this is even a thing to note). It can also be used to transform
existing varibles.

Transmute function (you read it right its transmute not


transmutate) does the same but it drops all the variables
except the created new variable. So if you just want to create a
single variable then use transmute otherwise use mutate.

# mutate transforms the variable and keep the existing


variable
mutate(airquality,TempInC = ((Temp - 32) * 5 / 9))

# transmute transforms a variable and drops the existing


variables( I said variables )
# it keeps only the new variable and drops all other variables
transmute(airquality,TempInC = ((Temp - 32) * 5 / 9))

# base function
# somehow i find it easy to use than mutate
airquality$TempInC<-((Temp - 32) * 5 /9)

Arrange function

Arrange function is used to sort variable(s).

# dplyr--arrange
arrange(airquality,Month,desc(Temp))

# base--order()
airquality[order(airquality$Month,-airquality$Temp),]
# Note - for ordering desc

group_by and summarise functions

First let me explain summarise fn then we go for group_by.


Digital  Life Sciences Healthcare Staffing  Company  Contact Us Knowledge Center 
Summarise function takes vector as input and outputs a
single value. You can ask min, max, mean,sd, var, median, etc
from a vector and summarise fn outputs the result. Of course R
base package will give you all these summary stats but there
is a catch, summarise function works with group_by function
but base functions don't. I will explain you with examples.

# Both base and summarise give you same output for normal
df/tbl
mean(airquality$Temp)

## [1] 77.88235

summarise(airquality, mean(Temp))

## mean(Temp)
## 1 77.88235

There is a subtle difference in outputs between these two. First


return double the later returns list. But that doesn't concern us,
the key difference is when used with group_by function.

If you know SQL then you may be deceived by the group_by


function. Here group_by doesn't return the output for each
group as you might expect but it creates a new grouped table.
This table can be further used to do lot of actions with that
grouped variable.

grouped_table<-group_by(airquality,Month)
head(grouped_table)

## Source: local data frame [6 x 6]


## Groups: Month [1]
##
## Ozone Solar.R Wind Temp Month Day
##
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6

#Dimensions for both original dataset and grouped dataset


dim(airquality)

## [1] 153 6

dim(grouped_table)
Digital  Life Sciences Healthcare Staffing  Company  Contact Us Knowledge Center 
## [1] 153 6

Both have same dimensions and you can see the header
records of grouped_table which looks same as original
dataset. But grouped_table is grouped on top of Month
variable. You can see the 'Groups' section denoting the
variable(s) used for group_by function.

Generally you can use more than one variable to group and
ask Summarise fn to give output. Now we ask for
average(mean) using summarise and mean fns and compare
the results.

mean(grouped_table$Temp)

## [1] 77.88235

summarise(grouped_table,mean(Temp))

## # A tibble: 5 x 2
## Month mean(Temp)
##
## 1 5 65.54839
## 2 6 79.10000
## 3 7 83.90323
## 4 8 83.96774
## 5 9 76.90000

dplyr::distinct vs base::unique

From the name you can understand both unique and distinct
functions. Both gives you the unique/distinct values but unique
works with list too (of course).

distinct(airquality,Month)

## Month
## 1 5
## 2 6
## 3 7
## 4 8
## 5 9

unique(airquality$Month)

Digital  Life Sciences Healthcare Staffing  Company  Contact Us Knowledge Center 


## [1] 5 6 7 8 9

dplyr::sample_n/sample_frac vs
base::sample

dplyr sample is a wrapper around base sample.int function.

sample_n(airquality,size=2)

## Ozone Solar.R Wind Temp Month Day


## 124 96 167 6.9 91 9 1
## 65 NA 101 10.9 84 7 4

sample_frac(airquality,size=0.01)

## Ozone Solar.R Wind Temp Month Day


## 57 NA 127 8.0 78 6 26
## 135 21 259 15.5 76 9 12

Piping

This symbol %>% is pipe operator which is used to connect


codes together and run connected codes together without
saving intermediate results.

Simply put this operator sends left side parameter as first


argument to right side function. You can also use .(dot)
operator if you want to pass left side parameter Let me show
you,

airquality %>% group_by(Month) %>%


summarise(mean_wind=mean(Wind)) %>%
arrange(desc(mean_wind))

## # A tibble: 5 x 2
## Month mean_wind
##
## 1 5 11.622581
## 2 6 10.266667
## 3 9 10.180000
## 4 7 8.941935
## 5 8 8.793548

airquality data is used as first argument for group_by function.


Then the intermediate grouped table is passed as first
argument for summarise function. At last summarised table is
Digital  Life Sciences Healthcare Staffing  Company  Contact Us Knowledge Center 

passed to arrange function and produces output.


This is commonly used when experimenting with data. It also
helps to reduce creating number of temp variables while doing
analysis. ## other functions to look for na_if - converts any
suspicious value to na coalesce - picks non-missing value at
each position when you input more than 1 vectors with same
length. Inspired from SQL coalesce tbl - create table from data
recode - replace values for both numeric and character
vectors. Numeric based on position and character based on
name.

This completes our introduction part for dplyr. This will help you
to start working with data and have fun.!!

About Rang Technologies:


Headquartered in New Jersey, Rang Technologies has
dedicated over a decade delivering innovative solutions and
best talent to help businesses get the most out of the latest
technologies in their digital transformation journey. Read
More...

 BY: SANTHOSH SUBRAMANIAN  Jul 12 2016

Rang Technologies Facebook Feeds


Inc
Headquartered in New Jersey,
Rang Technologies has
dedicated over a decade
delivering innovative solutions
and best talent to help
businesses get the most out of
the latest technologies in their
digital transformation journey.
Read more...

Office Locations
United India Canad

Corporate Headquarters:

Rang Technologies Inc.
15 Corporate Pl S, Suite# 356,
Piscataway, NJ 08854

Phone

+1.732.947.4119

Digital  Life Sciences Healthcare Staffing  Company  Contact Us Knowledge Center 


E-mail

General Inquiries:
[email protected]
Sales Inquiries:
[email protected]

Learn More Learn More

About Us Digital

© 2023 Rang Why partner with us Data Science & Analytics

Technologies Inc Diversity AI & Machine Learning Google


All rights reserved
Reviews
Press Release Big Data

Case Studies Cloud


4.5 1r e9v7i e w s
Terms of Use
Provided by review-widget.net
Privacy Policy Blog Staffing

Cookies Policy Events


   
Follow
85,652

Update cookies preferences

You might also like