Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

PyTorch Tutorial

08. Dataset and DataLoader

Lecturer : Hongpu Liu Lecture 8-1 PyTorch Tutorial @ SLAM Research Group
Revision: Manual data feed

xy = np.loadtxt(‘diabetes.csv.gz’, delimiter=‘,’, dtype=np.float32)


x_data = torch.from_numpy(xy[:,:-1])
y_data = torch.from_numpy(xy[:, [-1]])

……

for epoch in range(100):


# 1. Forward
Use all of the data
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
print(epoch, loss.item())
# 2. Backward
optimizer.zero_grad()
loss.backward()
# 3. Update
optimizer.step()

Lecturer : Hongpu Liu Lecture 8-2 PyTorch Tutorial @ SLAM Research Group
Terminology: Epoch, Batch-Size, Iterations

# Training cycle
for epoch in range(training_epochs):
# Loop over all batches
for i in range(total_batch):

Definition: Epoch Definition: Batch-Size Definition: Iteration

One forward pass and one The number of training Number of passes, each
backward pass of all the examples in one forward pass using [batch size]
training examples. backward pass. number of examples.

Lecturer : Hongpu Liu Lecture 8-3 PyTorch Tutorial @ SLAM Research Group
DataLoader: batch_size=2, shuffle=True

Sample 1 Sample 4 Sample 4

Batch 1
Sample 2 Sample 7 Sample 7

Sample 3 Sample 8 Sample 8

Batch 2
Sample 4 Sample 1 Sample 1

Sample 5 Shuffle Sample 5 Loader Sample 5

Batch 3
Sample 6 Sample 2 Sample 2

Sample 7 Sample 6 Sample 6

Batch 4
Sample 8 Sample 3 Sample 3

Dataset Queue Iterable Loader

Lecturer : Hongpu Liu Lecture 8-4 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset
Dataset is an abstract class. We
from torch.utils.data import DataLoader
can define our class inherited from
class DiabetesDataset(Dataset):
def __init__(self):
this class.
pass

def __getitem__(self, index):


pass

def __len__(self):
pass

dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-5 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset DataLoader is a class to help us
from torch.utils.data import DataLoader
loading data in PyTorch.
class DiabetesDataset(Dataset):
def __init__(self):
pass

def __getitem__(self, index):


pass

def __len__(self):
pass

dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-6 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
DiabetesDataset is inherited from
class DiabetesDataset(Dataset):
def __init__(self): abstract class Dataset.
pass

def __getitem__(self, index):


pass

def __len__(self):
pass

dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-7 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class DiabetesDataset(Dataset):
def __init__(self):
pass
The expression, dataset[index],
def __getitem__(self, index):
pass will call this magic function.
def __len__(self):
pass

dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-8 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class DiabetesDataset(Dataset):
def __init__(self):
pass

def __getitem__(self, index):


pass
This magic function returns length
def __len__(self):
pass of dataset.
dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-9 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class DiabetesDataset(Dataset):
def __init__(self):
pass

def __getitem__(self, index):


pass
This magic function returns length
def __len__(self):
pass of dataset.
dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-10 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class DiabetesDataset(Dataset):
def __init__(self):
pass

def __getitem__(self, index):


pass

def __len__(self):
pass

dataset = DiabetesDataset() Construct DiabetesDataset object.


train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-11 PyTorch Tutorial @ SLAM Research Group
How to define your Dataset
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class DiabetesDataset(Dataset):
def __init__(self):
pass

def __getitem__(self, index):


pass

def __len__(self):
pass

dataset = DiabetesDataset() Initialize loader with batch-size,


train_loader = DataLoader(dataset=dataset,
batch_size=32, shuffle, process number.
shuffle=True,
num_workers=2)

Lecturer : Hongpu Liu Lecture 8-12 PyTorch Tutorial @ SLAM Research Group
Extra: num_workers in Windows

train_loader = DataLoader(dataset=dataset, The implementation of multiprocessing is


batch_size=32,
different on Windows, which uses spawn
shuffle=True,
num_workers=2) instead of fork.
…… So left code will cause:
for epoch in range(100): RuntimeError:
for i, data in enumerate(train_loader, 0): An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
……
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

So we have to wrap the code with an if-clause if __name__ == '__main__':


freeze_support()
to protect the code from executing multiple ...

times. The "freeze_support()" line can be omitted if the program


is not going to be frozen to produce an executable.

Lecturer : Hongpu Liu Lecture 8-13 PyTorch Tutorial @ SLAM Research Group
Extra: num_workers in Windows

train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)
……
if __name__ == '__main__':
for epoch in range(100):
for i, data in enumerate(train_loader, 0):
# 1. Prepare data

So we have to wrap the code with an if-clause


to protect the code from executing multiple
times.

Lecturer : Hongpu Liu Lecture 8-14 PyTorch Tutorial @ SLAM Research Group
Example: Diabetes Dataset

class DiabetesDataset(Dataset):
def __init__(self, filepath):
xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
self.len = xy.shape[0]
self.x_data = torch.from_numpy(xy[:, :-1])
self.y_data = torch.from_numpy(xy[:, [-1]])

def __getitem__(self, index):


return self.x_data[index], self.y_data[index]

def __len__(self):
return self.len

dataset = DiabetesDataset('diabetes.csv.gz')
train_loader = DataLoader(dataset=dataset, batch_size=32, shuffle=True, num_workers=2)

Lecturer : Hongpu Liu Lecture 8-15 PyTorch Tutorial @ SLAM Research Group
Example: Using DataLoader

for epoch in range(100):


for i, data in enumerate(train_loader, 0):
# 1. Prepare data
inputs, labels = data
# 2. Forward
y_pred = model(inputs)
loss = criterion(y_pred, labels)
print(epoch, i, loss.item())
# 3. Backward
optimizer.zero_grad()
loss.backward()
# 4. Update
optimizer.step()

Lecturer : Hongpu Liu Lecture 8-16 PyTorch Tutorial @ SLAM Research Group
Classifying Diabetes
import numpy as np

Prepare dataset
import torch

1
from torch.utils.data import Dataset, DataLoader

class DiabetesDataset(Dataset):
def __init__(self, filepath):
xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
self.len = xy.shape[0]
self.x_data = torch.from_numpy(xy[:, :-1])
Dataset and Dataloader
self.y_data = torch.from_numpy(xy[:, [-1]])

def __getitem__(self, index):


return self.x_data[index], self.y_data[index]

def __len__(self):
return self.len

Design model using Class


2
dataset = DiabetesDataset('diabetes.csv.gz')
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)

class Model(torch.nn.Module):
inherit from nn.Module
def __init__(self):
super(Model, self).__init__()
self.linear1 = torch.nn.Linear(8, 6)
self.linear2 = torch.nn.Linear(6, 4)
self.linear3 = torch.nn.Linear(4, 1)
self.sigmoid = torch.nn.Sigmoid()

Construct loss and optimizer


def forward(self, x):

3
x = self.sigmoid(self.linear1(x))
x = self.sigmoid(self.linear2(x))
x = self.sigmoid(self.linear3(x))
return x

model = Model() using PyTorch API


criterion = torch.nn.BCELoss(size_average=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(100):


for i, data in enumerate(train_loader, 0):
# 1. Prepare data
inputs, labels = data

Training cycle
# 2. Forward

4
y_pred = model(inputs)
loss = criterion(y_pred, labels)
print(epoch, i, loss.item())
# 3. Backward

forward, backward, update


optimizer.zero_grad()
loss.backward()
# 4. Update
optimizer.step()

Lecturer : Hongpu Liu Lecture 8-17 PyTorch Tutorial @ SLAM Research Group
The following dataset loaders are available

• MNIST
• Fashion-MNIST
• EMNIST
• COCO
• LSUN
• ImageFolder
• DatasetFolder
• Imagenet-12
• CIFAR
• STL10
• PhotoTour

Lecturer : Hongpu Liu Lecture 8-18 PyTorch Tutorial @ SLAM Research Group
Example: MINST Dataset

import torch
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision import datasets

train_dataset = datasets.MNIST(root='../dataset/mnist',
train=True,
transform= transforms.ToTensor(),
download=True)
test_dataset = datasets.MNIST(root='../dataset/mnist',
train=False,
transform= transforms.ToTensor(),
download=True)

train_loader = DataLoader(dataset=train_dataset,
batch_size=32,
shuffle=True)
test_loader = DataLoader(dataset=test_dataset,
batch_size=32,
shuffle=False)

for batch_idx, (inputs, target) in enumerate(train_loader):


……

Lecturer : Hongpu Liu Lecture 8-19 PyTorch Tutorial @ SLAM Research Group
Exercise 8-1

• Build DataLoader for


• Titanic dataset: https://www.kaggle.com/c/titanic/data
• Build a classifier using the DataLoader

Lecturer : Hongpu Liu Lecture 8-20 PyTorch Tutorial @ SLAM Research Group
PyTorch Tutorial
08. Dataset and DataLoader

Lecturer : Hongpu Liu Lecture 8-21 PyTorch Tutorial @ SLAM Research Group

You might also like