Lecture 08 Dataset and Dataloader

PyTorch Tutorial
08. Dataset and DataLoader
Lecturer : Hongpu Liu Lecture 8-1 PyTorch Tutorial @ SLAM Research Group
Revision: Manual data feed
xy = np.loadtxt(‘diabetes.csv.gz’, delimiter=‘,’, dtype=np.float32)

x_data = torch.from_numpy(xy[:,:-1])
y_data = torch.from_numpy(xy[:, [-1]])
……
for epoch in range(100):

# 1. Forward
Use all of the data
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
print(epoch, loss.item())
# 2. Backward
optimizer.zero_grad()
loss.backward()
# 3. Update
optimizer.step()
Terminology: Epoch, Batch-Size, Iterations
# Training cycle
for epoch in range(training_epochs):
# Loop over all batches
for i in range(total_batch):
Definition: Epoch Definition: Batch-Size Definition: Iteration
One forward pass and one The number of training Number of passes, each
backward pass of all the examples in one forward pass using [batch size]
training examples. backward pass. number of examples.
DataLoader: batch_size=2, shuffle=True
Sample 1 Sample 4 Sample 4
Batch 1
Batch 2
Sample 5 Shuffle Sample 5 Loader Sample 5
Batch 3
Batch 4
Dataset Queue Iterable Loader
How to define your Dataset
import torch
from torch.utils.data import Dataset
Dataset is an abstract class. We
from torch.utils.data import DataLoader
can define our class inherited from
class DiabetesDataset(Dataset):
def __init__(self):
this class.
pass
def __getitem__(self, index):

pass
def __len__(self):
pass
dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,
batch_size=32,
shuffle=True,
num_workers=2)
import torch
from torch.utils.data import Dataset DataLoader is a class to help us
loading data in PyTorch.
def __init__(self):
pass

pass
def __len__(self):
pass
batch_size=32,
shuffle=True,
num_workers=2)
import torch
DiabetesDataset is inherited from
def __init__(self): abstract class Dataset.
pass

pass
def __len__(self):
pass
batch_size=32,
shuffle=True,
num_workers=2)
import torch
def __init__(self):
pass
The expression, dataset[index],
pass will call this magic function.
def __len__(self):
pass
batch_size=32,
shuffle=True,
num_workers=2)
import torch
def __init__(self):
pass

pass
This magic function returns length
def __len__(self):
pass of dataset.
batch_size=32,
shuffle=True,
num_workers=2)
import torch
def __init__(self):
pass

pass
This magic function returns length
def __len__(self):
pass of dataset.
batch_size=32,
shuffle=True,
num_workers=2)
import torch
def __init__(self):
pass

pass
def __len__(self):
pass
dataset = DiabetesDataset() Construct DiabetesDataset object.

batch_size=32,
shuffle=True,
num_workers=2)
import torch
def __init__(self):
pass

pass
def __len__(self):
pass
dataset = DiabetesDataset() Initialize loader with batch-size,

batch_size=32, shuffle, process number.
shuffle=True,
num_workers=2)
Extra: num_workers in Windows
train_loader = DataLoader(dataset=dataset, The implementation of multiprocessing is

batch_size=32,
different on Windows, which uses spawn
shuffle=True,
num_workers=2) instead of fork.
…… So left code will cause:
for epoch in range(100): RuntimeError:
for i, data in enumerate(train_loader, 0): An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
……
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
So we have to wrap the code with an if-clause if __name__ == '__main__':

freeze_support()
to protect the code from executing multiple ...
times. The "freeze_support()" line can be omitted if the program

is not going to be frozen to produce an executable.
Extra: num_workers in Windows
batch_size=32,
shuffle=True,
num_workers=2)
……
if __name__ == '__main__':
for i, data in enumerate(train_loader, 0):
# 1. Prepare data
So we have to wrap the code with an if-clause

to protect the code from executing multiple
times.
Example: Diabetes Dataset
def __init__(self, filepath):
xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
self.len = xy.shape[0]
self.x_data = torch.from_numpy(xy[:, :-1])
self.y_data = torch.from_numpy(xy[:, [-1]])

return self.x_data[index], self.y_data[index]
def __len__(self):
return self.len
dataset = DiabetesDataset('diabetes.csv.gz')
train_loader = DataLoader(dataset=dataset, batch_size=32, shuffle=True, num_workers=2)
Example: Using DataLoader

# 1. Prepare data
inputs, labels = data
# 2. Forward
y_pred = model(inputs)
loss = criterion(y_pred, labels)
print(epoch, i, loss.item())
# 3. Backward
loss.backward()
# 4. Update
optimizer.step()
Classifying Diabetes
import numpy as np
Prepare dataset
import torch
1
from torch.utils.data import Dataset, DataLoader
def __init__(self, filepath):
xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
self.len = xy.shape[0]
self.x_data = torch.from_numpy(xy[:, :-1])
Dataset and Dataloader
self.y_data = torch.from_numpy(xy[:, [-1]])

return self.x_data[index], self.y_data[index]
def __len__(self):
return self.len
Design model using Class

2
dataset = DiabetesDataset('diabetes.csv.gz')
batch_size=32,
shuffle=True,
num_workers=2)
class Model(torch.nn.Module):
inherit from nn.Module
def __init__(self):
super(Model, self).__init__()
self.linear1 = torch.nn.Linear(8, 6)
self.sigmoid = torch.nn.Sigmoid()
Construct loss and optimizer

def forward(self, x):
3
x = self.sigmoid(self.linear1(x))
return x
model = Model() using PyTorch API

criterion = torch.nn.BCELoss(size_average=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# 1. Prepare data
inputs, labels = data
Training cycle
# 2. Forward
4
y_pred = model(inputs)
loss = criterion(y_pred, labels)
print(epoch, i, loss.item())
# 3. Backward
forward, backward, update

loss.backward()
# 4. Update
optimizer.step()
The following dataset loaders are available
• MNIST
• Fashion-MNIST
• EMNIST
• COCO
• LSUN
• ImageFolder
• DatasetFolder
• Imagenet-12
• CIFAR
• STL10
• PhotoTour
Example: MINST Dataset
import torch
from torchvision import transforms
from torchvision import datasets
train_dataset = datasets.MNIST(root='../dataset/mnist',
train=True,
transform= transforms.ToTensor(),
download=True)
test_dataset = datasets.MNIST(root='../dataset/mnist',
train=False,
transform= transforms.ToTensor(),
download=True)
train_loader = DataLoader(dataset=train_dataset,
batch_size=32,
shuffle=True)
test_loader = DataLoader(dataset=test_dataset,
batch_size=32,
shuffle=False)
for batch_idx, (inputs, target) in enumerate(train_loader):

……
Exercise 8-1
• Build DataLoader for

• Titanic dataset: https://www.kaggle.com/c/titanic/data
• Build a classifier using the DataLoader
PyTorch Tutorial
08. Dataset and DataLoader

Lecture 08 Dataset and Dataloader

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 08 Dataset and Dataloader

Uploaded by

Copyright:

Available Formats

PyTorch Tutorial

08. Dataset and DataLoader

xy = np.loadtxt(‘diabetes.csv.gz’, delimiter=‘,’, dtype=np.float32)

for epoch in range(100):

Definition: Epoch Definition: Batch-Size Definition: Iteration

Sample 1 Sample 4 Sample 4

Sample 3 Sample 8 Sample 8

Sample 5 Shuffle Sample 5 Loader Sample 5

Sample 7 Sample 6 Sample 6

Dataset Queue Iterable Loader

def __getitem__(self, index):

def __getitem__(self, index):

def __getitem__(self, index):

def __getitem__(self, index):

def __getitem__(self, index):

def __getitem__(self, index):

dataset = DiabetesDataset() Construct DiabetesDataset object.

def __getitem__(self, index):

dataset = DiabetesDataset() Initialize loader with batch-size,

train_loader = DataLoader(dataset=dataset, The implementation of multiprocessing is

So we have to wrap the code with an if-clause if __name__ == '__main__':

times. The "freeze_support()" line can be omitted if the program

So we have to wrap the code with an if-clause

def __getitem__(self, index):

for epoch in range(100):

def __getitem__(self, index):

Design model using Class

Construct loss and optimizer

model = Model() using PyTorch API

for epoch in range(100):

forward, backward, update

for batch_idx, (inputs, target) in enumerate(train_loader):

• Build DataLoader for

You might also like

def getitem(self, index):

def getitem(self, index):

def getitem(self, index):

def getitem(self, index):

def getitem(self, index):

def getitem(self, index):

def getitem(self, index):

So we have to wrap the code with an if-clause if name == 'main':

def getitem(self, index):

def getitem(self, index):