Final Up to date on July 12, 2022
Once we construct and practice a Keras deep studying mannequin, the coaching knowledge will be offered in a number of alternative ways. Presenting the information as a NumPy array or a TensorFlow tensor is a typical one. Making a Python generator perform and let the coaching loop to learn knowledge from it’s one other approach. One more approach of offering knowledge is to make use of tf.knowledge
 dataset.
On this tutorial, we are going to see how we are able to use tf.knowledge
dataset for a Keras mannequin. After ending this tutorial, you’ll study:
- Tips on how to create and useÂ
tf.knowledge
 dataset - The good thing about doing so in comparison with a generator perform
Let’s get began.

A Mild Introduction to tensorflow.knowledge API
Picture by Monika MG. Some rights reserved.
Overview
This text is cut up into 4 sections; they’re:
- Coaching a Keras Mannequin with NumPy Array and Generator Operate
- Making a Dataset utilizing
tf.knowledge
- Making a Dataest from Generator Operate
- Knowledge with Prefetch
Coaching a Keras Mannequin with NumPy Array and Generator Operate
Earlier than we see how the tf.knowledge
API works, let’s evaluation how we normally practice a Keras mannequin.
First, we’d like a dataset. An instance is the style MNIST dataset that comes with the Keras API, which we have now 60,000 coaching samples and 10,000 take a look at samples of 28×28 pixels in grayscale and the corresponding classification label is encoded with integers 0 to 9.
The dataset is a NumPy array. Then we are able to construct a Keras mannequin for classification, and with the mannequin’s match()
perform, we offer the NumPy array as knowledge.
The whole code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import matplotlib.pyplot as plt import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data() print(train_image.form) print(train_label.form) print(test_image.form) print(test_label.form)  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ]) mannequin.compile(optimizer=“adam”,               loss=“sparse_categorical_crossentropy”,               metrics=“sparse_categorical_accuracy”) historical past = mannequin.match(train_image, train_label,                     batch_size=32, epochs=50,                     validation_data=(test_image, test_label), verbose=0)  print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
Working this code will print out the next:
(60000, 28, 28) (60000,) (10000, 28, 28) (10000,) 313/313 [==============================] – 0s 392us/step – loss: 0.5114 – sparse_categorical_accuracy: 0.8446 [0.5113903284072876, 0.8446000218391418] |
And in addition create the next plot of validation accuracy over the 50 epochs we skilled our mannequin:
The opposite approach of coaching the identical community is to supply the information from a Python generator perform as a substitute of a NumPy array. A generator perform is the one with a yield
assertion to emit knowledge whereas the perform is working in parallel to the information shopper. A generator of the style MNIST dataset will be created as follows:
def batch_generator(picture, label, batchsize):     N = len(picture)     i = 0     whereas True:         yield picture[i:i+batchsize], label[i:i+batchsize]         i = i + batchsize         if i + batchsize > N:             i = 0 |
This perform is meant to be name with the syntax batch_generator(train_image, train_label, 32)
. It’ll scan the enter arrays in batches indefinitely. As soon as it reaches the tip of the array, it would restart from the start.
Coaching a Keras mannequin with a generator is comparable, utilizing the match()
perform:
historical past = mannequin.match(batch_generator(train_image, train_label, 32), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â steps_per_epoch=len(train_image)//32, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â epochs=50, validation_data=(test_image, test_label), verbose=0) |
As an alternative of offering the information and label, we simply want to supply the generator because the generator will give out each. When knowledge are introduced as NumPy array, we are able to inform what number of samples are there by trying on the size of the array. Keras can full one epoch when the complete dataset is used as soon as. Nonetheless, our generator perform will emit batches indefinitely so we have to inform when an epoch is ended, utilizing the steps_per_epoch
argument to the match()
perform.
Whereas within the above code, we offered the validation knowledge as NumPy array, we are able to additionally use a generator as a substitute and specify validation_steps
argument.
The next is the whole code utilizing generator perform, which the output is identical because the earlier instance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import matplotlib.pyplot as plt import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data() print(train_image.form) print(train_label.form) print(test_image.form) print(test_label.form)  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ])  def batch_generator(picture, label, batchsize):     N = len(picture)     i = 0     whereas True:         yield picture[i:i+batchsize], label[i:i+batchsize]         i = i + batchsize         if i + batchsize > N:             i = 0  mannequin.compile(optimizer=“adam”,               loss=“sparse_categorical_crossentropy”,               metrics=“sparse_categorical_accuracy”) historical past = mannequin.match(batch_generator(train_image, train_label, 32),                     steps_per_epoch=len(train_image)//32,                     epochs=50, validation_data=(test_image, test_label), verbose=0) print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
Making a Dataset utilizing tf.knowledge
Given we have now the style MNIST knowledge loaded, we are able to convert it right into a tf.knowledge
dataset, like the next:
... dataset = tf.knowledge.Dataset.from_tensor_slices((train_image, train_label)) print(dataset.element_spec) |
This prints the dataset’s spec, as follows:
(TensorSpec(form=(28, 28), dtype=tf.uint8, identify=None), TensorSpec(form=(), dtype=tf.uint8, identify=None)) |
We are able to see the information is a tuple (as we handed a tuple as argument to the from_tensor_slices()
perform), whereas the primary factor is in form (28,28)
whereas the second factor is a scalar. Each components are saved as 8-bit unsigned integers.
If we don’t current the information as a tuple of two NumPy array once we create the dataset, we are able to additionally do it later. The next is creating the identical dataset however first create the dataset for the picture knowledge and label individually earlier than combining them:
... train_image_data = tf.knowledge.Dataset.from_tensor_slices(train_image) train_label_data = tf.knowledge.Dataset.from_tensor_slices(train_label) dataset = tf.knowledge.Dataset.zip((train_image_data, train_label_data)) print(dataset.element_spec) |
This may print the identical spec:
(TensorSpec(form=(28, 28), dtype=tf.uint8, identify=None), TensorSpec(form=(), dtype=tf.uint8, identify=None)) |
The zip()
perform in dataset is just like the zip()
perform in Python within the sense that it matches knowledge one-by-one from a number of datasets right into a tuple.
One advantage of utilizing tf.knowledge
dataset is the flexibleness in dealing with the information. Under is the whole code on how we are able to practice a Keras mannequin utilizing dataset, which the batch measurement is ready to the dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import matplotlib.pyplot as plt import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data() dataset = tf.knowledge.Dataset.from_tensor_slices((train_image, train_label))  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ])  historical past = mannequin.match(dataset.batch(32),                     epochs=50,                     validation_data=(test_image, test_label),                     verbose=0)  print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
That is the best use case of utilizing a dataset. If we dive deeper, we are able to see {that a} dataset is simply an iterator. Due to this fact we are able to print out every pattern in a dataset utilizing the next:
for picture, label in dataset:     print(picture)  # array of form (28,28) in tf.Tensor     print(label)  # integer label in tf.Tensor |
The dataset has many capabilities built-in. The batch()
we used earlier than is one among them. If we create batches from dataset and print it, we have now the next:
for picture, label in dataset.batch(32):     print(picture)  # array of form (32,28,28) in tf.Tensor     print(label)  # array of form (32,) in tf.Tensor |
which every merchandise we get from a batch will not be a pattern however a batch of samples. We even have capabilities similar to map()
, filter()
, and cut back()
for sequence transformation, or concatendate()
and interleave()
for combining with one other dataset. There are additionally repeat()
, take()
, take_while()
, and skip()
like our acquainted counterpart from Python’s itertools
module. A full record of the capabilities will be discovered from the API documentation.
Making a Dataset from Generator Operate
To date, we noticed how dataset can be utilized rather than a NumPy array in coaching a Keras mannequin. Certainly, a dataset will also be created out of a generator perform. However as a substitute of a generator perform that generates a batch as we noticed in one of many instance above, right here we make a generator perform that generates one pattern at a time. The next is the perform:
import numpy as np import tensorflow as tf  def shuffle_generator(picture, label, seed):     idx = np.arange(len(picture))     np.random.default_rng(seed).shuffle(idx)     for i in idx:         yield picture[i], label[i]  dataset = tf.knowledge.Dataset.from_generator(     shuffle_generator,     args=[train_image, train_label, 42],     output_signature=(         tf.TensorSpec(form=(28,28), dtype=tf.uint8),         tf.TensorSpec(form=(), dtype=tf.uint8))) print(dataset.element_spec) |
This perform randomizes the enter array by shuffling the index vector. Then it generates one pattern at a time. Not like the earlier instance, this generator will finish when the samples from the array are exhausted.
We create a dataset from the perform utilizing from_generator()
. We have to present the identify of the generator perform (as a substitute of an instantiated generator) and likewise the output signature of the dataset. That is required as a result of the tf.knowledge.Dataset
API can not infer the dataset spec earlier than the generator is consumed.
Working the above code will print the identical spec as earlier than:
(TensorSpec(form=(28, 28), dtype=tf.uint8, identify=None), TensorSpec(form=(), dtype=tf.uint8, identify=None)) |
Such a dataset is functionally equal to the dataset that we created beforehand. Therefore we are able to use it for coaching as earlier than. The next is the whole code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import matplotlib.pyplot as plt import numpy as np import tensorflow as tf from tensorflow.keras.datasets.fashion_mnist import load_data from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.fashions import Sequential  (train_image, train_label), (test_image, test_label) = load_data()  def shuffle_generator(picture, label, seed):     idx = np.arange(len(picture))     np.random.default_rng(seed).shuffle(idx)     for i in idx:         yield picture[i], label[i]  dataset = tf.knowledge.Dataset.from_generator(     shuffle_generator,     args=[train_image, train_label, 42],     output_signature=(         tf.TensorSpec(form=(28,28), dtype=tf.uint8),         tf.TensorSpec(form=(), dtype=tf.uint8)))  mannequin = Sequential([     Flatten(input_shape=(28,28)),     Dense(100, activation=“relu”),     Dense(100, activation=“relu”),     Dense(10, activation=“sigmoid”) ])  historical past = mannequin.match(dataset.batch(32),                     epochs=50,                     validation_data=(test_image, test_label),                     verbose=0)  print(mannequin.consider(test_image, test_label))  plt.plot(historical past.historical past[‘val_sparse_categorical_accuracy’]) plt.present() |
Dataset with Prefetch
The true advantage of utilizing dataset is to make use of prefetch()
.
Utilizing a NumPy array for coaching might be the most effective in efficiency. Nonetheless, this implies we have to load all knowledge into reminiscence. Utilizing a generator perform for coaching permits us to organize one batch at a time, which the information will be loaded from disk on demand, for instance. Nonetheless, utilizing a generator perform to coach a Keras mannequin means both the coaching loop or the generator perform is working at any time. It’s not simple to make the generator perform and Keras’ coaching loop to run in parallel.
Dataset is the API that permits the generator and the coaching loop to run in parallel. When you’ve got a generator that’s computationally costly (e.g., doing picture augmentation at realtime), you’ll be able to create a dataset from such generator perform after which use it with prefetch()
, as follows:
... historical past = mannequin.match(dataset.batch(32).prefetch(3), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â epochs=50, Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â validation_data=(test_image, test_label), Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â verbose=0) |
The quantity argument to prefetch()
is the dimensions of the buffer. Right here we ask the dataset to maintain 3 batches in reminiscence prepared for the coaching loop to devour. At any time when a batch is consumed, the dataset API will resume the generator perform to refill the buffer, asynchronously in background. Due to this fact we are able to enable the coaching loop and the information preparation algorithm contained in the generator perform to run in parallel.
It price to say that, within the earlier part, we created a shuffling generator for the dataset API. Certainly the dataset API additionally has a shuffle()
perform to do the identical however we might not wish to use it until the datset is sufficiently small to slot in reminiscence.
The shuffle()
perform, identical as prefetch()
, takes a buffer measurement argument. The shuffle algorithm will fill the buffer with the dataset and draw one factor randomly from it. The consumed factor will probably be changed with the subsequent factor from the dataset. Therefore we’d like the buffer as giant because the dataset itself to make a very random shuffle. We are able to display this limitation with the next snippet:
import tensorflow as tf import numpy as np  n_dataset = tf.knowledge.Dataset.from_tensor_slices(np.arange(10000)) for n in n_dataset.shuffle(10).take(20):     print(n.numpy()) |
The output from the above seems to be like the next:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | 9 6 2 7 5 1 4 14 11 17 19 18 3 16 15 22 10 23 21 13 |
Which we are able to see the numbers are shuffled round its neighborhood and we by no means see giant numbers from its output.
Additional Studying
Extra concerning the tf.knowledge
dataset will be discovered from its API documentation:
Abstract
On this put up, you could have seen how we are able to use the tf.knowledge
dataset and the way it may be utilized in coaching a Keras mannequin.
Particularly, you discovered:
- Tips on how to practice a mannequin utilizing knowledge from NumPy array, a generator, and a dataset
- Tips on how to create a dataset utilizing a NumPy array or a generator perform
- Tips on how to use prefetch with dataset to make the generator and coaching loop run in parallel