TensorFlow 1.0 | 05. Feeding data to TensorFlow

## Feeding data to TensorFlow TensorFlow is designed to work efficiently with large amount of data. So it's important not to starve your TensorFlow model in order to maximize its performance. There are various ways that you can feed your data to TensorFlow. ### Constants The simplest approach is to embed the data in your graph as a constant: ```python import tensorflow as tf import numpy as np actual_data = np.random.normal(size=[100]) data = tf.constant(actual_data) ``` This approach can be very efficient, but it's not very flexible. One problem with this approach is that, in order to use your model with another dataset you have to rewrite the graph. Also, you have to load all of your data at once and keep it in memory which would only work with small datasets. ### Placeholders Using placeholders solves both of these problems: ```python import tensorflow as tf import numpy as np data = tf.placeholder(tf.float32) prediction = tf.square(data) + 1 actual_data = np.random.normal(size=[100]) tf.Session().run(prediction, feed_dict={data: actual_data}) ``` Placeholder operator returns a tensor whose value is fetched through the feed_dict argument in Session.run function. Note that running Session.run without feeding the value of data in this case will result in an error. ### Python ops Another approach to feed the data to TensorFlow is by using Python ops: ```python def py_input_fn(): actual_data = np.random.normal(size=[100]) return actual_data data = tf.py_func(py_input_fn, [], (tf.float32)) ``` Python ops allow you to convert a regular Python function to a TensorFlow operation. ### Dataset API The recommended way of reading the data in TensorFlow however is through the dataset API: ```python actual_data = np.random.normal(size=[100]) dataset = tf.data.Dataset.from_tensor_slices(actual_data) data = dataset.make_one_shot_iterator().get_next() ``` If you need to read your data from file, it may be more efficient to write it in TFrecord format and use TFRecordDataset to read it: ```python dataset = tf.data.TFRecordDataset(path_to_data) ``` See the [official docs](https://www.tensorflow.org/api_guides/python/reading_data#Reading_from_files) for an example of how to write your dataset in TFrecord format. Dataset API allows you to make efficient data processing pipelines easily. For example this is how we process our data in the accompanied framework (See [trainer.py](https://github.com/vahidk/TensorflowFramework/blob/master/trainer.py)): ```python dataset = ... dataset = dataset.cache() if mode == tf.estimator.ModeKeys.TRAIN: dataset = dataset.repeat() dataset = dataset.shuffle(batch_size * 5) dataset = dataset.map(parse, num_threads=8) dataset = dataset.batch(batch_size) ``` After reading the data, we use Dataset.cache method to cache it into memory for improved efficiency. During the training mode, we repeat the dataset indefinitely. This allows us to process the whole dataset many times. We also shuffle the dataset to get batches with different sample distributions. Next, we use the Dataset.map function to perform preprocessing on raw records and convert the data to a usable format for the model. We then create batches of samples by calling Dataset.batch.