should text transformers return arrays?

Question

should text transformers return arrays?

mheilman opened this issue 8 years ago · comments

The text transformations currently return generators, but the sklearn doc page for the TransformerMixin say transformers should return arrays.

Also, the skflow text classification example that uses VocabularyProcessor converts the generator to an array.

I think this means that VocabularyProcessor can't be used in a Pipeline (except maybe if one makes a generator to array transformer).

Illia Polosukhin · Answer 1 · Fri Mar 11 2016 08:57:36 GMT+0800 (China Standard Time)

It's made because you can have very big dataset that you don't want to have in memory.

VocabularyProcessor should be usable with Pipeline if you use one of skflow estimators. If it doesn't - we can fix that (I didn't actually try, let me know if you face an issue with that). Streaming data - is one of the important parts of the skflow and tensorflow and sklearn is not exactly supports it - so we will be extending interface here.

Michael Heilman · Answer 2 · Fri Mar 11 2016 11:07:37 GMT+0800 (China Standard Time)

Yeah, I definitely understand the value of not loading up everything into memory. I wish scikit-learn had better support for out-of-core learning.

Anyway, I modified this example to have it directly pass in the output of VocabularyProcessor to TensorFlowRNNClassifier rather than wrapping the processor output in a numpy array as done here (comments removed for conciseness; see the lines below # CHANGED HERE):

import numpy as np
from sklearn import metrics
import pandas
import tensorflow as tf
import skflow

train = pandas.read_csv('dbpedia_csv/train.csv', header=None)
X_train, y_train = train[2], train[0]
test = pandas.read_csv('dbpedia_csv/test.csv', header=None)
X_test, y_test = test[2], test[0]

MAX_DOCUMENT_LENGTH = 10
vocab_processor = skflow.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
n_words = len(vocab_processor.vocabulary_)

EMBEDDING_SIZE = 50

def input_op_fn(X):
    word_vectors = skflow.ops.categorical_variable(X, n_classes=n_words,
        embedding_size=EMBEDDING_SIZE, name='words')
    word_list = skflow.ops.split_squeeze(1, MAX_DOCUMENT_LENGTH, word_vectors)
    return word_list

classifier = skflow.TensorFlowRNNClassifier(rnn_size=EMBEDDING_SIZE,
    n_classes=15, cell_type='gru', input_op_fn=input_op_fn,
    num_layers=1, bidirectional=False, sequence_length=None,
    steps=1000, optimizer='Adam', learning_rate=0.01, continue_training=True)

# CHANGED HERE
X_train = vocab_processor.fit_transform(X_train)
classifier.fit(X_train, (y for y in y_train))  # Also make the label argument a generator to avoid a ValueError from data_feeder.py:88

Running that leads to the following exceptions:

/data/src/skflow/examples# python text_classification_builtin_rnn_model.py
W tensorflow/core/common_runtime/executor.cc:1102] 0x9ff0ca0 Compute status: Invalid argument: Index 1 at offset 0 in Tindices is out of range
     [[Node: words/embedding_lookup/embedding_lookup = Gather[Tindices=DT_INT64, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](words/words_embeddings/read, words/embedding_lookup/Reshape)]]
Traceback (most recent call last):
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 571, in _do_call
    return fn(*args)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 555, in _run_fn
    return tf_session.TF_Run(session, feed_dict, fetch_list, target_list)
tensorflow.python.pywrap_tensorflow.StatusNotOK: Invalid argument: Index 1 at offset 0 in Tindices is out of range
     [[Node: words/embedding_lookup/embedding_lookup = Gather[Tindices=DT_INT64, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](words/words_embeddings/read, words/embedding_lookup/Reshape)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "text_classification_builtin_rnn_model.py", line 37, in <module>
    classifier.fit(X_train, (y for y in y_train))
  File "/data/src/skflow/skflow/estimators/base.py", line 243, in fit
    feed_params_fn=self._data_feeder.get_feed_params)
  File "/data/src/skflow/skflow/trainer.py", line 118, in train
    feed_dict=feed_dict)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 315, in run
    return self._run(None, fetches, feed_dict)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 511, in _run
    feed_dict_string)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 564, in _do_run
    target_list)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 586, in _do_call
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Index 1 at offset 0 in Tindices is out of range
     [[Node: words/embedding_lookup/embedding_lookup = Gather[Tindices=DT_INT64, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](words/words_embeddings/read, words/embedding_lookup/Reshape)]]
Caused by op 'words/embedding_lookup/embedding_lookup', defined at:
  File "text_classification_builtin_rnn_model.py", line 37, in <module>
    classifier.fit(X_train, (y for y in y_train))
  File "/data/src/skflow/skflow/estimators/base.py", line 217, in fit
    self._setup_training()
  File "/data/src/skflow/skflow/estimators/base.py", line 148, in _setup_training
    self._inp, self._out)
  File "/data/src/skflow/skflow/estimators/rnn.py", line 108, in _model_fn
    self.initial_state)(X, y)
  File "/data/src/skflow/skflow/models.py", line 227, in rnn_estimator
    X = input_op_fn(X)
  File "text_classification_builtin_rnn_model.py", line 20, in input_op_fn
    embedding_size=EMBEDDING_SIZE, name='words')
  File "/data/src/skflow/skflow/ops/embeddings_ops.py", line 77, in categorical_variable
    return embedding_lookup(embeddings, tensor_in)
  File "/data/src/skflow/skflow/ops/embeddings_ops.py", line 50, in embedding_lookup
    embeds_flat = tf.nn.embedding_lookup(params, ids_flat, name)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/ops/embedding_ops.py", line 86, in embedding_lookup
    validate_indices=validate_indices)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 423, in gather
    validate_indices=validate_indices, name=name)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
    op_def=op_def)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2040, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/opt/conda/envs/3.4/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1087, in __init__
    self._traceback = _extract_stack()

I'm not really sure what's going on there without taking a deep look at it, but I'm pretty sure that means a pipeline won't work either with this example because the output from the process gets passed along in the same way.

Illia Polosukhin · Answer 3 · Wed Oct 26 2016 12:29:27 GMT+0800 (China Standard Time)

Processing code was removed from tensorflow in favor of future preprocessing tools and for now - scikit learn tools. Sorry!