Dataset loading

Question

Dataset loading

danintheory opened this issue 9 years ago · comments

I'm trying to train a HMM that classifies lines in an HTML document as belonging to a certain zone or class (e.g. body, header, footer, title, etc.). Thus, each sequence is a document and each sample is a line. On each line, I compute a number of floating point valued features.

What is the correct input format for this data in order to train a model with seqlearn? I'm having trouble understanding how to format the data from the documentation.

Lars · Answer 1 · Wed May 06 2015 17:17:36 GMT+0800 (China Standard Time)

All lines in the entire set of HTML documents would be one big matrix X. Each row in this matrix is a sample (line). All the labels of all the lines are a single target vector y of the same length (len(y) == X.shape[0]).

The lengths of the actual sequences need to be an array lengths that contains the length of each sequence (document).

So, suppose you have a function that computes the features of a single line as a vector (1-d NumPy array):

def features(line):
    return np.array([feature1(line), feature2(line)])

... then you should be able to construct the input as follows:

X, y, lengths = [], [], []

for doc, label in training_set:
    lines = doc.splitlines()
    lengths.append(len(lines))
    X.append(features(line))
    y.append(label)

X, y, lengths = map(np.asarray, [X, y, lengths])

Does that answer your question?

Dan Roberts · Answer 2 · Wed May 06 2015 23:19:43 GMT+0800 (China Standard Time)

Thank you, that was very helpful. So the X matrix can be a dense matrix, where each row is a feature vector and each column is a different sample? And the feature vectors can be float valued with multiple features being nonzero?

I was confused by trying to deconstruct the included conll.py file in the example. I wasn't sure if features had to be translated to a list of strings (such as ["feature1:val1", "feature2:val2"]) and then encoded into a sparse matrix using the FeatureHasher from sklearn. Also, from your documentation, I wasn't sure what this line

"Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied."

meant in terms of having my feature vectors as floats, many of which are nonzero.

Finally, I wasn't sure (from deconstructing conll.py) whether features from the previous and subsequent sample in the sequence need to be included in the current sample (as is shown in the conll.py example).

Thanks again for all your help!

Lars · Answer 3 · Thu May 07 2015 04:30:14 GMT+0800 (China Standard Time)

X may be either a dense array or a sparse matrix. It follows scikit-learn conventions.

Re: one-hot encoding, that's because the HMM is meant to deal with categorical data and each feature should represent the identity of an event as a boolean. I think you should be using a StructuredPerceptron if your data is anything different (sorry, hadn't thought about this earlier, I very seldom use HMMs).