this more of question regarding practices. have large amount of labeled text data , trying train binary classifier. optimize hyperparameters in pipeline, vectorization step takes huge amount of time. acceptable vectorize text in training set before cross-validation, not revectorizing same training text on , over?
here snippet of code:
tokenizer = regexptokenizer(r'\b[a-z|a-z]{2,}\b') vect = countvectorizer(tokenizer = tokenizer.tokenize, ngram_range = (1,2)) x_train_vect = vect.fit_transform(x_train) pipe = pipeline([('vect', tfidftransformer()), ('norm', normalizer()), ('clf', sgdclassifier())]) skf = stratifiedkfold(y_train, n_folds = 5, shuffle = true) params = {'vect__norm' : ['l1','l2'], 'vect__use_idf' : [true, false], 'clf__loss' : ['hinge', 'log', 'modified_huber'], 'clf__alpha' : stats.expon(scale = 0.0001)} gs = randomizedsearchcv(estimator = pipe, param_distributions = params, n_iter = 50, scoring = 'accuracy', verbose = 3, cv = skf) gs.fit(x_train_vect, y_train)
i've tried incorporating countvectorizer pipeline , didn't see difference. thoughts/suggestions?
Comments
Post a Comment