python - Precomputing Countvectorizer -


this more of question regarding practices. have large amount of labeled text data , trying train binary classifier. optimize hyperparameters in pipeline, vectorization step takes huge amount of time. acceptable vectorize text in training set before cross-validation, not revectorizing same training text on , over?

here snippet of code:

tokenizer = regexptokenizer(r'\b[a-z|a-z]{2,}\b') vect = countvectorizer(tokenizer = tokenizer.tokenize, ngram_range = (1,2))  x_train_vect = vect.fit_transform(x_train)  pipe = pipeline([('vect', tfidftransformer()),                  ('norm', normalizer()),                  ('clf', sgdclassifier())])  skf = stratifiedkfold(y_train, n_folds = 5, shuffle = true)  params = {'vect__norm' : ['l1','l2'],           'vect__use_idf' : [true, false],           'clf__loss' : ['hinge', 'log', 'modified_huber'],           'clf__alpha' : stats.expon(scale = 0.0001)}  gs = randomizedsearchcv(estimator = pipe,                         param_distributions = params,                         n_iter = 50,                         scoring = 'accuracy',                         verbose = 3,                         cv = skf)  gs.fit(x_train_vect, y_train) 

i've tried incorporating countvectorizer pipeline , didn't see difference. thoughts/suggestions?


Comments