corpus created, stopwords defined, cleansing done (removepunctuation, removenumbers, tolower...).
the corpus ready stemmed. function executed correctly , works should, but...
i need know words being stemmed each common root. possible using tm package? or other package?
for example, terma1, terma2, termb1, termb2, termb3, of them stemmed term , new corpus reflect term. however, need know words associated each root word, , therefore optimal output should be:
term stemm terma1 term terma2 term termb1 term termb2 term termb3 term ... worda1 word wordb1 word wordb2 word wordb3 word wordc1 word
in tm package there function stemcompletion allows complete each stemmed word given specific dictionary.
to obtain output follows:
library(tm) data("crude") words <- stemcompletion(c("compan", "entit", "suppl"), crude) stemmed <- names(words) stemcomp <- unname(words) data.table(stemmed, stemcomp)
references: stemcompletion {tm}
[update: more german words]
i tried verify behavior german vowels:
library(snowballc) library(tm) library(data.table) text <- c("für", "aktuelle", "nachrichten", "und", "themen", "bilder", "und", "videos", "aus", "den", "bereichen", "news", "wirtschaft","politik","können", "fremdschämen", "lebensmüde", "erklärungsnot") stem <- stemmed <- wordstem(text, language = "porter") completed <- stemcompletion(stemmed, text) comparison <- data.table(text, stemmed, completed)
in table comparison can see original words german vowels not being stemmed but, if try complete given stem "f" stemcompletion("f", text)
obtain correct word "für". strange, maybe can follow here , try find work around.
Comments
Post a Comment