nlp - R - Package tm - Which terms correspond to each common root after stemming? -


corpus created, stopwords defined, cleansing done (removepunctuation, removenumbers, tolower...).

the corpus ready stemmed. function executed correctly , works should, but...

i need know words being stemmed each common root. possible using tm package? or other package?

for example, terma1, terma2, termb1, termb2, termb3, of them stemmed term , new corpus reflect term. however, need know words associated each root word, , therefore optimal output should be:

term     stemm terma1   term terma2   term termb1   term termb2   term termb3   term ... worda1   word wordb1   word wordb2   word wordb3   word wordc1   word 

in tm package there function stemcompletion allows complete each stemmed word given specific dictionary.

to obtain output follows:

library(tm) data("crude") words <- stemcompletion(c("compan", "entit", "suppl"), crude) stemmed <-  names(words) stemcomp <- unname(words) data.table(stemmed, stemcomp) 

references: stemcompletion {tm}

[update: more german words]

i tried verify behavior german vowels:

library(snowballc) library(tm) library(data.table)  text <- c("für", "aktuelle", "nachrichten", "und", "themen", "bilder",        "und", "videos", "aus", "den", "bereichen", "news", "wirtschaft","politik","können", "fremdschämen", "lebensmüde", "erklärungsnot")  stem <- stemmed <- wordstem(text, language = "porter") completed <- stemcompletion(stemmed, text) comparison <- data.table(text, stemmed, completed) 

in table comparison can see original words german vowels not being stemmed but, if try complete given stem "f" stemcompletion("f", text) obtain correct word "für". strange, maybe can follow here , try find work around.


Comments