transformation drops documents

Kev1807 · Beitrag von **Kev1807** » So Jan 06, 2019 9:44 pm

Hallo ihr Lieben,

Nachdem ich zuletzt viel an der Datenbank, sowie dem nicht technischen Teil-Projekt gearbeitet habe, hier nun ein neues Problem:

Ich möchte mittels pdftools eine PDF einlesen und an dieser mittels

Code: Alles auswählen

tm_map(corpus, removeWords, stopwords('english'))

die Stopwords entfernen.

Code: Alles auswählen

install.packages("tm")
install.packages("SnowballC")
install.packages("RColorBrewer")
install.packages("wordcloud")
install.packages("pdftools")

library("tm")
library("SnowballC")
library("RColorBrewer")
library("wordcloud")
library("pdftools")

text <-pdf_text("H:/HEIBOX/My Library/Ebooks/Pack.pdf")

corpus <- Corpus(VectorSource(text))

clean_data <- tm_map( corpus, removeWords, stopwords('english'))

beim letzten bekomme ich folgende Warnmeldung:

Code: Alles auswählen

In tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")) :
  transformation drops documents

Ich dachte es läge an einer unaktuellen R-Version, war noch bei 3.5.0; habe dann geupdatet, aber das Problem bleibt bestehen.

Leider wird der Befehl nicht ausgeführt, sodass ncht nur bedingt weiter machen kann.
Kann mir jemand einen Hinweis geben?

Beitrag von **EDi** » So Jan 06, 2019 11:46 pm

Leider wird der Befehl nicht ausgeführt, sodass ncht nur bedingt weiter machen kann.
Kann mir jemand einen Hinweis geben?

Warnungen brechen die Berechnungen nicht ab und der Fehlerhafte Teil konnte teilweise korrigiert werden (so die Definition einer Warnung in R).

Bei einem Fehler kommt es zu einem nicht-reparable Fehler umd die Berechnung bricht ab.

Da du eine Warnung hast und kein reproduzierbares Beispiel postest, glaube ich dir diese Aussage, dass nichst ausgeführt wird, also nicht.

Das hier hast du sicherlich auch schon gefunden, gelesen und verstanden.
Hast du die data.frame Lösung mal probiert?

Kev1807 · Beitrag von **Kev1807** » Mo Jan 07, 2019 4:49 pm

EDi hat geschrieben: ↑So Jan 06, 2019 11:46 pm
Leider wird der Befehl nicht ausgeführt, sodass ncht nur bedingt weiter machen kann.
Kann mir jemand einen Hinweis geben?
Warnungen brechen die Berechnungen nicht ab und der Fehlerhafte Teil konnte teilweise korrigiert werden (so die Definition einer Warnung in R).

Bei einem Fehler kommt es zu einem nicht-reparable Fehler umd die Berechnung bricht ab.

Da du eine Warnung hast und kein reproduzierbares Beispiel postest, glaube ich dir diese Aussage, dass nichst ausgeführt wird, also nicht.

wow. Wenn ich also Worte entfernen möchte und diese danach noch auftauchen wird also der Befehl ausgeführt? Erscheint mir nicht so.

Ich kann die PDF hier nicht hochladen, das wäre file-sharing! Und der Code an sich funktioniert ja ...

Das hier hast du sicherlich auch schon gefunden, gelesen und verstanden.
Hast du die data.frame Lösung mal probiert?

Ja. Die Fehlermeldung ist verschwunden, aber die punctuation und bspw. die von mir manuell ausgeschlossenen Worte (bspw. "the") wurde nach wie vor nicht entfernt ... Nun mag der Text etwas schlecht gescannt sein (immerhin aus dem 19. Jahrhundert) aber Doppelpunke oder Semikolon sollte er nichtsdestotrotz entfernen können, oder? In einem anderen Text bleibt ebenfalls "the" bestehen, habe es gerade probiert.

Ich orientiere mich bei dieser Arbeit an einem Tutorial, in dem sicher mit VectorSource gearbeitet wurde, hat sich irgendwo mit irgendeinem Update irgendwas geändert? (also zur ursprünglichen Frage) Bzw. warum kommt da jetzt genau diese Fehlermeldung?

Die Wordclouds mit dem exakten Tutorial-Code von (https://www.springboard.com/blog/text-mining-in-r/ ) git ein ganz anderes Ergebnis, bspw. sieht man gut die nicht entfernten Worte! (Middell-Wordcloud.png ist direkt das Tutorial befolgt)
Middell-WC.png ist meine erstellte, ohne das Tutorial im Blick zu haben.

Nachbauen funktioniert leider auch nicht!

liebe Grüße und vielen Dank!

Athomas · Beitrag von **Athomas** » Di Jan 08, 2019 4:07 pm

So, ich habe das jetzt mal anhand einer freien Kopie von "A Midsummer Night's Dream" nachvollzogen.
Ich bekomme dieselbe "warning" wie Du, allerdings auch ein proppenvolles Objekt clean_data, eine fette Liste mit 108 Elementen - für jede (Buch-) Seite eins.

Darin stehen (unter anderem) die ausgedünnten Inhalte. So wird etwa aus

THESEUS
No epilogue, I pray you; for your play needs no
excuse. Never excuse; for when the players are all
dead, there needs none to be blamed. Marry, if he
that writ it had played Pyramus and hanged himself
in Thisbe's garter, it would have been a fine
tragedy: and so it is, truly; and very notably
discharged. But come, your Bergomask: let your
epilogue alone.

im Original

THESEUS\r\n No epilogue, I pray ; play needs \r\n excuse. Never excuse; players \r\n dead, needs none blamed. Marry, \r\n writ played Pyramus hanged \r\n Thisbe's garter, fine\r\n tragedy: , truly; notably\r\n discharged. But come, Bergomask: let \r\n epilogue alone.\r\n

in der "bereinigten" Fassung.

Was willst Du denn, wenn nicht das?!

Beitrag von **EDi** » Di Jan 08, 2019 11:47 pm

Bei diesem reproduzierbarem Beispiel funktioniert alles so wie es sollte. Es kommt auch keine Fehlermeldung, da ich DataFrameSource nutze. Ich verwende magrittr der Lesbarkeit halber und hab noch eine kleine Funktion eingefügt, welche die Wortfreqenz anzeigt (und der corpus wieder zurückliefert), um die Schritte nachvollziehen zu können.

Ich vermute, du hast tolower() vergessen (oder angenommen, es wäre schon includiert). stripwhitespace und andere Funktionen sind sicherlich auch nützlich wenn man mit pdfs arbeitet...

Falls dies nicht deinen Erwartungen entspricht, bitte ein reproduzierbares Beispiel posten.

wow. Wenn ich also Worte entfernen möchte und diese danach noch auftauchen wird also der Befehl ausgeführt? Erscheint mir nicht so.

Mir scheint wir haben da unterschiedlche Auffassungen von "Befehl wird ausgeführt". Meine bezieht sich rein auf die Ausführung, deine anscheinend auf das zurückgelieferte Ergebnis. Siehe auch https://adv-r.hadley.nz/conditions.html ... conditions

Code: Alles auswählen

library(tm)
library(magrittr)

# some data
text <- c('The warden is best.', 'Happy Potter is the smartest.', 'I think I spider.')
df <- data.frame(doc_id = seq_along(text), text = text, stringsAsFactors = FALSE)
corpus <- Corpus(DataframeSource(df))

# funtion that just prints the table of words
print_words <- function(corpus) {
  print(rowSums(as.matrix(TermDocumentMatrix(corpus))))
  return(corpus)
}


corpus %>% 
  print_words() %>%
  # this drops the 'the' in Happy Potter
  tm_map(removeWords, stopwords('english')) %>%
  print_words()  %>%
  # tolower drops also the other the
  tm_map(tolower) %>%
  tm_map(removeWords, stopwords('english')) %>%
  print_words() %>%
  # this drops potter
  tm_map(removeWords, 'potter') %>%
  print_words()

Kev1807 · Beitrag von **Kev1807** » Mi Jan 09, 2019 6:47 pm

Direkt deinen Vorschlag kopiert und um den PDF-Teil ergänzt:

Code: Alles auswählen

install.packages("tm")
install.packages("magrittr")
install.packages("pdftools")
library("tm")
library("magrittr")
library("pdftools")

# some data
setwd("H:/HEIBOX/My Library/Ebooks/")
text <- pdf_text("Pack.pdf")
df <- data.frame(doc_id = seq_along(text), text = text, stringsAsFactors = FALSE)
corpus <- Corpus(DataframeSource(df))

# funtion that just prints the table of words
print_words <- function(corpus) {
  print(rowSums(as.matrix(TermDocumentMatrix(corpus))))
  return(corpus)
}


corpus %>% 
  print_words() %>%
  # this drops the 'the' 
  tm_map(removeWords, stopwords('english')) %>%
  print_words()  %>%
  # tolower drops also the other the
  tm_map(tolower) %>%
  tm_map(removeWords, stopwords('english')) %>%
  print_words() %>%
  # this drops potter
  tm_map(removeWords, 'potter') %>%
  print_words()


# TermDocumentMatrix
dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d,10)

Ergebnis von head(d,10)

Code: Alles auswählen

       word freq
the     the 6404
and     and 3045
was     was 1493
his     his  933
that   that  933
had     had  910
for     for  840
with   with  730
anson anson  652
were   were  588

Während folgendes das passende Ergebnis liefert:

Code: Alles auswählen

# Install
install.packages("tm")  # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator 
install.packages("RColorBrewer") # color palettes
install.pdftools #pdftools zum öffnen von PDFS

# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library("pdftools")

# In welchem Verzeichnis soll gesucht werden?
# Format: "Dateipfad"
setwd("H:/HEIBOX/My Library/EBOOKS/")

#Vorbereitung des Textes
text <- pdf_text("Pack.pdf")
# Text zum Corpus machen
docs <- Corpus(VectorSource(text))

inspect(docs)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Text alles zu Kleinbuchstaben
docs <- tm_map(docs, content_transformer(tolower))
# Wir brauchen keine Zahlen - oder etwa doch?
docs <- tm_map(docs, removeNumbers)
# Entfernen der "Stopwords" 
docs <- tm_map(docs, removeWords, stopwords("english"))
# Eigene Liste mit Stopwords die entfernt werden sollen
# Eigene Stopwords als character vector definieren
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
# Satzzeichen entfernen
docs <- tm_map(docs, removePunctuation)
# "White space" entfernen
docs <- tm_map(docs, stripWhitespace)
# Text stemming
docs <- tm_map(docs, stemDocument)

# TermDocumentMatrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

das passende Eregbnis liefert:

Code: Alles auswählen

> head(d, 10)
               word freq
anson         anson  912
ship           ship  539
lord           lord  343
admir         admir  249
command     command  239
centurion centurion  232
offic         offic  232
men             men  225
time           time  225
great         great  217

Da zweiteres funktioniert ist das nicht so schlimm, aber ich fände es trotzdem spannend, woran das liegt ...
mir fallen spontan zwei Unterschiede auf:
Normaler Text der zum dataframe gemacht wird, vs. PDF als VectorSource, sowie der Gebrauch von magrittr.

Falls es relevant ist, @ EDi, die Ausgabe exakt deines Codes schaut bei mir ebenfalls noch sehr "the"-lastig aus.

Code: Alles auswählen

> text <- c('The warden is best.', 'Happy Potter is the smartest.', 'I think I spider.')
> df <- data.frame(doc_id = seq_along(text), text = text, stringsAsFactors = FALSE)
> corpus <- Corpus(DataframeSource(df))
> print_words <- function(corpus) {
+   print(rowSums(as.matrix(TermDocumentMatrix(corpus))))
+   return(corpus)
+ }
> corpus %>% 
+   print_words() %>%
+   # this drops the 'the' in Happy Potter
+   tm_map(removeWords, stopwords('english')) %>%
+   print_words()  %>%
+   # tolower drops also the other the
+   tm_map(tolower) %>%
+   tm_map(removeWords, stopwords('english')) %>%
+   print_words() %>%
+   # this drops potter
+   tm_map(removeWords, 'potter') %>%
+   print_words()
    best.       the    warden     happy    potter smartest.   spider.     think 
        1         2         1         1         1         1         1         1 
    best.       the    warden     happy    potter smartest.   spider.     think 
        1         1         1         1         1         1         1         1 
    best.    warden     happy    potter smartest.   spider.     think 
        1         1         1         1         1         1         1 
    best.    warden     happy smartest.   spider.     think 
        1         1         1         1         1         1 
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

Ich verstehe das langsam nicht mehr.

Deutsches R-Forum

transformation drops documents

transformation drops documents

Re: transformation drops documents

Re: transformation drops documents

Re: transformation drops documents

Re: transformation drops documents

Re: transformation drops documents