Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe <https://github.com/kohlschutter/boilerpipe> Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
| Version: | 1.3.2 | 
| Imports: | rJava | 
| Suggests: | RCurl | 
| Published: | 2021-05-19 | 
| DOI: | 10.32614/CRAN.package.boilerpipeR | 
| Author: | See AUTHORS file. boilerpipeR author details | 
| Maintainer: | Mario Annau <mario.annau at gmail.com> | 
| BugReports: | https://github.com/mannau/boilerpipeR/issues | 
| License: | Apache License (== 2.0) | 
| URL: | https://github.com/mannau/boilerpipeR | 
| NeedsCompilation: | no | 
| Materials: | NEWS | 
| In views: | NaturalLanguageProcessing, WebTechnologies | 
| CRAN checks: | boilerpipeR results | 
| Reference manual: | boilerpipeR.html , boilerpipeR.pdf | 
| Vignettes: | Introduction to the tm.plugin.webmining Package (source, R code) | 
| Package source: | boilerpipeR_1.3.2.tar.gz | 
| Windows binaries: | r-devel: boilerpipeR_1.3.2.zip, r-release: boilerpipeR_1.3.2.zip, r-oldrel: boilerpipeR_1.3.2.zip | 
| macOS binaries: | r-release (arm64): boilerpipeR_1.3.2.tgz, r-oldrel (arm64): boilerpipeR_1.3.2.tgz, r-release (x86_64): boilerpipeR_1.3.2.tgz, r-oldrel (x86_64): boilerpipeR_1.3.2.tgz | 
| Old sources: | boilerpipeR archive | 
Please use the canonical form https://CRAN.R-project.org/package=boilerpipeR to link to this page.