| License: | LGPL-2 | LGPL-2.1 | LGPL-3 [expanded from: LGPL] | 
| Title: | A Text Mining Toolkit for Chinese | 
| Type: | Package | 
| LazyLoad: | yes | 
| Author: | Jian Li | 
| Maintainer: | Jian Li <rweibo@sina.com> | 
| Description: | A Text mining toolkit for Chinese, which includes facilities for Chinese string processing, Chinese NLP supporting, encoding detecting and converting. Moreover, it provides some functions to support 'tm' package in Chinese. | 
| Version: | 0.2-13 | 
| Date: | 2019-08-04 | 
| Depends: | R (≥ 3.0.0), utils | 
| Suggests: | tm | 
| RoxygenNote: | 6.1.1 | 
| NeedsCompilation: | yes | 
| Packaged: | 2019-08-08 03:36:59 UTC; jli | 
| Repository: | CRAN | 
| Date/Publication: | 2019-08-08 04:40:02 UTC | 
GBK character set
Description
GBK character set including some useful information.
Usage
data(GBK)Format
A data frame with 8 columns.
- GBK
- Chinese characters in UTF-8. 
- py0
- Unique Pinyin of each character. 
- py
- Pinyin string of each character. 
- Radical
- In Chinese, it means 'Bu Shou'. 
- Stroke_Num_Radical
- In Chinese, it means the number of 'Bi Hua'. 
- Stroke_Order
- In Chinese, it means 'Bi Shun'. 
- Structure
- In Chinese, it means 'Zi Ti Jie Gou'. 
- Freq
- Frequency of the character in Sogou news corpus from all sites between June and July 2012. 
Author(s)
Jian Li <rweibo@sina.com>
National Taiwan University Semantic Dictionary
Description
National Taiwan University Semantic Dictionary.
Usage
data(NTUSD)Format
A list with 4 components.
- positive_chs
- Positive words in simplified Chinese 
- negative_chs
- Negative words in simplified Chinese 
- positive_cht
- Positive words in traditional Chinese 
- negative_cht
- Negative words in traditional Chinese 
References
Dictionary of simplified and traditional Chinese
Description
Dictionary of simplified and traditional Chinese.
Usage
data(SIMTRA)Format
A data frame with 2 columns.
- Sim
- a simplified Chinese string. 
- Tra
- a traditional Chinese string. 
Sport news.
Description
Sport news.
Usage
data(SPORT)Format
A data frame with 6 columns.
- id
- ID of the news. 
- time
- Time of the news. 
- title
- Title of the news. 
- class
- Class of the news, 'B' means Basketball, 'F' means Football. 
- abstract
- Abstract of the news. 
- content
- Content of the news. 
Dictionary of Chinese stop words
Description
Dictionary of Chinese stop words.
Usage
data(STOPWORDS)Format
A data frame with 1 column.
- word
- a string vertor of the stop words. 
Print the UTF-8 codes of a string.
Description
Print the UTF-8 codes of a string.
Usage
catUTF8(string, file = "")
Arguments
| string | A character vector. | 
| file | A  | 
Value
No results.
Author(s)
Jian Li <rweibo@sina.com>
Examples
catUTF8("hello")
Create a Chinese term-document matrix or a document-term matrix.
Description
Create a Chinese term-document matrix or a document-term matrix.
Usage
createDTM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)
createTDM(string, language = c("zh", "en"), tokenize = NULL, removePunctuation = TRUE, 
  removeNumbers = TRUE, removeStopwords = TRUE)
Arguments
| string | A character vector. | 
| language | The language type, 'zh' means Chinese. | 
| tokenize | A tokenizers function. | 
| removePunctuation | Whether to remove the punctuations. | 
| removeNumbers | Whether to remove the numbers. | 
| removeStopwords | Whether to remove the stop words. | 
Details
Package "tm" is required.
Value
An object of class TermDocumentMatrix or class DocumentTermMatrix.
Author(s)
Jian Li <rweibo@sina.com>
Create a word frequency data.frame.
Description
Create a word frequency data.frame.
Usage
createWordFreq(obj, onlyCN = TRUE, nosymbol = TRUE, stopwords = NULL,
  useStopDic = FALSE)
Arguments
| obj | A character vector or  | 
| onlyCN | Whether to keep only Chinese words. | 
| nosymbol | Whether to keep symbols. | 
| stopwords | A character vector of stop words. | 
| useStopDic | Whether to use the default stop words. | 
Value
A data.frame.
Author(s)
Jian Li <rweibo@sina.com>
Examples
createWordFreq(c("a", "a", "b", "c"), onlyCN = FALSE, nosymbol = TRUE, useStopDic = FALSE)
Get the current encoding of the locale.
Description
Get the current encoding of the locale.
Usage
getCharset()
Value
Character of encoding.
Author(s)
Jian Li <rweibo@sina.com>
Examples
getCharset()
Indicate whether the encoding of input string is BIG5.
Description
Indicate whether the encoding of input string is BIG5.
Usage
isBIG5(string, combine = FALSE)
Arguments
| string | A character vector. | 
| combine | Whether to combine all the strings. | 
Value
Logical value.
Author(s)
Jian Li <rweibo@sina.com>
Examples
isBIG5("hello")
Indicate whether the encoding of input string is GB18030.
Description
Indicate whether the encoding of input string is GB18030.
Usage
isGB18030(string, combine = FALSE)
Arguments
| string | A character vector. | 
| combine | Whether to combine all the strings. | 
Value
Logical value.
Author(s)
Jian Li <rweibo@sina.com>
Examples
isGB18030("hello")
Indicate whether the encoding of input string is GB2312.
Description
Indicate whether the encoding of input string is GB2312.
Usage
isGB2312(string, combine = FALSE)
Arguments
| string | A character vector. | 
| combine | Whether to combine all the strings. | 
Value
Logical value.
Author(s)
Jian Li <rweibo@sina.com>
Examples
isGB2312("hello")
Indicate whether the encoding of input string is GBK.
Description
Indicate whether the encoding of input string is GBK.
Usage
isGBK(string, combine = FALSE)
Arguments
| string | A character vector. | 
| combine | Whether to combine all the strings. | 
Value
Logical value.
Author(s)
Jian Li <rweibo@sina.com>
Examples
isGBK("hello")
Indicate whether the encoding of input string is UTF-8.
Description
Indicate whether the encoding of input string is UTF-8.
Usage
isUTF8(string, combine = FALSE)
Arguments
| string | A character vector. | 
| combine | Whether to combine all the strings. | 
Value
Logical value.
Author(s)
Jian Li <rweibo@sina.com>
Examples
isUTF8("hello")
Extract the left or right substrings in a character vector.
Description
Extract the left or right substrings in a character vector.
Usage
left(string, n)
right(string, n)
Arguments
| string | A character vector. | 
| n | How many characters. | 
Value
A character vector.
Author(s)
Jian Li <rweibo@sina.com>
Examples
left("hello", 3)
Revert UTF-8 string to Chinese character.
Description
Revert UTF-8 string to Chinese character.
Usage
revUTF8(string, utype = "R")
Arguments
| string | A character vector. | 
| utype | UTF-8 string type, the default is R type, such as "<U+XXXX>". | 
Value
A character vector.
Author(s)
Jian Li <rweibo@sina.com>
Set locale to Simplified Chinese/Traditional Chinese/UK.
Description
Set locale to Simplified Chinese/Traditional Chinese/UK.
Usage
setchs(rev = FALSE)
setcht(rev = FALSE)
setuk(rev = FALSE)
Arguments
| rev | Whethet to set the locale back. | 
Value
No results.
Author(s)
Jian Li <rweibo@sina.com>
Examples
setchs()
setchs(rev = TRUE)
Return Chinese stop words.
Description
Return Chinese stop words.
Usage
stopwordsCN(stopwords = NULL, useStopDic = TRUE)
Arguments
| stopwords | A character vector of stop words. | 
| useStopDic | Whether to use the default stop words. | 
Value
A vector of stop words.
Author(s)
Jian Li <rweibo@sina.com>
Examples
stopwordsCN("yes", useStopDic = FALSE)
Mixed case capitalizing.
Description
To capitalize every first letter of a word.
Usage
strcap(string, strict = FALSE)
Arguments
| string | A character vector. | 
| strict | Whether strict. | 
Value
A character vector with the first letter of each word capitalized.
Author(s)
Jian Li <rweibo@sina.com>
Examples
strcap("the quick red fox jumps over the lazy brown dog")
Extract matched substrings by regular expression.
Description
Extract matched substrings by regular expression.
Usage
strextract(string, pattern, invert = FALSE, ignore.case = FALSE,
  perl = FALSE, useBytes = FALSE)
Arguments
| string | A character vector. | 
| pattern | A character string containing a regular expression to be matched in the given character vector. | 
| invert | A logical value: if TRUE, extract the non-matched substrings. | 
| ignore.case | If FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching. | 
| perl | A logical value. Should perl-compatible regexps be used? | 
| useBytes | A logical value. If TRUE the matching is done byte-by-byte rather than character-by-character. | 
Value
A character vector with the matched or non-matched substrings.
Author(s)
Jian Li <rweibo@sina.com>
Examples
txt1 <- c("\t(x1)a(aa2)a ", " bb(bb)")
strextract(txt1, "\\([^)]*\\)")
txt2 <- c("  Ben Franklin and Jefferson Davis", "\tMillard Fillmore")
strextract(txt2, "(?<first>[[:upper:]][[:lower:]]+)", perl = TRUE)
Pad a string to a specified length with a padding character.
Description
Pad a string to a specified length with a padding character.
Usage
strpad(string, width = 0, side = c("left", "right", "both"),
  pad = " ")
Arguments
| string | A character vector. | 
| width | The number of characters of the string after padding. | 
| side | Which side to pad. | 
| pad | The padding character. | 
Value
A character vector after padding.
Author(s)
Jian Li <rweibo@sina.com>
Examples
strpad(1:5, width = 4, pad = "0")
Trim space of a string.
Description
Trim space of a string.
Usage
strstrip(string, side = c("both", "left", "right"))
Arguments
| string | A character vector. | 
| side | Which side of the string to be trimed, 'both', 'left' or 'right'. | 
Value
Trimed vector.
Author(s)
Jian Li <rweibo@sina.com>
Examples
strstrip(c("\taaaa ", " bbbb    "))
Convert a chinese text to pinyin format.
Description
Convert a chinese text to pinyin format.
Usage
toPinyin(string, capitalize = FALSE)
Arguments
| string | A character vector. | 
| capitalize | Whether to capitalize the first letter of each word. | 
Value
A character vector in pinyin format.
Author(s)
Jian Li <rweibo@sina.com>
Examples
toPinyin("the quick red fox jumps over the lazy brown dog")
Convert a Chinese text from simplified to traditional characters and vice versa.
Description
Convert a chinese text from simplified to traditional characters and vice versa.
Usage
toTrad(string, rev = FALSE)
Arguments
| string | A Chinese string vector. | 
| rev | Reverse. TRUE means traditional to simplified. Default is FALSE. | 
Value
Converted vectors.
Author(s)
Jian Li <rweibo@sina.com>
Examples
toTrad("hello")
Convert encoding of Chinese string to UTF-8.
Description
Convert encoding of Chinese string to UTF-8.
Usage
toUTF8(cnstring)
Arguments
| cnstring | A Chinese string vector. | 
Value
Converted vectors.
Author(s)
Jian Li <rweibo@sina.com>
Examples
toUTF8("hello")