This is a package for Chinese text segmentation, keyword extraction and speech tagging.
You can use worker()
to initialize a worker, and then
use []
or segment()
to do the
segmentation.
## Loading required package: jiebaRD
## Using default settings to initialize a worker.
cutter = worker()
### Note: Can not display Chinese characters here.
segment( "This is a good day!" , cutter )
## [1] "This" "is" "a" "good" "day"
You can use file path as input.
## [1] "temp" "dat"
You can initialize multiple engines simultaneously.
cutter2 = worker(type = "mix",
dict = "some_path/jieba.dict.utf8",
hmm = "some_path/hmm_model.utf8",
user = "some_path/test.dict.utf8",
detect=T, symbol = F,
lines = 1e+05, output = NULL
)
cutter2 ### Print information of worker
Worker Type: Mix Segment
Detect Encoding : TRUE
Default Encoding: UTF-8
Keep Symbols : FALSE
Output Path :
Write File : TRUE
Max Read Lines : 1e+05
Fixed Model Components:
$dict
[1] "dict/jieba.dict.utf8"
$hmm
[1] "dict/hmm_model.utf8"
$user
[1] "dict/test.dict.utf8"
$detect $encoding $symbol $output $write $lines can be reset.
The public settings of the model can be modified by $
cutter$symbol = T
. Private settings are fixed when the
engine is initialized, and you can get them by
cutter$PrivateVarible
.
## [1] "UTF-8"
## [1] TRUE
## [1] FALSE
You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.
## [1] "/github/workspace/pkglib/jiebaRD/dict"
Speech Tagging function [.tagger
and
tagging
tag each word in a sentence after segmentation,
using labels compatible with ictclas.
## eng eng
## "hello" "world"
Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.
## 11.7392
## "fun"
Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.
## $simhash
## [1] "3804341492420753273"
##
## $keyword
## 11.7392
## "hello"
## $distance
## [1] 0
##
## $lhs
## 11.7392
## "hello"
##
## $rhs
## 11.7392
## "hello"