Title: | Chinese Text Segmentation |
---|---|
Description: | Chinese text segmentation, keyword extraction and speech tagging For R. |
Authors: | Qin Wenfeng, Wu Yanyi |
Maintainer: | Qin Wenfeng <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.11 |
Built: | 2024-10-25 04:32:51 UTC |
Source: | https://github.com/qinwf/jiebar |
Keywords symbol to find keywords.
## S3 method for class 'keywords' jiebar <= code ## S3 method for class 'keywords' jiebar[code]
## S3 method for class 'keywords' jiebar <= code ## S3 method for class 'keywords' jiebar[code]
jiebar |
jiebaR Worker. |
code |
A Chinese sentence or the path of a text file. |
Qin Wenfeng <http://qinwenfeng.com>
## Not run: words = "hello world" test1 = worker("keywords",topn=1) test1 <= words ## End(Not run)
## Not run: words = "hello world" test1 = worker("keywords",topn=1) test1 <= words ## End(Not run)
Depreciated.
## S3 method for class 'qseg' qseg <= code ## S3 method for class 'qseg' qseg[code] qseg
## S3 method for class 'qseg' qseg <= code ## S3 method for class 'qseg' qseg[code] qseg
qseg |
a qseg object |
code |
a string |
qseg an environment
Quick mode is depreciated, and is scheduled to be remove in v0.11.0. If you want to keep this feature, please submit a issue on GitHub page to let me know.
Quick mode symbol to do segmentation, keyword extraction
and speech tagging. This symbol will initialize a quick_worker
when it is first called, and will do segmentation or other types of work
immediately.
You can reset the default model setting by $
, and
it will change the default setting the next time you use quick mode.
If you only want to change the parameter temporarily, you can reset the
settings of quick_worker$
. get_qsegmodel
,
set_qsegmodel
, and reset_qsegmodel
are also available for setting quick mode settings.
Qin Wenfeng <http://qinwenfeng.com>
## Not run: qseg <= "This is test" qseg <= "This is the second test" ## End(Not run) ## Not run: qseg <= "This is test" qseg$detect = T qseg get_qsegmodel() ## End(Not run)
## Not run: qseg <= "This is test" qseg <= "This is the second test" ## End(Not run) ## Not run: qseg <= "This is test" qseg$detect = T qseg get_qsegmodel() ## End(Not run)
Text segmentation symbol to cut words.
## S3 method for class 'segment' jiebar <= code ## S3 method for class 'segment' jiebar[code]
## S3 method for class 'segment' jiebar <= code ## S3 method for class 'segment' jiebar[code]
jiebar |
jiebaR Worker. |
code |
A Chinese sentence or the path of a text file. |
Qin Wenfeng <http://qinwenfeng.com>
## Not run: words = "hello world" test1 = worker() test1 <= words ## End(Not run)
## Not run: words = "hello world" test1 = worker() test1 <= words ## End(Not run)
Simhash symbol to compute simhash.
## S3 method for class 'simhash' jiebar <= code ## S3 method for class 'simhash' jiebar[code]
## S3 method for class 'simhash' jiebar <= code ## S3 method for class 'simhash' jiebar[code]
jiebar |
jiebaR Worker. |
code |
A Chinese sentence or the path of a text file. |
Qin Wenfeng <http://qinwenfeng.com>
## Not run: words = "hello world" test1 = worker("simhash",topn=1) test1 <= words ## End(Not run)
## Not run: words = "hello world" test1 = worker("simhash",topn=1) test1 <= words ## End(Not run)
Tagger symbol to tag words.
## S3 method for class 'tagger' jiebar <= code ## S3 method for class 'tagger' jiebar[code]
## S3 method for class 'tagger' jiebar <= code ## S3 method for class 'tagger' jiebar[code]
jiebar |
jiebaR Worker. |
code |
A Chinese sentence or the path of a text file. |
Qin Wenfeng <http://qinwenfeng.com>
## Not run: words = "hello world" test1 = worker("tag") test1 <= words ## End(Not run)
## Not run: words = "hello world" test1 = worker("tag") test1 <= words ## End(Not run)
Apply list input to a worker
apply_list(input, worker)
apply_list(input, worker)
input |
a list of characters |
worker |
a worker |
cutter = worker() apply_list(list("this is test", "that is not test"), cutter) apply_list(list("this is test", list("that is not test","ab c")), cutter)
cutter = worker() apply_list(list("this is test", "that is not test"), cutter) apply_list(list("this is test", list("that is not test","ab c")), cutter)
The path of dictionary, and it is used by segmentation and other function.
DICTPATH HMMPATH USERPATH IDFPATH STOPPATH
DICTPATH HMMPATH USERPATH IDFPATH STOPPATH
character
This function uses Simhash worker to do keyword extraction and finds the keywords from two inputs, and then computes Hamming distance between them.
distance(codel, coder, jiebar) vector_distance(codel, coder, jiebar)
distance(codel, coder, jiebar) vector_distance(codel, coder, jiebar)
codel |
For |
coder |
For |
jiebar |
jiebaR worker |
Qin Wenfeng
http://en.wikipedia.org/wiki/Hamming_distance
## Not run: words = "hello world" simhasher = worker("simhash", topn = 1) simhasher <= words distance("hello world" , "hello world!" , simhasher) vector_distance(c("hello","world") , c("hello", "world","!") , simhasher) ## End(Not run)
## Not run: words = "hello world" simhasher = worker("simhash", topn = 1) simhasher <= words distance("hello world" , "hello world!" , simhasher) vector_distance(c("hello","world") , c("hello", "world","!") , simhasher) ## End(Not run)
Edit the default user dictionary.
edit_dict(name = "user")
edit_dict(name = "user")
name |
the name of dictionary including |
There are three column in the system dictionary. Each column is seperated by space. The first column is the word, and the second column is the frequency of word. The third column is speech tag using labels compatible with ictclas.
There are two column in the user dictionary. The first column is the word,
and the second column is speech tag using labels compatible with ictclas.
Frequencies of words in the user dictionary is set by user_weight in worker
function.
If you want to provide the frequency of a new word,
you can put it in the system dictionary.
Only one column in the stop words dictionary, and it contains the stop words.
The ictclas speech tag : http://t.cn/RAEj7e1
This function detects the encoding of input files. You can also check encoding with checkenc package which is on GitHub.
file_coding(file) filecoding(file)
file_coding(file) filecoding(file)
file |
A file path. |
This function will choose the most likely encoding, and it will be more stable for a large input text file.
The encoding of file
Wu Yongwei, Qin wenfeng
https://github.com/adah1972/tellenc
https://github.com/qinwf/checkenc
This function helps remove some words in the segmentation result.
filter_segment(input, filter_words, unit = 50)
filter_segment(input, filter_words, unit = 50)
input |
a string vector |
filter_words |
a string vector of words to be removed. |
unit |
the length of word unit to use in regular expression, and the default is 50. Long list of a words forms a big regular expressions, it may or may not be accepted: the POSIX standard only requires up to 256 bytes. So we use unit to split the words in units. |
filter_segment(c("abc","def"," ","."), c("abc"))
filter_segment(c("abc","def"," ","."), c("abc"))
This function returns the frequency of words
freq(x)
freq(x)
x |
a vector of words |
The frequency of words
Qin wenfeng
freq(c("a","a","c"))
freq(c("a","a","c"))
Generate IDF dict from a list of documents.
get_idf(x, stop_word = STOPPATH, path = NULL)
get_idf(x, stop_word = STOPPATH, path = NULL)
x |
a list of character |
stop_word |
stopword path |
path |
output path |
Input list contains multiple character vectors with words, and each vector represents a document.
Stop words will be removed from the result.
If path is not NULL, it will write the result to the path.
a data.frame or a file
https://en.wikipedia.org/wiki/Tf-idf#Inverse_document_frequency_2
get_idf(list(c("abc","def"),c("abc"," ")))
get_idf(list(c("abc","def"),c("abc"," ")))
Depreciated.
get_qsegmodel() set_qsegmodel(qsegmodel) reset_qsegmodel()
get_qsegmodel() set_qsegmodel(qsegmodel) reset_qsegmodel()
qsegmodel |
a list which has the same structure as the return value of get_qsegmodel |
These function can get and modify quick mode model. get_qsegmodel
returns
the default model parameters. set_qsegmodel
can modify quick mode model
using a list, which has the same structure as the return value of get_qsegmodel.
reset_qsegmodel
can reset the default model to origin jiebaR
default
model.
Qin Wenfeng <http://qinwenfeng.com>
## Not run: qseg <= "This is test" qseg <= "This is the second test" ## End(Not run) ## Not run: qseg <= "This is test" qseg$detect = T qseg get_qsegmodel() model = get_qsegmodel() model$detect = F set_qsegmodel(model) reset_qsegmodel() ## End(Not run)
## Not run: qseg <= "This is test" qseg <= "This is the second test" ## End(Not run) ## Not run: qseg <= "This is test" qseg$detect = T qseg get_qsegmodel() model = get_qsegmodel() model$detect = F set_qsegmodel(model) reset_qsegmodel() ## End(Not run)
get tuple from the segmentation result
get_tuple(x, size = 2, dataframe = T)
get_tuple(x, size = 2, dataframe = T)
x |
a character vector or list |
size |
a integer >= 2 |
dataframe |
return data.frame |
get_tuple(c("sd","sd","sd","rd"),2)
get_tuple(c("sd","sd","sd","rd"),2)
This is a package for Chinese text segmentation, keyword extraction and speech tagging with Rcpp and cppjieba.
You can use custom dictionary. JiebaR can also identify new words, but adding new words will ensure higher accuracy.
Qin Wenfeng <http://qinwenfeng.com>
CppJieba https://github.com/aszxqw/cppjieba;
JiebaR https://github.com/qinwf/jiebaR;
### Note: Can not display Chinese characters here. ## Not run: words = "hello world" engine1 = worker() segment(words, engine1) # "./temp.txt" is a file path segment("./temp.txt", engine1) engine2 = worker("hmm") segment("./temp.txt", engine2) engine2$write = T segment("./temp.txt", engine2) engine3 = worker(type = "mix", dict = "dict_path",symbol = T) segment("./temp.txt", engine3) ## End(Not run) ## Not run: ### Keyword Extraction engine = worker("keywords", topn = 1) keywords(words, engine) ### Speech Tagging tagger = worker("tag") tagging(words, tagger) ### Simhash simhasher = worker("simhash", topn = 1) simhash(words, simhasher) distance("hello world" , "hello world!" , simhasher) show_dictpath() ## End(Not run)
### Note: Can not display Chinese characters here. ## Not run: words = "hello world" engine1 = worker() segment(words, engine1) # "./temp.txt" is a file path segment("./temp.txt", engine1) engine2 = worker("hmm") segment("./temp.txt", engine2) engine2$write = T segment("./temp.txt", engine2) engine3 = worker(type = "mix", dict = "dict_path",symbol = T) segment("./temp.txt", engine3) ## End(Not run) ## Not run: ### Keyword Extraction engine = worker("keywords", topn = 1) keywords(words, engine) ### Speech Tagging tagger = worker("tag") tagging(words, tagger) ### Simhash simhasher = worker("simhash", topn = 1) simhash(words, simhasher) distance("hello world" , "hello world!" , simhasher) show_dictpath() ## End(Not run)
Keyword Extraction worker uses MixSegment model to cut word and uses
TF-IDF algorithm to find the keywords. dict
, hmm
,
idf
, stop_word
and topn
should be provided when initializing
jiebaR worker.
keywords(code, jiebar) vector_keywords(code, jiebar)
keywords(code, jiebar) vector_keywords(code, jiebar)
code |
For |
jiebar |
jiebaR Worker. |
There is a symbol <=
for this function.
a vector of keywords with weight.
Qin Wenfeng
http://en.wikipedia.org/wiki/Tf-idf
## Not run: ### Keyword Extraction keys = worker("keywords", topn = 1) keys <= "words of fun" ## End(Not run)
## Not run: ### Keyword Extraction keys = worker("keywords", topn = 1) keys <= "words of fun" ## End(Not run)
Add user word
new_user_word(worker, words, tags = rep("n", length(words)))
new_user_word(worker, words, tags = rep("n", length(words)))
worker |
a jieba worker |
words |
the new words |
tags |
the new words tags, default "n" |
cc = worker() new_user_word(cc, "test") new_user_word(cc, "do", "v")
cc = worker() new_user_word(cc, "test") new_user_word(cc, "do", "v")
These functoins print the worker settings.
## S3 method for class 'inv' print(x, ...) ## S3 method for class 'jieba' print(x, ...) ## S3 method for class 'simhash' print(x, ...) ## S3 method for class 'keywords' print(x, ...) ## S3 method for class 'qseg' print(x, ...)
## S3 method for class 'inv' print(x, ...) ## S3 method for class 'jieba' print(x, ...) ## S3 method for class 'simhash' print(x, ...) ## S3 method for class 'keywords' print(x, ...) ## S3 method for class 'qseg' print(x, ...)
x |
The jiebaR Worker. |
... |
Other arguments. |
Qin Wenfeng
The function uses initialized engines for words segmentation. You
can initialize multiple engines simultaneously using worker()
.
Public settings of workers can be got and modified using $
,
such as WorkerName$symbol = T
. Some private settings are fixed
when engine is initialized, and you can get then by
WorkerName$PrivateVarible
.
segment(code, jiebar, mod = NULL)
segment(code, jiebar, mod = NULL)
code |
A Chinese sentence or the path of a text file. |
jiebar |
jiebaR Worker. |
mod |
change default result type, value can be "mix","hmm","query","full" or "mp" |
There are four kinds of models:
Maximum probability segmentation model uses Trie tree to construct
a directed acyclic graph and uses dynamic programming algorithm. It
is the core segmentation algorithm. dict
and user
should be provided when initializing jiebaR worker.
Hidden Markov Model uses HMM model to determine status set and
observed set of words. The default HMM model is based on People's Daily
language library. hmm
should be provided when initializing
jiebaR worker.
MixSegment model uses both Maximum probability segmentation model
and Hidden Markov Model to construct segmentation. dict
,
hmm
and user
should be provided when initializing
jiebaR worker.
QuerySegment model uses MixSegment to construct segmentation and then
enumerates all the possible long words in the dictionary. dict
,
hmm
and qmax
should be provided when initializing
jiebaR worker.
There is a symbol <=
for this function.
Show the default dictionaries' path. HMMPATH
, DICTPATH
, IDFPATH
, STOPPATH
and USERPATH
can be changed
in default environment.
show_dictpath()
show_dictpath()
Qin Wenfeng
Simhash worker uses the keyword extraction worker to find the keywords
and uses simhash algorithm to compute simhash. dict
hmm
, idf
and stop_word
should be provided when initializing
jiebaR worker.
simhash(code, jiebar) vector_simhash(code, jiebar)
simhash(code, jiebar) vector_simhash(code, jiebar)
code |
For |
jiebar |
jiebaR Worker. |
There is a symbol <=
for this function.
Qin Wenfeng
MS Charikar - Similarity Estimation Techniques from Rounding Algorithms
## Not run: ### Simhash words = "hello world" simhasher = worker("simhash",topn=1) simhasher <= words distance("hello world" , "hello world!" , simhasher) ## End(Not run)
## Not run: ### Simhash words = "hello world" simhasher = worker("simhash",topn=1) simhasher <= words distance("hello world" , "hello world!" , simhasher) ## End(Not run)
Compute Hamming distance of Simhash value
simhash_dist(x, y) simhash_dist_mat(x, y)
simhash_dist(x, y) simhash_dist_mat(x, y)
x |
a character vector of simhash value |
y |
a character vector of simhash value |
a character vector
simhash_dist("1","1") simhash_dist("1","2") tobin("1") tobin("2") simhash_dist_mat(c("1","12","123"),c("2","1"))
simhash_dist("1","1") simhash_dist("1","2") tobin("1") tobin("2") simhash_dist_mat(c("1","12","123"),c("2","1"))
The function uses Speech Tagging worker to cut word and
tags each word after segmentation using labels compatible with
ictclas. dict
hmm
and user
should be provided when initializing
jiebaR worker.
tagging(code, jiebar)
tagging(code, jiebar)
code |
a Chinese sentence or the path of a text file |
jiebar |
jiebaR Worker |
There is a symbol <=
for this function.
Qin Wenfeng
The ictclas speech tag : http://t.cn/RAEj7e1
## Not run: words = "hello world" ### Speech Tagging tagger = worker("tag") tagger <= words ## End(Not run)
## Not run: words = "hello world" ### Speech Tagging tagger = worker("tag") tagger <= words ## End(Not run)
simhash value to binary
tobin(x)
tobin(x)
x |
simhash value |
Tag the a character vector
vector_tag(string, jiebar)
vector_tag(string, jiebar)
string |
a character vector of segmented words. |
jiebar |
jiebaR Worker. |
## Not run: cc = worker() (res = cc["this is test"]) vector_tag(res, cc) ## End(Not run)
## Not run: cc = worker() (res = cc["this is test"]) vector_tag(res, cc) ## End(Not run)
This function can initialize jiebaR workers. You can initialize different
kinds of workers including mix
, mp
, hmm
,
query
, full
, tag
, simhash
, and keywords
.
see Detail for more information.
worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = USERPATH, idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5, encoding = "UTF-8", detect = T, symbol = F, lines = 1e+05, output = NULL, bylines = F, user_weight = "max")
worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = USERPATH, idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5, encoding = "UTF-8", detect = T, symbol = F, lines = 1e+05, output = NULL, bylines = F, user_weight = "max")
type |
The type of jiebaR workers including |
dict |
A path to main dictionary, default value is |
hmm |
A path to Hidden Markov Model, default value is |
user |
A path to user dictionary, default value is |
idf |
A path to inverse document frequency, default value is |
stop_word |
A path to stop word dictionary, default value is |
write |
Whether to write the output to a file, or return a the result in a object. This value will only be used when the input is a file path. The default value is TRUE. The value is used for segment and speech tagging workers. |
qmax |
Max query length of words, and the value
is used for |
topn |
The number of keywords, and the value is used for
|
encoding |
The encoding of the input file. If encoding
detection is enable, the value of |
detect |
Whether to detect the encoding of input file
using |
symbol |
Whether to keep symbols in the sentence. |
lines |
The maximal number of lines to read at one time when input is a file. The value is used for segmentation and speech tagging workers. |
output |
A path to the output file, and default worker will generate file name by system time stamp, the value is used for segmentation and speech tagging workers. |
bylines |
return the result by the lines of input files |
user_weight |
the weight of the user dict words. "min" "max" or "median". |
The package uses initialized engines for word segmentation, and you
can initialize multiple engines simultaneously. You can also reset the model
public settings using $
such as
WorkerName$symbol = T
. Some private settings are fixed
when a engine is initialized, and you can get then by
WorkerName$PrivateVarible
.
Maximum probability segmentation model uses Trie tree to construct
a directed acyclic graph and uses dynamic programming algorithm. It
is the core segmentation algorithm. dict
and user
should be provided when initializing jiebaR worker.
Hidden Markov Model uses HMM model to determine status set and
observed set of words. The default HMM model is based on People's Daily
language library. hmm
should be provided when initializing
jiebaR worker.
MixSegment model uses both Maximum probability segmentation model
and Hidden Markov Model to construct segmentation. dict
hmm
and user
should be provided when initializing
jiebaR worker.
QuerySegment model uses MixSegment to construct segmentation and then
enumerates all the possible long words in the dictionary. dict
,
hmm
and qmax
should be provided when initializing
jiebaR worker.
FullSegment model will enumerates all the possible words in the dictionary.
Speech Tagging worker uses MixSegment model to cut word and
tag each word after segmentation using labels compatible with
ictclas. dict
,
hmm
and user
should be provided when initializing
jiebaR worker.
Keyword Extraction worker uses MixSegment model to cut word and use
TF-IDF algorithm to find the keywords. dict
,hmm
,
idf
, stop_word
and topn
should be provided when initializing
jiebaR worker.
Simhash worker uses the keyword extraction worker to find the keywords
and uses simhash algorithm to compute simhash. dict
hmm
, idf
and stop_word
should be provided when initializing
jiebaR worker.
This function returns an environment containing segmentation
settings and worker. Public settings can be modified
using $
.
### Note: Can not display Chinese characters here. ## Not run: words = "hello world" engine1 = worker() segment(words, engine1) # "./temp.txt" is a file path segment("./temp.txt", engine1) engine2 = worker("hmm") segment("./temp.txt", engine2) engine2$write = T segment("./temp.txt", engine2) engine3 = worker(type = "mix", dict = "dict_path",symbol = T) segment("./temp.txt", engine3) ## End(Not run) ## Not run: ### Keyword Extraction engine = worker("keywords", topn = 1) keywords(words, engine) ### Speech Tagging tagger = worker("tag") tagging(words, tagger) ### Simhash simhasher = worker("simhash", topn = 1) simhash(words, simhasher) distance("hello world" , "hello world!" , simhasher) show_dictpath() ## End(Not run)
### Note: Can not display Chinese characters here. ## Not run: words = "hello world" engine1 = worker() segment(words, engine1) # "./temp.txt" is a file path segment("./temp.txt", engine1) engine2 = worker("hmm") segment("./temp.txt", engine2) engine2$write = T segment("./temp.txt", engine2) engine3 = worker(type = "mix", dict = "dict_path",symbol = T) segment("./temp.txt", engine3) ## End(Not run) ## Not run: ### Keyword Extraction engine = worker("keywords", topn = 1) keywords(words, engine) ### Speech Tagging tagger = worker("tag") tagging(words, tagger) ### Simhash simhasher = worker("simhash", topn = 1) simhash(words, simhasher) distance("hello world" , "hello world!" , simhasher) show_dictpath() ## End(Not run)