360SDN.COM

首页/Java/列表

庖丁解牛分词工具使用教程(

来源:  2017-09-29 13:59:56    评论:0点击:

原文:http://blog.csdn.net/Pc620/article/details/6280489
今天想测试一下“庖丁”分词的效果,编写了一个测试小程序,从文件中读入文本,并将分词结果显示到控制台。
 
环境平台:Win7+eclipse

过程如下:
1.编辑paoding-analysis.jar中的paoding-dic-home.properties文件,去掉“#paoding.dic.home=dic”前面的#号,并将等号后面的dic改为dic文件夹在你本地存放的具体路劲,如:F://workspace//data//dic
(注:编辑paoding-analysis.jar中的文件,可先用WinRAR将paoding-analysis.jar打开,再用记事本或写字板直接编辑paoding-dic-home.properties文件后保存即可)
 
2. 将paoding-analysis.jar、commons-logging.jar、lucene-analyzers-2.2.0.jar和lucene-core-2.2.0.jar四个包导入工程:
①在工程下新建lib文件夹,将这四个包复制进来;
②右键单击工程->Properties->Java Build Path,在右侧选中第三个标签Libraries,点击Add JARs…,导入上述四个包;
③再选择第四个标签Order and Export,勾选上这四个包,点击OK按钮。
 
3. 创建一个主类,编写测试小程序,如下:
import java.io.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
 
public class FenciTest {
 
    public static void main(String[] args)
    {
       Analyzer analyzer = new PaodingAnalyzer();
       String docText = null;
       File file = new File("F://Work//workSpace//FenciTest//data//test1.txt");
       docText = readText(file);
      
       TokenStream tokenStream = analyzer.tokenStream(docText, new StringReader(docText));
       try {
           Token t;
           //System.out.println(docText);
           while ((t = tokenStream.next()) != null)
           {
               System.out.println(t);
           }
       } catch (IOException e) {
           e.printStackTrace();
       }
 
    }
   
    private static String readText(File file) {
       String text = null;
       try
       {
           InputStreamReader read1 = new InputStreamReader(new FileInputStream(file), "GBK");
           BufferedReader br1 = new BufferedReader(read1);   
           StringBuffer buff1 = new StringBuffer();    
           while((text = br1.readLine()) != null)
           {   
              buff1.append(text + "/r/n");   
           }   
           br1.close();        
           text = buff1.toString();
       } 
       catch(FileNotFoundException e) 
       {  
           System.out.println(e); 
       } 
       catch(IOException e) 
       {  
           System.out.println(e); 
       }
       return text;
    }
}
 

说明:此测试程序对lucene2.2适用,但对lucene3.0不适用,因为其去掉了tokenStream.next()方法,具体可参考:http://www.cnblogs.com/LeftNotEasy/archive/2010/01/14/1647778.html
 
 
 
4. 运行程序,会有如下提示信息:
2011-3-26 20:05:29 net.paoding.analysis.knife.PaodingMaker getProperties
信息: config paoding analysis from: F:/Work/workspace/FenciTest/file:/F:/Work/workspace/FenciTest/lib/paoding-analysis.jar!/paoding-analysis.properties;F:/Work/workspace/FenciTest/file:/F:/Work/workspace/FenciTest/lib/paoding-analysis.jar!/paoding-analysis-default.properties;F:/Work/workspace/FenciTest/file:/F:/Work/workspace/FenciTest/lib/paoding-analysis.jar!/paoding-analyzer.properties;F:/Work/workspace/FenciTest/file:/F:/Work/workspace/FenciTest/lib/paoding-analysis.jar!/paoding-dic-home.properties;F:/Work/workspace/FenciTest/data/dic/paoding-dic-names.properties;F:/Work/workspace/FenciTest/file:/F:/Work/workspace/FenciTest/lib/paoding-analysis.jar!/paoding-knives.properties;F:/Work/workspace/FenciTest/file:/F:/Work/workspace/FenciTest/lib/paoding-analysis.jar!/paoding-knives-user.properties
2011-3-26 20:05:29 net.paoding.analysis.knife.PaodingMaker createPaodingWithKnives
信息: add knike: net.paoding.analysis.knife.CJKKnife
2011-3-26 20:05:29 net.paoding.analysis.knife.PaodingMaker createPaodingWithKnives
信息: add knike: net.paoding.analysis.knife.LetterKnife
2011-3-26 20:05:29 net.paoding.analysis.knife.PaodingMaker createPaodingWithKnives
信息: add knike: net.paoding.analysis.knife.NumberKnife
 
这是正常情况,表示”刀片”加载成功,后面则会输出具体的分词结果。
到此庖丁分词小程序就已成功实现~
p.s.路径中不能有中文,所以目录名都最好不要用中文。
 
为您推荐

友情链接 |九搜汽车网 |手机ok生活信息网|ok生活信息网|ok微生活
 Powered by www.360SDN.COM   京ICP备11022651号-4 © 2012-2016 版权