Hive UDF 过滤字符串中的中英文标点符号

日期：2020-06-26 栏目：程序人生浏览：次

使用Hive 过程中需要做一些UDF的开发，简单贴一个UDF 是用来去除字符串中的所有中英文符号

本人Java菜鸟代码可能不好看勿喷啊。仅供参考

package com.fccs.utils;

import java.text.ParseException;

import org.apache.Hadoop.hive.ql.exec.UDF;

/***
* 本类是用于字符串替换支持正则表达
* @author yqj@fccs.com
* @date 2015-5-28
* @version 1.0
*
*/
public class F_str_replace extends UDF{

public String evaluate(String str,String ...args) {
if(str != null){
str = str.trim();
}
if(str == null || "".equals(str)){
return "null";
}

return get_str_replace( str,args);

}

/**
* 传入一个字符串，把所有符合条件的字符串和空字符转换为“null”，不符合条件的就返回原字符串
* 比如：get_str_replace("金成·江南春城 (·竹海水韵)")
* 转换后："金成江南春城竹海水韵"
* @param subject
* @param args 多参数 args[0] 要替换成的字符串, args[1] 正则表达式
* @use:get_str_replace(subject,replacement,pattern)
* @return result
*/
private static String get_str_replace(String subject,String...args){

String text = "null";
if(subject != null){
subject = subject.trim();
}
if(subject == null || "".equals(subject)){
return "null";
}

if(args.length==0){
text = subject.replaceAll( "\\p{Punct}","" );
text = text.replaceAll("\\pP" , "");
text = text.replaceAll("\\p{P}" ,"");
text = text.replaceAll( "[\\p{P}+~$`^=|<>～｀＄＾＋＝｜＜＞￥×]" , "");

}else if (args.length==1){
String replacement = args[0].length()>0?args[0]:"";
text = subject.replaceAll( "\\p{Punct}",replacement );
text = text.replaceAll("\\pP" , replacement);
text = text.replaceAll("\\p{P}" ,replacement);
text = text.replaceAll( "[\\p{P}+~$`^=|<>～｀＄＾＋＝｜＜＞￥×]" , replacement);

}else{
String pattern = args[1];
String replacement = args[0] ;
text = subject.replaceAll( pattern,replacement);

}

text = text.replaceAll("\\s+", ""); //过滤多余空格
return text;
}

}

Hive编程指南 PDF 中文高清版

基于Hadoop集群的Hive安装

Hive内表和外表的区别

Hadoop + Hive + Map +reduce 集群安装部署

Hive本地独立模式安装

Hive学习之WordCount单词统计

Hive运行架构及配置部署

转载注明出处：https://www.heiqu.com/ccd95b871acb46fbcd2c34781a1b88fa.html

Hive UDF 过滤字符串中的中英文标点符号

相关推荐