用户每天会在网站上产生各种各样的行为,比如浏览网页,下单等,这种行为会被网站记录下来,形成用户行为日志,并存储在hdfs上。格式如下:
17:03:35.012ᄑpageviewᄑ{"device_id":"4405c39e85274857bbef58e013a08859","user_id":"0921528165741295","ip":"61.53.69.195","session_id":"9d6dc377216249e4a8f33a44eef7576d","req_url":"http://www.bigdataclass.com/product/1527235438747427"}
这是一个类Json 的非结构化数据,主要内容是用户访问网站留下的数据,该文本有device_id,user_id,ip,session_id,req_url等属性,前面还有17:03:20.586ᄑpageviewᄑ,这些非结构化的数据,我们想把该文本通过mr程序处理成被数仓所能读取的格式,比如Json串形式输出,具体形式如下:
{"time_log":1527584600586,"device_id":"4405c39e85274857bbef58e013a08859","user_id":"0921528165741295","active_name":"pageview","ip":"61.53.69.195","session_id":"9d6dc377216249e4a8f33a44eef7576d","req_url":"http://www.bigdataclass.com/my/0921528165741295"}
代码工具:intellij idea, maven,jdk1.8
操作步骤:
配置 pom.xml
1 <?xml version="1.0" encoding="UTF-8"?> 2 <project xmlns="http://maven.apache.org/POM/4.0.0" 3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 4 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 "> 5 <modelVersion>4.0.0</modelVersion> 6 7 <groupId>netease.bigdata.course</groupId> 8 <artifactId>etl</artifactId> 9 <version>1.0-SNAPSHOT</version> 10 11 <dependencies> 12 <dependency> 13 <groupId>org.apache.hadoop</groupId> 14 <artifactId>hadoop-client</artifactId> 15 <version>2.7.6</version> 16 <scope>provided</scope> 17 </dependency> 18 <dependency> 19 <groupId>com.alibaba</groupId> 20 <artifactId>fastjson</artifactId> 21 <version>1.2.4</version> 22 </dependency> 23 </dependencies> 24 25 <build> 26 <sourceDirectory>src/main</sourceDirectory> 27 <plugins> 28 <plugin> 29 <groupId>org.apache.maven.plugins</groupId> 30 <artifactId>maven-assembly-plugin</artifactId> 31 <configuration> 32 <descriptorRefs> 33 <descriptorRef> 34 jar-with-dependencies 35 </descriptorRef> 36 </descriptorRefs> 37 </configuration> 38 <executions> 39 <execution> 40 <id>make-assembly</id> 41 <phase>package</phase> 42 <goals> 43 <goal>single</goal> 44 </goals> 45 </execution> 46 </executions> 47 </plugin> 48 49 </plugins> 50 </build> 51 52 </project>