POC Issues 0327
POC Issues 0327
POC Issues 0327
co
/usr/iop/4.2.5.0-0000/hbase/lib/hbase-client.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-protocol.jar,/usr/iop/
server.jar,/usr/iop/4.2.5.0-0000/hbase/lib/guava-12.0.1.jar,/usr/iop/4.2.5.0-0000/hbase/lib/htrace-core-3.1.0-incu
lib/protobuf-java-2.5.0.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-hadoop2-compat.jar,/usr/iop/4.2.5.0-0000/
2.2.0.jar,/usr/iop/4.2.5.0-0000/hbase/lib/htrace-core-3.1.0-incubating.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-
hadoop/lib/hadoop-lzo-0.5.1.jar --m
sc.textFile("hdfs:/Data/csc_insights/disability/rdz/cbs/member/preferences/ONETIME_CBS_TCBECSP_20171
(l.substring(0, 10).trim())).toDF.show(5)
Filter Condition
Hbase table
import org.apache.hadoop.hbase.spark
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.{Put,HTable}
import org.apache.hadoop.fs.{Path, FileAlreadyExistsException, FileSystem}
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.spark._
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/4.2.5.0-0000/0/hbase-site.xml"))
conf.addResource(new Path("/etc/hbase/4.2.5.0-0000/0/core-site.xml"))
val hbaseContext = new HBaseContext(sc, conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val accountMapping = s"""rowkey INTEGER :key, SRC_SYS_NM STRING b:SRC_SYS_NM, RPT_NUM STRIN
val accountdf=sqlContext.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping",accoun
accountdf.registerTempTable("edms_qa_test")
NETIME_CBS_TCBECSP_20171018021626.DAT").map(l =>
F.show(5)
.count
able("Sample")
distinct(value)) from Sample")
w
RC_SYS_NM, RPT_NUM STRING b:RPT_NUM""".stripMargin
ase.columns.mapping",accountMapping).option("hbase.table","T_RPT_NUM_ACT").load().persist
CREATE VIEW T_CLM_PY
( ROWKEY VARCHAR PRIMARY KEY,
b."SRC_SYS_NM" VARCHAR,
b."CLM_GUID " VARCHAR,
b."CLM_NUM" VARCHAR,
b."PMT_ID" VARCHAR,
b."PY_HIST_IND" VARCHAR,
e."PAYMENTCONTACTHISTORY" VARCHAR);
Process:
1. The comparison is done using Spark. Mapreduce comparison is NA in this new build
2. As of now, this supports and reads data from Hive, HDFS & HBASE components
3. This process reads the data from Hive or HDFS files or HBASE tables and creates & loads them into Spa
4. The output files are written on HDFS. Since this is spark, its not possible to write log files in local Unix s
POC's
1. HDFS vs HDFS comparison
/*HDFS vs HDFS*/
spark-submit --master local[*] --class cog.bravo.sparkComp.SparkComp_3 /hadoop/dsa/qa/jmichael2/Br
/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Ca
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>FILE</ <source>/user/jmich
<sql> select * from csv_src </sql></SparkSQLObj>" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>FILE</ <source>/user/jmich
<sql> select * fr </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
true "true" "/user/jmichael2/test" "" "true"
Output files:
2. HIVE vs HIVE comparison
/*HIVE vs HIVE*/
spark-submit --master local[*] --packages org.apache.hbase:hbase-common:1.0.0,org.apache.hbase:hba
/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Ca
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
true "true" "/user/jmichael2/test" "" "true"
Output Files:
3. HBASE vs HIVE
/*HIVE vs HBASE - NT10*/
spark-submit --master local[*] --packages org.apache.hbase:hbase-common:1.0.0,org.apache.hbase:hba
/home/METNET/jmichael2 "logs" "TC_1" \
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HBASE<<source>T_RPT_NUM_ <alias>DPA_TGT</alia
true "true" "/user/jmichael2/test" "" "true"
combinations as well
s new build
/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Case_Executor/cogBrv3897_Metlife
cessing_QA/DataLake_Test_Case_Executor "logs" "TC_1" \
<alias>csv_src</alias><header>false</head<delimiter </object>
parkSQLObj>" "" "" "" "" "" "" "" "" "" "" "," "1" \
<alias>csv_src</alias><header>false</head<delimiter </object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \
n:1.0.0,org.apache.hbase:hbase-client:1.0.0,org.apache.hbase:hbase-server:1.0.0 --class cog.bravo.sparkComp.SparkComp_3 /hadoop/ds
cessing_QA/DataLake_Test_Case_Executor "logs" "TC_1" \
<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \
<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \
<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \
<columns></object> <sql> select B_R </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "," "1" \
_Executor/cogBrv3897_Metlife.jar \
mp.SparkComp_3 /hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Case_Executor/c
mp.SparkComp_3 /home/METNET/jmichael2/cogBrv3897_Metlife0320.jar \
Resolution:
To resolve this issue we need to make sure that HBASE_CONF_PATH environment variable is set before l
> We need to load RDZ data into Phoenix for doing RDZ against EOS comparison
> Phoenix supports importing files which are only in .csv format
> Any tables that are created in Phoenix are also created in HBASE
> Hence, we always need to append a primary key to the RDZ file
> This primary key will be taken as the row_key when the data is getting inserted into HBASE
> When we load the data using the below two options, we are getting error
Root Cause:
This happens when user has an incorrect value defined for "zookeeper.znode.parent" in the hbase-site.x
For example the default "zookeeper.znode.parent" is set to "/hbase-unsecure" , but if you incorrectly sp
Resolution:
The solution here would be to update the hbase-site.xml / source out the same hbase-site.xml from the
ecause the default phoenix configurations are hitting timeout limits
onment variable is set before launching sqlline.py. This variable should point to hbase config directory.
ETIME_UDS_DPA_Conversion_20180119011722.csv
-21-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table Sample_edms_qa --input user/
de.parent" in the hbase-site.xml sourced on the client side or in case of a custom API written , the "zookeeper.znode.parent" was incorrec
ure" , but if you incorrectly specify that as lets say "/hbase" as opposed to what we have set up in the cluster, we will encounter this excep
same hbase-site.xml from the cluster or update the HBase API to correctly point out the "zookeeper.znode.parent" value as updated in the
ample_edms_qa --input user/jmichael2/test/ONETIME_UDS_DPA_Conversion_trial.csv
meout to 1200.