POC Issues 0327

Download as xlsx, pdf, or txt
Download as xlsx, pdf, or txt
You are on page 1of 46

spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.

co
/usr/iop/4.2.5.0-0000/hbase/lib/hbase-client.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-protocol.jar,/usr/iop/
server.jar,/usr/iop/4.2.5.0-0000/hbase/lib/guava-12.0.1.jar,/usr/iop/4.2.5.0-0000/hbase/lib/htrace-core-3.1.0-incu
lib/protobuf-java-2.5.0.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-hadoop2-compat.jar,/usr/iop/4.2.5.0-0000/
2.2.0.jar,/usr/iop/4.2.5.0-0000/hbase/lib/htrace-core-3.1.0-incubating.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-
hadoop/lib/hadoop-lzo-0.5.1.jar --m

Reading HDFS File

sc.textFile("hdfs:/Data/csc_insights/disability/rdz/cbs/member/preferences/ONETIME_CBS_TCBECSP_20171
(l.substring(0, 10).trim())).toDF.show(5)

Querying HDFS File


textFile.distinct.count
textFile.registerTempTable("Sample")
val result=sqlContext.sql("select count(distinct(value)) from Sample")
result.show

Filter Condition

Hbase table
import org.apache.hadoop.hbase.spark
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.{Put,HTable}
import org.apache.hadoop.fs.{Path, FileAlreadyExistsException, FileSystem}
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.spark._
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/4.2.5.0-0000/0/hbase-site.xml"))
conf.addResource(new Path("/etc/hbase/4.2.5.0-0000/0/core-site.xml"))
val hbaseContext = new HBaseContext(sc, conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val accountMapping = s"""rowkey INTEGER :key, SRC_SYS_NM STRING b:SRC_SYS_NM, RPT_NUM STRIN
val accountdf=sqlContext.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping",accoun
accountdf.registerTempTable("edms_qa_test")

val result1 =sqlContext.sql("select count(*) from edms_qa_test where SRC_SYS_NM = 'UDS'")


val result1 =sqlContext.sql("select * from edms_qa_test limit 5")

Joining Hbase and HDFS file


es http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/4.2.5.0-0000/0/hbase-site.xml --jars
b/hbase-protocol.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-common.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-
base/lib/htrace-core-3.1.0-incubating.jar,/usr/iop/4.2.5.0-0000/hbase/lib/zookeeper.jar,/usr/iop/4.2.5.0-0000/hbase/
mpat.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-hadoop-compat.jar,/usr/iop/4.2.5.0-0000/hbase/lib/metrics-core-
4.2.5.0-0000/hbase/lib/hbase-spark.jar,/usr/iop/4.2.5.0-0000/hive2/lib/hive-hbase-handler.jar,/usr/iop/4.2.5.0-0000/
p/lib/hadoop-lzo-0.5.1.jar --master yarn

NETIME_CBS_TCBECSP_20171018021626.DAT").map(l =>
F.show(5)
.count
able("Sample")
distinct(value)) from Sample")
w
RC_SYS_NM, RPT_NUM STRING b:RPT_NUM""".stripMargin
ase.columns.mapping",accountMapping).option("hbase.table","T_RPT_NUM_ACT").load().persist
CREATE VIEW T_CLM_PY
( ROWKEY VARCHAR PRIMARY KEY,
b."SRC_SYS_NM" VARCHAR,
b."CLM_GUID " VARCHAR,
b."CLM_NUM" VARCHAR,
b."PMT_ID" VARCHAR,
b."PY_HIST_IND" VARCHAR,
e."PAYMENTCONTACTHISTORY" VARCHAR);

CREATE VIEW T_CLM


( ROWKEY VARCHAR PRIMARY KEY,
b."SRC_SYS_NM" VARCHAR,
b."CLM_NUM" VARCHAR,
e."PAYMENTCONTACTHISTORY" VARCHAR);

SELECT COUNT(*) FROM "T_CLM_PY";

SELECT * FROM "T_CLM_PY" WHERE "e"."PAYMENTCONTACTHISTORY" IS NOT NULL LIMIT 5;


Bravo Team have given a build which willdo the below comparison using SPARK
a. Base object: (Hive or HDFS or HBASE) vs (Hive or HDFS or HBASE)
b. Multiple objects: (Hive & HDFS) vs (Hbase_1 & Hbase_2) and all other combinations as well

Process:
1. The comparison is done using Spark. Mapreduce comparison is NA in this new build
2. As of now, this supports and reads data from Hive, HDFS & HBASE components
3. This process reads the data from Hive or HDFS files or HBASE tables and creates & loads them into Spa
4. The output files are written on HDFS. Since this is spark, its not possible to write log files in local Unix s

Below items are the pre-requisits


1. Java 1.8
2. Spark 2.1
3. Access to Namedoe [NT10] - We are seeings issues in ZooKeeper services in edge node [ET01]. In Nam
4. Access to write files to HDFS layer

POC's
1. HDFS vs HDFS comparison
/*HDFS vs HDFS*/
spark-submit --master local[*] --class cog.bravo.sparkComp.SparkComp_3 /hadoop/dsa/qa/jmichael2/Br
/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Ca
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>FILE</ <source>/user/jmich
<sql> select * from csv_src </sql></SparkSQLObj>" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>FILE</ <source>/user/jmich
<sql> select * fr </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
true "true" "/user/jmichael2/test" "" "true"

Output files:
2. HIVE vs HIVE comparison
/*HIVE vs HIVE*/
spark-submit --master local[*] --packages org.apache.hbase:hbase-common:1.0.0,org.apache.hbase:hba
/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Ca
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
true "true" "/user/jmichael2/test" "" "true"

Output Files:

3. HBASE vs HIVE
/*HIVE vs HBASE - NT10*/
spark-submit --master local[*] --packages org.apache.hbase:hbase-common:1.0.0,org.apache.hbase:hba
/home/METNET/jmichael2 "logs" "TC_1" \
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HBASE<<source>T_RPT_NUM_ <alias>DPA_TGT</alia
true "true" "/user/jmichael2/test" "" "true"
combinations as well

s new build

creates & loads them into Spark-sql tables


to write log files in local Unix server [how our current build works is NA]

s in edge node [ET01]. In Namenode [NT10], its running successfully

/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Case_Executor/cogBrv3897_Metlife
cessing_QA/DataLake_Test_Case_Executor "logs" "TC_1" \

<alias>csv_src</alias><header>false</head<delimiter </object>
parkSQLObj>" "" "" "" "" "" "" "" "" "" "" "," "1" \

<alias>csv_src</alias><header>false</head<delimiter </object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \
n:1.0.0,org.apache.hbase:hbase-client:1.0.0,org.apache.hbase:hbase-server:1.0.0 --class cog.bravo.sparkComp.SparkComp_3 /hadoop/ds
cessing_QA/DataLake_Test_Case_Executor "logs" "TC_1" \

<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \

<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \

n:1.0.0,org.apache.hbase:hbase-client:1.0.0,org.apache.hbase:hbase-server:1.0.0 --class cog.bravo.sparkComp.SparkComp_3 /home/MET

<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \

<columns></object> <sql> select B_R </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "," "1" \
_Executor/cogBrv3897_Metlife.jar \
mp.SparkComp_3 /hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Case_Executor/c

mp.SparkComp_3 /home/METNET/jmichael2/cogBrv3897_Metlife0320.jar \

"" "" "" "," "1" \


taLake_Test_Case_Executor/cogBrv3897_Metlife.jar \
1. Querying Larger data sets
Issue: Error: Operation timed out. (state=TIM01,code=6000)
Root Cause: This is primarily seen with queries running on larger data set because the default phoenix co

Resolution:
To resolve this issue we need to make sure that HBASE_CONF_PATH environment variable is set before l

1) Update or add the following configs to hbase-site.xml


--> phoenix.query.timeoutMs=1800000
--> hbase.regionserver.lease.period = 1200000
--> hbase.rpc.timeout = 1200000
--> hbase.client.scanner.caching = 1000
--> hbase.client.scanner.timeout.period = 1200000
2) Restart Hbase services to make these changes effective.
3) export HBASE_CONF_PATH = /etc/hbase/conf
4) Launch sqlline.py
5) run the same query that is failing

2. Running join queries


Issue: Size of hash cache (104857608 bytes) exceeds the maximum allowed size (104857600 bytes) [100
Root Cause: The cache size for processing the data is not sufficient
Resolution: Need to increase the buffer size of te hash cache
3. Unable to load RDZ data

> We need to load RDZ data into Phoenix for doing RDZ against EOS comparison
> Phoenix supports importing files which are only in .csv format
> Any tables that are created in Phoenix are also created in HBASE
> Hence, we always need to append a primary key to the RDZ file
> This primary key will be taken as the row_key when the data is getting inserted into HBASE
> When we load the data using the below two options, we are getting error

Option 1: Using phoenix-psql method


phoenix-psql -t Sample_edms_qa localhost /home/METNET/jmichael2/ONETIME_UDS_DPA_Conversion

Rootcause & Resolution: Unable to find


Option 2: Using phoenix jar and csvBulkLoadTool
hadoop jar /usr/iop/4.2.5.0-0000/phoenix/phoenix-4.8.1-HBase-1.2.0-IBM-21-client.jar org.apache.phoe

Root Cause:
This happens when user has an incorrect value defined for "zookeeper.znode.parent" in the hbase-site.x
For example the default "zookeeper.znode.parent" is set to "/hbase-unsecure" , but if you incorrectly sp

Resolution:
The solution here would be to update the hbase-site.xml / source out the same hbase-site.xml from the
ecause the default phoenix configurations are hitting timeout limits

onment variable is set before launching sqlline.py. This variable should point to hbase config directory.

d size (104857600 bytes) [100 MB]


serted into HBASE

ETIME_UDS_DPA_Conversion_20180119011722.csv
-21-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table Sample_edms_qa --input user/

de.parent" in the hbase-site.xml sourced on the client side or in case of a custom API written , the "zookeeper.znode.parent" was incorrec
ure" , but if you incorrectly specify that as lets say "/hbase" as opposed to what we have set up in the cluster, we will encounter this excep

same hbase-site.xml from the cluster or update the HBase API to correctly point out the "zookeeper.znode.parent" value as updated in the
ample_edms_qa --input user/jmichael2/test/ONETIME_UDS_DPA_Conversion_trial.csv

er.znode.parent" was incorrectly updated to a wrong location.


r, we will encounter this exception while trying to connect to the HBase cluster.

arent" value as updated in the HBase cluster.


1. Timeout error - during Huge count tables
Cause:
The above error (java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]) sugges
The stack displays the query plan calls broadcast joins. The awaitResult has a default timeout value of 3
The above error is displayed when this default timeout value is exceeded
Solution:
To resolve this issue, increase the default value of 300 for spark.sql.broadcastTimeout to 1200.
er [300 seconds]) suggests that there has been a timeout.
efault timeout value of 300 seconds for the broadcast wait time in broadcast joins

meout to 1200.

You might also like