Parking spot analysis and migration techniques

My hive cluster is running on 3 worker pcs. My name node configuration is core i5 and 20GB RAM and 2 other data nodes are 16GB RAM and core i5 processor. All of my pc HDD are sata and I’m using centos 7. My hadoop cluster running using ambari and hortonworks 2.4. If you don’t know about hortonworks then please try to download the sandbox from hortonworks website to start your journery using hive. https://hortonworks.com/products/sandbox/.

Today I am not going to give you the instruction regarding hortonworks installation. I will give you a another post where you will find the the installation procedure of hortonworks in multiple node using ambari.


hive:zurich> select count(*) from parking_events_text_partition;
   _col0
-----------
 229966246

Here you can see that almost 230 million record in my parking table.

select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET')

Now I am trying to generate the number of records from 2014-01-01 to 2015-08-01 in my parking_events_text_orc where parking spots were free for a specific parking city, parking neighborhood and parking area.

Lets start the hive shell to run the query where hive table is ORC format

[root@namenode pyspark]# hive
WARNING: Use "yarn jar" to launch YARN applications.

Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> use zurich;
OK
Time taken: 2.04 seconds
hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');

Here is your query which will run over hive table orc format

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193604_0956af6f-3720-4ef8-804a-18cf2ef36c10
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ......         RUNNING     35         24       11        0       3       0
Reducer 2             INITED    118          0        0      118       0       0
Reducer 3             INITED      1          0        0        1       0       0
--------------------------------------------------------------------------------

VERTICES: 00/03 [====>>----------------------] 15% ELAPSED TIME: 25.89 s

--------------------------------------------------------------------------------

Query is running right at this moment.

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXX" and parking_neighborhood = "XXXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193604_0956af6f-3720-4ef8-804a-18cf2ef36c10
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     35         35        0        0       3       0
Reducer 2 ......   SUCCEEDED     15         15        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------

VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 48.14 s

-------------------------------------------------------------------------------- OK 29943 Time taken: 53.44 seconds, Fetched: 1 row(s)

And it takes almost 53 second to generate the output

Now lets see the time duration for the parking_events_text_parquet table where hive table is parquet format.

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_parquet where parking_type = "Spot" and parking_city = "XXXXXX" and parking_neighborhood = "XXXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193806_c755690f-8839-4d12-aa4c-cc6122187cd6
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     50         50        0        0       5       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------

VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 68.80 s

-------------------------------------------------------------------------------- OK 29943 Time taken: 69.697 seconds, Fetched: 1 row(s)

And it tooks almost 69 second to complete

Now we could analysis the hive table as a text format.

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_partition where parking_type = "Spot" and parking_city = "XXXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193950_b3f851d2-f49f-4977-931b-6815f953d31f
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1                RUNNING     36          0       20       16       0       0
Reducer 2             INITED     18          0        0       18       0       0
Reducer 3             INITED      1          0        0        1       0       0
--------------------------------------------------------------------------------

VERTICES: 00/03 [>>--------------------------] 0% ELAPSED TIME: 69.74 s

Now you can see that already 69 second elapsed to generate 0% data to process when your hive table is running as a normal text format.

Hive data migration

Now its time to migrate your hive server data from one server to another. The data is not that much big which was < 35GB and 229 Million data, I had to migrate the data from one server to another, So what I actually did at that time is given below.

I have run a command to generate the result as a csv format

Now you could see that there are almost 37GB data which was generated after completed my shell command

Then I have created the tar file to compressed the data which is around 3.9GB

Now we are almost done to download the data from one server to another. Now there is a another challenge has come to your end, you need to import that data to your another server so you need to create a table to import which was partitioned before in your hive table. The schema was look like this

create table parking_events_text_partition
(
	parking_id string,
	parking_type string,
	parking_size int,
	parking_free int,
	parking_longitude double,
        parking_latitude double,
	parking_permit_types string,
	parking_country string,
	parking_city string,
	parking_neighborhood string,
	parking_area string,
	parking_address string,
	parking_event_type string,
	parking_event_time timestamp,
	parking_status_event_id string
)
partitioned by (`date` string)  
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ',';

Parquet

create table parking_events_text_parquet
(
	parking_id string,
	parking_type string,
	parking_size int,
	parking_free int,
	parking_longitude double,
        parking_latitude double,
	parking_permit_types string,
	parking_country string,
	parking_city string,
	parking_neighborhood string,
	parking_area string,
	parking_address string,
	parking_event_type string,
	parking_event_time timestamp,
	parking_status_event_id string,
    `date` string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS parquet;
insert into parking_events_text_parquet select * from parking_events_text_partition;

ORC

create table parking_events_text_orc
(
	parking_id string,
	parking_type string,
	parking_size int,
	parking_free int,
	parking_longitude double,
        parking_latitude double,
	parking_permit_types string,
	parking_country string,
	parking_city string,
	parking_neighborhood string,
	parking_area string,
	parking_address string,
	parking_event_type string,
	parking_event_time timestamp,
	parking_status_event_id string,
        `date` string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS orc;
insert into parking_events_text_orc select * from parking_events_text_partition;

Now you are need to restore your data in your hive table.