Hive & Hdfs to HBase Phoenix using Pig script

Few days before I had to work in my project to import data from hive to hbase phoenix database and the script I used during that time is given below :

Hive table schema

 	col_name	data_type	comment
0	roadid	int	
1	year	int	
2	month	int	
3	day	int	
4	hour	int	
5	total_trips	bigint

Here is Hive table snap short which has been stored in phoenix table

0	1612342	2016	4	1	3	3
1	1612342	2016	4	1	4	1
2	1612342	2016	4	1	5	8
3	1612342	2016	4	2	4	1
4	1612342	2016	4	2	7	4
5	1612342	2016	4	2	8	2
6	1612342	2016	4	2	9	2
7	1612342	2016	4	2	10	2
8	1612342	2016	4	2	11	2
9	1612342	2016	4	2	12	4
10	1612342	2016	4	3	11	1
11	1612342	2016	4	3	12	2
12	1612342	2016	4	3	13	4
13	1612342	2016	4	3	14	2
14	1612342	2016	4	3	15	3
15	1612342	2016	4	3	16	2
16	1612342	2016	4	3	17	4
17	1612342	2016	4	4	18	4
18	1612342	2016	4	6	2	1
19	1612342	2016	4	7	2	1

Pig Script

REGISTER /usr/hdp/current/phoenix-client/phoenix-client.jar

waypoints = LOAD 'zurich.zurich_waypoints_orc_cum' USING org.apache.hive.hcatalog.pig.HCatLoader();

STORE waypoints into 'hbase://waypoint_cum' using
org.apache.phoenix.pig.PhoenixHBaseStorage('namenode.selise.ch', '-batchSize 5000');

Here you can see that I have used a phoenix-client.jar which is located in my hortonworks distribution. Simply I have just registered the jar for the pig script.

Sample Hdfs data what I have stored in my phoenix db using pig script.

1887574,2016-04-05T10:26:43.000Z,8.6782,47.4279,c5a3d5c06036818187068cbf82d81f87,2016,04,05,10
1887574,2016-04-05T10:27:29.000Z,8.6829,47.4375,c5a3d5c06036818187068cbf82d81f87,2016,04,05,10
1887574,2016-04-05T10:28:54.000Z,8.7012,47.451,c5a3d5c06036818187068cbf82d81f87,2016,04,05,10
1822482,2016-04-07T16:26:50.000Z,8.5743,47.417,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:27:32.000Z,8.5743,47.4221,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:27:37.000Z,8.5742,47.4222,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:27:42.000Z,8.5742,47.4223,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:27:47.000Z,8.5742,47.4224,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:27:52.000Z,8.5741,47.4225,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:28:17.000Z,8.5741,47.4225,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:28:18.000Z,8.574,47.4226,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:28:21.000Z,8.5739,47.4227,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:28:22.000Z,8.5738,47.4227,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:28:27.000Z,8.5735,47.423,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16
1974545,2016-04-07T16:28:28.000Z,8.5734,47.4231,36b454bb2f66270d192d55fb35f68a59,2016,04,07,16

The script what I have used to import hdfs data in phoenix/hbase

REGISTER /usr/hdp/current/phoenix-client/phoenix-client.jar

waypoints = LOAD 'hdfs://datanode1.selise.ch/home/admin/pyspark/zurich-road-waypoints-trimmed/part-m-000[0-9]*'
USING PigStorage(',')
AS (roadid:int, capture_date:chararray, longitude:double, latitude:double, tripid:chararray, year:int, month:int, day:int, hour:int);

STORE waypoints into 'hbase://waypoint' using
org.apache.phoenix.pig.PhoenixHBaseStorage('namenode.selise.ch', '-batchSize 5000');

Now lets run the script to test


pig  -x mapreduce hive2phoenix.pig

pig -x mapreduce hdfs2phoenix.pig

1.7 Billion weather data analysis which is around 28 years data for two different places

Hello Guys, Today I am going to give some idea how I handled 1.7 billion weather data which is around 28 years of data. Our main target was to generate some analytics from this weather data. We have handled only temperature and humidity data. Which is around 22GB.

Now one the main challenge is how we will download the data from the client server. Client specific given url is quite good enough to download the data. I had to download using some basic shell command. At the very first time I was concern to download the data using xargs but I changed my mind to not to use it. Because I had no hurry to download the data and my local pc bandwidth was not that much high to use xargs.

You could use xargs to download the data from a specific location

downloader.sh

d=1990-01-01
while [ "$d" != 2017-04-05 ]; do 
  echo  https://xxxxxxxx.com/"$d"T00:00:00ZXXXXXX/xxxxxxxxxx/xxxxxx.xxxxxxxxx/csv
  d=$(date -I -d "$d + 1 day")
done
[root@namenode.selise.ch ~]# chmod +x downloader.sh
[root@namenode.selise.ch ~]# ./downloader.sh > urls.txt
[root@namenode.selise.ch ~]# cat urls.txt | xargs -n 1 -P 6 wget -P data/

but what I have used for the download purposes which is given below

downloader.sh

d=1990-01-01
while [ "$d" != 2017-04-05 ]; do 
  echo  wget https://xxxxxxxx.com/"$d"T00:00:00ZXXXXXX/xxxxxxxxxx/xxxxxx.xxxxxxxxx/csv
  d=$(date -I -d "$d + 1 day")
done

chmod +x downloader.sh
./downloader.sh > urls.txt
chmod +x urls.txt
./urls.txt

It takes almost 48 hours using my low bandwidth.

Now I am ready to store the data in hdfs for analysis

hadoop fs -put *.csv /user/root/weather

Its also a time consuming to store.
Lets see how many records are there


hive> select count(*) from WeatherTemperatureHumidity_orc_update;
OK
1719639510
Time taken: 1.792 seconds, Fetched: 1 row(s)

now lets create the external hive table to use the hdfs data

CREATE EXTERNAL TABLE IF NOT EXISTS WeatherTemperatureHumidity(
        validdate string,
        temperature float,
        humidity float
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS TEXTFILE
    location '/user/root/weather';
CREATE TABLE IF NOT EXISTS WeatherTemperatureHumidity_orc(
        validdate string,
        temperature double,
        humidity double,
        suburb string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS ORC;
    
INSERT OVERWRITE TABLE WeatherTemperatureHumidity_orc SELECT w.*, "xxxx" FROM WeatherTemperatureHumidity w;
CREATE TABLE IF NOT EXISTS WeatherTemperatureHumidity_orc_update (
        `year` string,
        `month` string,
        `date` string,
        validdate string,
        temperature double,
        humidity double,
        suburb string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS ORC;
    
drop table WeatherTemperatureHumidity_orc_update;

INSERT OVERWRITE TABLE WeatherTemperatureHumidity_orc_update select year(to_date(w.validdate)), month(to_date(w.validdate)), to_date(w.validdate), w.validdate, w.temperature, w.humidity, w.suburb from WeatherTemperatureHumidity_orc w where w.validdate is not NULL

Now you are ready to start your analytics from the hive table. Here I have tried to find out the the min and max temperature and humidity from the weather hive table.

hive> select `year`, min(temperature), max(temperature) from WeatherTemperatureHumidity_orc_update where suburb='XXXX' group by `year` order by `year` desc;
Query ID = root_20170406182432_92f15cf5-5a23-4250-8739-288676e0589d
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1491480905379_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Reducer 2 ......   SUCCEEDED    253        253        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 48.97 s
--------------------------------------------------------------------------------
OK
2017    -12.399999618530273     20.100000381469727
2016    -6.900000095367432      31.200000762939453
2015    -9.699999809265137      33.099998474121094
2014    -15.800000190734863     30.700000762939453
2013    -13.199999809265137     30.5
2012    -19.299999237060547     30.700000762939453
2011    -12.199999809265137     30.0
2010    -15.199999809265137     29.399999618530273
2009    -16.5   28.5
2008    -12.800000190734863     27.700000762939453
2007    -14.199999809265137     29.600000381469727
2006    -11.0   29.200000762939453
2005    -15.600000381469727     28.5
2004    -10.699999809265137     26.5
2003    -15.0   31.700000762939453
2002    -13.300000190734863     29.799999237060547
2001    -16.899999618530273     30.100000381469727
2000    -16.899999618530273     28.700000762939453
1999    -15.0   28.0
1998    -15.0   29.600000381469727
1997    -11.399999618530273     26.799999237060547
1996    -16.799999237060547     27.100000381469727
1995    -19.399999618530273     28.799999237060547
1994    -13.699999809265137     29.700000762939453
1993    -18.600000381469727     28.700000762939453
1992    -18.100000381469727     29.799999237060547
1991    -18.799999237060547     30.600000381469727
1990    -11.699999809265137     29.100000381469727
NULL    NULL    NULL
Time taken: 53.314 seconds, Fetched: 29 row(s)

Now lets see for a another suburb

hive> select `year`, min(temperature), max(temperature) from WeatherTemperatureHumidity_orc_update where suburb='XXXXXXX' group by `year` order by `year` desc;
Query ID = root_20170406182754_271dd6cb-b69e-4fd3-b4c0-3e70d284fa1d
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1491480905379_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Reducer 2 ......   SUCCEEDED    253        253        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 47.21 s
--------------------------------------------------------------------------------
OK
2017    -12.399999618530273     19.700000762939453
2016    -7.300000190734863      30.600000381469727
2015    -9.100000381469727      32.79999923706055
2014    -15.300000190734863     30.600000381469727
2013    -13.199999809265137     30.299999237060547
2012    -19.5   30.600000381469727
2011    -12.399999618530273     29.899999618530273
2010    -15.399999618530273     29.200000762939453
2009    -16.600000381469727     28.299999237060547
2008    -13.0   27.5
2007    -14.399999618530273     29.399999618530273
2006    -11.199999809265137     29.0
2005    -15.699999809265137     28.399999618530273
2004    -10.800000190734863     26.399999618530273
2003    -15.199999809265137     31.5
2002    -13.399999618530273     29.700000762939453
2001    -16.899999618530273     29.899999618530273
2000    -16.799999237060547     28.600000381469727
1999    -15.199999809265137     27.899999618530273
1998    -15.100000381469727     29.399999618530273
1997    -11.300000190734863     26.600000381469727
1996    -17.0   27.0
1995    -19.5   28.700000762939453
1994    -13.899999618530273     29.600000381469727
1993    -18.799999237060547     28.5
1992    -18.200000762939453     29.600000381469727
1991    -18.899999618530273     30.5
1990    -11.800000190734863     29.0
NULL    NULL    NULL
Time taken: 50.031 seconds, Fetched: 29 row(s)

It takes around 61 seconds to complete from 1.7 billion data using 3 node hadoop cluster. Where my all nodes are using intel i5 with 16GB RAM and my name node is using 20GB.

Parking spot analysis and migration techniques

My hive cluster is running on 3 worker pcs. My name node configuration is core i5 and 20GB RAM and 2 other data nodes are 16GB RAM and core i5 processor. All of my pc HDD are sata and I’m using centos 7. My hadoop cluster running using ambari and hortonworks 2.4. If you don’t know about hortonworks then please try to download the sandbox from hortonworks website to start your journery using hive. https://hortonworks.com/products/sandbox/.

Today I am not going to give you the instruction regarding hortonworks installation. I will give you a another post where you will find the the installation procedure of hortonworks in multiple node using ambari.


hive:zurich> select count(*) from parking_events_text_partition;
   _col0
-----------
 229966246

Here you can see that almost 230 million record in my parking table.

select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET')

Now I am trying to generate the number of records from 2014-01-01 to 2015-08-01 in my parking_events_text_orc where parking spots were free for a specific parking city, parking neighborhood and parking area.

Lets start the hive shell to run the query where hive table is ORC format

[root@namenode pyspark]# hive
WARNING: Use "yarn jar" to launch YARN applications.

Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> use zurich;
OK
Time taken: 2.04 seconds
hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');

Here is your query which will run over hive table orc format

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193604_0956af6f-3720-4ef8-804a-18cf2ef36c10
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ......         RUNNING     35         24       11        0       3       0
Reducer 2             INITED    118          0        0      118       0       0
Reducer 3             INITED      1          0        0        1       0       0
--------------------------------------------------------------------------------

VERTICES: 00/03 [====>>----------------------] 15% ELAPSED TIME: 25.89 s

--------------------------------------------------------------------------------

Query is running right at this moment.

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_orc where parking_type = "Spot" and parking_city = "XXXXX" and parking_neighborhood = "XXXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193604_0956af6f-3720-4ef8-804a-18cf2ef36c10
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     35         35        0        0       3       0
Reducer 2 ......   SUCCEEDED     15         15        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------

VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 48.14 s

-------------------------------------------------------------------------------- OK 29943 Time taken: 53.44 seconds, Fetched: 1 row(s)

And it takes almost 53 second to generate the output

Now lets see the time duration for the parking_events_text_parquet table where hive table is parquet format.

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_parquet where parking_type = "Spot" and parking_city = "XXXXXX" and parking_neighborhood = "XXXXXXX" and parking_area = "XXXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193806_c755690f-8839-4d12-aa4c-cc6122187cd6
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     50         50        0        0       5       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------

VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 68.80 s

-------------------------------------------------------------------------------- OK 29943 Time taken: 69.697 seconds, Fetched: 1 row(s)

And it tooks almost 69 second to complete

Now we could analysis the hive table as a text format.

hive> select count(distinct(parking_status_event_id)) as total from zurich.parking_events_text_partition where parking_type = "Spot" and parking_city = "XXXXXX" and parking_neighborhood = "XXXXXX" and parking_area = "XXXXXX" and parking_free =  0 and parking_event_time between from_utc_timestamp('2014-01-01 00:00:00.0','CET') and from_utc_timestamp('2015-08-01 00:00:00.0','CET');
Query ID = root_20170329193950_b3f851d2-f49f-4977-931b-6815f953d31f
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1490785713790_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1                RUNNING     36          0       20       16       0       0
Reducer 2             INITED     18          0        0       18       0       0
Reducer 3             INITED      1          0        0        1       0       0
--------------------------------------------------------------------------------

VERTICES: 00/03 [>>--------------------------] 0% ELAPSED TIME: 69.74 s

Now you can see that already 69 second elapsed to generate 0% data to process when your hive table is running as a normal text format.

Hive data migration

Now its time to migrate your hive server data from one server to another. The data is not that much big which was < 35GB and 229 Million data, I had to migrate the data from one server to another, So what I actually did at that time is given below.

I have run a command to generate the result as a csv format

Now you could see that there are almost 37GB data which was generated after completed my shell command

Then I have created the tar file to compressed the data which is around 3.9GB

Now we are almost done to download the data from one server to another. Now there is a another challenge has come to your end, you need to import that data to your another server so you need to create a table to import which was partitioned before in your hive table. The schema was look like this

create table parking_events_text_partition
(
	parking_id string,
	parking_type string,
	parking_size int,
	parking_free int,
	parking_longitude double,
        parking_latitude double,
	parking_permit_types string,
	parking_country string,
	parking_city string,
	parking_neighborhood string,
	parking_area string,
	parking_address string,
	parking_event_type string,
	parking_event_time timestamp,
	parking_status_event_id string
)
partitioned by (`date` string)  
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ',';

Parquet

create table parking_events_text_parquet
(
	parking_id string,
	parking_type string,
	parking_size int,
	parking_free int,
	parking_longitude double,
        parking_latitude double,
	parking_permit_types string,
	parking_country string,
	parking_city string,
	parking_neighborhood string,
	parking_area string,
	parking_address string,
	parking_event_type string,
	parking_event_time timestamp,
	parking_status_event_id string,
    `date` string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS parquet;
insert into parking_events_text_parquet select * from parking_events_text_partition;

ORC

create table parking_events_text_orc
(
	parking_id string,
	parking_type string,
	parking_size int,
	parking_free int,
	parking_longitude double,
        parking_latitude double,
	parking_permit_types string,
	parking_country string,
	parking_city string,
	parking_neighborhood string,
	parking_area string,
	parking_address string,
	parking_event_type string,
	parking_event_time timestamp,
	parking_status_event_id string,
        `date` string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS orc;
insert into parking_events_text_orc select * from parking_events_text_partition;

Now you are need to restore your data in your hive table.

Biketrips and parking data benchmark using hive and presto

Today I will give you some idea regrading hive and presto. Hive is a distributed data warehouse system where you could be able to read, write and manage your data through SQL. And presto is a another SQL Engine for interactive analytic queries against data. Presto can handle terabytes to petabyte range data. Currently presto is managing 300Petabytes of data in Facebook by 1000 engineers. They are running more than 30000 queries over this presto Custer.

Recently I have setup the presto cluster in my local environment. Where I have used 3 worker node and a coordinator and I have used presto over hive. I have experimented the presto Custer over 229 Million data sets. Its basically a parking spot statistical data which was one my small scale bigdata project where I had to handle and migrate the data from one server to another server. But this is not the main issues, actually I had to experiment the server performance once I informed apache presto is a kind of good choice for data interactive analytic queries.

Here are some of hive and presto benchmark what I have found during the migration from one server to another server.


Figure: Here is my parking events text parquet schema which was used during my migration


Figure: Year wise number of parking events when presto over hive is orc format and it takes 2:01 seconds for 1 worker nodes


Figure: Year wise parking events over hive orc format and it takes only 53 seconds which quite huge compare to presto because hive running for 3 node cluster and presto is for single node.


Figure: Total number of entry when presto over hive orc format and it takes 0:03 seconds which is very fast for 229M records


Figure: Total number of entry when hive orc format and it takes around 0.29 secons which is quite larger than presto.

Now its time to have a look on the hive as a parquet format.

Figure: Presto takes almost 6 to 9 second for single node


Figure: Now you can see that hive takes only 0.51 second for 3 worker node.

Biketrips data analysis

Here is a another data set which is biketrips. I have loaded the data for comparison. You could easily download he data from internet.

ORC format for both presto and hive

PARQUET format for both presto and hive

Uber data analysis

select * from trips_orc where trip_id is not NULL limit 10
 
trips_orc.trip_id	trips_orc.vendor_id	trips_orc.pickup_datetime	trips_orc.dropoff_datetime	trips_orc.store_and_fwd_flag	trips_orc.rate_code_id	trips_orc.pickup_longitude	trips_orc.pickup_latitude	trips_orc.dropoff_longitude	trips_orc.dropoff_latitude	trips_orc.passenger_count	trips_orc.trip_distance	trips_orc.fare_amount	trips_orc.extra	trips_orc.mta_tax	trips_orc.tip_amount	trips_orc.tolls_amount	trips_orc.ehail_fee	trips_orc.improvement_surcharge	trips_orc.total_amount	trips_orc.payment_type	trips_orc.trip_type	trips_orc.pickup	trips_orc.dropoff	trips_orc.cab_type	trips_orc.precipitation	trips_orc.snow_depth	trips_orc.snowfall	trips_orc.max_temperature	trips_orc.min_temperature	trips_orc.average_wind_speed	trips_orc.pickup_nyct2010_gid	trips_orc.pickup_ctlabel	trips_orc.pickup_borocode	trips_orc.pickup_boroname	trips_orc.pickup_ct2010	trips_orc.pickup_boroct2010	trips_orc.pickup_cdeligibil	trips_orc.pickup_ntacode	trips_orc.pickup_ntaname	trips_orc.pickup_puma	trips_orc.dropoff_nyct2010_gid	trips_orc.dropoff_ctlabel	trips_orc.dropoff_borocode	trips_orc.dropoff_boroname	trips_orc.dropoff_ct2010	trips_orc.dropoff_boroct2010	trips_orc.dropoff_cdeligibil	trips_orc.dropoff_ntacode	trips_orc.dropoff_ntaname	trips_orc.dropoff_puma
0	2	201	2016-01-01 00:39:36.0	NULL	1	-73	40.68061065673828	-73.92427825927734	40.69804382324219	1	1	8	0.5	0.5	1.86	0	NULL	0.3	11.16	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1	2	201	2016-01-01 00:39:18.0	NULL	1	-73	40.72317504882813	-73.92391967773438	40.76137924194336	1	3	15.5	0.5	0.5	0	0	NULL	0.3	16.8	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
2	2	201	2016-01-01 00:39:48.0	NULL	1	-73	40.67610549926758	-74.0131607055664	40.64607238769531	1	3	16.5	0.5	0.5	4.45	0	NULL	0.3	22.25	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
3	2	201	2016-01-01 00:38:32.0	NULL	1	-73	40.66957855224609	-74.00064849853516	40.68903350830078	1	3	13.5	0.5	0.5	0	0	NULL	0.3	14.8	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
4	2	201	2016-01-01 00:39:22.0	NULL	1	-73	40.68285369873047	-73.94071960449219	40.66301345825195	1	2	12	0.5	0.5	0	0	NULL	0.3	13.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
5	2	201	2016-01-01 00:39:35.0	NULL	1	-73	40.74645614624023	-73.86774444580078	40.74211120605469	1	1	7	0.5	0.5	0	0	NULL	0.3	8.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
6	2	201	2016-01-01 00:39:21.0	NULL	1	-73	40.74619674682617	-73.88619232177734	40.74568939208984	1	0	5	0.5	0.5	0	0	NULL	0.3	6.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
7	2	201	2016-01-01 00:39:36.0	NULL	1	-73	40.80355834960938	-73.94915008544922	40.79412078857422	1	1	7	0.5	0.5	0	0	NULL	0.3	8.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
8	2	201	2016-01-01 00:39:52.0	NULL	1	-73	40.70281600952148	-73.97157287597656	40.67972564697266	1	2	12	0.5	0.5	2	0	NULL	0.3	15.3	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
9	2	201	2016-01-01 00:39:23.0	NULL	1	-73	40.75664138793945	-73.91754913330078	40.73965835571289	1	1	9	0.5	0.5	1.6	0	NULL	0.3	11.9	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL

Here is a another data set which is 2016 uber trips for New York city. I have used only 3/2 month data for analysis.

Next time I will give you the more details idea over presto and hive. Today I have focused the single node presto cluster but next time I will introduce the multi node presto cluster experiment for your convenience.