Biketrips and parking data benchmark using hive and presto

Today I will give you some idea regrading hive and presto. Hive is a distributed data warehouse system where you could be able to read, write and manage your data through SQL. And presto is a another SQL Engine for interactive analytic queries against data. Presto can handle terabytes to petabyte range data. Currently presto is managing 300Petabytes of data in Facebook by 1000 engineers. They are running more than 30000 queries over this presto Custer.

Recently I have setup the presto cluster in my local environment. Where I have used 3 worker node and a coordinator and I have used presto over hive. I have experimented the presto Custer over 229 Million data sets. Its basically a parking spot statistical data which was one my small scale bigdata project where I had to handle and migrate the data from one server to another server. But this is not the main issues, actually I had to experiment the server performance once I informed apache presto is a kind of good choice for data interactive analytic queries.

Here are some of hive and presto benchmark what I have found during the migration from one server to another server.


Figure: Here is my parking events text parquet schema which was used during my migration


Figure: Year wise number of parking events when presto over hive is orc format and it takes 2:01 seconds for 1 worker nodes


Figure: Year wise parking events over hive orc format and it takes only 53 seconds which quite huge compare to presto because hive running for 3 node cluster and presto is for single node.


Figure: Total number of entry when presto over hive orc format and it takes 0:03 seconds which is very fast for 229M records


Figure: Total number of entry when hive orc format and it takes around 0.29 secons which is quite larger than presto.

Now its time to have a look on the hive as a parquet format.

Figure: Presto takes almost 6 to 9 second for single node


Figure: Now you can see that hive takes only 0.51 second for 3 worker node.

Biketrips data analysis

Here is a another data set which is biketrips. I have loaded the data for comparison. You could easily download he data from internet.

ORC format for both presto and hive

PARQUET format for both presto and hive

Uber data analysis

select * from trips_orc where trip_id is not NULL limit 10
 
trips_orc.trip_id	trips_orc.vendor_id	trips_orc.pickup_datetime	trips_orc.dropoff_datetime	trips_orc.store_and_fwd_flag	trips_orc.rate_code_id	trips_orc.pickup_longitude	trips_orc.pickup_latitude	trips_orc.dropoff_longitude	trips_orc.dropoff_latitude	trips_orc.passenger_count	trips_orc.trip_distance	trips_orc.fare_amount	trips_orc.extra	trips_orc.mta_tax	trips_orc.tip_amount	trips_orc.tolls_amount	trips_orc.ehail_fee	trips_orc.improvement_surcharge	trips_orc.total_amount	trips_orc.payment_type	trips_orc.trip_type	trips_orc.pickup	trips_orc.dropoff	trips_orc.cab_type	trips_orc.precipitation	trips_orc.snow_depth	trips_orc.snowfall	trips_orc.max_temperature	trips_orc.min_temperature	trips_orc.average_wind_speed	trips_orc.pickup_nyct2010_gid	trips_orc.pickup_ctlabel	trips_orc.pickup_borocode	trips_orc.pickup_boroname	trips_orc.pickup_ct2010	trips_orc.pickup_boroct2010	trips_orc.pickup_cdeligibil	trips_orc.pickup_ntacode	trips_orc.pickup_ntaname	trips_orc.pickup_puma	trips_orc.dropoff_nyct2010_gid	trips_orc.dropoff_ctlabel	trips_orc.dropoff_borocode	trips_orc.dropoff_boroname	trips_orc.dropoff_ct2010	trips_orc.dropoff_boroct2010	trips_orc.dropoff_cdeligibil	trips_orc.dropoff_ntacode	trips_orc.dropoff_ntaname	trips_orc.dropoff_puma
0	2	201	2016-01-01 00:39:36.0	NULL	1	-73	40.68061065673828	-73.92427825927734	40.69804382324219	1	1	8	0.5	0.5	1.86	0	NULL	0.3	11.16	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1	2	201	2016-01-01 00:39:18.0	NULL	1	-73	40.72317504882813	-73.92391967773438	40.76137924194336	1	3	15.5	0.5	0.5	0	0	NULL	0.3	16.8	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
2	2	201	2016-01-01 00:39:48.0	NULL	1	-73	40.67610549926758	-74.0131607055664	40.64607238769531	1	3	16.5	0.5	0.5	4.45	0	NULL	0.3	22.25	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
3	2	201	2016-01-01 00:38:32.0	NULL	1	-73	40.66957855224609	-74.00064849853516	40.68903350830078	1	3	13.5	0.5	0.5	0	0	NULL	0.3	14.8	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
4	2	201	2016-01-01 00:39:22.0	NULL	1	-73	40.68285369873047	-73.94071960449219	40.66301345825195	1	2	12	0.5	0.5	0	0	NULL	0.3	13.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
5	2	201	2016-01-01 00:39:35.0	NULL	1	-73	40.74645614624023	-73.86774444580078	40.74211120605469	1	1	7	0.5	0.5	0	0	NULL	0.3	8.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
6	2	201	2016-01-01 00:39:21.0	NULL	1	-73	40.74619674682617	-73.88619232177734	40.74568939208984	1	0	5	0.5	0.5	0	0	NULL	0.3	6.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
7	2	201	2016-01-01 00:39:36.0	NULL	1	-73	40.80355834960938	-73.94915008544922	40.79412078857422	1	1	7	0.5	0.5	0	0	NULL	0.3	8.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
8	2	201	2016-01-01 00:39:52.0	NULL	1	-73	40.70281600952148	-73.97157287597656	40.67972564697266	1	2	12	0.5	0.5	2	0	NULL	0.3	15.3	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
9	2	201	2016-01-01 00:39:23.0	NULL	1	-73	40.75664138793945	-73.91754913330078	40.73965835571289	1	1	9	0.5	0.5	1.6	0	NULL	0.3	11.9	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL

Here is a another data set which is 2016 uber trips for New York city. I have used only 3/2 month data for analysis.

Next time I will give you the more details idea over presto and hive. Today I have focused the single node presto cluster but next time I will introduce the multi node presto cluster experiment for your convenience.

Leave a Reply