How to setup multi node presto cluster ?

Fist of all you need to download the presto-server binary from its website.

[root@namenode admin]# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.164/presto-server-0.164.tar.gz
--2017-03-29 12:22:26--  https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.164/presto-server-0.164.tar.gz
Resolving repo1.maven.org (repo1.maven.org)... 151.101.24.209
Connecting to repo1.maven.org (repo1.maven.org)|151.101.24.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 406230611 (387M) [application/x-gzip]
Saving to: ‘presto-server-0.164.tar.gz’

 0% [                                                                                                                                                     ] 981,144     68.6KB/s  eta 69m 7s
[root@namenode admin]#tar -xvf presto-server-0.164.tar.gz 
[root@namenode admin]# cd presto-server-0.164/
[root@namenode presto-server-0.164]# ls
bin  etc  lib  NOTICE  plugin  README.txt
[root@namenode presto-server-0.164]# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.164/presto-cli-0.164-executable.jar
--2017-03-29 12:28:41--  https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.164/presto-cli-0.164-executable.jar
Resolving repo1.maven.org (repo1.maven.org)... 151.101.24.209
Connecting to repo1.maven.org (repo1.maven.org)|151.101.24.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15093095 (14M) [application/java-archive]
Saving to: ‘presto-cli-0.164-executable.jar’

 5% [======>                                                                                                                                              ] 794,624      202KB/s  eta 72s    

[root@namenode presto-server-0.164]# mv presto-cli-0.164-executable.jar presto
[root@namenode presto-server-0.164]# sudo chmod +x presto

Now you need to create a folder name “etc” in your presto binary directory and create another folder name catelog in side your etc folder. Then in your catalog folder you have to create two properties files which is hive.properties and another one is jmx.properties. Once you are done you need to create another 4 config and properties file inside your etc folder and your folder structure will be given bellow.

[root@namenode presto-server-0.164]# tree etc/
etc/
├── catalog
│   ├── hive.properties
│   └── jmx.properties
├── config.properties
├── jvm.config
├── log.properties
└── node.properties

1 directory, 6 files

Now you are ready to configure your presto etc folder. Here are the configuration files signature what you need to simply paste inside a specific file.
hive.properties

[root@namenode presto-server-0.164]# cat etc/catalog/hive.properties
hive.metastore.uri=thrift://datanode2.selise.ch:9083
connector.name=hive-hadoop2

jmx.properties

[root@namenode presto-server-0.164]# cat etc/catalog/jmx.properties
connector.name=jmx

config.properties

[root@namenode presto-server-0.164]# cat etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8081
query.max-memory=800MB
query.max-memory-per-node=200MB
discovery-server.enabled=true
discovery.uri=http://namenode.selise.ch:8081

jvm.config

[root@namenode presto-server-0.164]# cat etc/jvm.config
-server
-Xmx800M
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p

node.properties

[root@namenode presto-server-0.164]# cat etc/node.properties
node.environment=dev
node.id=presto-node-cordinator
node.data-dir=/root/datap

log.properties

[root@namenode presto-server-0.164]# cat etc/log.properties
com.facebook.presto=INFO

Now you need to create a data location where you need to store the presto data. I have assigned node.data-dir=/root/datap inside node.properties.

Now you are ready to start your presto server:

[root@namenode presto-server-0.164]# ls
bin  etc  lib  NOTICE  plugin  presto  presto-cli-0.170-executable.jar  README.txt

[root@namenode presto-server-0.164]#./bin/launcher run
 separated)
2017-03-29T13:00:43.264+0600    INFO    main    Bootstrap       hive.metastore                                     thrift      thrift
2017-03-29T13:00:43.264+0600    INFO    main    Bootstrap       hive.allow-add-column                              false       false                                Allow Hive connector to add column
2017-03-29T13:00:43.264+0600    INFO    main    Bootstrap       hive.allow-drop-table                              false       false                                Allow Hive connector to drop table
2017-03-29T13:00:43.264+0600    INFO    main    Bootstrap       hive.allow-rename-column                           false       false                                Allow Hive connector to rename column
2017-03-29T13:00:43.264+0600    INFO    main    Bootstrap       hive.allow-rename-table                            false       false                                Allow Hive connector to rename table
2017-03-29T13:00:43.264+0600    INFO    main    Bootstrap       hive.security                                      legacy      legacy
2017-03-29T13:00:43.264+0600    INFO    main    Bootstrap
2017-03-29T13:00:44.572+0600    INFO    main    com.facebook.presto.metadata.StaticCatalogStore -- Added catalog hive using connector hive-hadoop2 --
2017-03-29T13:00:44.574+0600    INFO    main    com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager       -- Loading resource group configuration manager --
2017-03-29T13:00:44.575+0600    INFO    main    com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager       -- Loaded resource group configuration manager legacy --
2017-03-29T13:00:44.575+0600    INFO    main    com.facebook.presto.security.AccessControlManager       -- Loading system access control --
2017-03-29T13:00:44.575+0600    INFO    main    com.facebook.presto.security.AccessControlManager       -- Loaded system access control allow-all --
2017-03-29T13:00:44.926+0600    INFO    main    com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========

Now you are ready for your single node cluster where coordinator and worker is running in a single pc.

Lets start Multi worker presto cluster

I have three worker pc which are

namenode.selise.ch
datanode1.selise.ch
datanode2.selise.ch

Then I have copied presto-server-0.164.tar.gz to my datanode1.selise.ch and datanode2.selise.ch pc and tried to configure as like as my below configuration. Please have a look on my presto worker node configuration.

[root@namenode presto-server-0.164]# scp -r ../presto-server-0.164.tar.gz datanode1.selise.ch:/home/admin

[root@namenode presto-server-0.164]# ssh datanode1.selise.ch
Last login: Mon Mar 27 14:48:46 2017 from namenode.selise.ch

[root@datanode1 ~]# cd /home/admin
[root@datanode1 admin]# tar -xvf presto-server-0.164.tar.gz
[root@datanode1 admin]# cd presto-server-0.164/

[root@datanode1 presto-server-0.164]# tree etc/
etc/
├── catalog
│   ├── hive.properties
│   └── jmx.properties
├── config.properties
├── jvm.config
├── log.properties
└── node.properties

1 directory, 6 files

config.properties

[root@datanode1 presto-server-0.164]# cat etc/config.properties
coordinator=false
http-server.http.port=8081
query.max-memory=800MB
query.max-memory-per-node=200MB
discovery.uri=http://namenode.selise.ch:8081

Here I have disabled the coordinator=false and discovery.uri is my presto coordinator pc which is my namenode.selise.ch pc.

node.properties

[root@datanode1 presto-server-0.164]# cat etc/node.properties
node.environment=dev
node.id=presto-node-1
node.data-dir=/root/datap

Now do the same things as for datanode2.selise.ch and restart the coordinator pc and try to use your shell to check every thing is working or not.

[root@namenode.selise.ch presto-server-0.164]# ./bin/launcher run
2017-03-27T13:55:06.696+0600    INFO    main    Bootstrap       hive.allow-rename-table                            false       false                                Allow Hive connector to rename table
2017-03-27T13:55:06.696+0600    INFO    main    Bootstrap       hive.security                                      legacy      legacy
2017-03-27T13:55:06.696+0600    INFO    main    Bootstrap
2017-03-27T13:55:07.456+0600    INFO    main    com.facebook.presto.metadata.StaticCatalogStore -- Added catalog hive using connector hive-hadoop2 --
2017-03-27T13:55:07.457+0600    INFO    main    com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager       -- Loading resource group configuration manager --
2017-03-27T13:55:07.459+0600    INFO    main    com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager       -- Loaded resource group configuration manager legacy --
2017-03-27T13:55:07.459+0600    INFO    main    com.facebook.presto.security.AccessControlManager       -- Loading system access control --
2017-03-27T13:55:07.459+0600    INFO    main    com.facebook.presto.security.AccessControlManager       -- Loaded system access control allow-all --
2017-03-27T13:55:07.482+0600    INFO    main    com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========
[root@namenode presto-server-0.164]# ./presto --server namenode.selise.ch:8081 --catalog hive
presto> use zurich;
presto:zurich> select count(*) from parking_events_text_partition;
   _col0
-----------
 229966246
(1 row)

Query 20170327_075512_00001_hgdm2, FINISHED, 3 nodes
Splits: 1,187 total, 1,187 done (100.00%)
6:50 [230M rows, 32.3GB] [561K rows/s, 80.7MB/s]

Now you can easily see that 3 node presto cluster is ready for work. Now lets enjoy with your presto custer
will be continued ….

Biketrips and parking data benchmark using hive and presto

Today I will give you some idea regrading hive and presto. Hive is a distributed data warehouse system where you could be able to read, write and manage your data through SQL. And presto is a another SQL Engine for interactive analytic queries against data. Presto can handle terabytes to petabyte range data. Currently presto is managing 300Petabytes of data in Facebook by 1000 engineers. They are running more than 30000 queries over this presto Custer.

Recently I have setup the presto cluster in my local environment. Where I have used 3 worker node and a coordinator and I have used presto over hive. I have experimented the presto Custer over 229 Million data sets. Its basically a parking spot statistical data which was one my small scale bigdata project where I had to handle and migrate the data from one server to another server. But this is not the main issues, actually I had to experiment the server performance once I informed apache presto is a kind of good choice for data interactive analytic queries.

Here are some of hive and presto benchmark what I have found during the migration from one server to another server.


Figure: Here is my parking events text parquet schema which was used during my migration


Figure: Year wise number of parking events when presto over hive is orc format and it takes 2:01 seconds for 1 worker nodes


Figure: Year wise parking events over hive orc format and it takes only 53 seconds which quite huge compare to presto because hive running for 3 node cluster and presto is for single node.


Figure: Total number of entry when presto over hive orc format and it takes 0:03 seconds which is very fast for 229M records


Figure: Total number of entry when hive orc format and it takes around 0.29 secons which is quite larger than presto.

Now its time to have a look on the hive as a parquet format.

Figure: Presto takes almost 6 to 9 second for single node


Figure: Now you can see that hive takes only 0.51 second for 3 worker node.

Biketrips data analysis

Here is a another data set which is biketrips. I have loaded the data for comparison. You could easily download he data from internet.

ORC format for both presto and hive

PARQUET format for both presto and hive

Uber data analysis

select * from trips_orc where trip_id is not NULL limit 10
 
trips_orc.trip_id	trips_orc.vendor_id	trips_orc.pickup_datetime	trips_orc.dropoff_datetime	trips_orc.store_and_fwd_flag	trips_orc.rate_code_id	trips_orc.pickup_longitude	trips_orc.pickup_latitude	trips_orc.dropoff_longitude	trips_orc.dropoff_latitude	trips_orc.passenger_count	trips_orc.trip_distance	trips_orc.fare_amount	trips_orc.extra	trips_orc.mta_tax	trips_orc.tip_amount	trips_orc.tolls_amount	trips_orc.ehail_fee	trips_orc.improvement_surcharge	trips_orc.total_amount	trips_orc.payment_type	trips_orc.trip_type	trips_orc.pickup	trips_orc.dropoff	trips_orc.cab_type	trips_orc.precipitation	trips_orc.snow_depth	trips_orc.snowfall	trips_orc.max_temperature	trips_orc.min_temperature	trips_orc.average_wind_speed	trips_orc.pickup_nyct2010_gid	trips_orc.pickup_ctlabel	trips_orc.pickup_borocode	trips_orc.pickup_boroname	trips_orc.pickup_ct2010	trips_orc.pickup_boroct2010	trips_orc.pickup_cdeligibil	trips_orc.pickup_ntacode	trips_orc.pickup_ntaname	trips_orc.pickup_puma	trips_orc.dropoff_nyct2010_gid	trips_orc.dropoff_ctlabel	trips_orc.dropoff_borocode	trips_orc.dropoff_boroname	trips_orc.dropoff_ct2010	trips_orc.dropoff_boroct2010	trips_orc.dropoff_cdeligibil	trips_orc.dropoff_ntacode	trips_orc.dropoff_ntaname	trips_orc.dropoff_puma
0	2	201	2016-01-01 00:39:36.0	NULL	1	-73	40.68061065673828	-73.92427825927734	40.69804382324219	1	1	8	0.5	0.5	1.86	0	NULL	0.3	11.16	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
1	2	201	2016-01-01 00:39:18.0	NULL	1	-73	40.72317504882813	-73.92391967773438	40.76137924194336	1	3	15.5	0.5	0.5	0	0	NULL	0.3	16.8	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
2	2	201	2016-01-01 00:39:48.0	NULL	1	-73	40.67610549926758	-74.0131607055664	40.64607238769531	1	3	16.5	0.5	0.5	4.45	0	NULL	0.3	22.25	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
3	2	201	2016-01-01 00:38:32.0	NULL	1	-73	40.66957855224609	-74.00064849853516	40.68903350830078	1	3	13.5	0.5	0.5	0	0	NULL	0.3	14.8	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
4	2	201	2016-01-01 00:39:22.0	NULL	1	-73	40.68285369873047	-73.94071960449219	40.66301345825195	1	2	12	0.5	0.5	0	0	NULL	0.3	13.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
5	2	201	2016-01-01 00:39:35.0	NULL	1	-73	40.74645614624023	-73.86774444580078	40.74211120605469	1	1	7	0.5	0.5	0	0	NULL	0.3	8.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
6	2	201	2016-01-01 00:39:21.0	NULL	1	-73	40.74619674682617	-73.88619232177734	40.74568939208984	1	0	5	0.5	0.5	0	0	NULL	0.3	6.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
7	2	201	2016-01-01 00:39:36.0	NULL	1	-73	40.80355834960938	-73.94915008544922	40.79412078857422	1	1	7	0.5	0.5	0	0	NULL	0.3	8.3	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
8	2	201	2016-01-01 00:39:52.0	NULL	1	-73	40.70281600952148	-73.97157287597656	40.67972564697266	1	2	12	0.5	0.5	2	0	NULL	0.3	15.3	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL
9	2	201	2016-01-01 00:39:23.0	NULL	1	-73	40.75664138793945	-73.91754913330078	40.73965835571289	1	1	9	0.5	0.5	1.6	0	NULL	0.3	11.9	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL

Here is a another data set which is 2016 uber trips for New York city. I have used only 3/2 month data for analysis.

Next time I will give you the more details idea over presto and hive. Today I have focused the single node presto cluster but next time I will introduce the multi node presto cluster experiment for your convenience.