1.7 Billion weather data analysis which is around 28 years data for two different places

Hello Guys, Today I am going to give some idea how I handled 1.7 billion weather data which is around 28 years of data. Our main target was to generate some analytics from this weather data. We have handled only temperature and humidity data. Which is around 22GB.

Now one the main challenge is how we will download the data from the client server. Client specific given url is quite good enough to download the data. I had to download using some basic shell command. At the very first time I was concern to download the data using xargs but I changed my mind to not to use it. Because I had no hurry to download the data and my local pc bandwidth was not that much high to use xargs.

You could use xargs to download the data from a specific location

downloader.sh

d=1990-01-01
while [ "$d" != 2017-04-05 ]; do 
  echo  https://xxxxxxxx.com/"$d"T00:00:00ZXXXXXX/xxxxxxxxxx/xxxxxx.xxxxxxxxx/csv
  d=$(date -I -d "$d + 1 day")
done
[root@namenode.selise.ch ~]# chmod +x downloader.sh
[root@namenode.selise.ch ~]# ./downloader.sh > urls.txt
[root@namenode.selise.ch ~]# cat urls.txt | xargs -n 1 -P 6 wget -P data/

but what I have used for the download purposes which is given below

downloader.sh

d=1990-01-01
while [ "$d" != 2017-04-05 ]; do 
  echo  wget https://xxxxxxxx.com/"$d"T00:00:00ZXXXXXX/xxxxxxxxxx/xxxxxx.xxxxxxxxx/csv
  d=$(date -I -d "$d + 1 day")
done

chmod +x downloader.sh
./downloader.sh > urls.txt
chmod +x urls.txt
./urls.txt

It takes almost 48 hours using my low bandwidth.

Now I am ready to store the data in hdfs for analysis

hadoop fs -put *.csv /user/root/weather

Its also a time consuming to store.
Lets see how many records are there


hive> select count(*) from WeatherTemperatureHumidity_orc_update;
OK
1719639510
Time taken: 1.792 seconds, Fetched: 1 row(s)

now lets create the external hive table to use the hdfs data

CREATE EXTERNAL TABLE IF NOT EXISTS WeatherTemperatureHumidity(
        validdate string,
        temperature float,
        humidity float
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS TEXTFILE
    location '/user/root/weather';
CREATE TABLE IF NOT EXISTS WeatherTemperatureHumidity_orc(
        validdate string,
        temperature double,
        humidity double,
        suburb string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS ORC;
    
INSERT OVERWRITE TABLE WeatherTemperatureHumidity_orc SELECT w.*, "xxxx" FROM WeatherTemperatureHumidity w;
CREATE TABLE IF NOT EXISTS WeatherTemperatureHumidity_orc_update (
        `year` string,
        `month` string,
        `date` string,
        validdate string,
        temperature double,
        humidity double,
        suburb string)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ';'
    STORED AS ORC;
    
drop table WeatherTemperatureHumidity_orc_update;

INSERT OVERWRITE TABLE WeatherTemperatureHumidity_orc_update select year(to_date(w.validdate)), month(to_date(w.validdate)), to_date(w.validdate), w.validdate, w.temperature, w.humidity, w.suburb from WeatherTemperatureHumidity_orc w where w.validdate is not NULL

Now you are ready to start your analytics from the hive table. Here I have tried to find out the the min and max temperature and humidity from the weather hive table.

hive> select `year`, min(temperature), max(temperature) from WeatherTemperatureHumidity_orc_update where suburb='XXXX' group by `year` order by `year` desc;
Query ID = root_20170406182432_92f15cf5-5a23-4250-8739-288676e0589d
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1491480905379_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Reducer 2 ......   SUCCEEDED    253        253        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 48.97 s
--------------------------------------------------------------------------------
OK
2017    -12.399999618530273     20.100000381469727
2016    -6.900000095367432      31.200000762939453
2015    -9.699999809265137      33.099998474121094
2014    -15.800000190734863     30.700000762939453
2013    -13.199999809265137     30.5
2012    -19.299999237060547     30.700000762939453
2011    -12.199999809265137     30.0
2010    -15.199999809265137     29.399999618530273
2009    -16.5   28.5
2008    -12.800000190734863     27.700000762939453
2007    -14.199999809265137     29.600000381469727
2006    -11.0   29.200000762939453
2005    -15.600000381469727     28.5
2004    -10.699999809265137     26.5
2003    -15.0   31.700000762939453
2002    -13.300000190734863     29.799999237060547
2001    -16.899999618530273     30.100000381469727
2000    -16.899999618530273     28.700000762939453
1999    -15.0   28.0
1998    -15.0   29.600000381469727
1997    -11.399999618530273     26.799999237060547
1996    -16.799999237060547     27.100000381469727
1995    -19.399999618530273     28.799999237060547
1994    -13.699999809265137     29.700000762939453
1993    -18.600000381469727     28.700000762939453
1992    -18.100000381469727     29.799999237060547
1991    -18.799999237060547     30.600000381469727
1990    -11.699999809265137     29.100000381469727
NULL    NULL    NULL
Time taken: 53.314 seconds, Fetched: 29 row(s)

Now lets see for a another suburb

hive> select `year`, min(temperature), max(temperature) from WeatherTemperatureHumidity_orc_update where suburb='XXXXXXX' group by `year` order by `year` desc;
Query ID = root_20170406182754_271dd6cb-b69e-4fd3-b4c0-3e70d284fa1d
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1491480905379_0001)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     38         38        0        0       0       0
Reducer 2 ......   SUCCEEDED    253        253        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 47.21 s
--------------------------------------------------------------------------------
OK
2017    -12.399999618530273     19.700000762939453
2016    -7.300000190734863      30.600000381469727
2015    -9.100000381469727      32.79999923706055
2014    -15.300000190734863     30.600000381469727
2013    -13.199999809265137     30.299999237060547
2012    -19.5   30.600000381469727
2011    -12.399999618530273     29.899999618530273
2010    -15.399999618530273     29.200000762939453
2009    -16.600000381469727     28.299999237060547
2008    -13.0   27.5
2007    -14.399999618530273     29.399999618530273
2006    -11.199999809265137     29.0
2005    -15.699999809265137     28.399999618530273
2004    -10.800000190734863     26.399999618530273
2003    -15.199999809265137     31.5
2002    -13.399999618530273     29.700000762939453
2001    -16.899999618530273     29.899999618530273
2000    -16.799999237060547     28.600000381469727
1999    -15.199999809265137     27.899999618530273
1998    -15.100000381469727     29.399999618530273
1997    -11.300000190734863     26.600000381469727
1996    -17.0   27.0
1995    -19.5   28.700000762939453
1994    -13.899999618530273     29.600000381469727
1993    -18.799999237060547     28.5
1992    -18.200000762939453     29.600000381469727
1991    -18.899999618530273     30.5
1990    -11.800000190734863     29.0
NULL    NULL    NULL
Time taken: 50.031 seconds, Fetched: 29 row(s)

It takes around 61 seconds to complete from 1.7 billion data using 3 node hadoop cluster. Where my all nodes are using intel i5 with 16GB RAM and my name node is using 20GB.

Leave a Reply