Real time streaming using Spark streaming, Kafka and Pyspark

Now a day real time streaming is a one of the most challenging task in our data industries. It might be sensor data, fleet tacking, traffic, cellular, website real time streaming or web crawling. You might need to analyze the data on your real time for your application or end user needs.

Spark stream and storm are kind of technologies which will help you a lot for your real time streaming purposes. My choice is spark streaming for real time streaming.

Here I have implemented a basic streaming application which will cover the Spark streaming, kafka and a basic crawling using python.

will be continued …

Historical cricket data download and analysis.

Today I’m going to give you some idea how to download data from for your experimental works. I have implemented a basic python script using pandas html data reader. Where I just used a url and a iterator for different types of pages.

import panda as pd
from lxml import html

for x in range(1, 4009):
    url1 = ";page=" + str(x) + ";template=results;type=batting;view=innings"

    data1 = pd.read_html(url1)

    df1 = data1[2]
    df1.to_csv(path_or_buf='/Users/mohiulalamprince/work/python/cricstat/player-info-%s.csv' % '{:05d}'.format(x), sep=',')
    print str(x) + "  =>  [DONE]"

Here I have downloaded the data from year 1877 to February 26th, 2017. I have hosted the data for your analysis in my amazon s3 cricstat bucket. Here is the downloaded location where you can be able to download the data for your convenience.

Here is my first analysis where I have tried to find out the highest number of run in all over the world using spark bigdata in memory based framework. You could follow my tutorial
Jupyter notebook pyspark installation guide. If you don’t know about spark then follow my spark setup guide.

from pyspark import SQLContext
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf

sc = SparkContext()
sqlContext = SQLContext(sc)

df ='com.databricks.spark.csv').options(header='true', inferschema='true').load('/Users/mohiulalamprince/work/python/cs/player-info.csv').cache()

def formatting_unbitten_and_dnb(run):
    if run.endswith('*'):
        return run[:-1]
    elif (run == 'TDNB' or run == 'DNB'):
        return 0
    return run.strip()

ignore_dnb = udf(formatting_unbitten_and_dnb, StringType())
df = df.withColumn("Runs", ignore_dnb("Runs"))
sqlContext.sql("select Player, sum(Inns), count(Inns), sum(Runs) as total_runs from players group by Player order by total_runs desc").show()

Now here is your analysis output


How you can calculate the total number of players in between 1877 to 2017 year

from pyspark import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

df ='com.databricks.spark.csv').options(header='true', inferschema='true').load('/Users/mohiulalamprince/work/python/cricstat/player-info-[0-9]*')

Here is a another cricket data analysis from 2005 to 2017 where we will calculate the highest run taker during this time frame.

data location :
Here is the command what you need to start your task.

Now you can see that spark is running

Now We need to load the csv data from 2005 to 2017 for our analysis. I have used the databricks csv loader for my processing.

runsByPlayer = df ='com.databricks.spark.csv').options(header='true', inferschema='true').load("/Users/mohiulalamprince/work/p

And I have used sparkSQL for analysis. SparkSQL is very powerfull, It’s basically accept the sql command.

sqlContext.sql("select player_one, sum(runs) as total_runs from stat group by player_one order by total_runs desc").show()

And Here is our result

few statistics

Here is my gitbub location for data location and technical source code. You could use it for your analysis.