×

The difference between statistics and analytics and how numbers provide insights into cricket

A comprehensive and cerebral study performed by Dr Srinivas Bhogle of cricket analytics as it was yesterday, as it is today, and how it should be tomorrow.

user-circle cricketcountry.com Written by Srinivas Bhogle
Published: Jun 26, 2013, 11:28 AM (IST)
Edited: Jun 26, 2013, 11:28 AM (IST)

The difference between statistics and analytics and how numbers provide insights into the cricket

A comprehensive and cerebral study performed by Dr Srinivas Bhogle of cricket analytics as it was yesterday, as it is today, and how it should be tomorrow.

A few months ago I was invited to give a talk at the Chennai Mathematical Institute. I was free to pick any subject, but it had to be connected in some way to cricket! And since I was speaking at an elite math institute, there also had to be some curves and equations.

I guess I got unduly ambitious and decided that I would do a quick review of everything I know in cricket analytics. You may ask: “What’s ‘cricket analytics?” And how is it different from ‘cricket statistics’?

Let me explain. I view cricket analytics as something with a little more modeling and a little more data processing. ‘Cricket statistics’, on the other hand, would just be lists of ‘highest’, ‘lowest’, ‘first’, ‘last’ cricketing numbers. For example a list that goes Brian Lara 400*, Matthew Hayden 380, Brian Lara 375, Garfield Sobers 365* …

To nobody’s surprise my talk went on and on. To my pleasant surprise most of the audience stayed behind till I reached the end. They asked some great questions too, and occasionally caused me great embarrassment. Here’s a sampler: “That photo that you are trying to pass off as Lance Gibbs is actually Joel Garner!”

It was!

In this article, I plan to go over my entire talk. There were almost 20 slides, so this narration is going to take some time. I must also add the disclaimer that this is my view of cricket analytics as it was yesterday, is today, and should be tomorrow. I’m sure others can say it better. As Amitabh Bachchan sang in Kabhi Kabhie:

I can’t think of a better place to start than with the good old cricket score book. As a little schoolboy I had an intimate association with this book and my happiest childhood moment was when ML Jaisimha leaned over my right shoulder to ask: “How many sixes did Santosh Reddy hit today?”

The difference between statistics and analytics and how numbers provide insights into the cricket

For a long time I thought that there can’t be a better creation than this humble score book, but I now realise that it has one really serious weakness: it doesn’t connect the batsman to the bowler on a ball-by-ball basis. It can tell you that Sachin Tendulkar hit a six, it can tell you that Anil Kumble was hit for a six, but it cannot confirm that Tendulkar hit that six off Kumble. The only time the printed scorecard links the batsman to the bowler is when the bowler dismisses the batsman. This weakness has now been corrected by using relational databases.

Let me next elaborate on what I have called ‘cricket statistics’. Here are some cricket records that truly excited me when I was a schoolboy and a teenager … the kind of numbers that my father or uncle passed down to me.

The difference between statistics and analytics and how numbers provide insights into the cricket

The most romantic story was about how Don Bradman was bowled by Eric Hollies for a duck, and how if he had scored just four runs in his last innings, his Test average would have been 100.

Then there’s the romance of Garfield Sobers scoring an unbeaten 365 to overtake Len Hutton’s 364; Lance Gibbs surpassing the fiery Freddie Trueman’s record of 307 Test wickets … and Bapu Nadkarni bowling maiden after maiden in the 1964 Test match against England at Madras [Chennai].
 
There were other big numbers: Hanif Mohammad’s 499 in a First-Class match and getting run out as he attempted to get to 500, Jim Laker taking all 10 wickets in an innings (Anil Kumble would later equal that record) and so on.

Notice that we are talking of just numbers, with practically no ‘processing’. The batting average (total runs scored divided by number of completed innings) and the bowling average (total runs conceded divided by total wickets taken) do involve some division, but they have weaknesses; especially the bowling average that doesn’t recognise a bowler’s run-containing ability (Nadkarni, for example, must have hated it).

More interesting questions like ‘who is currently the best Test player’ were rarely asked, possibly because such questions didn’t have easy answers and required much more processing. It wasn’t until the mid-1980s that we started asking such questions with the introduction of the Deloitte Ratings (the name changed many times over the years: Deloitte -> Cooper Lybrand -> Pricewaterhouse Coopers … eventually to Reliance ICC rankings). I remember there was quite a flutter when 
Dilip Vengsarkar briefly topped the list.

The difference between statistics and analytics and how numbers provide insights into the cricket

Rather surprisingly (or, perhaps, not so surprisingly!) a description of the ratings formula is not easily available. But based on a study of the numbers, and some inadvertent leaks on the Internet, it would appear that the rating is an intelligent manipulation of the average: start with the average and scale it up or down based on factors such as quality of opposition, state of the match etc. It is also almost surely configured to give greater weight to more recent performances.

There are also apparently different formulae for bowlers and batsmen, although there is an effort to keep them in the same ballpark.

The Reliance ICC team also produces rankings for the best ODI batsmen, bowlers and all-rounders, as always with their hallowed formula shrouded in mystery. I have never liked the idea of hiding the formula. In fact, I worked with Professor MJ Manohar Rao (who, sadly, passed away 10 years ago) to create the most valuable player index (MVPI) that we continue to publish on rediff.com.

The difference between statistics and analytics and how numbers provide insights into the cricket

We’ve remarked earlier that having different rating schemes for batsmen and bowlers isn’t ideal; it would be much nicer if we had a single index that measures overall cricketing ability, instead of measuring batting, bowling and fielding ability separately. Our MVPI does this: it collapses a player’s batting, bowling and fielding performances into a ‘run equivalent’: so every performance can be expressed in terms of ‘runs’ — even if it initially feels awkward to say that Anil Kumble’s bowling performance was equivalent to a score of 133 ‘runs’.

Remember too that in ODI (or T20) matches it isn’t enough for a batsman to just score a lot of runs; he must also score them briskly. Likewise, it isn’t enough for a bowler to take wickets; he must also concede as few runs as possible. So how would we handle this?

Let us suppose that Virender Sehwag scores 50 in 30 balls and Rahul Dravid scores 50 in 45 balls. In an ODI context, Sehwag’s 50 would seem like 75 while Dravid’s 50 may seem like 55. The MVPI formula therefore gives Sehwag and Dravid ‘bonus’ runs; with obviously a bigger bonus for Sehwag because he scored at a faster clip. Of course if someone had taken 80 balls to score 50, then the 50 would have seemed like 35. And there would have been a ‘penalty’.

The MVPI therefore adds or subtracts runs to make it conform to a par. An alternative approach, used for example in the Castrol Index, is to multiply or divide the actual score to attain the right par.

The MVPI formulation revolves around such a par criterion. If the par score in an ODI is considered to be 250, then we expect every batsman to score five runs in every six balls, and every bowler to concede five runs per over. We also assume that every wicket a bowler gets is worth 25 ‘runs’. So if Sehwag scores 50 in 30 balls, when he was expected to score only 25, we give him a bonus of 50 – 25 = 25 and argue that his 50 is equivalent to 75. Likewise, if Kumble has figures of 10-3-17-4, his four wickets are worth 25 * 4 = 100 runs; also he conceded only 17 runs when he was expected to concede 50 … so we give him a bonus of 50 – 17 = 33 ‘runs’ and his overall performance is worth 100 + 33 = 133 ‘runs’.

Let me end by mentioning another very worthy index — the Impact Index (II) devised by Jaideep Varma and others. This is a relative index in which every performance is evaluated using a 5-point scale. II doesn’t really care how many runs a batsman scored, or how many wickets a bowler took; it asks how much better a batsman or bowler scored relative to other players and therefore how impactful this performance was in the context of the match. I have often imagined how wonderful it would be if we could marry MVPI to II.

So much about ranking players. But what about ranking teams?

The difference between statistics and analytics and how numbers provide insights into the cricket

It seems elementary that teams that win more matches must be ranked higher. If you win with higher margins, or win the series, your team is probably even better. But most of us would agree that to be called a champion team the team must win against stronger opposition and win matches away from home. These two are the more significant criteria — and that’s why I have drawn bigger bubbles to represent them.

Many ranking schemes have been proposed, including the ranking scheme that we ran for a decade on rediff.com. However the official ICC ranking scheme continues to be the one devised by ICC’s David Kendix. The ICC scheme, which accords a higher weight for wins against stronger teams, performs reasonably well, but suffers the weakness of not accounting for the home-away difference. To explain away this deficiency, many analysts argue that the difference is neutralized because every country plays the other successively at home and away. This explanation is not quite tenable because the rating value decays with time, and even more because at any given point we may not have equal home-away parity between every pair of teams. The Indian test team looked pretty awful after being drubbed 4-0 away in Australia … till they returned the 4-0 compliment at home!

Our scheme does modestly better than the ICC scheme because it accommodates both the opposition strength and home-away factors, and I am convinced that a vastly superior rating scheme can be devised simply by tweaking our formula. But I also recognise that ICC will not change its ways … just see how steadfast they are in backing the D/L rule without ever giving a fair trial to V Jayadevan’s rain rule!
In fact, that’s what we are going to talk about next: rain rules!

The difference between statistics and analytics and how numbers provide insights into the cricket

Till ODI matches came along, it was okay for cricket matches to end as draws, and indeed a large number of matches were drawn; the very first cricket series I followed (Tiger Pataudi’s India against Mike Smith’s England in 1964) ended with five dreary draws! But ODI matches demanded a result even if the match was interrupted or disrupted by rain.

Initially the approach was naive (perhaps even ‘dumb’) and the revised target was only dependent on the run rate. But soon captains started getting smart. In particular, the wily Arjuna Ranatunga would always opt to field first on a cloudy day if he won the toss. The reason was trivial: it is a lot easier to score 125 in 25 overs than 250 in 50 overs if all your ten batsmen are allowed to bat. With the run-rate rule, you could encounter absurd situations where Pakistan would win at 151 for nine in 25 overs in response to India’s 300 for two in 50 overs.

So, before the 1991-92 World Cup in Australia and New Zealand, they decided to come up with a new rain rule — and they apparently requested the venerable Richie Benaud to devise something sensible. Now Benaud was a great captain, and is a greater commentator, but analytics was clearly not his cup of tea. He came up with something that seemed profound: he said that if India scored 300 in 50 overs, and Pakistan had only 25 overs to bat … then Pakistan must score what India scored in their 25 most productive overs out of 50.

Think of the Manhattan that plots the runs scored in overs 1, 2, 3 … up to 50. Now rearrange the Manhattan so that the tallest building comes first on the left, then the next tallest and so on. Benaud’s rule effectively said that if 25 (or more generally ‘x’) overs were lost, cut off the runs equivalent to the height of the 25 (‘x’) shortest buildings at the right end of the rearranged Manhattan, and score as much as the height of the 25 (or 50 – ‘x’) tallest buildings to win.

I hope I didn’t confuse the reader. As an example, consider the shocking example of the England-South Africa semi-final in 2002. Batting first, England batted for 45 overs (so England’s Manhattan only had 45 buildings). When South Africa were chasing, they needed 22 to win in 13 balls when there was a rain interruption. Two overs were deemed to be lost. So the two shortest buildings at the right end of the Manhattan, respectively of height 0 and 1 runs, were cut off. The balls reduced from 13 to 1, but the target reduced only from 22 to 21!

It was this sorry mishap the cleared the way for the D/L method devised by Frank Duckworth and Tony Lewis. D/L’s chief merit was that it recognised that targets must be set based not just on ’overs remaining’, but on a combination of the ‘overs remaining’ and ‘wickets remaining’ resources. The D/L method could also seamlessly accommodate interruptions at any stage of the match, and multiple interruptions. After the horror of Richie’s method, Frank and Tony’s method was much, much better!
I have written about the D/L method innumerable times — and even been mentioned in the D/L book — and I have no doubt that D/L constituted a very big step forward in solving the rain rule problem. My only grouse is that the International Cricket C0uncil (ICC) has shut the rain rule door after D/L and not given worthy challengers like Jayadevan’s VJD method a fair chance.

The difference between statistics and analytics and how numbers provide insights into the cricket

What really puzzles me is the way D/L is hyped and considered to be very complicated. I can understand if a cricket commentator with a BA in English, who only studied Chaucer and Shakespeare, throws up his hands in despair. But the average cricket fan mustn’t let the method defeat him (as India waited to defeat Pakistan in the Champions Trophy, someone messaged on Cricinfo that he finds his college calculus simpler than D/L!).

The general idea is to think of a ‘resource’. When an ODI innings is starting, the batting side has all its 50 overs and 10 wickets available. You therefore say that it has all its 100 percent resources available. As overs deplete and wickets fall, this resource diminishes. D/L merely creates a table that tells you how this resource diminishes from 100% to 0% as the innings advances from 50 overs to 0 overs and from 10 wickets available to 0 wickets available. It is essentially a table with 300 rows (for balls) and 10 columns (for wickets).

That’s one part of D/L; the other part is to determine how to reset the target using the resource table if there is an interruption. Remember that in the era of simple run rates we calculated the target by looking at the ratio of overs available to both the teams.at the time of interruption. Example: India bats all 50 overs to score 300, and Pakistan are 100 for no loss after 20 overs when the match is abandoned. Pakistan’s par target then would have been: (20/50) * 300 = 120 and they would be declared loser. Now, instead of a ratio using overs, we use a ratio using the resource percentage. In our example, while India used up 100% of its resource, Pakistan might just have used up 30% at the time of interruption (remember they have all 10 wickets in hand, and the resource percent judiciously combines overs used and wickets lost) and their par target might just be (30/100) * 300 = 90. So they’d be declared winner by D/L.

This is the key idea. Of course, things get complicated if the first innings of the ODI match is itself interrupted, or when there are multiple interruptions, or when scores tend to be too high or low … but the D/L rationale is always to set the target by comparing the resources used up.

To summarize: the D/L method uses (a) a 300 x 10 table of resource percentages and (b) a rule or a formula to reset targets in every interruption situation.

So what does V Jayadevan’s method do? Well, Jayadevan proposes his own criterion to populate the 300 x 10 table, and a different rule to reset targets based on what he calls the normal and target curves. I have described his method elsewhere in considerable detail. Essentially Jayadevan tries to repair Benaud’s Manhattan project involving most productive overs. Jayadevan recognised that the most productive overs criterion was perfect if the interruption happened between innings, but misbehaved badly for interruptions within innings. He therefore sought to correct that misbehaviour.

The difference between statistics and analytics and how numbers provide insights into the cricket

The D/L vs Jayadevan debate has gone on for over a decade, and is frequently seen as India vs the Rest of the World conflict. I am one of the interlocutors in this conflict and often asked if my technical judgment is clouded by my nationality. Perhaps yes, perhaps no. Back in 2001 I did a comparison of the two methods and cast my vote in favour of Jayadevan because I thought he was ahead by a whisker.

A lot has changed since; both D/L and Jayadevan have significantly upgraded their methods, and both now need computers to reset targets. This happened after D/L got a big scare during the India-Australia 2003 World Cup final; the match, undeniably Australia’s, was briefly going India’s way during a 10-12 ball interval when Sehwag was firing all guns. If rain had ended play at that point India could have become undeserving winners and D/L would’ve got an instant burial.

Where do I stand in the D/L vs Jayadevan debate today? I’m still with Jayadevan — although both methods compute nearly the same target in most ‘normal’ match situations — chiefly because I think ICC is being completely unfair in denying him an opportunity to demonstrate his rain rule. I still find the debate fascinating, and see this as another example of the engineer vs mathematician debate that crops up ever so often in information management.

If we look at T20 cricket, however, I can say without hesitation that D/L is simply not good enough. We have to realise and accept that T20 is a very different sort of animal.

The difference between statistics and analytics and how numbers provide insights into the cricket

So how does the rain rule go for a T20 match? Shout, scream, smile or gasp when you hear this, but the dumb rule is simply to pretend that a T20 match is an ODI match with the first 30 overs lost for either team! Or simply erase the top 180 rows from our 300 x 10 table. It was this ridiculous construct that got Paul Collingwood hot under the collar and Chris Gayle grinning like a contented cat when West Indies easily defeated England in a 2009 World T20 match.

You could view the situation this way: there is a long trouser that doesn’t fit you, but you are being forced to wear it. What could you do? Either cut the trouser, or perhaps shrink it. Using the D/L ODI rain rule for T20 is like cutting the trouser. In an exercise with *Rajeeva Karandikar we tried to see if we could ‘shrink’ the D/L trouser from ODI to T20 size (by ‘shrinking’ we assume that a T20 game evolves just like an ODI; only everything happens faster). We got mixed results: shrinking wasn’t better than cutting, but it wasn’t worse either.

In reality, T20 indeed appears to be a different animal. To devise a rain rule for T20 we must return to the drawing board, instead of tinkering with an ODI rain rule. So you might wonder why D/L haven’t done this so far? I’m sure they are at it, but their problem appears to be that they don’t have enough international T20 match data because not enough matches are played. But what about all that data from six Indian Premier League (IPLs) or the Big Bash? Oh, but IPL is just a silly Indian league that most Englishmen pretend not to notice. And isn’t the IPL all fixed?

Actually you must approach the problem differently. The only real requirement is to create that 120 x 10 table such that the resource diminishes after every ball and every wicket. There are so many statistical and probabilistic techniques that’ll help you build such a table, and even ensure that your table gets smarter as the months and years roll by. One remarkable exercise at Simon Fraser University actually created such a table. And their big finding was that D/L over-estimates the amount of resource available in mid-innings by almost 5%. Because there’s more resource apparently available, D/L thinks the batting side has the potential to score more and therefore sets a lower-than-expected target for mid-innings interruptions. Now you know why Gayle was grinning when WI easily scampered past England’s target.
But let us now move to another truly exciting application of D/L-like resource tables: we call it the ‘pressure index’!

The difference between statistics and analytics and how numbers provide insights into the cricket

Imagine you are in a meeting with your phone switched off and there’s an India-Pakistan match under way. As soon as the meeting ends, you check the score. To get an idea of who’s winning the chase you need to know three variables: the runs scored, the wickets lost, and the overs remaining.

Wouldn’t it be wonderful if all this information can be condensed into one number? Well, that number is what we call the pressure index.

We define the pressure index using the idea of a par score. A par score is what the chasing team must score to be on level terms with the bowling team (with all the rain in the ongoing Champions Trophy everyone’s talking of the par score). So if the chasing team is exactly at the par score we say that it has a pressure index of exactly 100. If a wicket falls at that point the par score rises and therefore the pressure index goes up to 115 or 120 or whatever (our formula is devised so that the max pressure index value is 200). If, on the other hand, the batsman hits three consecutive boundaries at that point then the chasing team has scored more than the par score and might have a pressure index of 92 or 95 (min pressure index value is 0). In the West Indies vs South Africa Champions Trophy match, the West Indies pressure index was below 100 when Kieron Pollard hit that unfortunate shot leading to his dismissal. His dismissal pushed the pressure index up to exactly 100 and the match ended as a tie.

The pressure index keeps changing ball after ball. In a close match it will fluctuate this way and that from 100. If we now plot the ball-by-ball change in the pressure index we obtain what we call the ‘pressure map’. In today’s age of smartphones and mobile connectivity the best way to report a match could be by using the pressure map. Major events on the map (such as a dismissal, or a high scoring streak) can be hyperlinked so that the map becomes the cricket fan’s one-stop cricket match reporting tool.

We carried the pressure index calculation live on rediff.com during the 2007 World Cup. But India’s early elimination killed off all interest. Fortunately, our reporting around the paisa vasool index in IPL 2008 on rediff.com and on Hindustan Times was much more successful.

The difference between statistics and analytics and how numbers provide insights into the cricket

So what then is the paisa vasool index (PVI)? It is really something quite straight-forward, and based on the most valuable player index (MVPI) that I have described earlier in this post. PVI provides a good estimate of a player’s value in a professional cricket tournament such as the IPL.

Recall that MVPI collapses a player’s performance into a single variable that we can call ‘runs’ (in quotation marks). The higher the MVPI, the more ‘runs’ a player is contributing. The PVI is obtained by dividing a player’s earning (in US$) by his MVPI, and is therefore seen to be the amount (in US$) that the franchise owner pays the player for every ‘run’ scored.

The best buys are therefore players with the lowest possible PVI, i.e. players who contribute the most ‘runs’ at the least cost. In fact, an analysis using PVI even allows you to obtain Moneyball-like inferences.
There is however one weakness in PVI: it estimates the worth of a cricketer only on the playing field.

Players like Sachin Tendulkar or Sourav Ganguly are immensely valuable even if they don’t perform on the cricket field (Tendulkar sends TV ratings soaring; in his prime, Ganguly could single-handedly fill up Eden Gardens). So a more realistic estimate of a player’s value must be based not just on on-field performance, but also on his perceived brand value.

Now, one of the top talking points in an event like the IPL relates to the points table: which will be the four top teams in the table?

Usually such discussions involve a long series of complex ’if-then’ arguments, with the clear picture being elusive till almost the very end. That’s because we attempt only deterministic arguments. But what if we used probabilistic arguments instead?

The difference between statistics and analytics and how numbers provide insights into the cricket

What I will now describe is an idea from Rajeeva Karandikar. To illustrate the argument let us imagine we are looking at the IPL6 points table. IPL6 had 9 teams, or 36 distinct team pairs such as MI-KKR, KKR-RCB, SRH-CSK etc., etc. For each pair, let us write down our estimate of the win-loss probability. For example, for MI-KKR it could be 0.6-0.4 if we think MI has a 60% probability of winning.

With these probabilities we run a simulation. This means we simply tell the computer to pretend that IPL6 was played over and over, say 10,000 times (it will be impossible to do this in real life, but on a computer it will only take a minute!). The computer therefore ends up with 10,000 possible IPL6 points tables. Looking at these tables a simple counting process will enable us to identify while team is likely to be first, second, third and fourth. It is also clear that we can repeat the same simulation process to identify the likely finalists and the likely winner.

The process can also be easily refined. For example, we could factor in different win probabilities for home-away matches since IPL6 trends indicate significantly higher home win probabilities. It will also be a lot of fun; I can for example visualize a series of attractive contests built around this simulation idea on a cricket portal.

Next we pose an intriguing ODI batting question: If you have to chase a big total would you rather have Adam Gilchrist in your batting line-up or Herschelle Gibbs?

The difference between statistics and analytics and how numbers provide insights into the cricket

To answer this question, let us think hard about which ODI performance is upper most in our mind when we think of Gilchrist and Gibbs. The Gilchrist knock I remember most is his 172 against Zimbabwe in 2004. He had enough overs left to get past 200, but he just threw the opportunity away. My favourite Gibbs knock is his 175 at the Wanderers in 2006 as South Africa chased down Australia’s 434 for four.

With Gilchrist one feels that he starts strongly and becomes more vulnerable as the innings advances. With Gibbs it is just the opposite; he starts tentatively but looks rock solid as the innings progresses. Can we model this phenomenon? Do some ODI batsmen ‘age’ well as the innings progresses, and others ‘age’ poorly? This was the question that MRLN Panchanana and T Krishnan (who taught me statistics over 35 years ago at Indian Statistical Institute) posed some years ago. Their answer: Yes! They fitted a Weibull distribution and showed how batsmen with a beta value below 1 (like Tendulkar) age well, while batsmen with a beta greater than 1 (like Ponting) age poorly.

Our next chat is about how we can pictorially depict cricketers. Can we caricature their faces so that we can recognise who is a batsman and who is a bowler. Better still, can we use their faces to spot cricketing similarities between two cricketers?

The difference between statistics and analytics and how numbers provide insights into the cricket

This is an old visualisation idea: when there are a lot of variables associated with a person or object it becomes difficult to depict all of them holistically. We then use faces to depict them: the roundness of the face may be linked to ‘runs scored’, the extent of the smile may be linked to ‘strike rate’, the curvature of the eyebrows may be linked to ‘fielding acumen’, the length of the nose to ‘economy rate’, the loop of the ears to ‘wickets taken’ and so on. When you do this, every player has a ‘face’, and the look on the face can instantly tell you the attributes of the player.

Look at ‘Matthew Hayden’ and ‘Rick Ponting’. It is easy to see that both have very similar skills (which we recognise to be batting skills). Or look at ‘Scott Styris’ and ‘Sanath Jayasuriya”. They look similar because both were batting all-rounders.

We did this visualisation for cricketers right through our 2007 World Cup coverage on rediff.com, it was a fun project (of course it would have been more fun if we had more Indian faces). But if there are thousands of individuals sharing dozens of traits then it is easy to see how these pictures can suddenly become very informative!

I was pleasantly surprised when I saw a 2008 article in NYT using the same idea to describe traits of baseball coaches.

We will next address a question that has been asked about Sachin Tendulkar all through his glittering cricketing career: Does Tendulkar let you down in a crisis? A lot of folks contend that he isn’t a match-winner like Dravid or Laxman. For someone like me watching cricket for close to half a century, this debate has a déjà vu feeling. Back in the 1970s we were saying the same thing about Sunil Gavaskar vs Gundappa Viswanath or Dilip Vengsarkar.

The difference between statistics and analytics and how numbers provide insights into the cricket

I won’t write too much about this because I can scarcely better the compelling writing and 
arguments presented by Arunabha Sengupta,who argues that there is a cognitive fallacy in the reasoning. We are confusing two events … Does Sachin Tendulkar fail in a crisis? Or, is there a crisis because Sachin Tendulkar fails?

This confusion arises because many of us find it hard to understand conditional probability. If it is indeed true that Sachin Tendulkar fails in a crisis then the probability of the event ‘Sachin fails’ given the event that ’there is a crisis’ must be very high … say 75% or more. So can we compute the actual probability?
We can … if readers have ever studied probability in school or college they would recall that we can do this if we use the Bayes theorem!

To compute, we first need to guess some probabilities: what’s the probability that Tendulkar fails? Given his staggering record, very low. Let us say just 0.2. So the probability that Tendulkar does not fail is 0.8. What’s the probability that there is a crisis if Tendulkar fails? Historically that’s pretty high … say 0.7. And what’s the probability that there’s a crisis if Tendulkar does not fail? Pretty low .. I’d say 0.3. If we now do the arithmetic we find that the probability that Tendulkar fails in a crisis is just under 40%, i.e., in less than four cases out of 10!

Let us now gaze into our crystal ball, and see how the future of cricket analytics might look like as we enter the brave new world of high speed communication and big data. I think it is going to be really exciting and enjoyable.

The difference between statistics and analytics and how numbers provide insights into the cricket

It is easy to see that the first future conflict in cricket analytics will relate to the DRS (for some time it was called UDRS for ‘umpire decision review system’, but umpires clearly didn’t find this amusing). And it is just as easy to see that DRS is eventually here to stay: cameras keep getting better, algorithms keep getting smarter, errors keep getting expensive, and TV viewers and sponsors keep demanding more excitement.

I won’t go into too much DRS detail because there are great compilations already available, including the one by Kartikeya Date that I rate highly, but DRS is all about using great imagery (as in Hot Spot), great gadgets (for example, the Snickometer) and great algorithms (as in Hawk-Eye) to improve the quality of decisions on the cricket field.

Shorn of polemics and controversy a cold-blooded view is that if DRS does better than the umpires we must have it. We already see numbers telling us that umpires succeed 93% of the time while DRS gets it right 98% of the time. It is also evident that in the future the umpires percentage will drop even as the DRS percentage rises (as a parallel, see how dependent doctors are now on medical tests).

I am also amused by the view that no technology should be accepted unless it is 100% accurate; this is often an unattainable ideal; costs zoom as you try to improve by even a fraction of a percentage point. If we want to wait for 100% accuracy, it will be a really long wait!

If pictures from the 2013 Champions Trophy are any indicator, Hot Spot does indeed look improved. The Snicko was always reliable … so the debate is now really down to how well Hawk-Eye performs. Given the character of technology, we should expect Hawk-Eye to keep getting better, although it might initially also get costlier.

As we debate about how well or how poorly Hawk-Eye performs, I asked Dr Rajeeva Karandikar if there wasn’t a simpler way to answer this question. He said there was!

The difference between statistics and analytics and how numbers provide insights into the cricket

What does Hawk-Eye really do? In simple terms it models the trajectory of the ball bowled by the bowler and checks if the ball would have gone on to hit the stumps.

Why not then ask a team of bowlers to actually bowl a few hundred deliveries with the intention of hitting the stumps? For each delivery we ’freeze’ action as soon as the ball pitches and ask Hawk-Eye to predict its trajectory. We then compare what Hawk-Eye predicts with what actually happens. Did the ball really hit the stumps when Hawk-Eye said it would? Did it really miss the stumps as Hawk-Eye said it would? If there was a mismatch then what was the margin of error?

To make the analysis more robust we could carry out this experiment at different cricket grounds, with different cricket balls, with different pitch wear and tear (we could even deliberately create bowler’s footmarks to the extent umpires would allow) and in different climatic conditions. And then we would count and compute!

But cricket’s biggest future worry is of course ‘fixing’: spot fixing, ball fixing, player fixing, match fixing or whatever! How are we going to fix this?

The difference between statistics and analytics and how numbers provide insights into the cricket

Some years ago I wrote a mad post on a blog I used to write for Castrol Cricket asking, “What can cricket learn from Google?” It was just wild speculation that we could use big data analytics to uncover evidence of match mixing.

Even today that ramble probably qualifies as senile fantasy … but I’m not so sure about tomorrow. A couple of weeks ago I was reading the much-acclaimed book Big Data and was surprised to find a reference to match fixing in Japan’s sumo wrestling events. So other folks too are waking up to the big data opportunities in sport.

Essentially big data is all about discovering associations, and match fixing is about deliberately creating associations and correlations. If big data techniques become sufficiently powerful they’ll surely find the dirt.

All current approaches to deter match-fixing are based on denial of service (jamming mobile phones in and around a cricket field is a big joke). The principle is: “Make it harder and harder to fix”. I rather fancy that this principle should instead be: “Make it easier and easier to find”.

Let us end by asking how the cricket analytics story is likely to play itself out tomorrow and the day after.

The difference between statistics and analytics and how numbers provide insights into the cricket

Long, long ago, cricket was played on a cricket ground, and the rest of the world only came to know what’s happening via radio or via the next morning’s newspaper. Today, cricket is ‘played’ much more on TV screens, and tomorrow it will be played on computer and communication networks worldwide.

The game is adapting to the changing canvas. IPL wouldn’t be what it is without TV and Internet; even betting and fixing are far more prevalent now because today’s cricket matches can be seen everywhere in realtime.

Our cricket contests are now much more data-centric and almost certain to use Twitter or Facebook (does anyone even recall those spot-the-ball contests of the 1960s and 1970s on Sport & Pastime or Sportsweek?). We are now seeing ads on TV inviting us to watch matches on the computer, instead of on TV! Today’s cricket nostalgia would involve browsing on YouTube, instead of old cricket books and magazines featuring Neville Cardus, Ray Robinson or Jack Fingleton.

Surprisingly, cricket websites still aren’t embracing this new world. Cricinfo is playing out a huge nostalgic trip as it looks back to 20 years ago. Have they thought enough about how Cricinfo will be 20 years later? Are they even aware of the phenomenal value of their cricket statistics? Have they realised that text mining of their ball-by-ball summaries are likely to be the richest source of cricketing information? When I created pressure maps during the 2007 World Cup, I depended heavily on Cricinfo’s summaries. So why isn’t Cricinfo itself preparing pressure maps and selling them?

I see this as a big opportunity. Someone should create that ultimate one-stop cricket portal; this post itself contains several ideas and suggestions for such a portal: instant par scores and run scoring strategies after every ball; what-if scenarios after every ball (Dwayne Bravo wished that the umpires had given his team just one more ball in that West Indies vs South Africa match … so what could’ve happened if West Indies had that extra ball?); ongoing pressure index and pressure map, Chernoff faces, quizzes and contests about the IPL points table; drama and discussion accompanying DRS; round-up of the latest player ratings; sale of cricket merchandise … and I could go on, but I guess it now time to stop!

(Dr Srinivas Bhogle obtained his Ph.D. in 1983 from the University of Paris V, for a thesis on hypergraphs and information theory, and his Bachelor’s and Master’s degrees from the Indian Statistical Institute, Kolkata. He is a Director and India Country Manager of TEOCO Software Pvt Ltd. Earlier he was Vice President, Analytics, at Cranes Software International Limited (CSIL), and prior to that, headed the Information Management Division at National Aerospace Laboratories, Bangalore for over 20 years.)

TRENDING NOW

(*Dr (Prof) Rajeeva L Karandikar collaborated with Dr Srinivas Bhogle for the above article. Dr Karandikar obtained his Ph. D. at the Indian Statistical Institute (ISI), Kolkata in 1981 and joined the institute’s Delhi faculty three years later and since has headed it. He has been a visiting professor at several universities in USA and Europe. In 2006, he moved to Cranes Software International Limited as Executive Vice President – Analytics. In 2010, he returned to Academics and is now the Director of Chennai Mathematical Institute. The focus of his research work has been Stochastic Processes and he has also been involved in numerous consultancy projects over the years).