R Competition Brings Out the Best in Data Analytics

icon

4

pages

icon

English

icon

Documents

2012

Écrit par

Publié par

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

4

pages

icon

English

icon

Documents

2012

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

WHITE PAPER R Competition Brings Out the Best in Data Analytics R Provides a Winning Edge for Competitive Data Scientists By David Smith It’s often been said that competition brings out the best in us. We are all attracted to contests; our passion for competing seems hardwired into our souls. Apparently, even predictive modelers find the siren song of competition irresistible. That’s what a small Australian firm named Kaggle has discovered – when given the chance, data scientists love to duke it out, just like everyone else. Kaggle describes itself as “an innovative solution for statistical/analytics outsourcing.” That’s a very formal way of saying that Kaggle manages competitions among the world’s best data scientists. Here’s how it works: Corporations, governments and research laboratories are confronted with complex statistical challenges. They describe the problems to Kaggle and provide datasets. Kaggle converts the problems and the data into contests that are posted on its web site. The contests feature cash prizes ranging in value from $100 to $3 million. Kaggle’s clients range in size from tiny startups to multinational corporations such as Ford Motor Company and government agencies like NASA. “The idea is that someone comes to us with a problem, we put it up on our website, and then people from all over the world can compete to see who can produce the best solution,” says Andrew Goldbloom, Kaggle’s founder and CEO.
Voir icon arrow

Publié par

Publié le

03 juillet 2012

Nombre de lectures

97

Langue

English

1
Copyright 2011 Revolution Analytics
WHITE PAPER
RCompetition Brings Out the Best in
Data Analytics
R Provides a Winning Edge for Competitive Data Scientists
By David Smith
It’s often been said that competition brings out the best in us. We are all attracted to contests;
our passion for competing seems hardwired into our souls. Apparently, even predictive modelers
find the siren song of competition irresistible.
That’s what a small Australian firm named Kaggle has discovered – when given the chance, data
scientists love to duke it out, just like everyone else. Kaggle describes itself as “an innovative
solution for statistical/analytics outsourcing.” That’s a very formal way of saying that Kaggle
manages competitions among the world’s best data scientists.
Here’s how it works: Corporations, governments and research laboratories are confronted with
complex statistical challenges. They describe the problems to Kaggle and provide datasets. Kaggle
converts the problems and the data into contests that are posted on its web site. The contests
feature cash prizes ranging in value from $100 to $3 million. Kaggle’s clients range in size from tiny
startups to multinational corporations such as Ford Motor Company and government agencies
like NASA.
“The idea is that someone comes to us with a problem, we put it up on our website, and then
people from all over the world can compete to see who can produce the best solution,” says
Andrew Goldbloom, Kaggle’s founder and CEO.
In essence, Kaggle has developed a remarkably effective global platform for crowdsourcing thorny
analytic problems. What’s especially attractive about Kaggle’s approach is that it is truly a win-win
scenario – contestants get access to real-world data (that has been carefully “anonymized” to
eliminate privacy concerns) and prize sponsors reap the benefits of the contestants’ creativity.
It is not surprising that many Kaggle contestants use programs or packages written in R, the open-
source programming language designed specifically for data analysis. Created by two university
professors in New Zealand, R has emerged as the lingua franca of statistical analysts worldwide.
Because R enables analysts to visualize and model data very rapidly, it has become a favorite for
handling the kind of extremely large, complex data that have become increasingly common in
today’s networked global economy.
R is also uniquely suited for competitions such as those managed by Kaggle. That’s because the
competitions tend to focus on prototyping and modeling, rather than on execution.
Competition Brings Out the Best in Data Analytics
2
Copyright 2011 Revolution Analytics
“R is a really powerful prototyping tool. It has so many packages, that just about anything you
could think to try is readily available. So in that regard, it gives participants the opportunity to
experiment with techniques that would otherwise be cumbersome to implement,” says
Goldbloom.
Jeremy Howard, a highly successful Kaggle contestant, was introduced to Goldbloom at an R user
group meeting in Melbourne. Goldbloom asked if he was the same Jeremy Howard whose name
appeared so frequently on Kaggle’s lists of leading contenders. “We got to talking and it turned
out we were kindred spirits,” says Howard, who now serves as Kaggle’s chief data scientist.
At a recent R user group meeting, Howard talked to a packed room about his winning strategies.
“R is great for running six different models and seeing what works,” says Howard. “R is definitely
an important tool in a data miner’s arsenal.”
Because R is an open-source project, there are literally thousands of free R packages available for
downloading. Many of them, notes Howard, include cutting-edge statistical analytics. “You can
jump into R, try something and find out quickly if it works for you. That’s really nice.”
Kaggle was recently featured in a Wall Street Journal article focusing on a $3 million prize offered
by Heritage Provider Network Inc., a California-based physicians group. The prize will be awarded
to the data analyst who can develop the best model for predicting the number of days a patient is
likely to spend in the hospital over the next year, according to the Journal. Kaggle is handling the
competition, which is the largest yet of its kind.
The scope and scale of the Heritage competition isn’t likely to scare off any of Kaggle’s die-hard
contestants. “Competitions are a platform for data scientists to test the robustness of their
algorithms and theories,” says Ming-Hen Tsai, a Kaggle contestant and former undergrad at
National Taiwan University. The emergence of data mining competitions such as those run by
Kaggle helps data scientists share their knowledge more openly and effectively, says Ming-Hen.
And the availability of open-source packages and algorithms from the worldwide R community
makes it easier to “see beneath the hood” of highly complex analytic processes, he notes.
“Revealing the methods for solving analytic problems is important for advancing the data mining
community,” says Ming-Hen. “Although many experiments result in papers, there aren’t enough
comprehensive studies of the statistical methods used. Many papers just report results on specific
sets of data.”
The lack of transparency makes it more difficult to recreate experiments, which in turn slows the
advancement of data mining as a science. “I think we should have more open-source software
implementing state-of-the-art algorithms,” says Ming-Hen. “Let the performance speak for itself,
and the world can judge which is best.”
Kaggle recently ran a competition to create the best recommendation system for R packages.
When he prepared the data for the contest, Howard used a variety of statistical methods –
including programs written in R.
Competition Brings Out the Best in Data Analytics
3
Copyright 2011 Revolution Analytics
Now, Kaggle contestants have an additional weapon to deploy in their quest for victory. Kaggle
has partnered with Revolution Analytics to provide Revolution R Enterprise—free of charge to
Kaggle competitors for use in the competitions. Because this enhanced distribution of R scales to
the Big Data problems now a part of many Kaggle contests, it means that competitors can use R
for big-data problems, and that competition sponsors can implement those models in production
settings — thanks to the commercial-grade enhancements of Revolution R Enterprise.
Reflecting the field of statistics itself, Kaggle has generated unexpected results. It has become an
informal recruiting ground for companies looking for the best and brightest data analysts.
“When you’re talking to a prospective employer, you can say, ‘Look at my Kaggle profile.’ It’s by
far the best reputation tool in data science,” says Howard. “People who are successful in Kaggle
competitions can work anywhere they want to in this field.”
Revolution R Enterprise—Available free to Kaggle Competitors
Through a new partnership with Revolution Analytics, participants in current Kaggle competitions
can now download and use a FREE, full-featured version of Revolution R Enterprise software to
create their submissions. Built upon the powerful open source R language, this advanced
analytics software brings higher performance, 'Big Data' scalability, and greater productivity to
R
at a fraction of the cost of traditional statistics products.
Kaggle competitors can download Revolution R Enterprise by registering at
http://info.revolutionanalytics.com/Kaggle.html
Competition Brings Out the Best in Data Analytics
4
Copyright 2011 Revolution Analytics
About David Smith
David is the Vice President of Marketing at Revolution Analytics, the leading commercial provider
of software and support for the open source R statistical computing language. David is the co-
author, with Bill Venables, of the official R manual
An Introduction to R
. He is also the editor of
Revolutions (http://blog.revolutionanalytics.com), the leading blog focused on “R” language, and
one of the originating developers of ESS: Emacs Speaks Statistics. You can follow David on Twitter
as @revodavid
About Revolution Analytics
Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. Led
by predictive analytics pioneer and SPSS co-founder Norman Nie, the company brings high
performance, productivity, and enterprise readiness to open source R, the most powerful
statistics language in the world.
In the last 10 years, R has exploded in popularity and functionality and has emerged as the data
scientists’ tool of choice. Today R is used by over 2 million analysts worldwide in academia and at
cutting-edge analytics-driven companies such as Google, Facebook, and LinkedIn. To equip R for
the demands and requirements of all business environments, Revolution R Enterprise builds on
open source R with innovations in big data analysis, integration and user experience.
The company’s flagship Revolution R product is available both as a workstation and server-based
offering.
Revolution R Enterprise Server is designed to scale and meet the mission-critical production needs
of large organizations such as Merck, Bank of America and Mu Sigma, while Revolution R
Workstation offers productivity and development tools for individuals and small teams that need
to build applications and analyze data.
Revolution Analytics is committed to fostering the growth of the R community. The company
sponsors the Inside-R.org community site, local users groups worldwide, and offers free licenses
of Revolution R Enterprise to everyone in academia to broaden adoption by the next generation
of data scientists. Revolution Analytics is headquartered in Palo Alto, Calif. and backed by North
Bridge Venture Partners and Intel Capital.
Please visit us at www.revolutionanalytics.com
Voir icon more
Alternate Text