At Searchmetrics, big data is not only the heart and soul of our Enterprise Suite, it’s a the passion that drives our team. That’s why we celebrate when one of our own directs such enthusiasm toward helping make the world work and play better.
Data scientist Abhishek Thakur recently took home awards in the U.S. as part of a team that created an algorithm to identify at-risk populations for cervical cancer, a treatable disease that kills up to 275,000 women a year. The Kaggle Cervical Cancer Screening Competition, sponsored by Genentech, challenged some of the world’s top data scientists, called Kaggle Masters, to examine 250GB of anonymized health records and use machine learning models to predict which women will miss or ignore screening schedules.
With just 52 days to complete the project, Abhishek started the competition alone and early on developed an algorithm that reached 90 percent accuracy. But was that enough? Against fellow Kaggle Masters, he didn’t think so. Ultimately, Berlin-based Abhishek joined forces with two U.S. Kaggle Master and masters from two other countries (Hungary and Croatia) to create an algorithm resulting in about 96 percent accuracy (the big-prize winner beat them out by .3 percentage points).
“At Searchmetrics, I work on some very confidential research projects that require a lot of data munging, feature engineering and machine learning models,” he said. “When I go home, I like to take part in machine learning competitions. I can translate skills from one to the other.”
Abhishek and this team combined research and scoured forums as they analyzed patient data from different practitioners and diagnoses they’ve been through. Using variables such as location and different diseases among the populace, they created a single machine learning model. Ultimately, the team took home two prizes – the Challenge prize and Insights Award — and split a $20,000 purse (the overall prize pool was $100,000).
Cervical cancer can be prevented through early administration of the HPV vaccine and regular pap smear screenings, which indicate the presence of precancerous cells. It is also sometimes curable by the removal of the early-stage cancerous tissue that is identified through pap smears. Screening and early treatment can lead to potential cures in about 95% of women at risk for cervical cancer.
One of the plots from their analysis is depicted below. It shows how the screener percentage varies with the age group of patients.
One of the insights the jury liked a lot was a word cloud. This was created by one of the team members and is shown below. The word cloud shows which drugs were prescribed to patients with high screener percentage.
Identifying at-risk populations will make education and other intervention efforts more effective, ultimately reducing the number of women who die from this disease. “Different people working on different parts of machine learning algorithms can result in some amazing things,” Abhishek says.
We think so too. Congrats, Abhishek! And thank you to all the data scientists that use your powers for good.