Abhishek Thakur

Approaching (Almost) Any Machine Learning Problem

Abhishek Thakur June 30, 2020

This is not a traditional book.
The book has a lot of code. If you don't like the code first approach do not buy this book. Making code available on Github is not an option.

This book is for people who have some theoretical knowledge of machine learning and deep learning and want to dive into applied machine learning. The book doesn't explain the algorithms but is more oriented towards how and what should you use to solve machine learning and deep learning problems. The book is not…

This is not a traditional book.
The book has a lot of code. If you don't like the code first approach do not buy this book. Making code available on Github is not an option.

This book is for people who have some theoretical knowledge of machine learning and deep learning and want to dive into applied machine learning. The book doesn't explain the algorithms but is more oriented towards how and what should you use to solve machine learning and deep learning problems. The book is not for you if you are looking for pure basics. The book is for you if you are looking for guidance on approaching machine learning problems. The book is best enjoyed with a cup of coffee and a laptop/workstation where you can code along.

Table of contents:
- Setting up your working environment
- Supervised vs unsupervised learning
- Cross-validation
- Evaluation metrics
- Arranging machine learning projects
- Approaching categorical variables
- Feature engineering
- Feature selection
- Hyperparameter optimization
- Approaching image classification & segmentation
- Approaching text classification/regression
- Approaching ensembling and stacking
- Approaching reproducible code & model serving

There are no sub-headings. Important terms are written in bold.

See publication
AutoCompete: A Framework for Machine Learning Competitions

AutoML @ ICML Jun 2015
In this paper, we propose AutoCompete, a highly automated machine learning framework for tackling machine learning competitions. The framework has been learned, validated and improved over a period of more than two years by participating in online machine learning competitions. It aims at minimizing human interference required to build a first useful predictive model and to assess the practical difficulty of a given machine learning challenge. The proposed system helps in identifying data…

In this paper, we propose AutoCompete, a highly automated machine learning framework for tackling machine learning competitions. The framework has been learned, validated and improved over a period of more than two years by participating in online machine learning competitions. It aims at minimizing human interference required to build a first useful predictive model and to assess the practical difficulty of a given machine learning challenge. The proposed system helps in identifying data types, choosing a machine learning model, tuning hyper-parameters, avoiding over-fitting and optimization for a provided evaluation metric. We also observe that the proposed system produces better (or comparable) results with less runtime as compared to other approaches.

Other authors
Computer Vision for Head Pose Estimation: Review of a Competition

Scandinavian Conference on Image Analysis 2015
This paper studies the prediction of head pose from still
images, and summarizes the outcome of a recently organized competition,
where the task was to predict the yaw and pitch angles of an image
dataset with 2790 samples with known angles. The competition received
292 entries from 52 participants, the best ones clearly exceeding the
state-of-the-art accuracy. In this paper, we present the key methodologies
behind selected top methods, summarize their prediction accuracy…

This paper studies the prediction of head pose from still
images, and summarizes the outcome of a recently organized competition,
where the task was to predict the yaw and pitch angles of an image
dataset with 2790 samples with known angles. The competition received
292 entries from 52 participants, the best ones clearly exceeding the
state-of-the-art accuracy. In this paper, we present the key methodologies
behind selected top methods, summarize their prediction accuracy and
compare with the current state of the art.

Other authors
Parallel Processing Architecture for ECG Signal Analysis

International Journal of Machine Learning and Computing June 28, 2013
Research in detecting QRS peaks in ECG signals
has progressed to an acceptable extent and hence has gained
adequate confidence with respect to the validity of the outputs
produced. In view of the dynamics associated with ECG signals,
their variants among subjects owing to varied types of problems
encountered; it has become essential to, continuously, expand
the scope of analysis to provide more and useful information
from the ECG data. This warrants for a flexible…

Research in detecting QRS peaks in ECG signals
has progressed to an acceptable extent and hence has gained
adequate confidence with respect to the validity of the outputs
produced. In view of the dynamics associated with ECG signals,
their variants among subjects owing to varied types of problems
encountered; it has become essential to, continuously, expand
the scope of analysis to provide more and useful information
from the ECG data. This warrants for a flexible architecture for
ECG signal analysis. This paper presents one such flexible
architecture. The authors are working towards identification of
appropriate interfaces and their definitions.

Other authors
See publication

Patents

Classification of keywords

Issued October 24, 2017 US US9798820B1
A computer-implemented method of classifying a keyword in a network comprises: identifying a plurality of candidate categories, comprising: converting a plurality of search results related to the keyword into a plurality of search vectors, wherein each of the plurality of search results indicates a related resource in the network; converting a plurality of resources into a plurality of category vectors, wherein each of the plurality of resources is classified in one or more categories of a set…

A computer-implemented method of classifying a keyword in a network comprises: identifying a plurality of candidate categories, comprising: converting a plurality of search results related to the keyword into a plurality of search vectors, wherein each of the plurality of search results indicates a related resource in the network; converting a plurality of resources into a plurality of category vectors, wherein each of the plurality of resources is classified in one or more categories of a set of categories; and determining, for the plurality of category vectors, a plurality of similarity values indicating similarity to the plurality of search vectors; processing the plurality of candidate categories; and classifying the keyword by selecting the candidate category having a highest similarity value within the plurality of similarity values, a corresponding system, computing device and non-transitory computer-readable storage medium.

Other inventors
See patent
Classification of search queries

Issued September 19, 2017 US US9767182B1
A computer-implemented method of classifying a search query in a network comprises: classifying a plurality of search queries into categories, comprising: applying predetermined rules to each of the plurality of search queries, wherein the predetermined rules are indicative of the categories and each of the plurality of search queries is associated with search results in the network; determining, for each of the plurality of search queries, similarity values indicating similarity to each of the…

A computer-implemented method of classifying a search query in a network comprises: classifying a plurality of search queries into categories, comprising: applying predetermined rules to each of the plurality of search queries, wherein the predetermined rules are indicative of the categories and each of the plurality of search queries is associated with search results in the network; determining, for each of the plurality of search queries, similarity values indicating similarity to each of the categories based on the applied predetermined rules; and training a machine learning module, comprising: applying the machine learning module to a plurality of training sets to a plurality of training sets, wherein each of the plurality of training sets is based on one of the plurality of classified search queries and at least one of the respective one or more similarity values, a corresponding system, computing device and non-transitory computer-readable storage medium.

Other inventors
See patent

Courses

Computer Animation

-
Foundations of Graphics

-
Foundations of Vision and Audio

-
Image Processing: Retrieval and Analysis

-
Intelligent Information Systems

-
Mobile Robots

-
Network Security

-
Pearls of Algorithms

-
Temporal Information Systems

-
User Centered Software Design

-

Honors & Awards

Winner: Naive Bees Classification Challenge

Drivendata.Org, Metis

Dec 2015

Developed a deep learning algorithm to distinguish between bumblebee and honeybee using images.
The model scored 0.9956 area under roc curve on private set.
2nd/2225 Springleaf Marketing Response Challenge

Kaggle.com

Oct 2015
3rd / 3514 Otto Group Product Classifiation Challenge

Kaggle.com

May 2015

Our team ranked 3rd out of 3500+ participants. It was the largest kaggle competition till date.
Rank 3rd - Countable Care: Modeling Women's Health Care Decisions

Drivendata.Org

Mar 2015

Recent literature suggests that the demand for women’s health care will grow over 6% by 2020. Given how rapidly the health landscape has been changing over the last 15 years, it’s increasingly important that we understand how these changes affect what care people receive, where they go for it, and how they pay. Through the National Survey of Family Growth, the CDC provides one of the few nationally representative datasets that dives deep into the questions that women face when thinking about…

Recent literature suggests that the demand for women’s health care will grow over 6% by 2020. Given how rapidly the health landscape has been changing over the last 15 years, it’s increasingly important that we understand how these changes affect what care people receive, where they go for it, and how they pay. Through the National Survey of Family Growth, the CDC provides one of the few nationally representative datasets that dives deep into the questions that women face when thinking about their health.

The task was to predict what drives women’s health care decisions in America.
Winner - Box Plots for Education

DrivenData.Org

Jan 2015
Rank 10th - KDD Cup 2014

-

Jul 2014

KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners.
Rank 7th - Acquire Valued Shoppers Challenge

-

Jul 2014

### Predict which shoppers will become repeat buyers

Ranked 7th out of ~900 participants

The Acquire Valued Shoppers Challenge asked participants to predict which shoppers are most likely to repeat purchase. To aid with algorithmic development, they had provided complete, basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an acquisition campaign. The incentive offered to that shopper and their post-incentive behavior was also…

### Predict which shoppers will become repeat buyers

Ranked 7th out of ~900 participants

The Acquire Valued Shoppers Challenge asked participants to predict which shoppers are most likely to repeat purchase. To aid with algorithmic development, they had provided complete, basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an acquisition campaign. The incentive offered to that shopper and their post-incentive behavior was also provided.

This challenge provided almost 350 million rows of completely anonymised transactional data from over 300,000 shoppers. It was one of the largest problems run on Kaggle to date.
Rank 10th - The Random Number Grand Challenge

-

Apr 2014

Decode a sequence of pseudorandom numbers
Rank 4th - Crowdflower Partly Sunny with a Chance of Hashtags

-

Dec 2013

In this competition you are provided a set of tweets related to the weather. The challenge is to analyze the tweet and determine whether it has a positive, negative, or neutral sentiment, whether the weather occurred in the past, present, or future, and what sort of weather the tweet references.
Rank 6th - StumbleUpon Evergreen Classification Challenge

-

Oct 2013

StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen". The ratings we get from our community give us…

StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen". The ratings we get from our community give us strong signals that a page may no longer be relevant - but what if we could make this distinction ahead of time? A high quality prediction of "ephemeral" or "evergreen" would greatly improve a recommendation system like ours.
Many people know evergreen content when they see it, but can an algorithm make the same determination without human intuition? Your mission is to build a classifier which will evaluate a large set of URLs and label them as either evergreen or ephemeral. Can you out-class(ify) StumbleUpon?
Rank 10th - Cause Effect Challenge by CHALEARN, 2013

-

Sep 2013

The problem of attributing causes to effects is pervasive in science, medicine, economy and almost every aspects of our everyday life involving human reasoning and decision making. What affects your health? the economy? climate changes? The gold standard to establish causal relationships is to perform randomized controlled experiments. However, experiments are costly while non-experimental "observational" data collected routinely around the world are readily available. Unraveling potential…

The problem of attributing causes to effects is pervasive in science, medicine, economy and almost every aspects of our everyday life involving human reasoning and decision making. What affects your health? the economy? climate changes? The gold standard to establish causal relationships is to perform randomized controlled experiments. However, experiments are costly while non-experimental "observational" data collected routinely around the world are readily available. Unraveling potential cause-effect relationships from such observational data could save a lot of time and effort.
Consider for instance a target variable B, like occurence of "lung cancer" in patients. The goal would be to find whether a factor A, like "smoking", might cause B. The objective of the challenge is to rank pairs of variables {A, B} to prioritize experimental verifications of the conjecture that A causes B.
As is known, "correlation does not mean causation". More generally, observing a statistical dependency between A and B does not imply that A causes B or that B causes A; A and B could be consequences of a common cause. But, is it possible to determine from the joint observation of samples of two variables A and B that A should be a cause of B?
Rank 16th - Amazon Employee Access Challenge, 2013

-

When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to read/manipulate resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a…

When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to read/manipulate resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a reporting portal). A knowledgeable supervisor then takes time to manually grant the needed access in order to overcome access obstacles. As employees move throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money.
There is a considerable amount of data regarding an employee’s role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. These auto-access models seek to minimize the human involvement required to grant or revoke employee access.

Objective:
The objective of this competition was to build a model, learned using historical data, that will determine an employee's access needs, such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time. The model will take an employee's role information and a resource code and will return whether or not access should be granted.

Languages

English

Full professional proficiency
German

Limited working proficiency
Hindi

Full professional proficiency
Python

Full professional proficiency
Norwegian

Elementary proficiency

More activity by Abhishek

Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX

Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX

Liked by Abhishek Thakur
Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX

Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX

Shared by Abhishek Thakur
Hell yeah 🚀🚀🚀🎉🎉🎉🙏🙏🙏😍😍😍

Hell yeah 🚀🚀🚀🎉🎉🎉🙏🙏🙏😍😍😍

Liked by Abhishek Thakur
Hell yeah 🚀🚀🚀🎉🎉🎉🙏🙏🙏😍😍😍

Hell yeah 🚀🚀🚀🎉🎉🎉🙏🙏🙏😍😍😍

Shared by Abhishek Thakur
In case, I didnt mention, you can use both new Gemma 2 models with AutoTrain 🚀

In case, I didnt mention, you can use both new Gemma 2 models with AutoTrain 🚀

Liked by Abhishek Thakur

View Abhishek’s full profile

See who you know in common
Get introduced
Contact Abhishek directly

Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Abhishek Thakur

4434 others named Abhishek Thakur are on LinkedIn

See others named Abhishek Thakur

Abhishek Thakur

Norway 154K followers 500+ connections

See your mutual connections View mutual connections with Abhishek Sign in Welcome back Email or phone Password Show Forgot password? Sign in or New to LinkedIn? Join now or New to LinkedIn? Join now

About

Articles by Abhishek

Clickbaits Revisited: Deep Learning on Title + Content Features to Tackle Clickbaits

By Abhishek Thakur

Is That a Duplicate Quora Question?

By Abhishek Thakur

The One With the Anime (or Hentai?)

By Abhishek Thakur

Activity

https://lnkd.in/dUmxgcUX

Liked by Abhishek Thakur

https://lnkd.in/dUmxgcUX

Shared by Abhishek Thakur

AutoTrain makes finetuning embedding models easy! Want to fine-tune embedding models with Sentence Transformers? Now you can—no coding needed! Why…

Liked by Abhishek Thakur

Experience

Education

Licenses & Certifications

Publications

Abhishek Thakur June 30, 2020

AutoCompete: A Framework for Machine Learning Competitions

AutoML @ ICML Jun 2015

Computer Vision for Head Pose Estimation: Review of a Competition

Scandinavian Conference on Image Analysis 2015

International Journal of Machine Learning and Computing June 28, 2013

Patents

Issued October 24, 2017 US US9798820B1

Issued September 19, 2017 US US9767182B1

Courses

Computer Animation

-

Foundations of Graphics

-

Foundations of Vision and Audio

-

Image Processing: Retrieval and Analysis

-

Intelligent Information Systems

-

Mobile Robots

-

Network Security

-

Pearls of Algorithms

-

Temporal Information Systems

-

User Centered Software Design

-

Honors & Awards

Winner: Naive Bees Classification Challenge

Drivendata.Org, Metis

2nd/2225 Springleaf Marketing Response Challenge

Kaggle.com

3rd / 3514 Otto Group Product Classifiation Challenge

Kaggle.com

Rank 3rd - Countable Care: Modeling Women's Health Care Decisions

Drivendata.Org

Winner - Box Plots for Education

DrivenData.Org

Rank 10th - KDD Cup 2014

-

Rank 7th - Acquire Valued Shoppers Challenge

-

Rank 10th - The Random Number Grand Challenge

-

Rank 4th - Crowdflower Partly Sunny with a Chance of Hashtags

-

Rank 6th - StumbleUpon Evergreen Classification Challenge

-

Rank 10th - Cause Effect Challenge by CHALEARN, 2013

-

Rank 16th - Amazon Employee Access Challenge, 2013

-

Languages

English

Full professional proficiency

Norway

154K followers 500+ connections

View mutual connections with Abhishek

Welcome back

Email or phone

Password

Forgot password?

or

New to LinkedIn? Join now

or

New to LinkedIn? Join now