About
Get my book: Approaching (Almost) Any Machine Learning Problem for FREE:…
Articles by Abhishek
-
Clickbaits Revisited: Deep Learning on Title + Content Features to Tackle Clickbaits
Clickbaits Revisited: Deep Learning on Title + Content Features to Tackle Clickbaits
By Abhishek Thakur
Activity
-
AutoTrain makes finetuning embedding models easy! Want to fine-tune embedding models with Sentence Transformers? Now you can—no coding needed! Why…
AutoTrain makes finetuning embedding models easy! Want to fine-tune embedding models with Sentence Transformers? Now you can—no coding needed! Why…
Liked by Abhishek Thakur
Experience
Education
Licenses & Certifications
Publications
-
Approaching (Almost) Any Machine Learning Problem
Abhishek Thakur
This is not a traditional book.
The book has a lot of code. If you don't like the code first approach do not buy this book. Making code available on Github is not an option.
This book is for people who have some theoretical knowledge of machine learning and deep learning and want to dive into applied machine learning. The book doesn't explain the algorithms but is more oriented towards how and what should you use to solve machine learning and deep learning problems. The book is not…This is not a traditional book.
The book has a lot of code. If you don't like the code first approach do not buy this book. Making code available on Github is not an option.
This book is for people who have some theoretical knowledge of machine learning and deep learning and want to dive into applied machine learning. The book doesn't explain the algorithms but is more oriented towards how and what should you use to solve machine learning and deep learning problems. The book is not for you if you are looking for pure basics. The book is for you if you are looking for guidance on approaching machine learning problems. The book is best enjoyed with a cup of coffee and a laptop/workstation where you can code along.
Table of contents:
- Setting up your working environment
- Supervised vs unsupervised learning
- Cross-validation
- Evaluation metrics
- Arranging machine learning projects
- Approaching categorical variables
- Feature engineering
- Feature selection
- Hyperparameter optimization
- Approaching image classification & segmentation
- Approaching text classification/regression
- Approaching ensembling and stacking
- Approaching reproducible code & model serving
There are no sub-headings. Important terms are written in bold. -
AutoCompete: A Framework for Machine Learning Competitions
AutoML @ ICML
In this paper, we propose AutoCompete, a highly automated machine learning framework for tackling machine learning competitions. The framework has been learned, validated and improved over a period of more than two years by participating in online machine learning competitions. It aims at minimizing human interference required to build a first useful predictive model and to assess the practical difficulty of a given machine learning challenge. The proposed system helps in identifying data…
In this paper, we propose AutoCompete, a highly automated machine learning framework for tackling machine learning competitions. The framework has been learned, validated and improved over a period of more than two years by participating in online machine learning competitions. It aims at minimizing human interference required to build a first useful predictive model and to assess the practical difficulty of a given machine learning challenge. The proposed system helps in identifying data types, choosing a machine learning model, tuning hyper-parameters, avoiding over-fitting and optimization for a provided evaluation metric. We also observe that the proposed system produces better (or comparable) results with less runtime as compared to other approaches.
Other authors -
Computer Vision for Head Pose Estimation: Review of a Competition
Scandinavian Conference on Image Analysis
This paper studies the prediction of head pose from still
images, and summarizes the outcome of a recently organized competition,
where the task was to predict the yaw and pitch angles of an image
dataset with 2790 samples with known angles. The competition received
292 entries from 52 participants, the best ones clearly exceeding the
state-of-the-art accuracy. In this paper, we present the key methodologies
behind selected top methods, summarize their prediction accuracy…This paper studies the prediction of head pose from still
images, and summarizes the outcome of a recently organized competition,
where the task was to predict the yaw and pitch angles of an image
dataset with 2790 samples with known angles. The competition received
292 entries from 52 participants, the best ones clearly exceeding the
state-of-the-art accuracy. In this paper, we present the key methodologies
behind selected top methods, summarize their prediction accuracy and
compare with the current state of the art.Other authors -
Parallel Processing Architecture for ECG Signal Analysis
International Journal of Machine Learning and Computing
Research in detecting QRS peaks in ECG signals
has progressed to an acceptable extent and hence has gained
adequate confidence with respect to the validity of the outputs
produced. In view of the dynamics associated with ECG signals,
their variants among subjects owing to varied types of problems
encountered; it has become essential to, continuously, expand
the scope of analysis to provide more and useful information
from the ECG data. This warrants for a flexible…Research in detecting QRS peaks in ECG signals
has progressed to an acceptable extent and hence has gained
adequate confidence with respect to the validity of the outputs
produced. In view of the dynamics associated with ECG signals,
their variants among subjects owing to varied types of problems
encountered; it has become essential to, continuously, expand
the scope of analysis to provide more and useful information
from the ECG data. This warrants for a flexible architecture for
ECG signal analysis. This paper presents one such flexible
architecture. The authors are working towards identification of
appropriate interfaces and their definitions.Other authorsSee publication
Patents
-
Classification of keywords
Issued US US9798820B1
A computer-implemented method of classifying a keyword in a network comprises: identifying a plurality of candidate categories, comprising: converting a plurality of search results related to the keyword into a plurality of search vectors, wherein each of the plurality of search results indicates a related resource in the network; converting a plurality of resources into a plurality of category vectors, wherein each of the plurality of resources is classified in one or more categories of a set…
A computer-implemented method of classifying a keyword in a network comprises: identifying a plurality of candidate categories, comprising: converting a plurality of search results related to the keyword into a plurality of search vectors, wherein each of the plurality of search results indicates a related resource in the network; converting a plurality of resources into a plurality of category vectors, wherein each of the plurality of resources is classified in one or more categories of a set of categories; and determining, for the plurality of category vectors, a plurality of similarity values indicating similarity to the plurality of search vectors; processing the plurality of candidate categories; and classifying the keyword by selecting the candidate category having a highest similarity value within the plurality of similarity values, a corresponding system, computing device and non-transitory computer-readable storage medium.
Other inventorsSee patent -
Classification of search queries
Issued US US9767182B1
A computer-implemented method of classifying a search query in a network comprises: classifying a plurality of search queries into categories, comprising: applying predetermined rules to each of the plurality of search queries, wherein the predetermined rules are indicative of the categories and each of the plurality of search queries is associated with search results in the network; determining, for each of the plurality of search queries, similarity values indicating similarity to each of the…
A computer-implemented method of classifying a search query in a network comprises: classifying a plurality of search queries into categories, comprising: applying predetermined rules to each of the plurality of search queries, wherein the predetermined rules are indicative of the categories and each of the plurality of search queries is associated with search results in the network; determining, for each of the plurality of search queries, similarity values indicating similarity to each of the categories based on the applied predetermined rules; and training a machine learning module, comprising: applying the machine learning module to a plurality of training sets to a plurality of training sets, wherein each of the plurality of training sets is based on one of the plurality of classified search queries and at least one of the respective one or more similarity values, a corresponding system, computing device and non-transitory computer-readable storage medium.
Other inventorsSee patent
Courses
-
Computer Animation
-
-
Foundations of Graphics
-
-
Foundations of Vision and Audio
-
-
Image Processing: Retrieval and Analysis
-
-
Intelligent Information Systems
-
-
Mobile Robots
-
-
Network Security
-
-
Pearls of Algorithms
-
-
Temporal Information Systems
-
-
User Centered Software Design
-
Honors & Awards
-
Winner: Naive Bees Classification Challenge
Drivendata.Org, Metis
Developed a deep learning algorithm to distinguish between bumblebee and honeybee using images.
The model scored 0.9956 area under roc curve on private set. -
2nd/2225 Springleaf Marketing Response Challenge
Kaggle.com
-
3rd / 3514 Otto Group Product Classifiation Challenge
Kaggle.com
Our team ranked 3rd out of 3500+ participants. It was the largest kaggle competition till date.
-
Rank 3rd - Countable Care: Modeling Women's Health Care Decisions
Drivendata.Org
Recent literature suggests that the demand for women’s health care will grow over 6% by 2020. Given how rapidly the health landscape has been changing over the last 15 years, it’s increasingly important that we understand how these changes affect what care people receive, where they go for it, and how they pay. Through the National Survey of Family Growth, the CDC provides one of the few nationally representative datasets that dives deep into the questions that women face when thinking about…
Recent literature suggests that the demand for women’s health care will grow over 6% by 2020. Given how rapidly the health landscape has been changing over the last 15 years, it’s increasingly important that we understand how these changes affect what care people receive, where they go for it, and how they pay. Through the National Survey of Family Growth, the CDC provides one of the few nationally representative datasets that dives deep into the questions that women face when thinking about their health.
The task was to predict what drives women’s health care decisions in America. -
Winner - Box Plots for Education
DrivenData.Org
-
Rank 10th - KDD Cup 2014
-
KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners.
-
Rank 7th - Acquire Valued Shoppers Challenge
-
### Predict which shoppers will become repeat buyers
Ranked 7th out of ~900 participants
The Acquire Valued Shoppers Challenge asked participants to predict which shoppers are most likely to repeat purchase. To aid with algorithmic development, they had provided complete, basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an acquisition campaign. The incentive offered to that shopper and their post-incentive behavior was also…### Predict which shoppers will become repeat buyers
Ranked 7th out of ~900 participants
The Acquire Valued Shoppers Challenge asked participants to predict which shoppers are most likely to repeat purchase. To aid with algorithmic development, they had provided complete, basket-level, pre-offer shopping history for a large set of shoppers who were targeted for an acquisition campaign. The incentive offered to that shopper and their post-incentive behavior was also provided.
This challenge provided almost 350 million rows of completely anonymised transactional data from over 300,000 shoppers. It was one of the largest problems run on Kaggle to date. -
Rank 10th - The Random Number Grand Challenge
-
Decode a sequence of pseudorandom numbers
-
Rank 4th - Crowdflower Partly Sunny with a Chance of Hashtags
-
In this competition you are provided a set of tweets related to the weather. The challenge is to analyze the tweet and determine whether it has a positive, negative, or neutral sentiment, whether the weather occurred in the past, present, or future, and what sort of weather the tweet references.
-
Rank 6th - StumbleUpon Evergreen Classification Challenge
-
StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen". The ratings we get from our community give us…
StumbleUpon is a user-curated web content discovery engine that recommends relevant, high quality pages and media to its users, based on their interests. While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen". The ratings we get from our community give us strong signals that a page may no longer be relevant - but what if we could make this distinction ahead of time? A high quality prediction of "ephemeral" or "evergreen" would greatly improve a recommendation system like ours.
Many people know evergreen content when they see it, but can an algorithm make the same determination without human intuition? Your mission is to build a classifier which will evaluate a large set of URLs and label them as either evergreen or ephemeral. Can you out-class(ify) StumbleUpon? -
Rank 10th - Cause Effect Challenge by CHALEARN, 2013
-
The problem of attributing causes to effects is pervasive in science, medicine, economy and almost every aspects of our everyday life involving human reasoning and decision making. What affects your health? the economy? climate changes? The gold standard to establish causal relationships is to perform randomized controlled experiments. However, experiments are costly while non-experimental "observational" data collected routinely around the world are readily available. Unraveling potential…
The problem of attributing causes to effects is pervasive in science, medicine, economy and almost every aspects of our everyday life involving human reasoning and decision making. What affects your health? the economy? climate changes? The gold standard to establish causal relationships is to perform randomized controlled experiments. However, experiments are costly while non-experimental "observational" data collected routinely around the world are readily available. Unraveling potential cause-effect relationships from such observational data could save a lot of time and effort.
Consider for instance a target variable B, like occurence of "lung cancer" in patients. The goal would be to find whether a factor A, like "smoking", might cause B. The objective of the challenge is to rank pairs of variables {A, B} to prioritize experimental verifications of the conjecture that A causes B.
As is known, "correlation does not mean causation". More generally, observing a statistical dependency between A and B does not imply that A causes B or that B causes A; A and B could be consequences of a common cause. But, is it possible to determine from the joint observation of samples of two variables A and B that A should be a cause of B? -
Rank 16th - Amazon Employee Access Challenge, 2013
-
When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to read/manipulate resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a…
When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to read/manipulate resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a reporting portal). A knowledgeable supervisor then takes time to manually grant the needed access in order to overcome access obstacles. As employees move throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money.
There is a considerable amount of data regarding an employee’s role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. These auto-access models seek to minimize the human involvement required to grant or revoke employee access.
Objective:
The objective of this competition was to build a model, learned using historical data, that will determine an employee's access needs, such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time. The model will take an employee's role information and a resource code and will return whether or not access should be granted.
Languages
-
English
Full professional proficiency
-
German
Limited working proficiency
-
Hindi
Full professional proficiency
-
Python
Full professional proficiency
-
Norwegian
Elementary proficiency
More activity by Abhishek
-
Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX
Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX
Liked by Abhishek Thakur
-
Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX
Which task shall be added next? 😉 AutoTrain: https://lnkd.in/dUmxgcUX
Shared by Abhishek Thakur
-
In case, I didnt mention, you can use both new Gemma 2 models with AutoTrain 🚀
In case, I didnt mention, you can use both new Gemma 2 models with AutoTrain 🚀
Liked by Abhishek Thakur
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Abhishek Thakur
-
Abhishek Thakur
-
Abhishek Thakur
LinkedIn Certified Recruiter | US IT Consulting
-
Abhishek Thakur
Associate Director | Sprinklr | Bachelor of Technology in Computer Science
-
Abhishek Thakur
4434 others named Abhishek Thakur are on LinkedIn
See others named Abhishek Thakur