Jekyll2021-01-04T06:13:29+00:00https://smitan94.github.io/feed.xmlSmitan PradhanData Science - PortfolioSmitan PradhanForeC App - A local solution to a global pandemic2020-11-04T00:00:00+00:002020-11-04T00:00:00+00:00https://smitan94.github.io/ForeC<h3 id="problem-statement">PROBLEM STATEMENT:</h3>
<p>2020 has not been a kind year to any of us. While many of us have been fortunate to be not directly impacted, millions of others have not been. Apart from being affected physically and mentally by this ravaging virus, our economies have suffered, people have lost their jobs, and a lot of dreams were put on hold. Our team decided to put together a web application which could help local businesses in Australia make better decisions while learning to live with the new “COVID normal” life.</p>
<p>Two of the key problem statements that we identified were:</p>
<ul>
<li>Abrupt lockdowns in different states affecting businesses</li>
<li>The rise in misinformation leading to less number of people getting diagnosed</li>
</ul>
<h3 id="key-features">KEY FEATURES:</h3>
<ul>
<li>
<p>Forecasted data at local governance area level thus giving businesses an opportunity to plan ahead in case of more lockdowns in purview of local outbreaks in future</p>
</li>
<li>
<p>Medically approved chatbot which can help to triage patients with COVID symptoms anonymously</p>
</li>
</ul>
<p><a href="https://youtu.be/817jH0bHaQk">Link to the video</a></p>
<h3 id="future-plans-and-areas-of-improvement">FUTURE PLANS AND AREAS OF IMPROVEMENT:</h3>
<ul>
<li>
<p>Use better and more advanced forecasting techniques (by taking help from a domain specialist) to improvde our predictions</p>
</li>
<li>
<p>Include more features on the chatbot to identify more intents such as displaying lcoalized news, emergency numbers, and act as a symtpom checker for other diseases as well</p>
</li>
</ul>Smitan PradhanHow can we as Data Scientists help our community during this pandemic?White box vs Black Box models: Importance of interpretable model in today’s world2020-09-02T00:00:00+00:002020-09-02T00:00:00+00:00https://smitan94.github.io/Black%20Box%20vs%20White%20Box<h3 id="introduction">INTRODUCTION:</h3>
<p>Machine Learning has come a long way in the last decade or so. With new and rapid breakthroughs in the field of computer vision and natural language processing, companies are constantly trying to play catch up with the new technologies and methodologies. A quick look at some of the best performing models on Kaggle competitions will reveal the supremacy of models such as XG Boost and Neural networks, however, when it comes to making real world decisions, companies often do not trust them. This is not only because the working of these models is often complicated to explain to a layman “business” person, it is also because, we as data scientists cannot explain the exact working of these models. This leads to a level of mistrust being placed on these models.
So, is the alternative is to keep using models which give lower accuracy? Absolutely not!
Many researchers have shown that there is no significant difference in the performance of machine learning models, if tuned properly, on most of the “structured” data science problems. Unstructured problem statements such as image classification and text analysis are heavily dependent on the sequence of data (either on words or on pixels) and hence feature extraction becomes a key part of the process (which is not true for our structured problem statements). In this project, we have tried to emulate the accuracy achieved by some of the more sophisticated models with simpler models.</p>
<h3 id="problem-statement">PROBLEM STATEMENT:</h3>
<p>Our client asked us to create a model which will be able to identify the people who are more likely to default on their credit card loans next month. The dataset provided to us contains both their historical payments trend, balance amount and how many times have they defaulted in the last 6 months along with their demographic details.
Sneak peek at the data: A quick look at the data gave us some interesting insights-</p>
<ul>
<li>Women tend to default less than men as per the dataset</li>
<li>Young people (age between 21 and 25) and elder people (age greater than 55) tend to default more than any other age group</li>
<li>People with only “High School” as the highest level of education seem to default more</li>
</ul>
<h3 id="approach">APPROACH:</h3>
<p>Keeping the ultimate business objective in mind and having looked at the trends present in the data, we finally decided to follow the below mentioned approach:</p>
<ol>
<li>
<p>Create a model that can not only correctly identify the people who are about to default next month but also ensure that the model is not biased against a particular demographic segment.
The above point became a very critical part of this project since we did not want our model to rely heavily on the demographics feature of the population. The demographics feature can mislead a model since we are only working on a sample of the dataset and cannot be sure of the methodology behind the sampling of the dataset. Also, in real world, it is sometimes not possible to ignore one major segment of the population and hence we wanted our model to focus more on the payment history than on demographic attributes</p>
</li>
<li>
<p>Compare the performance of black model vs white box models</p>
</li>
</ol>
<h3 id="methodology">METHODOLOGY:</h3>
<p>To ensure we meet our business objectives, we undertook multiple steps that include:</p>
<ol>
<li>
<p>Feature engineering: It was imperative to create new features from the existing columns in the dataset as doing so enables us to derive more information from the dataset and find hidden relationships between variables</p>
<p>1.1 Cumulative points based on the number of times they have already defaulted. People who have defaulted multiple times are given a negative score while people who have always paid the credit on time have a positive score.</p>
<p>1.2 Average payment made to the bank over the last 6 months compared against the latest amount of loan taken from bank. For e.g. if a person has been paying $500 on an average to the bank but in the current month, has taken a loan of $10,000, then it’s highly likely that this person will default in the next month.</p>
<p>1.3 Removal of records when there was no clear indication of why people have been marked as “defaulter” despite their records showing that they have made due payments for the last 6 months. Albeit subtle, this step prevents the model from learning from cases wherein the default status is not dependent on the payment trend but rather focuses on other features present in the dataset.</p>
</li>
<li>
<p>Working with imbalanced data: The number of people defaulting on their credit card loans is as small as 1% of the entire dataset. To enable our model to learn equally from both the classes, defaulters and non-defaulters, it was important to increase the defaulters group with synthetic data. This would prevent the model from being biased towards the non-defaulters group</p>
</li>
</ol>
<h3 id="result">RESULT:</h3>
<p>The class imbalance in the dataset makes identification of defaulters challenging. The associated cost of not identifying a defaulter far outweighs the cost of misjudging a person to be a defaulter. This is where metrics such as recall and precision become important.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Recall = # of people classified as defaulters / # of people who actually defaulted
Precision = # of people correctly classified as defaulters / # of people classified as defaulters
</code></pre></div></div>
<p>There is usually a trade-off between precision and recall. Higher recall leads to lower precision and vice-versa. You can read more about precision-recall trade off <a href="https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c">here</a>.
Here is a comparison of model performance in terms of recall and precision across our four models for all the people defaulting on their credit card. This is important since the number of people defaulting is very low in our dataset.</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Type</th>
<th>Precision (for default cases)</th>
<th>Recall (for default cases)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Random Forest</td>
<td>Black Box Model</td>
<td>85%</td>
<td>64%</td>
</tr>
<tr>
<td>XG Boost</td>
<td>Black Box Model</td>
<td>90%</td>
<td>76%</td>
</tr>
<tr>
<td>Logistic Regression</td>
<td>White Box Model</td>
<td>73%</td>
<td>66%</td>
</tr>
<tr>
<td>General Additive Modelling (GAM)</td>
<td>White Box Model</td>
<td>92%</td>
<td>80%</td>
</tr>
</tbody>
</table>
<p>Unsurprisingly and as stated earlier, XG Boost is the better performer between the two black box models.</p>
<p>The white box model Logistic Regression is not a strong performer. This is because some of the features present in the dataset do not show linearity, an issue that is often faced in real world datasets, and hence logistic regression is unable to generalise a good model.</p>
<p>General Additive Modelling (GAM) overcomes the limitations of Logistic Regression. GAM is a powerful and yet a simple technique that combines the predictive power of black box model with the interpretability of logistic regression. It is able to achieve this because the relationships between variables are not assumed to be linear. Each feature is independently plotted against our target (can be either linear or non-linear) and is then summed up.</p>
<p>The main advantages of GAM are as follows:</p>
<ol>
<li>Easy to interpret</li>
<li>Flexible predictor functions can uncover hidden patterns in the data</li>
<li>Regularisation of predictor functions helps avoid overfitting</li>
</ol>
<p>You can also read this article <a href="https://multithreaded.stitchfix.com/blog/2015/07/30/gam/">here</a> which explains in a very simple way what GAM is about!</p>Smitan PradhanIn today's world, more and more focus is on ensuring how black box models can be interpreted and if they are really requiredSedans and Colours - Are certain colours more prevalent in sedans?2020-06-20T00:00:00+00:002020-06-20T00:00:00+00:00https://smitan94.github.io/Sedans%20and%20Colours%20-%20South%20Australia<p>Recently, I came across an article on <a href="https://www.economist.com/britain/2018/01/18/the-link-between-the-colour-of-cars-and-the-economy">The Economist</a> which went on to discuss how the colour of the cars sold in United Kingdom reflects the mood of the general public about their economy. It attributed the white colour to an optimistic sentiment towards the economy (also towards the ruling government) while the black colour was naturally attributed to “darker” times. This article intrigued me and I wanted to see for myself if there is any trend in the colour of the cars sold in the Australia.</p>
<p>Aim: While the primary aim was to identify the popularity of each brand along with the prevalence of specific colours, I also wanted to observe if there was any relevant association between specific brands of sedans and their colours. Additionally, I wanted to know if the trend was different for luxury segment of cars when compared to other segments. This comes from personal opinion that people with luxury cars usually want the car to stand out a bit as compared to other commercial cars.</p>
<p>Key Insights -</p>
<p>According to this <a href="https://www.axalta.com/gb/en_GB/news-releases/axalta-2019-color-popularity-report.html">report</a>, white dominates the overall car market globally with approximately 38% market share while black is a distant second with 19% of the market share. Grey and Silver are third and fourth most prevalent colours respectively. Overall these four colours, together represent 80% of the total car market share. I wanted to see if similar trends are followed in South Australia as well.</p>
<ul>
<li>
<p>Holden Cars (once a marque Australian Automobile company and soon to be retired entirely) is the most popular car down south</p>
</li>
<li>
<p>Like rest of the world, South Australians also prefer white sedans the most, however, non-traditional colours such red and blue are also widely popular</p>
</li>
<li>
<p>For the luxury car segment, we see that “silver” and “black” are more popular as compared to white. This seems interesting to me and might require a comprehensive market research to uncover any additional factor which may be resulting in this. It should also be noted that colours such as “white” and “gray” have become more popular in the recent years so this trend might get reversed soon in the future</p>
</li>
<li>
<p>Another interesting trend that can be observed is that while the total number of sedans sold across all colour segments shows a negative trend indicating an overall decline in the sedan market space, the “grey” colour stands out since it shows positive year over year growth</p>
</li>
</ul>
<p><a href="https://public.tableau.com/profile/smitan.pradhan#!/vizhome/CarsvsColoursVisualizationSouthAustralia/Dashboard?publish=yes">Dashboard - Link to tableau</a></p>
<p>About the dataset:</p>
<p>Dataset: I came across three datasets from the Department of Transport of South Australia on the Australian government website for the time period 2017-2019. These datasets list down the total count of each type of transport sold within the state along with other information such as “brand”, “colour of the vehicle”, “type of the vehicle”, “manufactured year”. I added an additional column “year” to identify map each record to the correct dataset. For the purpose of this analysis, I have selected only “Sedans” since I was able to easily associate myself with it. “Pig Trailer”, “Dog Trailer”, and “Caravan Vehicle” were some of the other terms that I learnt while exploring this dataset.</p>
<p>Assumptions made -</p>
<p>Since the dashboard was created at a very high level to get an intuitive understanding of the underlying trends present in the data, the following caveats should be noted while making any inference from the dashboard -</p>
<ul>
<li>
<p>The data reported by the govt does not reflect (or explains) redundancy in the dataset provided. For example, if a car was bought in 2017 and sold in 2019, there is a high probability that both the records are present in the dataset. For our analysis, we have assumed each record to be unique.</p>
</li>
<li>
<p>I have created the luxury segment based on the this <a href="https://www.caradvice.com.au/734435/premium-new-car-sales/">link</a> . I realize that certain brands in the non-luxury segments also have highly priced cars, however, for the sake of simplicity I have used the list as mentioned earlier</p>
</li>
<li>
<p>The “Other” sedan segment in the dashboard primarily refers to the brands that have been either closed or will close in the near future. A lot of the brands present under this segment are legacy companies and have already closed many of their operations within the state. I did not feel they needed to be clubbed under either “luxury” or “non-luxury” brand as there may have been multiple factors determining the reason for the colours of these brands apart from the most common one - Supply and Demand.</p>
</li>
</ul>
<p>Source - <a href="https://data.gov.au">Australian government</a></p>Smitan PradhanWhich colour sedans people prefer buying and is there a difference in the trend between different segments of the cars?Tweet Sentiment Analysis - Naive Bayes Classifier2020-03-06T00:00:00+00:002020-03-06T00:00:00+00:00https://smitan94.github.io/Tweet%20Sentiment%20Analysis%20-%20Naive%20Bayes%20Classifier<p>Today a large amount of data present around us is in the unstructured format such as bills, receipts, articles, blogs, etc. If we are to truly realize the potential of machine learning and data together then it is quintessential to be able to extract this information from these formats.</p>
<p>Background: Sentiment Analysis, a.k.a opinion mining, is the process of extracting subjective information that underlies a text using text analysis and natural language proecessing methods. This can be either an opinion, a judgment, or a feeling about a particular topic or subject. The most common type of sentiment analysis consists of classifying a statement as ‘positive’, ‘negative’ or ‘neutral’.</p>
<p>Recently, it has become a large part of companies to analyze this data and understand the reputation of their brand. This provides the companies with a more quantifiable way to get feedback on their products and any relevant associations. This form of feedbck earlier used to be both time consuming and resource intensive and hence a lot of companies are now investing significantly on managing their social media accounts along with a social media mining team.</p>
<p>Problem Statement: We are provided with a large corpus with a semi cleaned twitter dataset which consists of ~ 30K tweets. Our main objective is to classify these tweets into positive, negative or neutral and check the accuracy of our model</p>
<p><a href="https://github.com/Smitan94/Data-Science/blob/master/Tweet%20Sentiment%20Analysis.ipynb">Link to the project code</a></p>
<p>Key Learnings -</p>
<p>The most important step in sentiment analysis is the way you pre process or clean the data. In this analysis, pre-processing was an interative process in which after cleaning the data for the first time, I was validating my model against a validation dataset, performing an error analysis (verifying manually which tweets are mislablled) and then cleaning the data accordingly. After doing these steps couple of times, I finally stopped having performed the below mentioned steps -</p>
<table>
<thead>
<tr>
<th>Steps taken</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr>
<td>Removing all links from the tweets</td>
<td>We do not get any information regarding sentiments from the links themselves</td>
</tr>
<tr>
<td>Removing all sentences with less than 3 words</td>
<td>Not enough words to gain any useful information</td>
</tr>
<tr>
<td>Converting all the words to lowercase</td>
<td>This is to ensure we have standardised words across tweets</td>
</tr>
<tr>
<td>Stopwords and punctuations have been removed</td>
<td>Not enough information from these words</td>
</tr>
<tr>
<td>Lemmatized the words</td>
<td>This is to ensure we do not have multiple forms of the same word</td>
</tr>
<tr>
<td>Removing special characters (“@”,”$”,etc.) from the text</td>
<td>Do not get information about sentiment from these characters</td>
</tr>
<tr>
<td>Removing numbers from the tweets</td>
<td>Do not get information about sentiment from these characters</td>
</tr>
<tr>
<td>Removing repetitive characters from the tweet (e.g. goooaaalll changed to goal)</td>
<td>We gain the same information from both the words and minimises the total number of word features present in our dataset</td>
</tr>
</tbody>
</table>
<p>Key Insight -</p>
<p>As part of the exploratory data analysis, I had posted the frequency of the top 20 words for the tweets that have been classified as positive, negative, and neutral. We notice that a lot of the top words present in the neutral list such as “good”, “but”, “day”,”work” are present in all three types of tweets. Since in these cases, adverbs and adjectives would be the ones making the difference in the tweet. For example, “beautifful day”, “day sucks”, and “Leave during day time” would classify as “positive”, “negative” and “neutral” respectively. This will act as a major drawback to our model.</p>
<p>Limitations -</p>
<p>Since we are using Naive Bayes Classifier, the sequence of the words are not taken into consideration while classifying the tweets. This is a crucial step since tweets have to be at most 140 words and a lot of words will be overlapping amongst all three labels. We will be taking this into consideration for our next model.</p>Smitan PradhanTo classify the sentiments behind a large corpus of tweetsCOVID-19 Knowledge Distiller2020-01-06T00:00:00+00:002020-01-06T00:00:00+00:00https://smitan94.github.io/COVID-19%20Knowledge%20Distiller<p>In a pandemic situation, clinicians and researchers are in urgent need of rapid and quality information that will help them to inform diagnostics and therapeutics relating to the disease. We were tasked to provide them solutions for the same</p>
<p>Background: Traditional research models producing trustworthy and methodologically sound results takes time, which does not fit well with a pandemic context where research has to be fast-tracked. The ongoing coronavirus disease 2019 (COVID-19) pandemic has demonstrated the volume and velocity of scientific information that can be produced in a short period of time. For COVID-19, some of these traditional delays have been circumvented as many medical journals have prioritized publications related to COVID-19 and there is greater use of preprint servers to make research findings immediately available in an open format.</p>
<p>In the context of COVID-19 pandemic, researchers and clinicians require a reliable model to mine published literature for novel insights, emerging risk factors and therapeutics to inform their work in combating the COVID-19 pandemic</p>
<p>Problem Statement: We need to present an innovative text mining and analytical tool that will aid clinicians and researchers in extracting valuable insights from large datasets of literature.</p>
<p><a href="https://www.kaggle.com/ravikiranbhaskar/covid-knowledge-distiller">Link to the project code</a></p>
<p>For the purpose of the task, we formed a team including data scientists, software engineers, clinicians and medical researchers to enable a credible and informed approach in developing the text-mining model. Our text-mining model automates the knowledge discovery process aiding researchers and clinicians in their pursuit of appropriate treatment and management of COVID-19 cases. This process is achieved by identifying whether a given medical article is related to COVID-19, and it’s relevance to the competition task of identifying clinical risk factors embedded in the literature. The assumption here is that supplied databases collectively have relevant information suitable for extraction. While the tool we developed here was customised to automatically identify COVID19 related risk factors, this model can be potentially expanded to extract useful information from medical literature and building knowledge bases</p>Smitan PradhanTo search through 45K research articles and generate insights to help medical teams fight the pandemicGesture Recognition - Neural Network2020-01-05T00:00:00+00:002020-01-05T00:00:00+00:00https://smitan94.github.io/Gesture%20Recognition%20-%20Neural%20Network<p>We are hired by a smart tv manufacturing company which aims to add new gesture recognition capabilities to its next generation TVs which are set to be released in the next 1 year.</p>
<p>Background: Everywhere we look around us, we are able to see devices which are functioning more than they are supposed to. We have from everything from everyday household devices such as smart watches, smart bulbs, smart TVs to sophisticated machinery such as smart cars, smart medical devices. All of these things have one common theme to them, that is, they are able to reduce the manual effort required to do a particular task. Examples of this are voice recognition to activate your speakers and voice assistants like Siri and Cortana who are able to recognise the voice, decipher the meaning and then work on it. Taking one step further from this is the gesture recognition.</p>
<p>Problem Statement: We need to build a predictive model using advanced Deep Learning algorithms which will be able to predict from a list of 5 gestures and then work accordingly.</p>
<p><a href="https://github.com/Smitan94/Data-Science/blob/master/Gesture%20recognition%20-%20Neural%20Network.ipynb">Link to the project code</a></p>
<p>This case study wouldn’t have been possible without the help of my team mate Keerthi Gayam. Thanks Keerthi for your big help on this.</p>
<p>Please note that the main body of the code was provided to us by IIIT-B and UpGrad and we have worked only on the hyper-parameters addition and their tuning. The same has been highlighted in our codes.</p>Smitan PradhanTo learn and categorise 5 different gestures from a set of videos using deep learning algorithmsPredicting churn in Telecom Industry - Advanced ML2020-01-04T00:00:00+00:002020-01-04T00:00:00+00:00https://smitan94.github.io/Telecom%20Churn%20Case%20Study%20-%20Advanced%20ML<p>In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.</p>
<p>For many incumbent operators, retaining high profitable customers is the number one business goal.</p>
<p>Background: To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. We have been hired by a telecom industry giant to look at customer level data and identify customers at high risk of churn and identify the main indicators of churn.</p>
<p>Problem Statement: We need to build a predictive model using advanced Machine Learning algorithms in order to predict the customers at high risk of churn along with the key indicators of churn.</p>
<p><a href="https://github.com/Smitan94/Data-Science/blob/master/Advanced%20ML%20Telecom%20Churn.ipynb">Link to the project code</a></p>
<p>This case study has been completed with the help of my team mate Koushal Deshpande. Thanks Koushal for your help and your key insights!</p>Smitan PradhanTo identify the customers who are at high risk of changing their telecom service and suggest which services are most usedDetermining the Housing Prices in Australia - Lasso and Ridge Regression2020-01-03T00:00:00+00:002020-01-03T00:00:00+00:00https://smitan94.github.io/Australia%20Housing%20Prices%20-%20L1-L2%20regression<p>In this case study, we have been identified by Surprise Housing, an US based housing company which uses data analytics to purchase houses at a price below their actual values and flip them at a higher price.</p>
<p>Background: The company wants to enter the Australian Market and hence are looking at prospective properties to buy. They want to understand what are the factors affecting the prices and how exactly are those factors influencing it. The company would then manipulate the strategy of the firm and concentrate on areas that will yield high return.</p>
<p>Problem Statement: We need to build a regression model using regularisation in order to predict the actual value of the prospective properties and guide the company in making the best decision.</p>
<p><a href="https://github.com/Smitan94/Data-Science/blob/master/Advanced%20%20%20regression%20-%20House%20Prices.ipynb">Link to the project code</a></p>Smitan PradhanTo identify the key factors affecting the house prices in AustraliaClustering countries on socio-economic factors - Clustering Model and Principle Component Analysis2020-01-02T00:00:00+00:002020-01-02T00:00:00+00:00https://smitan94.github.io/Segmenting%20Countries%20-%20PCA<p>In this case study, we have been selected by HELP (an international humanitarian organisation) which is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.</p>
<p>Background: After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid.</p>
<p>Problem Statement: We need to categorise the countries using some socio-economic and health factors that determine the overall development of the country. We will then suggest the countries which the CEO needs to focus on the most along with visualisation and reasonings.</p>
<p><a href="https://github.com/Smitan94/Data-Science/blob/master/Clustering%20Assignment%20-%20Smitan.ipynb">Link to the project code</a></p>Smitan PradhanIdentifying countries in dire need of help by segmenting them on various socio-economic factorsAssigning lead score to customers in Telecom Industry - Simple Logistic Regression2020-01-01T00:00:00+00:002020-01-01T00:00:00+00:00https://smitan94.github.io/Lead%20Score-%20Advanced%20regression<p>In this case study, we have been assigned by an education company named X Education which sells online courses to industry professionals to identify the right leads.</p>
<p>Background: On any given day, many professionals who are interested in the courses land on their website and browse for courses, however, there are a lot of leads generated in the initial stage, but only a few of them come out as paying customers. In the middle stage, we need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.</p>
<p>Problem Statement: To build a model wherein we need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.</p>
<p>Also, discuss the important variables which are behind high lead scores</p>
<p><a href="https://github.com/Smitan94/Data-Science/blob/master/Logistic%20Regression%20-%20Lead%20Score.ipynb">Link to the project code</a></p>
<p>This case study has been completed with the help of my team mate Koushal Deshpande. Thanks Koushal for your help and your key insights!</p>Smitan PradhanAssigning a lead score to each customer and identifying the key causes for their score