Data Science Leadership Series : Part 1 – The need for AI Product Management

People keep talking about data science being such a rewarding and lucrative career, but I feel it's time to talk about the serious gaps that's been bugging this field, in terms of operationalizing a successful data teams. Few tech companies like the Googles and Microsofts of the world have got this working like a well-oiled … Continue reading Data Science Leadership Series : Part 1 – The need for AI Product Management

The Power of Distributed XGBoost: Efficient and Cost-Effective Training for Petabytes of Data

Featured ~ Harini Kannan ~ Leave a comment

In the era of big data, managing and processing large volumes of information is a challenge faced by many organizations. As a data science professional, one must constantly explore innovative techniques to extract meaningful insights from massive datasets. You would be surprised how many data teams in really large organizations still default to neural network … Continue reading The Power of Distributed XGBoost: Efficient and Cost-Effective Training for Petabytes of Data

Explaining P-value to a non technical audience

10 Dec 202110 Dec 2021 ~ Harini Kannan ~ Leave a comment

Wikipedia defines p-value as "the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct". Well if we give this definition, say in a presentation to a product or a business team, you're most probably gonna receive piercing puzzled looks. One of the major … Continue reading Explaining P-value to a non technical audience

Understanding Multicollinearity and Confounding Variables in Regression

10 Oct 202113 Oct 2022 ~ Harini Kannan ~ Leave a comment

Multicollinearity When two or more of the predictors are correlated, this phenomenon is called multicollinearity. This affects the resulting coefficients by masking the underlying individual weights of the correlated variables. This is why model weights are not equal to feature importance. Ways to deal with multicollinearity Looking at Variance Inflation Factor (VIf), which measures the … Continue reading Understanding Multicollinearity and Confounding Variables in Regression

Unnest (explode) a column of list in Pandas

8 Oct 20218 Oct 2021 ~ Harini Kannan ~ Leave a comment

In python, when you have a list of lists and convert it directly to a pandas dataframe, you get columns of lists. This may seem overwhelming, but fear not! Pandas comes to our rescue once again - use pandas.DataFrame.explode() import pandas as pd df = pd.DataFrame({'col1': [[0, 1, 2], 'foo', [], [3, 4]], 'col2': 1, … Continue reading Unnest (explode) a column of list in Pandas

RStudio in Docker – now share your R code effortlessly!

25 May 2019 ~ Harini Kannan ~ 4 Comments

If you are a full time data science practitioner and have passed through the stages of starting out with the Titanic dataset and working through the various exercises in Kaggle , you would know by now that we wish real world data problems are that simple, but they are not! This post is about just one … Continue reading RStudio in Docker – now share your R code effortlessly!