Outlier detection and removal: z score, standard deviation | Feature engineering tutorial python # 3

Outlier detection and removal: z score, standard deviation | Feature engineering tutorial python # 3

codebasics

4 года назад

114,343 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

@whimsicalkins5585
@whimsicalkins5585 - 01.11.2023 09:51

Thanks very much for your simple and clear code.

Ответить
@ethiotube4805
@ethiotube4805 - 28.08.2023 21:43

can you provide mock interview?

Ответить
@sadikaljarif9635
@sadikaljarif9635 - 06.08.2023 16:51

why we choose height column ??why dont we chose weight column???

Ответить
@renanaoki714
@renanaoki714 - 25.04.2023 16:01

Thanks!

Ответить
@piyush_clashroyale5730
@piyush_clashroyale5730 - 08.04.2023 09:35

How standard deviations is selected as 3 and zscalar 3 too?
Please someone explain

Ответить
@saifansari6459
@saifansari6459 - 19.12.2022 20:43

Excellent explanation in every topics, it really helps me alot for my data science career.. thanks

Ответить
@estherugwueke5409
@estherugwueke5409 - 03.12.2022 14:58

how can you apply this rule when you have about 10 features? Do you do them one by one?

Ответить
@pukyalligator
@pukyalligator - 23.11.2022 10:22

Great Video. Thx!!

Ответить
@hrush1k
@hrush1k - 24.10.2022 05:51

You can also use seaborn to plot the bell curve. It's much easier than matplotlib method.
seaborn.histplot(data=df.height, kde=True)
kde is the kernal density estimate line

Ответить
@anirbaniitgn8407
@anirbaniitgn8407 - 09.10.2022 14:50

Everything is good when you are applying Z_score for searching outliers which are either positive or negative outliers. If both positive and negative values are present together then it does not work..!!
data = [1, 2, 2, 2, 3, 1, 1,-19, 2, 2, 2, 3, 1, 1, 2,19,25]
try with this simple dataset.
with IQR method you can detect -19,19,25 all three
but with Z_score it is not working.
I don't know the reason. If you know Sir then let us know.

Ответить
@flaviobrienza7697
@flaviobrienza7697 - 17.09.2022 18:08

A little suggestion to make it simpler. In Z-Score method I can calculate its absolute value through np.abs and I can only write < 3 in my condition for the new dataframe.
In addition, to visualize the curve it is better to use sns.histplot with kde=True

Ответить
@chandrasekhar_m
@chandrasekhar_m - 10.09.2022 13:59

I think thanks is not enough for your teaching skills ! really amazing 👌eye opened me about outliers !!

Ответить
@research__7644
@research__7644 - 24.08.2022 15:53

BRUH.... why would you remove one column .... this just ruins the propose

Ответить
@kakmca
@kakmca - 13.03.2022 16:13

Wah... extra-ordinary explanation sir. Thank you...

Ответить
@Medjdiptiranjan
@Medjdiptiranjan - 11.03.2022 12:48

you are simply amazing , yr simple explanation helping a lot , thanks a trillion

Ответить
@beautyisinmind2163
@beautyisinmind2163 - 07.03.2022 07:20

hello sir, can we learn personally from you? and how can we contact you

Ответить
@python360
@python360 - 02.03.2022 16:13

Great tutorial, thanks for using readily available sample CSV as well. ☑☑

Ответить
@sarfrazhussain9851
@sarfrazhussain9851 - 25.02.2022 21:21

Nice effort

Ответить
@learnerlearner4090
@learnerlearner4090 - 20.02.2022 17:33

Your videos are easy to understand. Thanks so much!

Ответить
@anandshimpi8011
@anandshimpi8011 - 06.01.2022 08:19

Really amazing lecture sir,i increasing interest on Data science sir

Ответить
@Deepsim
@Deepsim - 24.12.2021 01:51

Your tutorial is so clear. Well done!

Ответить
@AryanFelix
@AryanFelix - 24.11.2021 21:34

How do we determine the Z-Score range for Skewed data? Do I use the same range on either side (like -3 to 3) or can I use different values like -1 to 3 (for left skewed data) after looking at the histogram plot?

Thanks in advance!

Ответить
@modhua4497
@modhua4497 - 13.11.2021 16:26

Does this work only if the feature is normally distributed? Most of the features in real world data are not normally distributed.

Ответить
@siddharthmodi2740
@siddharthmodi2740 - 18.10.2021 11:16

woww! what a simple and easy to understand tutorial. Love it. Thank you sir.

Ответить
@user-mq7xq1hi2q
@user-mq7xq1hi2q - 09.09.2021 22:04

Thanks!

Ответить
@Kingcolumbian
@Kingcolumbian - 19.08.2021 17:50

You know python, but you dont know much about statistics in identifying the outliers in normal distributed data.

Ответить
@priyantangupta5176
@priyantangupta5176 - 14.08.2021 18:12

Hello! Your lesson is very helpful for me. Can you just say how can I find outliers using multiple parameters? Like I want to find the outliers using all the column of data together that I have. What should I do??
Thank you in advance.

Ответить
@ajaykushwaha-je6mw
@ajaykushwaha-je6mw - 27.07.2021 13:36

I have a question kindly answer. Suppose we have 20 column and from all 2 column we are removing outliers, then we are excluding small amount of data from each column, i.e. all together we are loosing huge data. Is this a correct way to handle outliers ?

Ответить
@ajaykushwaha-je6mw
@ajaykushwaha-je6mw - 27.07.2021 13:34

Removing outlier is good option of replacing outliers with other value is good option ?

Ответить
@AlonAvramson
@AlonAvramson - 24.07.2021 07:11

Thank you!

Ответить
@0SIGMA
@0SIGMA - 27.06.2021 15:17

hey. why cant we use 'StandardScaler' and delete all outliers ?

Ответить
@likhithsasank8017
@likhithsasank8017 - 15.06.2021 19:59

Thank you so much sir your way of teaching is so clear and easily understandable

Ответить
@Artech.Ranjit
@Artech.Ranjit - 10.06.2021 10:24

How to decide 3 as a threshold value to calculate zscore values? you have considered ex: zscore >3

Ответить
@dipto624
@dipto624 - 30.04.2021 06:43

man!! I was struggling with how to use statistics in EDA. I knew std, mean n all but couldn't use them in the EDA flow. u just cleared my confusion!!!! u won't believe how long I have been struggling with this.. thank god I found this video.. u r a great teacher.. I had the tools but couldn't use them. u just taught me how to use it..

Ответить
@rsinh3792
@rsinh3792 - 28.04.2021 14:58

Sir reviewer has asked me this question I don't know how to address it, can you please guide me "Use some statistical significant test such as T-test or ANOVA to prove you validate the proposed diagnostic model on patients and quality improvements of your method". I have two datasets. Dataset 1 was used to train the model and dataset 2 was used to validate the trained model. I have trained the ML model deployed it and Validated it on new data and presented the results. Actually, I have understood the question. Shall I apply the statistical test between the performance metrics of trained model results and validation results? Please help me, sir.

Ответить
@barkhapaswan5807
@barkhapaswan5807 - 15.04.2021 19:53

🙌🙌🙌

Ответить
@shounaksushantadasgupta8440
@shounaksushantadasgupta8440 - 10.04.2021 05:01

how to remove outlier from dataframe which has categorical as well as continuous data, as by percentile technique I am getting NaN value in categorical columns

Ответить
@lamphantung5450
@lamphantung5450 - 24.03.2021 07:49

Sir thank you for this topic, it's very useful. Can you make more outlier remove tutorial such as use unsupervised learning for outlier detection.

Ответить
@pythongui5199
@pythongui5199 - 03.03.2021 20:59

Very nice

Ответить
@trinayanbharadwaj146
@trinayanbharadwaj146 - 27.02.2021 08:54

How can we apply this to multiple columns?
Is there any short way or we have to do it manually for every column?

Ответить
@Hale-xn6ec
@Hale-xn6ec - 05.02.2021 13:35

It is a really beneficial and useful video on this topic, thank you!

Ответить
@srishtikumari6664
@srishtikumari6664 - 24.12.2020 09:30

Very well explained sir!!
Worth watching

Ответить
@ssrriinniivvaass
@ssrriinniivvaass - 22.12.2020 07:27

Hi Sir,
How do I decide Z score values, does it depend on my data or is it always -3 to +3?

Ответить
@tucomax
@tucomax - 05.12.2020 02:34

Question, say you have a df of drink consumption and if you don't want to eliminate the outliers but instead replace them with NaN and keep the zero values of the dataframe, what would you do? Thanks

Ответить
@yogeshbharadwaj6200
@yogeshbharadwaj6200 - 20.11.2020 18:42

Tks for the very detailed explanation sir...

Ответить
@SurajKumar-bw9oi
@SurajKumar-bw9oi - 04.11.2020 17:13

Don't worry about the histogram plotting code, one simple alternative is -

import seaborn as sns
sns.distplot(df['XYZ'])

Ответить
@abdeali004
@abdeali004 - 02.11.2020 09:37

Great Greaaaaat and a fulll too Greaaattttt explanation man. Loved it.

Ответить