How do I find and remove duplicate rows in pandas?

Data School

7 лет назад

104,901 Просмотров

Скачать видео

Комментарии:

Kris - 02.02.2023 15:47

Kevin your videos are super helpful! thank you!!!

Ответить

Jessica Fletcher - 31.12.2022 19:48

OMG I WANT TO THAT YOU SOOOO MUCH 😊I been on the problem for days and the way you explain it make so easy then how I learned in class. I was so happy not to see that error message 😂 Thank you

Ответить

Mukul Pandey - 20.11.2022 09:51

love to have more videos like this

Ответить

Ildar Gabitov - 08.11.2022 20:50

Thank you so much, you made my day. Finally i found the row of code, that i really needed to finish my task:)(Code Line 17)

Ответить

Cradle of Relaxation - 29.10.2022 19:20

This is so helpful!
Pandas has the best duplicates handling. Better than spreadsheets and SQL.

Ответить

Alishba Khan - 15.10.2022 12:34

Thank you so much💕 your videos are really amazing...can you tell how to read any csv(without header on first line) and set first row with non null values as header...

Ответить

karthik kadadevarmath - 13.10.2022 18:17

how to remove just the names for example i have multiple coloum with same name but the same name has multiple heart rate measure. i just want single name. for example. imagine this is table
name heart rate
Aaron 79
Aaron 80
Aaron 90
. i want name to display only once

Ответить

Minah'A - 11.08.2022 07:45

just find your channel , just watched this as my first watch for your videos , and pressed subscribe !!! , cause your explanation for the idea as whole is very remarkable 😃 thanks a lot .

Ответить

Soman Talha - 03.08.2022 13:16

beneficial videos. ❤

Ответить

Tito Lee - 25.06.2022 22:51

Thank you! you sound like Kamala Harris lol

Ответить

Chandrapati Bhanuprakash | AP19110010155 - 20.04.2022 18:28

It helps me a lot. Can you explain how do we get the count of each duplicated value.

Ответить

Ishkatan - 14.04.2022 15:54

Good lesson, but the datatype has to match. I found I had to process my pandas tables with .astype(str) before this worked.

Ответить

Harpreet sandhu - 05.03.2022 01:26

how to drop a column which contains 95 % same values in python

Ответить

Captain America RY - 09.01.2022 16:35

thank you ...!!!

Ответить

Anant Gosai - 25.11.2021 12:29

That was so accurate, thanks a lot genius!

Ответить

Rajiv Jani - 27.09.2021 23:16

If I have a datataframe with a million rows and 15 columns, how do I figure out if any columns in my dataframe has mixed data type?

Ответить

Isaac - 29.08.2021 02:51

i love you, sir.

Ответить

zma314125 - 25.08.2021 22:51

Thank you!

Ответить

Rational Indian - 22.08.2021 11:54

Brilliant video .

Ответить

m marva - 03.08.2021 22:24

Thank you for this content! I have a question : how can we handle quasi redundant values in different columns ? (Imagine two different columns each containing similar values at 80%). Thanks a lot

Ответить

Dejan Jovanovic - 04.07.2021 15:19

HOW DO YOU KNOW WHAT I NEED? YOU ARE MY FAV TEACHER FROM NOW

Ответить

Linda fl - 22.06.2021 02:59

hello, thank you for the video, I'm wondering if you can make some tutorials about the API requests

Ответить

Imad Uddin - 18.06.2021 06:42

Thanks a lot. It was a great help. Much appreciated!

Ответить

Reaz Ahmed - 16.06.2021 04:22

How do I access iPython Jupyter Notebook link? it is not available in the github repository.

Ответить

HarshInDublin - 11.06.2021 00:20

Thanks for the video

Ответить

Asad Ghnaim - 02.06.2021 00:35

When I use the parameter keep=False I get a number of rows less than the first and last combined what is the reason of that??

Ответить

Anastasia - 27.04.2021 20:46

Jeez you just saved me so much work for a seemingly unsolvable project 🙏☕

Ответить

Halil Durmaz - 27.04.2021 01:48

Clean and informative !

Ответить

Oasis God - 17.04.2021 07:02

Great video. But I'd like just to find a duplicate column and then go to another column and find the duplicate and go to another column and find the duplicate and remain only one row with certain information.

Ответить

Tony Gonsa - 06.04.2021 19:19

Very methodical explanation

Ответить

Balaji Bhaskarrao Kondhekar - 22.03.2021 06:40

You have done very Good jobs about under standing of DataFrame and make very easy to understanding DataFrame it so easy with the people which are working in excel
Best wishes from me

Ответить

Asif Sohail - 01.03.2021 00:01

How can we efficiently find near duplicates from a dataset?

Ответить

Antony Joy - 03.02.2021 19:04

This is case of complete duplicates. So what should we do when we have to deal with incomplete duplicates..Ex age,gender and occupation same but zip is different..
could you also make a video on that please..

Ответить

Brian Waweru - 01.02.2021 16:11

wait Kevin, keep=first means what is duplicated are the rows towards the bottom, meaning they have a much higher index. Keep= last means ?? Oh men am getting mixed up. Could someone please explain to me. Kevin,Please?

Ответить

Dr sheldon cooper - 07.01.2021 06:57

Amazing and thanks bro , the right place for data queries

Ответить

Carlos Fernando Aguirre Toro - 09.11.2020 23:33

Great video. This helped me tremendously.
How would you go about finding duplicates "case insensitive" with a certain field?

Ответить

Cable Master - 14.10.2020 18:50

Really, your teaching method is very good, your videoes give more knowledge, Thanks Data School

Ответить

Bald is sexy - 31.08.2020 11:31

love u brother . u r changing so many lives, thanku ....the best teacher award goes to Data school.

Ответить

HongYee Gan - 30.08.2020 18:01

wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!

Ответить

goldensleeves - 24.08.2020 20:57

At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks

Ответить

Atul Rahangdale - 12.08.2020 18:03

Thanks for awesome videos on Pandas. I was able to automate few excel reporting at my work.. but stuck with something very complex(its complex for me!). Could you please help on some complex excel calculations using Python.?
for ex. suppose I have data in below format.
db_instance Hostname Disk_group disk_path disk_size disk_used header_status
abc_cr host1 data01 dev/mapper/asm01 240 90 Member
abc_cr host1 data01 dev/mapper/asm02 240 100 Member
abc_cr host1 data01 dev/mapper/asm03 240 60 Member
abc_xy host1 data01 dev/mapper/asm01 240 90 Member
abc_xy host1 data01 dev/mapper/asm02 240 100 Member
abc_xy host1 data01 dev/mapper/asm03 240 60 Member
abc_cr host1 acfs01 dev/mapper/asm04 90 30 Member
abc_cr host1 acfs01 dev/mapper/asm05 90 60 Member
abc_xy host1 acfs01 dev/mapper/asm04 90 30 Member
abc_xy host1 acfs01 dev/mapper/asm05 90 60 Member
host1 unassigned dev/mapper/asm06 180 0 Candidate
host1 unassigned dev/mapper/asm07 180 0 Former
res_du host2 data01 dev/mapper/asm01 240 90 Member
res_du host2 data01 dev/mapper/asm02 240 100 Member
res_du host2 data01 dev/mapper/asm03 240 60 Member
res_hg host2 data01 dev/mapper/asm01 240 90 Member
res_hg host2 data01 dev/mapper/asm02 240 100 Member
res_hg host2 data01 dev/mapper/asm03 240 60 Member
res_pq host2 acfs01 dev/mapper/asm04 90 30 Member
res_pq host2 acfs01 dev/mapper/asm05 90 60 Member
res_mn host2 acfs01 dev/mapper/asm04 90 30 Member
res_mn host2 acfs01 dev/mapper/asm05 90 60 Member
host2 unassigned dev/mapper/asm06 180 0 Candidate
host2 unassigned dev/mapper/asm07 180 0 Former

As you can see, disk_path is duplicated for each host..because of multiple db_instance. (Even though you see similar disk_paths for host1 & host2, but actually they are different disks from storage end.. but admins follow similar name conventions when they configure disks at host side, resulting similar disk_paths for different hosts)
My queries are, How
1. to remove duplicates for disks_path for each host?(considering only two columns Hostname & disk_path, that's how I remove duplicates in excel, I am not worried for db_instance)
2. once we remove duplicates, calculate total size of 'Member' disks... also total size of 'Candidate' and 'Former' disks combined.
3. to add another column 'Percent used', which will is result of 'disk_used'/'disk_size'*100 for each row.

Thanks in advance!

Ответить

Abylai Mustafa - 20.07.2020 23:29

long live and prosper!

Ответить

Bharati N - 18.07.2020 08:43

How to Remove Leading and Trailing space in data frame

Ответить

Shashwat Paul - 10.07.2020 22:26

I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python.
I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like:
1st category would be for the reviews less than 100,000,
2nd category would be for the reviews between 100,001 and 1,000,000,
3rd category would be for the reviews between 1,000,001 and 5,000,000 and
4th category would be for the reviews anything more than 5,000,000.
Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command:
gps['Reviews Group'] = [1 if x <= 1000000 else 2 for x in gps['Reviews']]
gps is the Data Set.
I replaced the Null Values using the below command:
gps['Rating'] = gps.groupby('Reviews Group')['Rating'].transform(lambda x: x.fillna(x.median()))

Please help me create multiple categories for "Reviews" as mentioned above and replace all the Null Values in "Rating".

Ответить

Jordy Leffers - 05.06.2020 14:19

lol, just when I felt you wouldn't handle the exact subject I was looking for: there came the bonus! Thanks!

Ответить

Emanuele Co - 24.05.2020 14:19

You are the greatest teacher in the world

Ответить

Cyrus Lam - 19.04.2020 11:55

I can solve the duplicate data from my CSV file~~~ Thank you.
However, I suggest you can do more in this video. I think you can show after the delete result list. Such as:
>> new_data=df.drop_duplicates(keep='first')
>> new_data.head(24898)
If you have to add it, I think this video will be more perfect~~~

Ответить