How do I find and remove duplicate rows in pandas?

How do I find and remove duplicate rows in pandas?

Data School

7 лет назад

104,901 Просмотров

Ссылки и html тэги не поддерживаются


Kris - 02.02.2023 15:47

Kevin your videos are super helpful! thank you!!!

Jessica Fletcher
Jessica Fletcher - 31.12.2022 19:48

OMG I WANT TO THAT YOU SOOOO MUCH 😊I been on the problem for days and the way you explain it make so easy then how I learned in class. I was so happy not to see that error message 😂 Thank you

Mukul Pandey
Mukul Pandey - 20.11.2022 09:51

love to have more videos like this

Ildar Gabitov
Ildar Gabitov - 08.11.2022 20:50

Thank you so much, you made my day. Finally i found the row of code, that i really needed to finish my task:)(Code Line 17)

Cradle of Relaxation
Cradle of Relaxation - 29.10.2022 19:20

This is so helpful!
Pandas has the best duplicates handling. Better than spreadsheets and SQL.

Alishba Khan
Alishba Khan - 15.10.2022 12:34

Thank you so much💕 your videos are really amazing...can you tell how to read any csv(without header on first line) and set first row with non null values as header...

karthik kadadevarmath
karthik kadadevarmath - 13.10.2022 18:17

how to remove just the names for example i have multiple coloum with same name but the same name has multiple heart rate measure. i just want single name. for example. imagine this is table
name heart rate
Aaron 79
Aaron 80
Aaron 90
. i want name to display only once

Minah'A - 11.08.2022 07:45

just find your channel , just watched this as my first watch for your videos , and pressed subscribe !!! , cause your explanation for the idea as whole is very remarkable 😃 thanks a lot .

Soman Talha
Soman Talha - 03.08.2022 13:16

beneficial videos. ❤

Tito Lee
Tito Lee - 25.06.2022 22:51

Thank you! you sound like Kamala Harris lol

Chandrapati Bhanuprakash | AP19110010155
Chandrapati Bhanuprakash | AP19110010155 - 20.04.2022 18:28

It helps me a lot. Can you explain how do we get the count of each duplicated value.

Ishkatan - 14.04.2022 15:54

Good lesson, but the datatype has to match. I found I had to process my pandas tables with .astype(str) before this worked.

Harpreet sandhu
Harpreet sandhu - 05.03.2022 01:26

how to drop a column which contains 95 % same values in python

Captain America RY
Captain America RY - 09.01.2022 16:35

thank you ...!!!

Anant Gosai
Anant Gosai - 25.11.2021 12:29

That was so accurate, thanks a lot genius!

Rajiv Jani
Rajiv Jani - 27.09.2021 23:16

If I have a datataframe with a million rows and 15 columns, how do I figure out if any columns in my dataframe has mixed data type?

Isaac - 29.08.2021 02:51

i love you, sir.

zma314125 - 25.08.2021 22:51

Thank you!

Rational Indian
Rational Indian - 22.08.2021 11:54

Brilliant video .

m marva
m marva - 03.08.2021 22:24

Thank you for this content! I have a question : how can we handle quasi redundant values in different columns ? (Imagine two different columns each containing similar values ​​at 80%). Thanks a lot

Dejan Jovanovic
Dejan Jovanovic - 04.07.2021 15:19


Linda fl
Linda fl - 22.06.2021 02:59

hello, thank you for the video, I'm wondering if you can make some tutorials about the API requests

Imad Uddin
Imad Uddin - 18.06.2021 06:42

Thanks a lot. It was a great help. Much appreciated!

Reaz Ahmed
Reaz Ahmed - 16.06.2021 04:22

How do I access iPython Jupyter Notebook link? it is not available in the github repository.

HarshInDublin - 11.06.2021 00:20

Thanks for the video

Asad Ghnaim
Asad Ghnaim - 02.06.2021 00:35

When I use the parameter keep=False I get a number of rows less than the first and last combined what is the reason of that??

Anastasia - 27.04.2021 20:46

Jeez you just saved me so much work for a seemingly unsolvable project 🙏☕

Halil Durmaz
Halil Durmaz - 27.04.2021 01:48

Clean and informative !

Oasis God
Oasis God - 17.04.2021 07:02

Great video. But I'd like just to find a duplicate column and then go to another column and find the duplicate and go to another column and find the duplicate and remain only one row with certain information.

Tony Gonsa
Tony Gonsa - 06.04.2021 19:19

Very methodical explanation

Balaji Bhaskarrao Kondhekar
Balaji Bhaskarrao Kondhekar - 22.03.2021 06:40

You have done very Good jobs about under standing of DataFrame and make very easy to understanding DataFrame it so easy with the people which are working in excel
Best wishes from me

Asif Sohail
Asif Sohail - 01.03.2021 00:01

How can we efficiently find near duplicates from a dataset?

Antony Joy
Antony Joy - 03.02.2021 19:04

This is case of complete duplicates. So what should we do when we have to deal with incomplete duplicates..Ex age,gender and occupation same but zip is different..
could you also make a video on that please..

Brian Waweru
Brian Waweru - 01.02.2021 16:11

wait Kevin, keep=first means what is duplicated are the rows towards the bottom, meaning they have a much higher index. Keep= last means ?? Oh men am getting mixed up. Could someone please explain to me. Kevin,Please?

Dr sheldon cooper
Dr sheldon cooper - 07.01.2021 06:57

Amazing and thanks bro , the right place for data queries

Carlos Fernando Aguirre Toro
Carlos Fernando Aguirre Toro - 09.11.2020 23:33

Great video. This helped me tremendously.
How would you go about finding duplicates "case insensitive" with a certain field?

Cable Master
Cable Master - 14.10.2020 18:50

Really, your teaching method is very good, your videoes give more knowledge, Thanks Data School

Bald is sexy
Bald is sexy - 31.08.2020 11:31

love u brother . u r changing so many lives, thanku ....the best teacher award goes to Data school.

HongYee Gan
HongYee Gan - 30.08.2020 18:01

wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!

goldensleeves - 24.08.2020 20:57

At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks

Atul Rahangdale
Atul Rahangdale - 12.08.2020 18:03

Thanks for awesome videos on Pandas. I was able to automate few excel reporting at my work.. but stuck with something very complex(its complex for me!). Could you please help on some complex excel calculations using Python.?
for ex. suppose I have data in below format.
db_instance Hostname Disk_group disk_path disk_size disk_used header_status
abc_cr host1 data01 dev/mapper/asm01 240 90 Member
abc_cr host1 data01 dev/mapper/asm02 240 100 Member
abc_cr host1 data01 dev/mapper/asm03 240 60 Member
abc_xy host1 data01 dev/mapper/asm01 240 90 Member
abc_xy host1 data01 dev/mapper/asm02 240 100 Member
abc_xy host1 data01 dev/mapper/asm03 240 60 Member
abc_cr host1 acfs01 dev/mapper/asm04 90 30 Member
abc_cr host1 acfs01 dev/mapper/asm05 90 60 Member
abc_xy host1 acfs01 dev/mapper/asm04 90 30 Member
abc_xy host1 acfs01 dev/mapper/asm05 90 60 Member
host1 unassigned dev/mapper/asm06 180 0 Candidate
host1 unassigned dev/mapper/asm07 180 0 Former
res_du host2 data01 dev/mapper/asm01 240 90 Member
res_du host2 data01 dev/mapper/asm02 240 100 Member
res_du host2 data01 dev/mapper/asm03 240 60 Member
res_hg host2 data01 dev/mapper/asm01 240 90 Member
res_hg host2 data01 dev/mapper/asm02 240 100 Member
res_hg host2 data01 dev/mapper/asm03 240 60 Member
res_pq host2 acfs01 dev/mapper/asm04 90 30 Member
res_pq host2 acfs01 dev/mapper/asm05 90 60 Member
res_mn host2 acfs01 dev/mapper/asm04 90 30 Member
res_mn host2 acfs01 dev/mapper/asm05 90 60 Member
host2 unassigned dev/mapper/asm06 180 0 Candidate
host2 unassigned dev/mapper/asm07 180 0 Former

As you can see, disk_path is duplicated for each host..because of multiple db_instance. (Even though you see similar disk_paths for host1 & host2, but actually they are different disks from storage end.. but admins follow similar name conventions when they configure disks at host side, resulting similar disk_paths for different hosts)
My queries are, How
1. to remove duplicates for disks_path for each host?(considering only two columns Hostname & disk_path, that's how I remove duplicates in excel, I am not worried for db_instance)
2. once we remove duplicates, calculate total size of 'Member' disks... also total size of 'Candidate' and 'Former' disks combined.
3. to add another column 'Percent used', which will is result of 'disk_used'/'disk_size'*100 for each row.

Thanks in advance!

Abylai Mustafa
Abylai Mustafa - 20.07.2020 23:29

long live and prosper!

Bharati N
Bharati N - 18.07.2020 08:43

How to Remove Leading and Trailing space in data frame

Shashwat Paul
Shashwat Paul - 10.07.2020 22:26

I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python.
I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like:
1st category would be for the reviews less than 100,000,
2nd category would be for the reviews between 100,001 and 1,000,000,
3rd category would be for the reviews between 1,000,001 and 5,000,000 and
4th category would be for the reviews anything more than 5,000,000.
Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command:
gps['Reviews Group'] = [1 if x <= 1000000 else 2 for x in gps['Reviews']]
gps is the Data Set.
I replaced the Null Values using the below command:
gps['Rating'] = gps.groupby('Reviews Group')['Rating'].transform(lambda x: x.fillna(x.median()))

Please help me create multiple categories for "Reviews" as mentioned above and replace all the Null Values in "Rating".

Jordy Leffers
Jordy Leffers - 05.06.2020 14:19

lol, just when I felt you wouldn't handle the exact subject I was looking for: there came the bonus! Thanks!

Emanuele Co
Emanuele Co - 24.05.2020 14:19

You are the greatest teacher in the world

Cyrus Lam
Cyrus Lam - 19.04.2020 11:55

I can solve the duplicate data from my CSV file~~~ Thank you.
However, I suggest you can do more in this video. I think you can show after the delete result list. Such as:
>> new_data=df.drop_duplicates(keep='first')
>> new_data.head(24898)
If you have to add it, I think this video will be more perfect~~~

Mahdi Bouaziz
Mahdi Bouaziz - 10.04.2020 11:07

you're amazing we need more videos in your channel

ARPIT MITTAL - 27.03.2020 07:29

very useful videos.. can you please tell me how to find duplicate of just one specific row?
