Python 3 Programming Tutorial - Parsing Websites with re and urllib

9 лет назад

196,800 Просмотров

Комментарии:

Jeff Rojas - 27.12.2021 05:28

Hello Sentdex, can you give a hints where i shold start to search. this is my story. I have a project in the project user can post image or text. now other user like the post and wanted to share the post to her own or him wall just like facebook. please and thank you,,

Ответить

MangDalin - 09.02.2021 12:49

wow 1M sub

Ответить

Dtomper - 09.11.2020 15:22

Thank you

Ответить

Sam Mraz - 28.07.2020 15:00

thanks god

Ответить

Yi Shao - 27.05.2020 21:25

Amazing. Love you. You make parsing so easy to understand.

Ответить

RAHIM ZAHI - 18.05.2020 02:53

Thank you bruh ❤

Ответить

mister tech - 21.03.2020 14:46

Great Bruh

Ответить

Lokesh Bhirud - 06.03.2020 16:25

how to replace spaces using symbol in python using regular expressions

Ответить

Patrick - 15.02.2020 10:07

I tried this script on a different URL and got a Forbidden 403 error...do some websites block parsing via script?

Ответить

FrenchyFred - 23.11.2019 06:13

@sentdex Hi there!! Thanx for your great tutorial! I'm a newbie on python and programming in general and I have a problem right now that's kinda like what you show here. I've extracted a table from a website (using the api) and the results come in text (csv). I get around 20 different statistics (it's sports-related) and I only need 3 of them. So I would like to eliminate all the data that I don't need and just get those 3. Would you recommend the same Library modules (re and urllib) or another module to do that? As I said, it looks to be the same kinda thing you're showing here, the difference being that I need to basically remove stats instead of text when I scrape it and just get the one I need. Thanx again for your great tutorials!!

Ответить

Hasti Bozorgi - 02.09.2019 16:50

Hi,
Thanks for these series of tutorials.l am new in this field and need help.I'm trying to write a code for scraping several pages of web and don't know how should I start?
I tried several times but hadn't true run☹I hope u help me🙏

Ответить

Suhas NM - 15.07.2019 13:16

how to save that file you have extracted ?

Ответить

Logomonic Learning - 10.07.2019 22:15

how do i get the full playlist, it's not in the user's profile. infact it is a totally different person but I want this guy!

Ответить

Chengyao Zheng - 20.06.2019 09:39

import re
what was re tho? I'm trying to recall this part now and I can't remember what it is.

Ответить

Hama Hawlery - 11.05.2019 02:17

it does not print any thing in terminal i maybe know because "eachp"

Ответить

Yavor Daskaloff - 12.02.2019 12:46

data = urllib.parse.urlencode(values)
data = data.encode('utf-8')

These two lines. You assign different values to the same variable. How does that work?

Ответить

A Jim Fan - 07.02.2019 04:37

So how does regex code work exactly? Is it one after the other? Would '.*?' yield different results than '*.?' or '?*.'?

Ответить

Alex Lasareishvili - 12.01.2019 21:54

Thanks for your video.
I have one question.. instead of specifying the sample URL in the code, would it be possible to make it via input?
what I mean is, I work with web based tools that contain same data fields with different values of course.. like support tickets lets say.
I want script where I can paste my ticket URL and then to be parsed for specific fields like ticket number, customer name, etc and populate the excel table with the parsed data
I have lot of tickets to deal with sometimes and opening all the URLs in separate tabs is just not an option so I'm trying to consolidate everything in excel file (for now) to quickly see which ticket is in what state, when they are scheduled, etc.

Ответить

Richard C - 27.11.2018 05:32

How would i do this in django?

Ответить

HoodedWarrior - 16.09.2018 09:57

It may work on p but for scraping useful stuff like links it gets tricky especially if you wanna get the href and also the value inside the tags.
I did use a library for that before but now I wanna try without.
EDIT: nvm, doing a second findall on the result of the first for further filtering does it. Also you could use those url results to traverse through all the results and filter those as well... hmm
Thanks, good tutorial.

Ответить

Nitesh Jaiswal - 08.09.2018 15:10

please process json data using urllib and string slicing

Ответить

SoldierGaming - 13.08.2018 01:38

content = []
paragraphs = re.findall(r'<p>(.*?)</p>', str(respData))
for eachP in str(paragraphs):
content.append(eachP)
sentence = ''.join(content)

* This just cleans the output a little more so you are not reading in like a downwards fassion

Ответить

Vijay Suresh - 06.08.2018 21:55

Thanks for the vid. Can anyone help me on how to send username and password to handle an authentication popup to automate it in chrome?

Ответить

Nikunj Parmar - 25.07.2018 21:25

You are awesome!

Ответить

Void Beats - 18.07.2018 00:46

@sentdex

values = {'s': 'basics',
'submit': 'search'}

I have tried to put some other links but It does not work, it only works with the link that you posted

Ответить

James Jemima - 07.07.2018 01:55

Instead of importing urllib.request and urllib.parse individually, is it possible to just import urllib as a whole library?
In the same respect, since in the last vid you said you mostly only use re.findall() , can we just import re.findall instead of the whole re library module?

Ответить

Josh Thomas - 05.07.2018 23:53

This is AWESOME! Thanks a lot!

Ответить

Andreas Papadakis - 04.07.2018 20:29

Hi, great video!

I just have a question, when you do this it doesn't save the webpage as "Complete" but rather as "HTML, only". Is there a way to do Complete using urllib?

Ответить

Rohan Naidu - 24.06.2018 11:02

how can you do this with "google" i am not able to achieve this with google. it's just blank after execution.
but i'm curious to read the para data or any normal English data in the Html source code of google.

Ответить

Hoora RM - 05.05.2018 23:35

Hi and thank you for the great tutorial.
I have extracted my paragraphs as you said but inside the <p> tags there are so many <a href=".....">some stuff in between</a> !!
I want to some how delete the <a href="...."> junks as well . I don't know how in your work u didn't face them :D
let me know if you have any comment on this.
thanks in advance for all the great videos you have uploaded for everyone. ! :)

Ответить

masteraravind - 25.03.2018 22:44

Could you explain how to parse HTML data which has two columns and have to go via login authtentication system

Ответить

Problem - 24.12.2017 18:47

Help me plz when i run the program it gives me that error AttributeError: module 'urllib' has no attribute 'encode'

Ответить

Walker Ward - 21.12.2017 08:14

Awesome videos! Keep it up

Ответить

Jagmohan Yadav - 15.12.2017 18:27

recommendable contribution, appreciate your effort to teach others

Ответить

Fernando Pinheiro - 29.11.2017 22:27

thanks!!!!

Ответить

whistler6318 - 26.11.2017 21:11

Thank you for taking the time to make these videos... You are a great teacher

Ответить

Finn Buhse - 04.11.2017 20:33

very good but how to integrate the fake id info so you can get into google with this?

Ответить

bharath9190 - 26.08.2017 13:42

Usually all give intoduction on single page website, what about the website which had 100 pages in it?? Try to make tutorial on it!!!

Ответить

We Rate Bikes - 25.08.2017 06:12

what happens of there's no closing (</p>) tag on the page?

Ответить

Jake Ambrose - 19.08.2017 22:27

been watching entire series. no clue whats going on lol. hope i can make my own tutorials one day

Ответить

Mahfuz Shahin - 05.08.2017 21:58

supper boss

Ответить

JAIDEEP BOMMIDI - 30.07.2017 19:26

Hi,

Great vedio. Wonderful explanation.

I have a small doubt.

I have to copy the website url which is currently opened in a browser using a python code instead of manually copy pasting the URL.

And assign it to the URL variable.

And use the code which is given by you in this vedio.

Please help me with the code to copy the URL using the python code.

Regards,
Jaideep.

Ответить