BeautifulSoup is NOT the king of HTML Parsers (try this one)

BeautifulSoup is NOT the king of HTML Parsers (try this one)

John Watson Rooney

1 год назад

24,570 Просмотров

Ссылки и html тэги не поддерживаются


Комментарии:

@extropiantranshuman
@extropiantranshuman - 15.10.2023 05:53

actually if you'd like to take things up a step - adding in categories into your videos would speed this up.

Ответить
@extropiantranshuman
@extropiantranshuman - 15.10.2023 05:53

selectolax wants me to install cython! It's not even python. That's already more. At least give us a warning lol.

Ответить
@extropiantranshuman
@extropiantranshuman - 15.10.2023 05:50

why can't we have something where we can just import a pack, type in the websource to pull from, and just type in what we want to pull, and where it goes? Where's the template for that? Why so much extra stuff?

Ответить
@extropiantranshuman
@extropiantranshuman - 15.10.2023 05:49

beautifulsoup really is clunky, because it doesn't auto-turn web script automatically into html. Having something already do that really helps!

Ответить
@extropiantranshuman
@extropiantranshuman - 15.10.2023 05:44

truth be told - we really just need an interface - where people can click on the places on a website that would need to be copied, along with the direction of the copies and let it roll. Why code it when you can just get it to run?

Ответить
@extropiantranshuman
@extropiantranshuman - 15.10.2023 05:43

I'm not convinced these are ideal, but do agree that beautifulsoup is honestly way too clunky for what it needs to do.

Ответить
@extropiantranshuman
@extropiantranshuman - 15.10.2023 05:42

it's weird that it's working off of css - but then again - having the option of having html + css is really helpful.

Ответить
@alisheik3076
@alisheik3076 - 24.09.2023 17:05

Hello sir,
When I try to code same as above, its throwing an error<built-in method text of selectolax.parser.Node object at 0×000001E78E494A40>
Please help how to rectify this error.
Thanks

Ответить
@RonWaller
@RonWaller - 18.08.2023 22:06

Are you in Seattle? Seattle fan? just noticed your shirt.

Ответить
@sulaimanahmed013
@sulaimanahmed013 - 14.06.2023 12:54

Updates on selectolax? How's it goin for you?

Ответить
@LuicMarin
@LuicMarin - 29.05.2023 21:18

Great video would be cool to see one on inspecting request/response headers without selenium

Ответить
@philwebb59
@philwebb59 - 20.05.2023 19:36

Your videos are terrific at encouraging me to try new things, but latency isn't a problem. I've never been successful converting your scripts to run on "real" websites without getting blocked for life, even when adding a time.sleep(60) after each pull. I think the html-world just doesn't like me. &^) That said, I haven't found a good example of using selectolax to parse tables. Gonna take another look through your videos. Also, I see selectolax has modest and lexbor engines. Wonder what the pros and cons?

Ответить
@MrSettler
@MrSettler - 16.04.2023 15:20

bs4 has built-in support for CSS selectors using soup.select() or soup.select_one()

Ответить
@hsider
@hsider - 24.03.2023 19:06

Personally I don't mind beautifulSoup latency, it's serve as requests delay. If the parsing takes some time it's good specially if I have a loop to make multiple requests to the same website. Nice video of course 👍
Edit: I forgot to mention pofiling: Python has cProfile and pstats libs to profile and display nicely time consumed by funcs and io, it may help you compare these new librairies, instead of comparing syntaxe only. From what I've tested so far, requests connection take some time (> 10s often) so in my understanding it's the requests library which take time not parsing :) hope this helps.

Ответить
@pavelerokhin1512
@pavelerokhin1512 - 05.01.2023 19:22

Your videos are super helpful and you're also a handsome man :)

Ответить
@djmill8000
@djmill8000 - 26.12.2022 11:37

Just import pandas and do a pd.read_html

Ответить
@00flydragon00
@00flydragon00 - 21.12.2022 22:08

What is scraping used for in the industry? Most of the scraping video's I have seen focus on "home projects".

Selectolax looks cool tho!

Ответить
@aaroncatolico7550
@aaroncatolico7550 - 21.12.2022 18:33

Hey John, which parser is quickest? I've been using Python 'Requests' library with the 'regex' library. Anything faster than this?

Ответить
@wanderingfool7136
@wanderingfool7136 - 01.12.2022 22:21

Going to give this a try on a new script I'm writing for a client today! Thanks for everything you do 🙏🙏🙏

Ответить
@AhmedThahir2002
@AhmedThahir2002 - 01.12.2022 14:12

Is selectolax faster than scrapy?

Ответить
@seangibbons4713
@seangibbons4713 - 29.11.2022 01:44

As someone learning to code, your videos are a godsend. Keep up the great work. You're helping a lot of amateurs get their footing.

Ответить
@SaMi-se2qs
@SaMi-se2qs - 27.11.2022 09:39

Can we use it for dynamic websites?

Ответить
@chillydoog
@chillydoog - 24.11.2022 20:10

Awesome. I'm going to build a best chili dog scraper.

Ответить
@AS-fj7ox
@AS-fj7ox - 17.11.2022 07:35

Thanks! that was so koool. little correction on line 14 in selectolax .py file you need to add "( )" to ".text" in order to call the method properly

Ответить
@s6yx
@s6yx - 17.11.2022 02:10

cant use selectolax to scrape items based on div styles attributes like i can on beautifulsoup, unfortunate

Ответить
@Frugtoy
@Frugtoy - 13.11.2022 10:33

I'm waiting for tool with regular expressions inside (many sites creating dynamic classes) and I don't like the way of solutions.
<div class = dhdhdh_rddhuud_hello_text>
Hello </div>
....
....
....
<div class = dhdhdh_3773_7372fb_hello_text>
World </div>

Wanna do some parser.find(p.*hello_text)

Ответить
@danielhangan
@danielhangan - 12.11.2022 15:04

Can you do a LinkedIn company scraper video?

Ответить
@SlackOps
@SlackOps - 12.11.2022 00:33

Please I need an aliexpress web scraping tutorial

Ответить
@bakasenpaidesu
@bakasenpaidesu - 10.11.2022 17:23

UPDATE: I tried the selectolax and its really fast.... about 20x+

Ответить
@nanjack5277
@nanjack5277 - 10.11.2022 10:09

hi sir, months ago i meet one web scraping project can only use xpath selector to get the exact element, which library should i use can go as nearly fast as the seletolax?

Ответить
@codified1
@codified1 - 09.11.2022 21:09

Please upload a video about how to solve a form based captcha.

Ответить
@felipejardim2517
@felipejardim2517 - 09.11.2022 15:55

Awesome! I'll try !
I really like BeautifulSoup because I can find elements in html using combinations, for example:
class + attributes
regex on attribute value
I confess that I'm still not that good at finding elements by the css selector
do you have any content about it? :D

Ответить
@geniusdavid
@geniusdavid - 08.11.2022 10:09

Usually skip over sponsors but this is actually interesting 🧐 will check it out indeed.

Ответить
@drac.96
@drac.96 - 08.11.2022 02:23

John, great video, I would like to know your thoughts on a few things. First, how would you approach crawling a website using GraphQL and requires scrolling down on a webpage to get more data? Is it possible to to retrieve this data without using a huge library like Playwright or Selenium to crawl it? Can we still get the data we want with our authentication cookies?

Ответить
@danlee1027
@danlee1027 - 08.11.2022 02:23

Great if speed is key to scale as you say.

Ответить
@marcossahade9369
@marcossahade9369 - 08.11.2022 02:04

What abaout request-html ? It does supports css and xpath.

Ответить
@BrandonJacobson
@BrandonJacobson - 07.11.2022 23:04

Perfect timing. I’m going to create my own headline news scraper and this is perfect. Thank you!

Ответить
@karim_ghibli
@karim_ghibli - 07.11.2022 19:36

You said "pure css selector(s)" multiple times in this video, I may have missed where you explain it, but what do you mean by "pure css selector"?
Selectolax does look pretty clean, for now don't really care about scalability, but as long as it's as readable (if not more readable than BS4), definitely looks like something I wanna give a go next time I need to do some html parsing. Thanks for introducing this!

Ответить
@bakasenpaidesu
@bakasenpaidesu - 07.11.2022 18:00

Beautifulsoup do have css selector

soup.select_one("h1.className")

Ответить
@xilllllix
@xilllllix - 07.11.2022 17:16

thanks for introducing this to us, john!

Ответить
@ianrickey208
@ianrickey208 - 07.11.2022 17:08

Nice! We are about to redesign our crawlers and I was starting to review parsers.

Ответить
@loverboykimi
@loverboykimi - 07.11.2022 17:08

Gosh. It is really FAST.

Ответить
@gitgosc7075
@gitgosc7075 - 07.11.2022 17:05

can you make a series about neo vim configuration for webscraping? ;) - thanks for another great material!

Ответить
@srikanthkoltur6911
@srikanthkoltur6911 - 07.11.2022 16:54

Thanks for the introduction of the new parsing Library it is really worth a shot
I was using scrapy for everything 😅

Ответить
@miguellopez7089
@miguellopez7089 - 07.11.2022 16:44

So cool! Will experiment with it one day 🤌🏽

Ответить