HomeОбразованиеRelated VideosMore From: DevTips

Recursion ➰ for Paginated Web Scraping

294 ratings | 8886 views
Sponsored by: Brilliant, thanks! Be one of the first 200 people to sign up with this link and get 20% off your annual subscription with Brilliant.org! https://brilliant.org/DevTips/ We figure out how to deal with the paginated search results in our web scrape. RECURSION is our tool - not as difficult as you might think!! 🗿 MILESTONES ⏯ 00:12 Fika 🍪 ⏯ 13:10 Extracting the next page number with regex ⏯ 16:50 Encounter with prettier... 🌋 ⏯ 18:39 ➰ Recap ⏯ 20:15 TIME FOR RECURSION 😎 ⏯ 29:00 Quick Google rant 🌋 ⏯ 29:23 ➰➰ Rerecap by Commenting the Code See the previous episode where we explain Puppeteer and finding the data to scrape ▶️ https://www.youtube.com/watch?v=pixfH6yyqZk The code used in this video is on GitHub 🗒 https://github.com/chhib/scraper/tree/5d00bb08c6ec4ea8cbaec7ea78fb90a10a864f8b Puppeteer - Headless Chrome browser for scraping (instead of Phantom JS) 🔪 https://github.com/GoogleChrome/puppeteer The editor is called Visual Studio Code and is free. Look for the Live Share extension to share your environment with friends. 💻 https://code.visualstudio.com/ DevTips is a weekly show for YOU who want to be inspired 👍 and learn 🖖 about programming. Hosted by David and MPJ - two notorious bug generators 💖 and teachers 🤗. Exploring code together and learning programming along the way - yay! DevTips has a sister channel called Fun Fun Function, check it out! ❤️ https://www.youtube.com/funfunfunction #recursion #webscraping #nodejs
Html code for embedding videos on your blog
Text Comments (61)
Kainar Ilyasov (25 days ago)
Did the same thing with another web site. All is same. But sometimes it returns me an empty array [ ] , sometimes it scrapes only 10 pages even there are 14 pages. Why is it so? I am so tired.
Josh Martin (1 month ago)
Just use a set, each time you update the set with the pages new to the list
Dragos Vlasie (1 month ago)
Is this the end of this series ? I was looking forward to see what would happen next
Sarah Britny (2 months ago)
#Help_me. there is #No page number, only Next button, and I can't copy anything from this site. Now how to Scraping..? link is - http://best-toy-importers.com/global-toy-importer-directory/wpbdp_category/germany/ Please make a video step by step about this. It's a request.
simone icardi (3 months ago)
.. and you just answered my question on the previous video! Thanks! I enjoyed so much this two on web scraping.
No Body (3 months ago)
How you know that you need to write div.compact when you don't Explain this in the Video also you don't Explain why .map(compact => What does this mean Why?
David Poindexter (3 months ago)
Kinda just imagining their SEO/Analytics folks freaking out at the web hits all over their site! Just to keep developers on their toes, send over an IE6 web user agent. (I'm kidding, don't do that)
Anton Kristensen (4 months ago)
Could use the http request status code to stop the recursion... Could probably also create another instance of the puppeteer that runs paralell to check if there is a next page, instead of using the same instance, would perhaps double the speed.
DevTips (4 months ago)
Regarding the first suggestion the site still returns 200 so that won’t work in this particular case. If we were to do this on thousands of pages and multiple sites - yes that’s a cool idea. At this stage though I think that is a bit too much overoptimization.
Mazen Mahari (4 months ago)
Yoo can you guy do basic Java programs.. I studying it in school
GifCo (4 months ago)
lol change schools!! Dont waste your time learning that crap. And definitely dont ask a JavaScript dedicated channel to teach it.
justvashu (4 months ago)
I would have used the “next” button at the navigation and use its href to get the next page until there are no more next pages
DevTips (4 months ago)
Great idea!
Francisco Jaimes Freyre (4 months ago)
Great videos! Does anybody know a channel that is fun like this one ir fun fun function but that uses Python?
Ty Hopp (4 months ago)
These two web scraping vids are awesome! Would love to see one on building a crawler 🕸
No Body (3 months ago)
I would like to see how he saves now the output to file.txt (Only the text, For every Partner a NewLine)
Abhishek Kumar (4 months ago)
Thank u for this awesome video
Razva Cos (4 months ago)
Hi David! How do you use puppeteer to make ssr?
mike quinn (4 months ago)
This video explains that https://www.youtube.com/watch?v=lhZOFUY1weo
spoooget (4 months ago)
Im impressed that you didnt get an error saying 'browser is not defined'!
spoooget (4 months ago)
Yeah! because my thought was that your exractPartners would need to know what browser is as it is evaluated. But I'm happy to be proven wrong - it's a great way to learn. I really enjoyed this web scraping series. Hope the fika was tasty ;)
DevTips (4 months ago)
You mean because it is used in the beginning of the function? The function is not run until it is called. At *const partners = extractPartners(firstUrl)*, that’s when we need browser to be defined. And it has been just above. The code is not run from top to bottom!
Alexzander Flores (4 months ago)
Why all the regex stuff over just passing the page number as an argument and creating the URL in the method?
Sultan Arziev (4 months ago)
I would say "aha next page" but not "aha next url", you took a worse solution and now you have to explain why it is good.
DevTips (4 months ago)
Why call a variable X instead of Y? It is just one way of solving it. There are thousands of ways. Here we were lucky the pattern was so simple, for the next site it may not be. Sure though, it could be done differently, using regex for this exact example was perhaps slightly overengineered. In programming there is never only one way of solving something. I like to not stuff parameters into the function, think it looks neat and is simple to understand what’s going on when browsing through the code. By passing the URL it is simple to scan and understand “aha the function will use that URL and get partners out of it”.
Lungile Madi (4 months ago)
Can you do a video on how to track a user's (maybe of a your website) exact location using either IP, MAC address or any other way, except the lame geolocation from javascript which requires use permission. Please dont just get the address of the server farms which is as far as i went... Please try to get the user's exact location, like when we use google Earth, we must see the user's house or office, depending where... Awesome videos, You know Im a subscriber!!!!
Lungile Madi (4 months ago)
+GifCo btw it is possible using mac address
Lungile Madi (4 months ago)
Ah maybe you don't think on my level brah... you can have sensitive information and not misuse it. I do have their passwords if they log into my website don't be dumb brah
GifCo (4 months ago)
lol would you like their passwords and door keys as well?? WTF dude there are reasons this is NOT possible.
Lungile Madi (4 months ago)
Oh cool... not for perverted reasons... solely with the intention of improving user experience
mike quinn (4 months ago)
So let me see if I have this correct. You want to be able to see a users exact location (house/office/whatever) without them granting express permission that allows you to do so? I don't think so.
Aaron (4 months ago)
While the cat's away the fika comes out to play.
Bogdan Lupu (4 months ago)
Good video. I know is over the scope of the lesson but should not have the result saved onto css or JSON file for further manipulation?
Bogdan Lupu (4 months ago)
+DevTips thanks. Btw is csv not css but my phone try to correct me wrongly. I parsed to a JSON . Nice and easy.
DevTips (4 months ago)
You're right - it will go into a database
Lungile Madi (4 months ago)
Hi David. Can you please address the legality around the concept of web scaping. I mean, after I watched your last video with Matius, I got really excited and did some examples of my own. However i later found out that web scraping can carry legal consequences if done wrong. So I read the terms of use of a few websites and i found out that web scraping is prohibited in all of them. So can you also advise us on how to use this properly because we can go to jail because of ignorance. Otherwise thanks for the videos
Lungile Madi (4 months ago)
Oh ok cool... Thanks for the videos!!!
DevTips (4 months ago)
I can't give legal advice. I don't know where you're located. It depends on the jurisdiction. But definitely you should beware of the terms of use. Often this is common with APIs. Many allow for a lot of fun things... Until you read the terms of use for the API. :( Swedish sites typically do not have terms of use (I don't know if it is implied through our constitution somehow) so Mattias and I are not very used to that even being an issue.
Lungile Madi (4 months ago)
Cool. So just to be clear, breaching a company's terms of use will not result in a "cyber-crime" prison sentence or a fraud charge? So its safe to say the worst that could happen, apart from being sued for copyright infringement, if you reuse the content, is getting blocked?
DevTips (4 months ago)
It depends heavily per jurisdiction. In Sweden even personal information like how much you earn, where you live etc, is public, so we are pretty used to that. I'm sure it is more restricted in other countries. As we argued in the previous video, if the content is there in the public domain, it ought to be available to anyone, server or human alike. Still any publisher is of course allowed to do what they please. If they want to block your IP because you drain their resources or do things they suspect are not OK, they have all rights to do that (I presume!).
Aamir (4 months ago)
Hi david, i guess it''d be more interesting and catchy if you add some sound effects to the intro ;)
Drew Lomax (4 months ago)
David, great video. As for that h1 tag... they have a history of funny h1 tags on these landing pages. A little over a year ago, before the "360" rebranding changed their marketing site, I was looking at how they formatted their markup for SEO on one of their product pages. I noticed that the h1 tag was in the markup and said, for example, "Google Tag Manager...", but it was not visible to the user. If I remember correctly, on desktop the h1 tag had display:none attached to it. Then, once the hamburger menu breakpoint was crossed, it was still display:none; until you opened the menu, at which point display:none was removed and the h1 tag was wrapped around an img element with an image of the stylized "Google Tag Manager..." The actual text "Google Tag Manager..." in the h1 tag was hidden with CSS and probably used as a fallback. After some researching on Matt Cutts blog I found out that this is semi-okay to do.
Tim Plumb (4 months ago)
Using a regular expression to pull out the page number seems a little odd to me rather than simply passing the function a base URL and the initial page number. Still, a great video.
ConquerJS (4 months ago)
DevTips (4 months ago)
I'm in a Regex Anonymous. I use it to make coffee.
Stephen James (4 months ago)
Use your regex capture group to get the url before the page number - not hard coded
DevTips (4 months ago)
Great suggestion!
Olaf Wrieden (4 months ago)
David, please bring back the music when you timelapse :) Interested to see where this project is going. Keep it up, always looking forward to the next episode of this series.
DevTips (4 months ago)
Yeah cool! I’ll try doing that more - it just takes time so I try to get something out even though I don’t have the time to add the finishing touches
Charlye Castro (4 months ago)
Great Vid! You guys should go over Docker next
Gaurav Thakur (4 months ago)
Which text editor are you using?
Jonas Røssum (4 months ago)
It's Visual Studio Code.
296k (4 months ago)
Are you from eastern europe?
296k (4 months ago)
+Субота I like your name. The best das in the week 🙂
Субота (4 months ago)
Mordredur (4 months ago)
Wouldn't an infinite loop work that breaks when there is a 404?
roju (4 months ago)
+DevTips I had to write a recursive and a non-recursive function that prints a binary tree for a tech interview. Writing that without recursion was surprisingly hard. I always just defaulted to recursion on that one.
DevTips (4 months ago)
Also I expect to crawl nested structures later on. Like a category tree structures. Then it would also be more difficult with the loops.
Mordredur (4 months ago)
DevTips Thanks for the answer. An infinite loop solution would have been a boring video :)
DevTips (4 months ago)
Sure it would work. It's the never-ending discussion "what is best, a loop or a recursion". Many argue loops are easier to comprehend and that's why you should use them. But that also has had the effect people rarely use recursion and do not understand it. The purpose of the video is to explain recursion through an example you can relate to - instead of the traditional Fibonacci numbers example. Who uses that in the real world? JavaScript is a functional language and the recursive approach fits better, in my opinion. Haskell, also a functional language, doesn't even have loops, you have to recurse. (It is still a 200 OK response in this example. I take it you mean when reaching a page with no items in it.)
Yadagalla Jaswanth Kumar (4 months ago)
Awesome lesson, really practical

Would you like to comment?

Join YouTube for a free account, or sign in if you are already a member.