Effortlessly Create Your Own Image Dataset for Deep Learning
Written on
Chapter 1: Introduction to Dataset Creation
When embarking on computer vision projects, you typically encounter two well-known datasets: MNIST and Fashion MNIST. These datasets contain grayscale images of handwritten digits and clothing items from Zalando, respectively. However, in real-world applications, images are often more intricate and not limited to black and white. This guide will demonstrate how to efficiently create your own image dataset, even if you lack coding expertise.
In this tutorial, we will scrape images of dresses from Vinted, a widely-used marketplace for second-hand clothing. The tool we will utilize is Octoparse, a user-friendly web scraping software that doesn't require programming knowledge. Let's dive in!
Table of Contents:
- What is Octoparse?
- How to Download Octoparse
- Scraping Dresses Using Octoparse
Section 1.1: Understanding Octoparse
Octoparse is a robust web scraping application that allows users to gather data from web pages easily. As previously mentioned, no coding skills are necessary—just a few clicks will yield results. The software can simulate actions, such as auto-detecting data, extracting information across multiple pages, and scrolling through pages automatically.
Additionally, Octoparse offers advanced capabilities, including the extraction of HTML elements, handling AJAX-loaded content, and creating APIs without directly interacting with the application. This software can be applied in various scenarios, from simple data extraction from Wikipedia to more sophisticated tasks like scraping Google Maps data. You can find a detailed list of use cases here.
Section 1.2: Downloading Octoparse
To get started with Octoparse, you need to download the application compatible with either Windows or Mac. After installation, log in with your Octoparse account. If you don’t have an account yet, you can easily sign up for free. While the software is free, Octoparse also offers paid plans such as the Standard, Professional, and Enterprise tiers for additional features. If you're considering using Octoparse for business, be sure to check out the Summer Sale for significant discounts.
Chapter 2: Scraping Images from Vinted
Now, it's time to define what we want to scrape using Octoparse. In this mini-project, we will extract images of dresses from Vinted. This process can be broken down into three straightforward steps: Create a Task, Edit the Task, and Run the Task.
Section 2.1: Creating Your Task
To begin, you must create a task to collect your desired images. Start by copying the URL that features all the dresses on Vinted and selecting the "Start" button.
Once you do this, the webpage will appear in Octoparse, similar to how it looks in your browser.
To automatically extract the data from the webpage, simply click “Autodetect web page data” in the Tips panel. This will reveal all the relevant elements for each dress, including the price, image link, and number of favorites. You will see a preview of the data in a table.
Section 2.2: Editing Your Task
At this point, you will see the automated steps that Octoparse has taken to scrape the webpage. The workflow designer on the right illustrates the steps involved in extracting the images and their details. Click the “Extract Data” button to highlight the elements associated with each dress.
We are primarily interested in the image links, so we can remove any unnecessary columns from the Data Preview manually. Additionally, the “Pagination” box allows you to adjust how you navigate between pages. If Octoparse doesn't correctly identify the pagination button, you can change it manually by deleting the Matching XPath, clicking the diagonal arrow icon, and selecting the “>” button.
You can further customize options in the “Click to Paginate” box. Select “Scroll down the page after it’s loaded,” which will unveil additional options for editing, such as Scroll Area, Wait seconds, and Scroll times. Set the waiting time to 10 seconds and the scroll count to 6.
Section 2.3: Running Your Task
Having completed the previous steps, it's time to execute the scraping task. Click the “Run” button in the upper right corner. You will be presented with two options: “Run on your device” or “Run in the Cloud.” For this project, we will choose to run it on our device.
The data can be exported in various formats, such as CSV and JSON, or into SqlServer and MySql databases. For simplicity, we will select the CSV format, which will only include links to the images you’ve collected. To download the images from these links, you can use a Chrome extension called Tab Save, where you simply paste the links into the provided form.
Final Thoughts
Congratulations! You have successfully learned how to scrape images using Octoparse. In this tutorial, we explored the auto-detection feature of the software and how to manually adjust settings for data extraction. You’ve gathered images with just a few clicks, without needing any knowledge of HTML, CSS, or Python. Thank you for reading! I hope this guide proves useful for your future computer vision projects. Have a great day!
Mlearning.ai Submission Suggestions
Consider joining Mlearning.ai to gain unlimited access to new data science content daily! Supporting me through membership incurs no extra cost to you. If you’re already a member, subscribe to receive email updates whenever I publish new guides on data science and Python!