4008063323.net

Effortlessly Create Your Own Image Dataset for Deep Learning

Written on

Chapter 1: Introduction to Dataset Creation

When embarking on computer vision projects, you typically encounter two well-known datasets: MNIST and Fashion MNIST. These datasets contain grayscale images of handwritten digits and clothing items from Zalando, respectively. However, in real-world applications, images are often more intricate and not limited to black and white. This guide will demonstrate how to efficiently create your own image dataset, even if you lack coding expertise.

In this tutorial, we will scrape images of dresses from Vinted, a widely-used marketplace for second-hand clothing. The tool we will utilize is Octoparse, a user-friendly web scraping software that doesn't require programming knowledge. Let's dive in!

Table of Contents:

  • What is Octoparse?
  • How to Download Octoparse
  • Scraping Dresses Using Octoparse

Section 1.1: Understanding Octoparse

Octoparse is a robust web scraping application that allows users to gather data from web pages easily. As previously mentioned, no coding skills are necessary—just a few clicks will yield results. The software can simulate actions, such as auto-detecting data, extracting information across multiple pages, and scrolling through pages automatically.

Additionally, Octoparse offers advanced capabilities, including the extraction of HTML elements, handling AJAX-loaded content, and creating APIs without directly interacting with the application. This software can be applied in various scenarios, from simple data extraction from Wikipedia to more sophisticated tasks like scraping Google Maps data. You can find a detailed list of use cases here.

Section 1.2: Downloading Octoparse

To get started with Octoparse, you need to download the application compatible with either Windows or Mac. After installation, log in with your Octoparse account. If you don’t have an account yet, you can easily sign up for free. While the software is free, Octoparse also offers paid plans such as the Standard, Professional, and Enterprise tiers for additional features. If you're considering using Octoparse for business, be sure to check out the Summer Sale for significant discounts.

Chapter 2: Scraping Images from Vinted

Now, it's time to define what we want to scrape using Octoparse. In this mini-project, we will extract images of dresses from Vinted. This process can be broken down into three straightforward steps: Create a Task, Edit the Task, and Run the Task.

Section 2.1: Creating Your Task

To begin, you must create a task to collect your desired images. Start by copying the URL that features all the dresses on Vinted and selecting the "Start" button.

Octoparse URL entry screenshot

Once you do this, the webpage will appear in Octoparse, similar to how it looks in your browser.

Screenshot of Octoparse displaying the webpage

To automatically extract the data from the webpage, simply click “Autodetect web page data” in the Tips panel. This will reveal all the relevant elements for each dress, including the price, image link, and number of favorites. You will see a preview of the data in a table.

Section 2.2: Editing Your Task

At this point, you will see the automated steps that Octoparse has taken to scrape the webpage. The workflow designer on the right illustrates the steps involved in extracting the images and their details. Click the “Extract Data” button to highlight the elements associated with each dress.

Octoparse Extract Data button screenshot

We are primarily interested in the image links, so we can remove any unnecessary columns from the Data Preview manually. Additionally, the “Pagination” box allows you to adjust how you navigate between pages. If Octoparse doesn't correctly identify the pagination button, you can change it manually by deleting the Matching XPath, clicking the diagonal arrow icon, and selecting the “>” button.

Pagination settings in Octoparse

You can further customize options in the “Click to Paginate” box. Select “Scroll down the page after it’s loaded,” which will unveil additional options for editing, such as Scroll Area, Wait seconds, and Scroll times. Set the waiting time to 10 seconds and the scroll count to 6.

Octoparse pagination options screenshot

Section 2.3: Running Your Task

Having completed the previous steps, it's time to execute the scraping task. Click the “Run” button in the upper right corner. You will be presented with two options: “Run on your device” or “Run in the Cloud.” For this project, we will choose to run it on our device.

Run Task button in Octoparse

The data can be exported in various formats, such as CSV and JSON, or into SqlServer and MySql databases. For simplicity, we will select the CSV format, which will only include links to the images you’ve collected. To download the images from these links, you can use a Chrome extension called Tab Save, where you simply paste the links into the provided form.

Final Thoughts

Congratulations! You have successfully learned how to scrape images using Octoparse. In this tutorial, we explored the auto-detection feature of the software and how to manually adjust settings for data extraction. You’ve gathered images with just a few clicks, without needing any knowledge of HTML, CSS, or Python. Thank you for reading! I hope this guide proves useful for your future computer vision projects. Have a great day!

Mlearning.ai Submission Suggestions

Consider joining Mlearning.ai to gain unlimited access to new data science content daily! Supporting me through membership incurs no extra cost to you. If you’re already a member, subscribe to receive email updates whenever I publish new guides on data science and Python!

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Innovative Strategies for Sustainable Dyeing in Textiles

Explore how the textile dyeing process can be made sustainable through innovative methods and natural alternatives.

# Neanderthals: Pioneers of the First Known

Recent discoveries suggest Neanderthals may have created the earliest known museum-like collections, challenging previous beliefs about artifact gathering.

Master the Principles of Strategy for Unyielding Success

Discover timeless strategies from The Art of War that can empower you to achieve your goals and navigate life's challenges effectively.

8 Practical Hacks to Keep Your Home Spotless and Organized

Discover effective strategies to maintain a clean, organized home that enhances your well-being and productivity.

generate a new title here, between 50 to 60 characters long

Exploring the increasing disposability of NBA coaches in a player-centric league and the implications of recent coaching changes.

The Impact of Noether's Theorem on Physics and Mathematics

Explore how Noether's Theorem connects symmetry and conservation laws in physics, reshaping our understanding of fundamental principles.

New Insights into My Life and Journey as a Writer

A glimpse into my journey as a writer, my passions, and some fun facts about me, reflecting on my experiences and aspirations.

# The UX Revolution in AI: Understanding the Shift in User Expectations

Explore how the current AI revolution focuses more on user experience than on technology itself, shaping new user expectations.