How to Create Your Own Web Crawler in JavaScript

Web crawlers, also known as spiders or bots, are programs designed to automatically browse the web, fetch pages, and extract useful data. They're the foundation of search engines like Google, price monitoring tools, SEO analyzers, and more.
In this tutorial, we'll build a simple web crawler using JavaScript and Node.js. And learned how to use it for our case.
⚠️ Disclaimer: Always check a website’s
robots.txt
and terms of service before crawling. Unauthorized crawling can lead to legal issues or IP bans.
What is a Web Crawler?
Let's first understand the basic of a crawler. A web crawler is a program that systematically browses the internet, collecting information from the pages of websites. Common use cases include:
Search engines like Google indexing pages.
Price comparison tools scraping e-commerce data.
SEO tools collecting meta tags and performance metrics.
Research gathering public datasets.
Example: Googlebot crawls billions of pages daily to keep search results fresh.
When we Need a Custom Web Crawler
While there are many third-party APIs and ready-made scraping tools available, sometimes the prebuilt ones don’t meet our exact needs — like when we want a specific functionality that no other API provides. A custom crawler lets us fully customize and focus on a specific niche, such as checking how Google or Bing bots see our pages or what content they might be missing, so we can identify where we are lagging.
For example, imagine we are running a blogging platform, and some of our articles are not ranking well on search engines. By building a custom crawler, we can identify how Googlebot crawls our site, analyze which pages are being indexed, and detect missing meta tags or blocked resources. This helps you fix SEO issues and improve your site’s visibility — something most generic scraping tools won’t be able to do with precision.
Sometimes, React websites aren’t fully crawlable by bots. Check out our guide how to do SEO in React to fix this issue.
Here's I am sharing simple method or script to build a basic level crawler in JavaScript:
1. Prerequisites
Before you start, ensure you have the following:
Node.js installed (v18+ recommended).
Basic understanding of JavaScript and HTTP requests.
A code editor like VS Code.
We'll also use these npm packages:
axios
– for making HTTP requests.cheerio
– for parsing and manipulating HTML.p-limit
– for controlling concurrency.
Install them with:
npm init -y
npm install axios cheerio p-limit
2. Setting Up the Project
Create a folder for your project:
mkdir web-crawler
cd web-crawler
Inside, create an index.js
file:
touch index.js
Your project structure should look like this:
web-crawler/
├── index.js
├── package.json
└── node_modules/
3. Fetching Web Pages
Let's start by fetching the raw HTML of a web page.
const axios = require('axios');
async function fetchPage(url) {
try {
const { data } = await axios.get(url, {
headers: {
'User-Agent': 'MyWebCrawler/1.0',
},
});
console.log(`Fetched: ${url}`);
return data;
} catch (error) {
console.error(`Error fetching ${url}:`, error.message);
return null;
}
}
fetchPage('https://example.com');
Pro Tip: Always set a custom
User-Agent
to identify your crawler.
4. Parsing HTML with Cheerio
We’ll use Cheerio to extract data from the HTML.
const cheerio = require('cheerio');
async function parsePage(html, url) {
const $ = cheerio.load(html);
console.log(`Title of ${url}:`, $('title').text());
const links = [];
$('a').each((_, element) => {
const href = $(element).attr('href');
if (href && href.startsWith('http')) {
links.push(href);
}
});
return links;
}
Now, combine it with fetchPage
:
(async () => {
const html = await fetchPage('https://example.com');
if (html) {
const links = await parsePage(html, 'https://example.com');
console.log('Links found:', links);
}
})();
5. Adding URL Queues
To avoid crawling the same page multiple times, maintain a queue and visited set.
const visited = new Set();
const queue = ['https://example.com'];
async function crawl() {
while (queue.length > 0) {
const url = queue.shift();
if (visited.has(url)) continue;
const html = await fetchPage(url);
if (!html) continue;
visited.add(url);
const links = await parsePage(html, url);
links.forEach(link => {
if (!visited.has(link)) {
queue.push(link);
}
});
}
}
crawl();
6. Handling Rate Limiting
If you crawl too fast, websites may block your IP. Use the p-limit
package to limit concurrency.
const pLimit = require('p-limit');
const limit = pLimit(2); // 2 concurrent requests
async function crawlWithLimit() {
const promises = queue.map(url =>
limit(() => fetchPage(url))
);
await Promise.all(promises);
}
7. Respecting robots.txt
Before crawling, check if you're allowed to crawl a website.
const axios = require('axios');
async function checkRobots(url) {
try {
const baseUrl = new URL(url).origin;
const { data } = await axios.get(`${baseUrl}/robots.txt`);
console.log('robots.txt:', data);
} catch {
console.log('No robots.txt found');
}
}
checkRobots('https://example.com');
Note: This doesn't automatically enforce rules, but it's a good starting point.
8. Full Code Example
Here’s a simplified version of the complete crawler:
const axios = require('axios');
const cheerio = require('cheerio');
const visited = new Set();
const queue = ['https://example.com'];
async function fetchPage(url) {
try {
const { data } = await axios.get(url, { headers: { 'User-Agent': 'MyCrawler/1.0' } });
console.log(`Fetched: ${url}`);
return data;
} catch (error) {
console.error(`Failed to fetch ${url}:`, error.message);
return null;
}
}
async function parsePage(html, url) {
const $ = cheerio.load(html);
const links = [];
$('a').each((_, el) => {
const href = $(el).attr('href');
if (href && href.startsWith('http')) links.push(href);
});
return links;
}
async function crawl() {
while (queue.length > 0) {
const url = queue.shift();
if (visited.has(url)) continue;
const html = await fetchPage(url);
if (!html) continue;
visited.add(url);
const links = await parsePage(html, url);
links.forEach(link => {
if (!visited.has(link)) queue.push(link);
});
}
}
crawl();
Conclusion
Building a web crawler in JavaScript is a great way to understand how search engines and scraping tools work. In this guide, we covered:
Fetching pages with
axios
.Parsing HTML using
cheerio
.Implementing a crawl queue.
Respecting web crawling ethics with
robots.txt
.
With this foundation, you can expand your crawler by adding data storage, handling dynamic pages with tools like Puppeteer, or scaling it for larger websites.
If you find it useful don't forgot to follow Techolyze.