When you’re browsing the web for business or pleasure, you might want to know what’s behind the site you’re on. You might want to know who owns the domain, where that site was registered, if it’s connected to others or if it’s even safe to access. Some information is easy to access, and some not so much. If you want to dive into the nitty-gritty part of domains and websites you’ll need the proper tools and guides.
In this guide, we’ll explore all the options for a great OSINT investigation of websites and domains, using a collection of tools we’ve gathered from across the internet.
Our focal points for investigation will be the following:
- Internal links & subdomains.
- Who’s behind the site?
- Related brands & websites.
- Infrastructure & threats.
- Web archives.
Internal Pages & Subdomains
A subdomain is a domain that is a part of another bigger domain. For example, if a domain has a community page as part of their website, example.com, it might use the subdomain community.example.com. Each website has multiple pages for its various activities and offerings, some are easily accessed through the main page, some are more hidden.
Locating and analyzing pages and subdomains of a site can provide an in-depth look at the company’s activities, and suggest that they’ve worked on new projects that have not yet been published.
This list of tools will provide you with all you need to find and study pages & subdomains.
- Robots.txt – A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request from your site. This is used mainly to avoid overloading your site with requests. From an OSINT perspective, we can exploit this list to learn about the internal pages of the site, especially those they want to hide from search engines.
- Dorks – Google hacking, also named Google dorks, is a hacker technique that uses Google Search and other Google applications to find shortcuts and loopholes in Google. Use the site: operator to retrieve all indexed pages of the website (site:example.com). You can also filter the results to last month/week/day to get the most recent changes to the website.
- Securitytrails – One of the best tools to retrieve sub-domains of a website. Aside from the names, you’ll get the servers that host each sub-domain, which helps to analyze them and understand their relations.
- Github subscraper – SubScraper is a subdomain enumeration tool that uses a variety of techniques to find potential subdomains of a given target. Not like SecurityTrails, this script runs live and retrieves the most recent subdomains.
- Hacker-Target’s ‘Extract Links’ tool – This tool allows a fast and easy way to scrape links from a web page. Listing links, domains, and resources that a page links to tell you a lot about the page. This tool will parse the HTML of a website and extract links from the page. The hrefs or “page links” are displayed in plain text for easy copying or review.
- Dead link checker – Dead Link Checker crawls through your website, identifying broken links for you to correct. For OSINT we can use it to identify all the internal pages on a site.
- Link gopher – This nice extension helps you organize your results. Extract all links from a web page, sort them, remove duplicates, and display them in a new tab for copy and paste into other systems.
Who’s Behind The Site?
At times during an investigation, we will like to know who the site owner is, so we can measure the risk, or track the person behind an illegal activity.
‘WHOIS’ – is an Internet service used to look up information about a domain name.
When registering a new domain, site owners provide some information regarding their identity and contacts. These tools will provide you all you need when investigating domain registration details (in case the owner didn’t pay to hide them).
- Who.is – Search the whois database and lookup domain and IP owner information. Just enter the URL, and you’ll see who registered it.
- Whoxy – Another whois tool, you can search a URL and get its historical whois records to investigate the former versions of this domain. You can also use a reverse whois method, to detect all domains registered with a specific Email, Name, Phone number, or IP.
Related Brands & Websites
Locating and mapping the digital eco-system of a website, will help us understand the business structure, and even predict new activities based on digital metadata relations. Many companies, who manage multiple brands as part of their business, may not always make public the connections to the main brand, and only by investigating its digital connections, we would be able to establish those connections.
- Related: – Using the Related: google operator, we can retrieve all sites that share the same audience characteristics as our site. We can also use the ‘Google Similar Pages’ chrome extension, to get websites similar to the site we’re on.
- BuiltWith – Under the relationship profile tab, we’ll see all the analytics tags used by the website (Google Analytics, FB pixel, Hotjat, etc.) and how long they’ve been detected. The OSINT beauty of analytics tags is the fact that site owners usually share the same tag ID with all their websites, and manage them under 1 account. BuiltWith will show us all the other domains that use the same tag and will help us in establishing connections between sites and brands we might never have connected otherwise.
- Crt.sh – Another element that is usually shared within sites belonging to the same owner is the security certificate. Using this tool, just enter an Identity (Domain Name, Organization Name, etc), a Certificate Fingerprint (SHA-1 or SHA-256), or a crt.sh ID and it will retrieve all connected certificates. You can also use the advanced search for a more targeted search.
Infrastructure & Threats
We would like to identify the infrastructure on which the site is built from security or cyber perspective. The tools, platforms, or scripts used by the site can compromise it, and even cause it to crash in the event of a malfunction.
- Checkshorturl – A known method for hackers, is to use a short link from a malicious site, and fool people into clicking on the ‘innocent’ URL. CheckShortURL is an expand link facility: it allows you to retrieve the original URL from a shortened link before clicking on it and visiting the destination.
- DNSdumpster – DNS recon & research. Find & lookup DNS records. By understanding the DNS and mapping the servers used by the site, we’ll be able to assess the risk and connections of the site.
- BuiltWith – Under the ‘Detailed Technology Profile’ BuiltWith will provide us with a detailed list of all the platforms, tools, and technologies used by the site, with the dates they were first and last detected. Additionally, we can search for specific technology and get all the sites that use it as part of their infrastructure. We can also explore its usage trends.
A glimpse at the history of a site can provide us with many insights into its past activities, structure, and strategies. Sometimes, in the early days of a website, its owners tend to share more ‘sensitive’ information. In other cases, the site won’t be live or reachable at the moment of our search, so we will get the latest available version of it to explore.
- Google cache – Simply copy the website URL to the Google search bar, and click search. Once the results appear, find the one you want and click on the little triangle on the right side of the page and then click cached. That will bring you to the page you want. Alternately, once on the page, you want cached, erase the ‘HTTPS://’ and insert ‘cache’. The same can be done with the most advanced search engines.
- Waybackmachine – A non-profit library of millions of historical snapshots of websites. Enter the URL, and you’ll be able to explore older versions of the site, and track changes that occur over time.
- Archive.is – Like the Waybackmachine, this tool takes a ‘snapshot’ of a webpage that will always be online even if the original page disappears. It saves a text and a graphical copy of the page for better accuracy and provides a short and reliable link to an unalterable record of any web page including those from Web 2.0 sites.