Scraper

The Website source repository lets you scrape a public-facing website to pull in articles that might help your users. Because websites and rich content vary widely, read the following information carefully when using this content source.

If an integration supports your documentation platform, use the integration. Dedicated Resource Center integrations offer a better end-to-end experience. The website scraper is a powerful tool, but treat it as a fallback option.

Website options

The website scraper provides unique options for source setup. Use this information with the procedure on the main Source Repository page.

Extract from URLs

There are two options when specifying the exact URLs you want to include:

URLs: Directly include as many URLs as you want. You can specify specific links or your entire website. Click Add URL to add each link.
- For each URL entered, Amplitude recursively pulls as many links as possible attached to those sites.
- Use multiple URLs if you have support articles that you can't access from the main URL link.
- This method might not find all applicable links.
Upload: Uploads a CSV or XML site map file of the specific URLs you want to include.
- Use this method if you want to specify the exact URL links and pages that form the source repository.
- When you provide the sitemap, you hard-code the source repository and prevent it from automatically incorporating any new articles. You then add new articles manually.
- CSV and XML files don't need special formatting associated with the URLs. Use a sitemap.

Advanced options

Click Advanced beneath the URL section to access the following options:

Only include these URL paths: Lets you filter the source repository even further by only including specific URLs. This is most useful when adding entire websites.
Exclude these URL paths: Lets you filter the source repository even further by excluding specific URLs from the source repository. This is useful if, for example, you want to exclude your company's blog posts from the source repository.
Override default selectors: Lets you specify only the content selectors that you want to include or lets you ignore website elements. For more information, review the following sections:

The website scraper pulls in all information about a page by default, not just the main content. This can include the page header, metadata, and other information. As a result, Resource Center articles can include both the main page content and extraneous content. Amplitude uses heuristics to identify and remove as much extraneous content as possible, but some content may still appear in the article. Use content selectors to target and identify content to add to the Resource Center article.

Selectors override what the scraper considers the main content on your page. The website scraper uses the selector to pull only content associated with that selector. You can create multiple selectors to target different pieces of content.

When you have multiple selectors, the website scraper looks for each selector in the order that you created it. If the website scraper doesn't find selectors in a specific URL link, it saves information based on Amplitude heuristics.

Re-sync excluded pages

The website scraper automatically ignores some pages. These typically include landing pages or pages that only contain links and no other text. Usually, these types of pages don't contain information suitable for a Resource Center article, and you can safely remove them from the source repository.

However, occasionally, you may want to re-include these pages so they stay active in your source repository.

Re-sync pages

Go to Guides and Surveys > Content and select the source repository you want.
Click Sync errors.
Search through the excluded pages and select the ones you want to re-include.
Click Force Sync and Publish.

Was this helpful?