Blog Scrapes

This article will dive into the steps necessary to run a scrape on a blog. We may need to use this tool during the build and launch process as some dealers may be unable or unwilling to provide us with an easily uploadable import file. Please note that much of this process is trial and error and there are situations in which the scrape tool will simply not work. In those scenarios, we may need to resort to going back to the dealer to see if we can get the file, or in the case of dealers coming to us from another LV brand, we may need to get in contact with the team in question directly to see if they can provide us with a CSV import.

1 Setting Up The Scrape
- 1.1 Logging In
- 1.2 Adjusting The Theme
- 1.3 Starting the Scraper
2 Selecting Your Targets
- 2.1 Request
- 2.2 Taxonomy
- 2.3 Post
- 2.4 Customize
- 2.5 Publish
- 2.6 Filter
- 2.7 Schedule
- 2.8 Others
3 Running the Scrape
- 3.1 Submitting the Scrape
- 3.2 Is It Done Yet?
4 Checking Content
5 Troubleshooting
6 Final Steps

Setting Up The Scrape

Logging In

First, we are going to log in to the dealer’s blog by logging in to their account, and navigating to Websites > Blog. The name in the top left corner should be the name of the dealer you are working on.

If it’s not the correct name, then double check to make sure you are logged in to the correct account. Occasionally, if an account has multiple domains/sites, you may not be able to log in to the correct site from the dealer’s account. You may need to log in to the main InteractRV account, and select the site from the list of all sites.

You can use the Search function on the All Sites page to find what you need.

Once you’ve found the site you need, you can go to that site’s Dashboard by hovering over the URL area and selecting “Dashboard”.

Adjusting The Theme

Because the iteration of Wordpress associated with our regular theme is so old, the Scrape plugin does not work correctly with the theme. In order to run the scraper, we will first need to change the theme to a “newer” theme that allows it to work as intended.

Navigate to Appearance > Themes on the left hand menu.

Once you’ve arrived at the Themes page, you will see a handful of different options for themes. We are going to select the theme Radiate by hovering over it and clicking the “Activate” button.

This will automatically install the new theme so that the plugin will work correctly. Note that the blog itself will lose the styling it previously had that matched it to the site. This is okay – we will put it back in place after we finish with the scrape.

Starting the Scraper

Next, to begin scraping, we will navigate to the Scrapes > Add New option on the left hand side of the page.

This will begin the scraping process by prompting you to name the Scrape and choose a “format” for the scrape.

The Name of the scrape is not important as we will likely not have multiple scrapes, but if for some reason we do need to scrape from multiple sources to have a complete blog, make sure these are labeled in a way that allows you to easily distinguish between them.
The Task Type, when we are scraping content from another blog, should be “Serial” as this will pull up the correct prompts for a blog with multiple posts on it.

Once we select “Serial”, we will be led through a series of prompts that will allow us to visually choose the specific item we are trying to pull. I will review each of these in the next section of the guide.

Selecting Your Targets

Request

First things first, the scraper needs to know where it will be pulling the scrape from. Cookies and Request Origin can be ignored, but Source URL should be filled out with the URL of the blog you are trying to scrape from.

Once you add this URL, the scraper will now point towards this URL to allow you select further options to start pinpointing important elements of each post. The first one is Post Item. You’ll want to select the blue icon to the right of the field to visually select the Post Item.

This will pull up a window that displays a preview of the blog you’re trying to scrape. When you hover over certain elements, they will turn a red color to indicate that’s what you’re selecting. You want to try and position your mouse in a way that entirely covers ONE single post item in the list of posts with the red hover. Ensure you are not selecting the whole list of posts – it must be ONE individual post item.

Once you’ve gotten your red box in the right place, click on it to select it. This will populate the field with the “value” the scraper is looking for.

You may find value in using the “Exact match only” button later. Do not turn it on in the first attempt as it should not be necessary to pull the right data.

Moving on, the “Next Page” value needs to be selected. Similarly to the last prompt, this will also pull up a preview of the blog page, and you will need to select the correct area with the red hover before clicking. In this example dealer, they don’t have a Next Page button, just a “Load More”, so I am going to try selecting this to see if it works. However, in many cases, a blog will have a clear “Next” or arrow icon that leads to the next page that you can select.

You may also need to make use of the “Use URL parameter” here if the regular selection method does not work. In that case, you would select that option and “Add new parameter” to add in the appropriate information.

The scraper should now (hopefully) have all the information it needs to correctly guess what to pull for the following sections of the scrape.

Taxonomy

If you notice the blog you’re scraping has entries that are categorized (we should make sure they are using Categories specifically – not just Tags, as that is a different setup later – if they do not use Categories, you can ignore this section), you can choose to set that up here.

Under “Create Categories” You’ll want to select the option “Category”. This will prompt you to select an area on the blog entries where the categories are being displayed.

Once you select the blue icon to the right of the Value field, you can once again use the red hover to select the categories. Importantly, you’ll want to select the area on the chosen entry that displays ALL of the categories for that post, not just one of the categories.

After you’ve selected that, you’ll also want to indicate what separator is being used between the various categories. We can see from the hover that there is a comma between each category on the post in question, so we will put a comma in that field.

We can ignore the list of existing Categories.

Post

Now we will select some various sections of the individual post to use as an example for the scraper to pull the correct content into the correct fields. Keep in mind that the post that it’s going to show you as an example is the one that you selected as an individual post back in the “Request” section of the scrape. If for some reason the individual post that’s being pulled up is a bad example/doesn’t have the things you need in it to help the scraper get an idea of the content elements, go back to that step and select a different individual post.

First, let’s select the Title of the post. Similarly to previous steps, we will click the blue icon to the right of the Value field to pull up a preview, and use the hover to select the Title of the entry in the preview.

Next, we’ll select the Content. You can try to allow the scraper to automatically detect the content by leaving the setting on “Detect automatically”, but I typically find you will get better results if you manually select the content. To do so, switch the setting to “Select from source”, and once again use the blue icon to the right of the Value field to select this. This process can be a little complicated as it can sometimes be tough to pick the right “box” that encompasses the content of the entry without the title or other parts of the page. Occasionally, it can be helpful to utilize the “Disable Styles” checkbox in the top left to remove the styling from the page, as this can make it more obvious where the content begins and ends on the page.

Selecting with styling on	Selecting with styling off

Selecting with styling on	Selecting with styling off

Once you’ve selected the content, we should also make sure that HTML tags are enabled (we want the dealer to keep any HTML styling they’ve utilized) and that we download any images they are using to our media library.

The Excerpt section is not necessary; as our layout automatically pulls an excerpt from the content of the post, we do not need to ensure this is selected correctly. We can leave this on “Generate from content”.

The Tags section may or may not be necessary depending on the content of the blog. If you notice that the dealer is using tags on their current blog, we may want to pull these over. These are not the same as Categories and should not be treated as such, so make sure you thoroughly investigate the setup of their current blog. We want to preserve their organization as much as we can in the import process. The example dealer only used Categories, not Tags, but if you see that a dealer is using Tags, you’ll want to use the blue icon on the Value field to select them accordingly. Remember, you’ll want to select the ENTIRE tags area, not just one individual tag!

Finally, the Featured Image field. We do want to try and port this over as much as possible, so we should set it up properly. The scraper may detect this automatically, but I try to manually make sure it is selecting the right thing, just in case.

We should also ensure this is set to “Select from source” and not “Select from media library”.

Customize

These fields are not necessary for a first go around. They may potentially help in troubleshooting, but for now, we will leave them untouched.

Publish

Most of the default Publish settings can be left untouched. However, if the dealer’s current blog has a publish date visible on them, we will want to port that information over. Under Date we’re going to choose “Select from source” and use the icon to the right of the Value field to visibly select the part of the post that has the date on it.

Filter

The default settings on this section can also generally be left alone for the first run. We may need to tweak these depending on the results of our first scrape.

Schedule

Most of these settings can remain untouched, however, we do not want this to run a periodic scrape. We only need this to run once, so we are going to uncheck “Unlimited” under Total Runs so that the number is set to 1.

Others

This section does not need to be adjusted on the first run.

Running the Scrape

Submitting the Scrape

Once all of the information in the last section has been filled out appropriately, it’s time to run the scrape!
Just select the “Save” button in the bottom right of the page.

This will take you to a page with the list of all Scrapes, and the scrape will begin to run.

Is It Done Yet?

To the right hand side of the Scrape name, there will be a “Schedules” section where you can determine if the scrape is done running. As long as it still says “Last Complete: None”, that means the scrape is still running.

We’ll want to wait until Last Complete has a date and time that it finished to check and see if it pulled all of the content appropriately. Sometimes this can complete quickly (within 5-10 minutes), but it can take up to a few hours depending on how much content is available to be pulled over. If you set it to run and it still says “None” in Last Complete the following day, then there is an issue, and the scrape should be cancelled and re-run (or possibly have its parameters adjusted).

Checking Content

Once the scrape is complete, we’ll want to double check and ensure that the content that was pulled in looks correct and there were no issues in what was pulled over.

Generally, we’ll want to check a handful of entries to make sure that the entirety of the content of the entry was pulled over, as well as the appropriate images. We also want to make sure that the number of entries looks correct – if what got scraped only goes back a few months, while their current blog goes back years, then something may have gone wrong.

Troubleshooting

Coming soon!

Final Steps

Once you’ve completed the scrape and confirmed that the content came over as much as possible, we will need to switch the blog back to its original theme.

Navigate to Appearance > Themes on the left hand side, to bring you back to the list of themes for the blog.

Since we no longer need the updated theme to use the Scrape tool, we can switch back to the original theme, “Interact RV w/Bootstrap 3”. Hover over it and select “Activate” to turn on this theme.

When you activate this theme, a screen will appear asking you to make some selections for settings for implementing the theme. You will want to change ALL of these dropdowns to “No”, and then click “Save Changes”.

The old theme will now be activated again. You’ll want to navigate to the blog (/blog on the dev and/or live domain name for the dealer) to ensure that the styling is correct and that no adjustments need to be made to make sure the blog looks like it’s part of the regular site. This is now completed!