Part 2: Processes
Commands needed for processing pages to extract links or articles are written in this section. Each section should start with BEGIN_PROCESS <name> command, where <name> is the process name. Section should end with END.
INCLUDE_LINKS = "index.html"
EXCLUDE_LINKS = "/SPECIALS/"
REPEAT_FOR_ALL_LINKS LIST_1 GetArticle
In Example 1.0, after setting START_URL and SOURCE, a process named "FirstProcess" is called. This process is shown in Example 1.1. In this process DOWNLOAD_PAGE command is called. Usually this command take one URL at a time from URL list and download the page. In the initial call of DOWNLOAD_PAGE, since there is nothing in URL list, START_URL is used.
INCLUDE_ALL_LINKS, INCLUDE_LINKS and EXCLUDE_LINKS commands can be used to set rules for extracting URLs and then use GET_LINKS command to populate the URL list specified.
Once URL's are extracted to the URL list, call "GetArticle" process repeatedly for all the extracted links in the specified list. The command REPEAT_FOR_ALL_LINKS is used for this.
//Example 1.2BEGIN_PROCESS GetArticle
SECTION_FROM "<head>" to "</body>"
ARTICLE_FROM "<!--endclickprintexclude--><p>" to "<!--endclickprintinclude-->"
INCLUDE_IMAGES = ".jpg"
FIND_LINE "<title>CNN.com -"
VAR_REMOVE_FROM VAR_START to "<title>CNN.com - "
VAR_REMOVE_FROM " -" TO VAR_ENd
Example 1.2 shows the process of extracting an article.
The DOWNLOAD_NR_PAGE command downloads an article page here. When downloading articles, it is advised to use DOWNLOAD_NR_PAGE rather than DOWNLOAD_PAGE in NewsRaider, since this checks for duplicate articles and ignores them.
In order to process the individual articles we need to separate the bits of the article page we want with those that we don't. The best way to see this is to open up the article pages and have a look at their HTML source to find out which bits we want and what makes them "special". The most important parts of this is to find out which is the important part of the page and where in the page the article text begins and where it ends. The commands we use for this are SECTION_FROM and ARTICLE_FROM.
This script will take all of the text from between those the ARTICLE_START= and ARTICLE_END= markers and use that for the main page text of the NewsRaider article corresponding to that URL.
This is all very well, we have the text we want but NewsRaider is more powerful than that. We can specify the category for article, who wrote it, when it was published and so forth.
To accomplish these tasks Raid Script has some very simple yet powerful commands that allow anyone to manipulate text based content - even if they have no programming experience.
To illustrate, lets see how we get the Title for the CNN Articles that the example script processes. We will jump in at the deep end with the example, and explain it afterwards:
The last bit is all we are interested in, onwards from the "//Acquire Title" comment. Lets go through it step by step.
FIND_LINE "<title>CNN.com –"
This tells NewsRaider to find the first line of the article with the text "<title>CNN.com -" in it. What you must understand here is that raid Script operates on the source HTML, not the "rendered" web page so the line we are finding isn't actually visible in your web browser - but its there, nonetheless.
You don't need to understand HTML to write raid scripts, but you will need to understand how to view html page sources.
Once NewsRaider has found that line it will make it current. Think of this as meaning that it will copy the entire contents of that line into memory, just like copy and paste does in a word processor.
Now to the next line of the script :
VAR is at the core of doing all of the cleaver stuff with your raid scripts. VAR is a variable, it's the only true variable currently supported by Raid Script (Its all we need to do some complex stuff simply). If you don't know what a variable is then think of it as being a place where text is stored and can be changed.
We saw above that the FIND_LINE command finds a line that contains some text and then copies the entire line into memory. The place it copies it to is called LINE. LINE is a variable like VAR but it can only be copied from. Nothing else can be done with LINE. It is what is termed an Input variable.
Lets assume that the line found by FIND_LINE was:
"<title>CNN.com - Experts: Tsunami disaster might ease terrorism - Jan 7, 2005</title><script"
So that means that the contents of LINE will also be this text and, after the VAR=LINE so to will the contents of VAR.
The Script has filled VAR with something useful, now we need to change it into exactly what we want:
VAR_REMOVE_FROM VAR_START TO "<title>CNN.com -"
The command VAR_REMOVE_FROM is one of the commands you will be using often.
It says "Remove all text from the start of VAR up to and including whatever is specified in the quotation marks". (For a more detailed explanation of the Commands see the commands section.).
Note that VAR_START is an item that says, whatever is at the beginning of VAR. If you knew that always you could use the literal text in quotation marks.
So, after this command the contents of VAR will be:
"Experts: Tsunami disaster might ease terrorism - Jan 7, 2005</title><script"
That's starting to look like a title for a NewsRaider article, but it still has the stuff at the end we don't want. To get rid of that we use VAR_REMOVE_FROM again, but this time we use the VAR_END item rather than VAR_START
VAR_REMOVE_FROM " -" TO VAR_END
VAR_REMOVE_FROM in this case says "Find the first occurrence of whatever is in the quotation marks and remove all text from there up until the end of VAR." The result of this command will be:
"Experts: Tsunami disaster might ease terrorism"
The final line in this part of the script is:
We saw how LINE was an Input Variable that could only provide text to VAR well, TITLE is an Output Variable that can only take text from VAR. NewsRaider knows that once it has finished processing the article whatever text is in TITLE it should use as the article title.