(somewhat) Social

Entries in Connect (1)

Sunday
Jan162011

Loading MS Connect To A DB–Part 4: Loading Old Data

In Parts 2 and 3 of this series we started loading data from the RSS feed and keeping it up to date. The only trouble is that we are missing all of the old data. In an effort to get this we’ll use a Google search and then parse the results into the database.

Interesting thing I found in working on this script, it doesn’t matter how many results Google or Bing say that they are returning, you can never actually get past result 1,000, it’s an API limitation apparently. Try one of these searches to see what I mean:

http://www.google.com/search?q=SQL+server&num=100&as_qdr=all&start=901
http://www.bing.com/search?q=sql+server&go=&form=QBLH&filt=all&first=995

Google will actually return an error whereas Bing will just keep giving you the final page of results over and over.

 

 

Building Our Search Query

I found best results using Google, in particular there is some good information on the URL parameters for searches, something I could not find for Bing.

The search sent into Google was SQL site:connect.microsoft.com/SQLServer. This gave me a basic URL of http://www.google.com/search?q=SQL+%2Bactive+site%3Aconnect.microsoft.com%2FSQLServer

The URL parameter num=100 will return 100 results per page. To allow for paging through results we send start= with a numerical value (we’re not paging with the initial query and so won’t actually pass any value to this).

This gives us an initial Google search URL of http://www.google.com/search?q=SQL+%2Bactive+site%3Aconnect.microsoft.com%2FSQLServer&num=100&as_qdr=all&start=

 

Paging Through Search Results

Using the Net.WebClient object again we grab the initial URL into a variable. Then we use our old friend regex to look and see how many results the search actually returned. After all, there’s no point in paging through 800 results if only 20 are returned.

This value gets loaded into $ResultTally.

$PrimaryURL = "http://www.google.com/search?q=SQL+%2Bactive+site%3Aconnect.microsoft.com%2FSQLServer&num=100&as_qdr=all&start="
$PrimaryPulledURL = (new-object Net.WebClient).DownloadString("$PrimaryURL") 
$regexResults = "<div id=resultStats>[^>]*?<nobr"
$Results = [regex]::Matches($PrimaryPulledURL, "$regexResults") | % { $_.Value -replace ",", "" } | Select -First 1 
$ResultsCount = [regex]::Matches($Results, "\d{1,5}") | Select Value
$ResultTally = [int]$ResultsCount.Value

 

In the case of this query Google was returning 2,420 results. With the API limitation we can never access more than 1,000 results and so set $ResultTally to be the results count or 1,000, whichever is greater. We’ll also initialize a $Count variable which will be used for paging purposes.

if ($ResultTally -gt 1000) { $ResultTally = 1000 } 
$Count = 1

Now we have our results we can use this to page through the search results. This is a case of taking the $PrimaryURL of http://www.google.com/search?q=SQL+%2Bactive+site%3Aconnect.microsoft.com%2FSQLServer&num=100&as_qdr=all&start= and appending the value of $Count. For each iteration of the loop we add 100 to the value of $Count until we reach the value of $ResultTally. Each loop will call our ParseResults function.

do {
$URL = "$PrimaryURL$Count"
ParseResults $URL $Count
$Count = $Count + 100
}
While ($Count -le $ResultTally)

 

Parsing The Search Results

As a part of this script a table is used to temporarily hold the results of the search. This is created and destroyed as a part of the Powershell script. It has no use beyond this process and so compartmentalizes things nicely.

The ParseResults function is called with a URL. We again use the Net.WebClient object and regex to go through the web page and pull out all of the Connect URLs. Each of these URLs we then load into the database.

$SearchURL = (new-object Net.WebClient).DownloadString("$URL")
$regexUp = "http://connect.microsoft.com/SQLServer/feedback/details[^>]*?`""
$up = [regex]::Matches($SearchURL, "$regexup")
foreach ($ConnectURLRegex in $up)
{
$ConnectURL = $ConnectURLRegex.Value.Substring(0,$ConnectURLRegex.Value.LastIndexOf("`""))
invoke-sqlcmd @params -Query "insert into ConnectItems_Temp (URL) values ('$ConnectURL')"
}

You might look at this and wonder why there is no SQL instance or database connection information being passed to invoke-sqlcmd. As it happens they are.

In previous scripts (which I will be updating) each and every call to the database has had to have the connection information explicitly defined. This can be a serious pain should you wish to change the SQL instance or database as you’d need to change it in multiple locations. I was reading a Blog post by Don Jones (blog|twitter) on Splatting and decided to give it a go.

Now, rather than defining it each time, a single hashtable can be stored in a variable and used.

$params = @{'server'='.';
       'Database'='DenaliConnect'}

 

Final Steps

Now that the table contains a list of URLs we can execute a process very much like that used in Part 3 updating each row by pulling and then parsing the data on the page.

Once that’s done we just load the data into the main table (avoiding duplicates) and drop the work table.

You can download the full script here.

 

Hopefully you’ve enjoyed this short series of posts and have given it a try on your local machine. Let me know if you did and what results you got.