Per my last post, I've been architecting a fabric application that would consume Twitter.com messages (timelines in the Twitter API vernacular), store them in SQL Server and generate some usage reports that I wanted to post in a graphical way on the web.
The implementation of fabric client that consumes the REST messages from Twitter.com was straightforward. Loading the Twitter public user timeline via the xmlDoc.Load works great, and parsing the last twenty messages via the xmlDoc works great.
xmlDoc.Load(url);
XmlNodeList Statuses = xmlDoc.SelectNodes("/statuses/status");
foreach (XmlNode Status in Statuses)
{
tw_status = new Twiter_Status();
String[] date_parts = Status.ChildNodes[0].InnerText.Split(new char[1] { ' ' });
tw_status.created_at = date_parts[1] + " " + date_parts[2] + " " + date_parts[3] + " " + date_parts[5];
tw_status.id = System.Convert.ToInt32(Status.ChildNodes[1].InnerText);
tw_status.text = Status.ChildNodes[2].InnerText;
//Get the user sub-section
tw_status.user_id = Status.ChildNodes[3].ChildNodes[0].InnerText;
tw_status.name = Status.ChildNodes[3].ChildNodes[1].InnerText;
tw_status.screen_name = Status.ChildNodes[3].ChildNodes[2].InnerText;
tw_status.location = Status.ChildNodes[3].ChildNodes[3].InnerText;
tw_status.description = Status.ChildNodes[3].ChildNodes[4].InnerText;
tw_status.profile_image_url = Status.ChildNodes[3].ChildNodes[5].InnerText;
tw_status.twiter_url = Status.ChildNodes[3].ChildNodes[6].InnerText;
twitter_chunk.Add( tw_status);
}
//Sort the arraylist to get the last id
twitter_chunk.Sort();
So, once I have the twenty twitter messages loaded up in my tw_status ArrayList object, they can be loaded into the TwitterChunk request attribute that can be fired into the fabric via the fabric Submit API.
try
{
request = new FabricRequest("Twitter_App", "CS_Process");
request["TwitterChunk"] = twitter_chunk;
fabric.Submit(request);
inflight++;
}
catch (Exception ex)
{
Console.WriteLine("Error on Submit:" + ex.Message);
}
From the perspective of getting the messages out of Twitter and into the fabric, that's all there is. From my initial testing, a single dual-core box consuming the 100s/min of Twitter messages shouldn't have much of a problem extracting the data. What I've also learned during my testing is that I really need the scalability of the fabric to do the transformation and clean up the data. What I found out very quickly was that Twitter is an international service that has gained a lot of popularity in the far east. While this might be a message that means something to someone that can read double-byte character sets, SQL Server REALLY doesn't like data formatted like this -
<status>
<created_at>Fri May 04 19:39:21 +0000 2007</created_at>
<id>950117512<id>
<text>@eichi 装置つ&
#12369;てるので2&
#24180;くらい、補&
#12390;いあ ;わせた&
#12425;5年くらい&
#12363 ;もしれませ&
#12435;。</text>
<user>
<id>953213</id>
<name> @adsfadsf </name>
<screen_name>adsfasdf</screen_name>
<location>Tokyo, Japan</location>
<description>歯列矯正(口中エ 521;ステウト)中ウェッ&
#12502;デザイナ 2540;&
#12290;歯列 {vertical-align: baseline;} 口 {position: relative;} 顎 {width: 15em;} 上 顎 {z-index: 1;} </description>
<profile_image_url>http://assets2.twitter.com/system/user/profile_image/953
213/normal/180.jpg?1176830312</profile_image_url>
<url>http://asdfasdf/blog/</url>
<protected>false</protected>
</user>
</status>
So the fabric has plenty to do trying to make sense of this and downloading and caching any of the relevant information we want to use!
You can really think of this application at this point as a massively scalable Twitter Extract Transform Load. I'll drill into the transform and load parts of this application next time!
Until next time….
Mark








Post new comment