Fabric Twitter ETL

by marksu on May 16, 2007

in Editorial

Per my last post, I’ve been architecting a fabric application that would consume Twitter.com messages (timelines in the Twitter API vernacular), store them in SQL Server and generate some usage reports that I wanted to post in a graphical way on the web.

 

 

The implementation of fabric client that consumes the REST messages from Twitter.com was straightforward. Loading the Twitter public user timeline via the xmlDoc.Load works great, and parsing the last twenty messages via the xmlDoc works great.

xmlDoc.Load(url);

XmlNodeList Statuses = xmlDoc.SelectNodes("/statuses/status");

foreach (XmlNode Status in Statuses)
{

tw_status = new Twiter_Status();

String[] date_parts = Status.ChildNodes[0].InnerText.Split(new char[1] { ‘ ‘ });
tw_status.created_at = date_parts[1] + " " + date_parts[2] + " " + date_parts[3] + " " + date_parts[5];
tw_status.id = System.Convert.ToInt32(Status.ChildNodes[1].InnerText);
tw_status.text = Status.ChildNodes[2].InnerText;

//Get the user sub-section
tw_status.user_id = Status.ChildNodes[3].ChildNodes[0].InnerText;
tw_status.name = Status.ChildNodes[3].ChildNodes[1].InnerText;
tw_status.screen_name = Status.ChildNodes[3].ChildNodes[2].InnerText;
tw_status.location = Status.ChildNodes[3].ChildNodes[3].InnerText;
tw_status.description = Status.ChildNodes[3].ChildNodes[4].InnerText;
tw_status.profile_image_url = Status.ChildNodes[3].ChildNodes[5].InnerText;
tw_status.twiter_url = Status.ChildNodes[3].ChildNodes[6].InnerText;
twitter_chunk.Add( tw_status);

}

//Sort the arraylist to get the last id
twitter_chunk.Sort();

So, once I have the twenty twitter messages loaded up in my tw_status ArrayList object, they can be loaded into the TwitterChunk request attribute that can be fired into the fabric via the fabric Submit API.

try
{
request = new FabricRequest("Twitter_App", "CS_Process");
request["TwitterChunk"] = twitter_chunk;
fabric.Submit(request);
inflight++;
}
catch (Exception ex)
{
Console.WriteLine("Error on Submit:" + ex.Message);
}

From the perspective of getting the messages out of Twitter and into the fabric, that’s all there is. From my initial testing, a single dual-core box consuming the 100s/min of Twitter messages shouldn’t have much of a problem extracting the data. What I’ve also learned during my testing is that I really need the scalability of the fabric to do the transformation and clean up the data. What I found out very quickly was that Twitter is an international service that has gained a lot of popularity in the far east. While this might be a message that means something to someone that can read double-byte character sets, SQL Server REALLY doesn’t like data formatted like this -

<status>
<created_at>Fri May 04 19:39:21 +0000 2007</created_at>
<id>950117512<id>
<text>@eichi &#35013;&#32622;&#12388;&
#12369;&#12390;&#12427;&#12398;&#12391;&#65298;&
#24180;&#12367;&#12425;&#12356;&#12289;&#35036;&
#12390;&#12356;&#12354 ;&#12431;&#12379;&#12383;&
#12425;&#65301;&#24180;&#12367;&#12425;&#12356;&
#12363 ;&#12418;&#12375;&#12428;&#12414;&#12379;&
#12435;&#12290;</text>
<user>
<id>953213</id>
<name> @adsfadsf </name>
<screen_name>adsfasdf</screen_name>
<location>Tokyo, Japan</location>
<description>&#27503;&#21015;&#30703;&#27491;(&#21475;&#20013;&#12456;&#12 521;&#12473;&#12486;&#12454;&#12488;)&#20013;&#12454;&#12455;&#12483;&
#12502;&#12487;&#12470;&#12452;&#12490;&#1 2540;&
#12290;&#27503;&#21015; {vertical-align: baseline;} &#21475; {position: relative;} &#38990; {width: 15em;} &#19978; &#38990; {z-index: 1;} </description>
<profile_image_url>http://assets2.twitter.com/system/user/profile_image/953
213/normal/180.jpg?1176830312</profile_image_url>
<url>http://asdfasdf/blog/</url>
<protected>false</protected>
</user>
</status>

So the fabric has plenty to do trying to make sense of this and downloading and caching any of the relevant information we want to use!

You can really think of this application at this point as a massively scalable Twitter Extract Transform Load. I’ll drill into the transform and load parts of this application next time!

 

Until next time….

 

Mark

 

Comments on this entry are closed.

Previous post: Twitter Fabric?

Next post: Thought Bubble … More on Digg Scaling