In this assignment you will download some of the mailing list data from http://mbox.dr-chuck.net/ and run the data cleaning / modeling process and take some screen shots. You will then run two visualizations of the email data you have retrieved and processed: a word cloud to visualize the frequency distribution. While a word cloud might seem a little silly and over-used, it is actually a very engaging way to visualize a frequency distribution or histogram. The word cloud is really a nice continuation of frequency/counting assignments we have been doing in this class. The second visualization will be to show how the data is a timeline to show how the data is changing over time. You are provided the base code for the two visualizations but will need to edit it to improve the data output.  Finally you will need to create your own visualization using the spidered data
Here is a copy of the Sakai Developer Mailing list from 2006-2014.
http://mbox.dr-chuck.net/ (Links to an external site.)Links to an external site.
The base program that includes gmane.py, gmodel.py, gword.py and gline.py.  It also included sample generated gword.js (with gword.htm) and gline.js (with gline.htm) is found in gmane.zip
You can install the SQLite browser http://sqlitebrowser.org/ (Links to an external site.)Links to an external site. if you would like to to view and modify the databases used for this assignment.
Project Structure
gmane.py
The gmane.py file is provided for you.  It operates as a spider in that it runs slowly and retrieves one mail message per second so as to avoid getting throttled. It stores all of its data in a database and can be interrupted and re-started as often as needed. It may take many hours to pull all the data down. So you may need to restart several times. You should download and process at least 1000 messages for the data visualizations to work – but more data is always better.
The base URL (http://mbox.dr-chuck.net/ (Links to an external site.)Links to an external site.) is hard-coded in the gmane.py. Make sure to delete the content.sqlite file if you switch the base url.
Navigate to the folder where you extracted the gmane.zip
Here is a run of gmane.py getting the last five messages of the sakai developer list:
python gmane.py
How many messages:10
http://mbox.dr-chuck.net/sakai.devel/5/6 9443
john@caret.cam.ac.uk 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments
http://mbox.dr-chuck.net/sakai.devel/6/7 3586
s-githens@northwestern.edu 2005-12-09T13:32:31-06:00 re: sakaiportallogin and presense
http://mbox.dr-chuck.net/sakai.devel/7/8 10600
john@caret.cam.ac.uk 2005-12-09T13:42:24+00:00 re: lms/vle rants/comments
The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message.
Sometimes there is missing a message. Perhaps administrators can delete messages or perhaps they get lost – I don’t know. If your spider stops, and it seems it has hit a missing message, go into the SQLite Manager and add a row with the missing id – leave all the other fields blank – and then restart gmane.py. This will unstick the spidering process and allow it to continue. These empty messages will be ignored in the next phase of the process.
One nice thing is that once you have spidered all of the messages and have them in content.sqlite, you can run gmane.py again to get new messages as they get sent to the list. gmane.py will quickly scan to the end of the already-spidered pages and check if there are new messages and then quickly retrieve those messages and add them to content.sqlite.
The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. This is intentional as it allows you to look at content.sqlite to debug the process. It would be a bad idea to run any queries against this database as they would be slow.
gmodel.py
The second process is running the program gmodel.py. gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text.
Each time gmodel.py runs – it completely wipes out and re-builds index.sqlite, allowing you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the data cleaning process.
Running gmodel.py works as follows:
python gmodel.py
Loaded allsenders 1588 and mapping 28 dns mapping 1
1 2005-12-08T23:34:30-06:00 ggolden22@mac.com
251 2005-12-22T10:03:20-08:00 tpamsler@ucdavis.edu
501 2006-01-12T11:17:34-05:00 lance@indiana.edu
751 2006-01-24T11:13:28-08:00 vrajgopalan@ucmerced.edu

The gmodel.py program does a number of data cleaing steps
Domain names are truncated to two levels for .com, .org, .edu, and .net other domain names are truncated to three levels. So si.umich.edu becomes umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are forced to lower case and some of the @gmane.org address like the following
arwhyte-63aXycvo3TyHXe+LvDLADg@public.gmane.org
are converted to the real address whenever there is a matching real email address elsewhere in the message corpus.
When you are done, you will have a nicely indexed version of the email in index.sqlite. This is the file to use to do data analysis. With this file, data analysis will be really quick.
gbasic.py
The first, simplest data analysis is to do a “who does the most” and “which organization does the most”? This is done using gbasic.py:
python gbasic.py
How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 5 Email list participants
steve.swinsburg@gmail.com 2657
azeckoski@unicon.net 1742
ieb@tfd.co.uk 1591
csev@umich.edu 1304
david.horwitz@uct.ac.za 1184
Top 5 Email list organizations
gmail.com 7339
umich.edu 6243
uct.ac.za 2451
indiana.edu 2258
unicon.net 2055
gword.py
There is a simple vizualization of the word frequence in the subject lines in the file gword.py:
python gword.py
Range of counts: 33229 129
Output written to gword.js
This produces the file gword.js which has he top 100 words found in the emails. You can view them in a word cloud using the file gword.htm. Once you get gword.py to work you will need to enhance the program to filter the output as follows:

The output should only contain words with letters that are are 4 letters or longer (no numbers)
The output should remove common words (stop words)
The output should remove sakai and email, common words in the output that are not meaningful for the word cloud.
The output should use content from the subjects of the emails.

The filters should be added in place of “words = text.split(” “)” in the sample program.
gline.py
A second visualization is in gline.py. It visualizes email participation by organizations over time.
python gline.py
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 10 Oranizations
[‘gmail.com’, ‘umich.edu’, ‘uct.ac.za’, ‘indiana.edu’, ‘unicon.net’, ‘tfd.co.uk’, ‘berkeley.edu’, ‘longsight.com’, ‘stanford.edu’, ‘ox.ac.uk’]
Output written to gline.js
Its output is written to gline.js which is visualized using gline.htm.
Change the gline.py program to show the message count by month instead of by year. You can switch from a by-year to a by-month visualization by changing only a few lines in gline.js. The puzzle is to figure out the smallest change to accomplish the change.
Your Own Visualization:
Once you have gotten the visualization to work for gword and gline you should create one other visualization to display the data in a different way. When creating your own visualization:

it must output data that is different (at least slightly) from that used in gword and gline.
it must use a different chart type than used in gword and gline

You can create a Bubble chart.  This chart can be used as an alternative to the word cloud. Instead of JSON data this uses csv data. A sample bubble chart is shown in sampleBubble.htm using the csv data in flare.csv (in the zip file for the assignment).  If you were to choose the bubble chart you would need to:

Create a new python file (gbubble.py) that is like your final gmodel.py except that the output is different
Change the output to the csv format seen in flare.csv
Output data for all words with a count of more than 10 (or 50 if you downloaded lots of the data)
Output actual count data instead of the scaled font size for the word cloud.

You could also choose another visualization to use with your data.  d3 supports a wide type of visualizations that you can use with your data: https://github.com/d3/d3/wiki/Gallery (Links to an external site.)Links to an external site..
Some other URLs for other visualization ideas:
https://developers.google.com/chart/ (Links to an external site.)Links to an external site.
https://developers.google.com/chart/interactive/docs/gallery/motionchart (Links to an external site.)Links to an external site.
https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart_time_formats (Links to an external site.)Links to an external site.
https://developers.google.com/chart/interactive/docs/gallery/annotatedtimeline (Links to an external site.)Links to an external site.
http://bost.ocks.org/mike/uberdata/ (Links to an external site.)Links to an external site.
http://nltk.org/install.html (Links to an external site.)Links to an external site.
Submitting Your Work
Please Upload Your Submission:

A screen shot of you running the gmane.py application to produce the content.sqlite database.
A screen shot of you running the gmodel.py application to produce the index.sqlite database.
A screen shot of you running the gbasic.py program to compute basic histogram data on the messages you have retrieved.
A screen shot of word cloud visualization for the messages you have retrieved, before you applied the filters.
A screen shot of word cloud visualization for the messages you have retrieved, after you have applied the appropriate filters.
A screen shot of time line visualization for the messages you have retrieved, by year.
A screen shot to the by month visualization for the messages.
A screen shot to the new visualization.
A zip file containing all of the py, js, csv, htm and sqlite files you used as a part of the assignment.

Rubric
 
 
 
 
 
 
 
 
Some Rubric

Some Rubric

Criteria
Ratings
Pts

This criterion is linked to a Learning OutcomeA screen shot of you running the gmane.py application to produce the content.sqlite database

5.0 pts

This criterion is linked to a Learning OutcomeA screen shot of you running the gmodel.py application to produce the index.sqlite database.

5.0 pts

This criterion is linked to a Learning OutcomeA screen shot of you running the gbasic.py program to compute basic histogram data on the messages you have retrieved.

5.0 pts

This criterion is linked to a Learning OutcomeA screen shot of word cloud visualization for the messages you have retrieved, before you applied the filters.

5.0 pts

This criterion is linked to a Learning Outcomegword.py edited to apply appropriate filters and to get data from the subject of the emails.

20.0 pts

This criterion is linked to a Learning OutcomeA screen shot of word cloud visualization for the messages you have retrieved, after you have applied the appropriate filters.

10.0 pts

This criterion is linked to a Learning OutcomeA screen shot of time line visualization for the messages you have retrieved, by year.

5.0 pts

This criterion is linked to a Learning Outcomegline.py edited to output data by month and year

15.0 pts

This criterion is linked to a Learning OutcomeA screen shot to the by month visualization for the messages.

10.0 pts

This criterion is linked to a Learning Outcomea new .py file (e.g. gbubble.py) containing code to uptput necessary data for another visualization .

10.0 pts

This criterion is linked to a Learning OutcomeA new data file (e.g. gbubble.csv) containing data for the new visualization.

5.0 pts

This criterion is linked to a Learning OutcomeA screen shot to the new visualization.

5.0 pts

Total Points: 100.0

 
 

                                                                                                                                                               Order Now