Skip to content

KuangSandBox

Joe Hellerstein edited this page Dec 8, 2015 · 1 revision

Assignment 4

Notices and corrections

  1. Do this assignment as PARTNERS!
  2. PLEASE READ INSTRUCTIONS ENTIRELY before starting.

Overview

Hello gumshoes! Justice needs your help! The year is 2001 and the Enron Scandal is in full effect. The authorities have taken over Enron emails to find out: who are the Enron kingpins?. They have started by implementing a basic database-driven web-application for search and visualization: EnronSearch.

In this assignment, you will help authorities find the Enron kingpins by scaling up EnronSearch to handle the large email dataset, and rendering the search results in a meaningful way. You will exercise your knowledge of web/boolean search, database tuning and SQL. You will need to write code in Ruby using the Rails framework, against a PostgreSQL database.

===== What's provided ===== We provide a working Ruby on Rails application providing search and visualization over the Enron emails. It includes:

  • Two sets (large and small) of Enron emails populating a PostgreSQL table called Links:

    links(id:integer, to:string, from:string, body:text)

  • A Rails ActiveRecord that wraps the table, and provides access to its tuples as Ruby objects

  • Using the Link ActiveRecord model and find_by_sql API, we provide a naive boolean search over the emails using the SQL like construct. This approach will not scale to the full size of the Enron emails -- which you'll fix!

  • A PostgreSQL table/ActiveRecord for the inverted index table:

    terms(id:integer, term:string, email_id:integer, pos:integer)

  • A basic visualization of search results. Example: Image(http://cs186berkeley.net/export/50/code/images/graph-ThuApr2312054407002009.png)

You can see an instance of the application here: [http://sphere.cs.berkeley.edu:3000]

===== Objective =====

Your task is the create a search engine that scales to search and visualize the Enron emails using an inverted index. There are many existing text search libraries out there (Lucene (Java), Ferret/Sphinx/Xapian/Solr ((Ruby)), tsearch (PostgreSQL)). We will not use any of these packages, and will instead create a simple one ourselves.

Summary of required tasks:

  1. Populate the inverted index: fill the Terms table from the Links table
  2. Implement boolean keyword search using the Terms table/ActiveRecord. We provide an example search function using the Links table/ActiveRecord.
  3. Determine which index(es) you would add to the Terms table to optimize its performance for your boolean search query
  4. Improve the search results visualization

Optional tasks:

  1. Order the search results in a relevant way. Google has PageRank -- what will you do for EnronSearch? Note that the Terms table has a pos field.

Tools for the assignment

Your course account has been set up with access to PostgreSQL 8.2.4 and all necessary Ruby/Rails binaries. We use Ruby 1.8.6 and Rails 2.1.2. In addition we use the will_paginate and ruby-graphviz gems. These were compiled for Solaris, so you must run on one of the following machines:

  • rhombus
  • pentagon
  • cube
  • sphere
  • po
  • torus If you have modified your PostgreSQL environment variables in your course account (e.g. in ~/.cshrc) you MUST remove those modifications or you may find that PostgreSQL will not work.

You can develop on your own machine if you like. You'll need to install the Ruby/Rails stack, gems and GraphViz. We don't support custom configurations and provide no additional details; be aware that we will test your queries on the version of PostgreSQL / Rails used by the class accounts, and you will not get credit unless it runs in that version. So it's wise to double-check your code on the class accounts before turning in.

Setting up the software stack

First, clear your previous PostgreSQL setup from ~ directory: the src checkout, pgsql and pgdata. Then set up the database.

> /bin/rm -rf $PGDATA/
> initdb
> pg_ctl -l logfile start
> createdb
> psql # just to make sure it works

Check out the homework, set up Rails,

> cd ~
> svn co file:///home/ff/cs186/sp09/hw4 /home/tmp/$USER/
> ln -s /home/tmp/$USER/hw4
> cd ~/hw4/enron

Create the configuration file ~/hw4/enron/config/database.yml by copying ~/hw4/enron/config/database.yml.example

cp ~/hw4/enron/config/database.yml.example ~/hw4/enron/config/database.yml

Edit database.yml and put in your user account:

cs186-xx

Load data

> cd ~/hw4
> psql -d enron_development < ~/hw4/data/enron-small.sql # start with the small dataset for testing
> cd hw4/enron
> rake db:create
> rake db:migrate

To start Rails, just type:

> server # this is aliased to "script/server webrick -p $WEBPORT" for you

Test that the web app is alive by visiting http://:/, you should see "Hello Enron!"

Check out the NaiveSearch

The main logic for NaiveSearch is in enron/app/controllers/links_controller.rb::nsearch

You can try out the life-cycle of a search as follows:

  1. Visit http://:/nsearch, type in search terms, for example: 'skillings lay' to search for the two kingpins
  2. The routing policy in enron/config/routes.rb will route the request to the links_controller mentioned above and the function nsearch
  3. Search terms are passed via HTTP POST in the params object. The nsearch function parses the query terms and generates a SQL statement to directly search for answers in the links table via the Link ~ActiveRecord object.

Given the search terms of "davaid foster wallace", the generated SQL looks like this:

SELECT * FROM "links" 
WHERE ((lower(links.body) like E'%david%' or lower(links.from) like E'%david%' or lower(links.to) like E'%david%') 
and (lower(links.body) like E'%foster%' or lower(links.from) like E'%foster%' or lower(links.to) like E'%foster%') 
and (lower(links.body) like E'%wallace%' or lower(links.from) like E'%wallace%' or lower(links.to) like E'%wallace%')) 
LIMIT 30 OFFSET 0
  1. Matching emails are returned by the nsearch to the search results portion of the webpage.
  2. Emails are also fed to the build_graph function in enron/app/controllers/application.rb, which generates an image file in enron/public/graph/graph-.png
  3. This image location is accessible over the web, since it is in the public folder. The image is included in the results

Implementing EnronSearch

===== Your tasks =====

  1. Populate the inverted index from the links table. First go to http://:/update_index -- you'll find that this routes to the action enron/app/controllers/terms_controller.rb::update_index. In this method, you will see stub code and comments for you to fill in the logic to populate the inverted index. We provide a function for you to parse terms out of an email -- this way we know what terms you end up with.

  2. Add the appropriate index to the inverted index table. According to boolean search, we will be doing many self-joins of the term table to intersect email_id lists. Specify the DDL statement for CREATE INDEX ... in db/terms_index.sql. Run this sql file to install the index. The file is provided for you on SVN checkout, place your answer there. Note: We expect answers will vary widely. If you feel like your answer requires explanation, write a sentence or two as -- comments in the .sql file you turn-in. We'll look at it on-demand.

  3. Implement http://:/search. The route is already set up, routing HTTP requests to enron/app/controllers/terms_controller.rb::search. Note that this is different from nsearch and lives in a different controller. In this method, you will see stub code and comments indicating where to implement the search function. You are not limited to changing terms_controller.rb. Here, you'll probably want to use the find_by_sql interface, and interact with both Term and Link models.

  4. Modify the build_graph function to create a simple visualization that best captures the answer to the question who are the Enron kingpins? given a particular search result set. We are looking for visual effectiveness and scalability. Try to search for a term like "enron" -- how will your visualization handle so many results? Be creative!

  5. When ready, load in the large dataset. How well do you scale?

    psql -d enron_development < ~/hw4/data/enron-large.sql

===== Getting started tips =====

  • If you're experienced with Rails, dive right in. If not, you'll be up to speed in no time by checking out the Technology reference section below
  • Start by just starting the server and play with the interface; use it to understand the nsearch implementation
  • You can find your postgres / rails server port with {{{

echo $PGPORT echo $WEBPORT }}}

  • Start with the small dataset. Then migrate to the larger one, once you've flushed out your functionality.

What to turn in

You will turn in your entire enron/ application folder as an archive - enron.tar.gz Requirements:

  1. Your submission must run after we unarchive and run the following commands: rake db:create, rake db:migrate.
  2. The archive must have a file in enron/db/terms_index.sql with the index(es) to make terms go fast

Technology reference

Ruby Ruby is a dynamic, open-source, scripting language. While learning ruby is obviously not a major focus of this class you will need a cursory understanding in order to complete this assignment. We recommend looking at some of the following resources before you get started.

Since this is a new language for some people, please don't hesitate to ask the TAs or newgroup questions about ruby basics, as we don't want you to waste time fighting with the language (of course, after you search the web and read other forum postings).

Rails Ruby On Rails is an open-source web development framework written in Ruby. Its goal is to make it easy to rapidly design, implement and deploy database backed, interactive websites. We will be providing you with a partially implemented site so that you can focus on the interesting parts and not on learning the ins and outs of Ruby/Rails. However, if you are interested in learning more there are a number of great quick start tutorials and screencasts just a web search away.

Links send out to the mailing list, regurgitated here: A few of my favorite resources for learning Ruby/Rails:

Understanding MVC Rails is based around the Model-View-Controller design pattern.

It is common to split an application into separate layers that run on different computers: presentation (UI), domain logic, and data access. In MVC the presentation layer is further separated into view and controller. MVC is often seen in web applications, where the view is the actual HTML page, and the controller is the code that gathers dynamic data and generates the content within the HTML. Finally, the model is represented by the actual content, usually stored in a database, and the business rules that transform that content based on user actions.

Though MVC comes in different flavors, control flow generally works as follows:

  1. The user interacts with the user interface in some way (e.g. presses a button).
  2. A controller handles the input event from the user interface, often via a registered handler or callback.
  3. The controller notifies the model of the user action, possibly resulting in a change in the model's state. (e.g. controller updates user's Shopping cart).
  4. A view uses the model (indirectly) to generate an appropriate user interface (e.g. the view produces a screen listing the shopping cart contents). The view gets its own data from the model. The model has no direct knowledge of the view.
  5. The user interface waits for further user interactions, which begins then a new cycle.

By decoupling models and views, MVC helps to reduce the complexity in architectural design, and to increase flexibility and reuse.

You will see this functionality separated into directories in app/ in the skeleton provided.

There are three controllers:

  • application.rb - This is the parent controller for all other controllers in your app. You should not need to put any code here. You can see that we put some helper functions here.
  • links_controller.rb - This controller handles interactions with the Link model and underlying table. You can test them by visiting http://:/links/. We have implemented our naive search here, http://:/nsearch
  • terms_controller.rb - This controller has a number of functions. First it handles interactions with the Term model and underlying table, and serves as our inverted index. Comments are provided describing the specs for each function. You can test them by visiting http://:/terms/.

ActiveRecord ActiveRecord is an Object-Relational Mapping library used by RubyOnRails. It's fairly full featured and makes a number of assumptions about the layout of your schema. If you are familiar with this library feel free to use any basic (no plug-ins) features you. If not, don't worry, as we have provided a list of these conventions and overrides, along with few simple example below about using ActiveRecord to run arbitrarily complex database queries over the Boats/Sailors example from lecture.

Conventions:

  1. Tables are the plural of the class name for any active record instance. (I.e. class Sailor < ActiveRecord:Base would refer to data in the Sailors table)
  2. The primary key for any entity is id. a. You can override this by putting set_primary_key :key or set_primary_keys [:key1, :key2] at the top of the class definition
  3. Foreign keys are always named _id a. You can override this by specifiying :foreign_key => [:song] after the belongs_to or has_many

Example:

Sailors(sid, sname, rating, age)
Boats(bid, bname, color)
Reserves(sid, bid, day)

class Sailor < ActiveRecord:Base
    set_primary_key :sid
    has_many :reservations, :foreign_key => :sid
end

class Boat < ActiveRecord:Base
    set_primary_key :bid
    has_many :reservations, :foreign_key => :bid
end

class Reservation < ActiveRecord:Base
    set_primary_keys :sid, :bid, :day
    belongs_to :sailor, :foreign_key => :sid
    belongs_to :boat, :foreign_key => :bid
end

Make a new sailor:

s = Sailor.new
s[:sid] = 1
s[:sname] = "Michael"
s[:rating] = 10
s[:age] = 24
s.save

Find a sailor by name:

s = Sailor.find_by_sname("Michael")

Find all red boats:

b = Boat.find_all_by_color("red")

Find all reservations by a sailor:

s = Sailor.find_by_sname("Michael")
r = s.reservations

These are the basics, but sometimes you want to be able to write more complex queries. For this you can call back on full SQL. If you wanted all sailors who have reserved a boat today you could write:

s = Sailor.find_by_sql("SELECT DISTINCT sailor.* FROM sailors JOIN reservations ON sailors.sid = reservations.sid WHERE reservation.day = <today>")

You can even add computed columns. For example to find all sailors along with their reservation counts you could write:

s = Sailor.find_by_sql("SELECT sailor.*, count(*) as reservation_count JOIN reservations ON sailors.sid = reservations.sid GROUP BY sailors.sid")
s[0] #First sailor in the list
s[0][:sname] #name of the first sailor in the list
s[0][:reservation_count] #Reservation count for the first sailor in the list