Name | Mike Izbicki (call me Mike) |
[email protected] | |
Office | Adams 216 |
Office Hours | See Issue #243 |
Zoom Link | See Issue #227 |
Webpage | izbicki.me |
Research | Machine Learning (see izbicki.me/research.html for some past projects) |
Fun facts:
- grew up in San Clemente (~1hr south of Claremont, on the beach)
- 7 years in the navy
- nuclear submarine officer, personally converted >10g of uranium into pure energy
- worked at National Security Agency (NSA)
- left Navy as a conscientious objector
- phd/postdoc at UC Riverside
- taught in DPRK (i.e. North Korea)
- my wife is pregnant and due to have a baby April 18th
- I'll be taking 2 weeks paternity leave when the baby comes
What is big data?
Depends entirely on the person who is talking
- Most non-computer scientists (muggles) think anything bigger than 1G is big data
- Facebook considers "tens of petabytes" to be a "SMALL data problem"
- One of the biggest problems in industry is people apply tools for "Facebook big data" to "muggle big data", and a major goal of this course is to teach you why this is bad and how to avoid it
- For us, "big data" means:
- managing a cluster of computers to solve a computational problem; if it can be solved on a single computer, it's SMALL data
- all the interesting/applied parts of upper division computer science compressed into a single course
We will work with the following three datasets:
- All geolocated tweets sent from 2017-today, 4 terabytes
The common crawl of the web since 2008, >1 petabyteThe internet archive, >50 petabytes as of 2014
By the end of this course, you will build your own "google" search engine. You will manage a cluster of machines that work together to:
- download all the data from the internet
- extract key information from the HTML
- store it in a format suitable for sub 200ms queries
- and serve the data in a webpage
In order to make your search engine scalable, we will use the following technologies:
-
Docker containers
- used to easily deploy code to thousands of computers
- requires concepts from operating systems, networks, architecture; closely related to "virtual machines"
- widely used in industry, see https://stackshare.io/docker
-
Databases
- stores and accesses the data efficiently
- application and database on same computer (SQLite, covered in CS40)
- application and database on different computers (Postgres), our focus
- database on a cluster of computers in the same datacenter (Postgres + extensions like Citus)
- database on a cluster of computers spread throughout the world (YugabyteDB, CocroachDB)
- SQL to manipulate data, python to build applications
- NoSQL (e.g. MongoDB, CouchDB) sucks and you should probably never use it (strongly held personal opinion)
- Postgres implements full text search in 70+ languages using custom libraries I've written
- Postgres widely used in industry, see https://stackshare.io/postgresql
- stores and accesses the data efficiently
-
With these technologies, you can create a fully functioning, highly scalable web business
- former CMC student Biniyam Asnake created the business NextDorm as his senior thesis (slightly different tech stack, but same ideas)
Who should take this course?
This course is designed for data science majors, not computer science majors. I'm happy to have CS majors in this course (and I think you'll find this course fun), but know that:
- you probably have not fully met the prereqs for this course
- some material in this course will duplicate material in your other CS courses
- this is especially true of CSCI133 Databases
- the course number CSCI143 comes from the fact that all CMC upper division CS courses start with CSCI14, and the 3 is for databases
Prerequisites:
-
Discrete math: CSCI055 or MATH055
- Basic probability / counting
- Basic graph theory
-
Foundations of data science: CSCI 036, ECON 122, or ECON 160
- Basic machine learning
- Basic SQL (also covered in CSCI040 Computing for the Web; not covered in any computer science class except CSCI133 Databases, which you should not take if you take this course)
- Regular expressions (for CS majors, typically covered in a theory of computing or compilers class)
-
Data structures: CSCI046 or CSCI70 (Mudd) or CSCI62 (Pomona)
- All courses cover:
- Big-oh notation
- Balanced binary search trees
- CSCI046 covers:
- Basic Unix shell commands
- Advanced git
- Vim text editor
- Analyzing multi-gigabyte Twitter datasets
- Data structures pre-req CSCI040:
- Markdown
- HTML / CSS
- Basic SQL
- Programming web servers with the
flask
library - Web scraping with the
requests
andbs4
libraries
- All courses cover:
-
Takeaway:
- I am expecting that you have basic familiarity with the Linux terminal, git, and SQL joins.
- If you haven't seen those concepts before, expect to spend extra time those weeks catching up.
- There are also extra assignments that certain people will have to complete depending on your background.
Relation to other CS courses:
One purpose of this course is to provide DS majors with an overview of CS concepts. Therefore, there is a lot of material in this course that is covered in other upper division CS courses required for CS majors.
-
Overlapping concepts
- CSCI105 Computer Systems (10% overlap)
- types of storage: tape vs HDD vs SDD vs NVME vs RAM
- RAID
- parallel vs distributed architectures
- CSCI135 Operating Systems (10% overlap)
- permissions systems
- processes vs threads
- virtual machines vs containers
- CSCI125 Networking (10% overlap)
- private vs public networks
- IP addresses
- TCP ports
- virtual networks
- CSCI121 Software Development (10% overlap)
- version control systems (i.e. git)
- test driven development / continuous integration
- microservices vs monolithic architectures
- 12 factor applications
- CSCI133 Databases (50% overlap)
- SQL
- ACID/MVCC/transactions
- indexing techniques
- A lot of the concepts we'll be covering "should" be covered in other CS courses, but because CS professors are often more theory minded than practice minded, they don't get covered. In that sense, this course is similar to the Missing Semester of Your CS Education course taught at MIT.
- CSCI105 Computer Systems (10% overlap)
-
Concepts we don't cover from CSCI133 Databases
- relational algebra
- technical implementation details / C programming
- relationship between the database and operating system
-
BigData concepts from a CS perspective that we will not talk about:
- Frameworks for distributed computation (e.g. Apache Hadoop, Apache Spark)
- Distributed Filesystems (e.g. HDFS, IPFS); we will talk about S3
- Geo-distributed databases
Textbook:
Big data is a rapidly changing field, and all currently printed textbooks are both incomplete and already out of date. Therefore, we won't be using a textbook. Instead, we will be using online documentation. The main references we will use are given below, but I will provide more specific links each week.
Assignments:
- Weekly labs (worth
2**1
points) - Weekly quizzes (worth
2**2
or2**3
or2**4
points) - Weekly projects (worth
2**3
or2**4
or2**5
points) - 2 exams (worth
2**6
points each)- Non-graduating students will complete a final project due during finals week.
- Occasional extra credit assignments
Late Work Policy:
You lose 2**i
points on every assignment,
where i
is the number of days late.
It is usually better to submit a correct assignment late than an incorrect one on time.
Grade Schedule:
Your final grade will be computed according to the following standard table, with the caveats described below.
If your grade satisfies | then you earn |
---|---|
95 ≤ grade | A |
90 ≤ grade < 95 | A- |
87 ≤ grade < 90 | B+ |
83 ≤ grade < 87 | B |
80 ≤ grade < 83 | B- |
77 ≤ grade < 80 | C+ |
73 ≤ grade < 77 | C |
70 ≤ grade < 73 | C- |
67 ≤ grade < 70 | D+ |
63 ≤ grade < 67 | D |
60 ≤ grade < 63 | D- |
60 > grade | F |
Caveats:
There are 2 "caveat tasks" in this course. These tasks should be easy, and everyone will get full credit on the task just for completing the task. If you don't complete one of the tasks, however, your grade (from the table above) will be docked 10%. (For example, an A- grade would become a B- grade.) You have the entire semester (until I submit grades) to complete these tasks.
You can find the details about the caveat tasks at:
Technology Policy:
-
You MUST complete all programming assignments on the lambda server.
-
You MUST use either vim or emacs to complete all programming assignments. In particular, you may not use the GitHub text editor, VSCode, IDLE, or PyCharm for any reason.
In particular: You MAY NOT use the GitHub interface to edit files for a pull request.
-
You MAY NOT share your lambda server credentials with anyone else.
Violations of any of these policies will be treated as academic integrity violations.
Collaboration Policy
-
There are no restrictions on what you can post to GitHub Issues. In particular, you are highly encouraged to post detailed questions/answers/comments with lots of code.
-
You are highly encouraged to collaborate with students
-
in class/lab,
-
in the QCL,
-
and in office hours.
-
-
You MAY NOT look at another student's code (or have another human look at your code) in any other context.
-
You MAY NOT look at another student's code on github.
All projects are developed as open source projects, and so the code is published openly online. The benefits of this model include: (1) you actually learn how to develop/contribute to open source projects; (2) future employers see you have github activity. Please do not abuse this privilege.
I've tried to design the course to be as accessible as possible for people with disabilities. (We'll talk a bit about how to design accessible software in class too!) If you need any further accommodations, please ask.
I want you to succeed and I'll make every effort to ensure that you can.