thank you!Assignment 3
MET CS 777 – Big Data Analytics
GitHub Classroom Invitation Link
In this assignment you will implement Batch Gradient Descent to fit a line into a two dimensional
data set. You will implement a set of Spark jobs that will learn parameters for such line from the
New York City Taxi trip reports in the Year 2013. The dataset was released under the FOIL (The
Freedom of Information Law) and made public by Chris Whong (https://chriswhong.com/open-
data/foil nyc taxi/). See the Assignment 1 for details about this data set.
We would like to train a linear model between travel distance in miles and fare amount (the
money that is paid to the taxis).
2 Taxi Data Set – Same data set as Assignment 1
This is the same data set as use for the Assignment 1. Please have a look on the table description
The data set is in Comma Separated Volume Format (CSV). When you read a line and split it
by comma sign ”,” you will the an string array with length of 17. With index number started from
zero, we need for this assignment to get index 5 trip distance (trip distance in miles) and index 11
fare amount ( fare amount in dollars) as stated on the following table.
index 5 (this our X-axis) trip distance trip distance in miles
index 11 (this our Y-axis) fare amount fare amount in dollars
Table 1: Taxi Data Set fields
Data Clean-up Step
• Remove all taxi rides that are less than 2 min or more than 1 hours.
• Remove all taxi rides that have ”fare amount” less than 3 dollar or more than 200 dollar
• Remove all taxi rides that have ”trip distance” less than 1 mile or more than 50 miles
• Remove all taxi rides that have ”tolls amount” less than 3 dollar.
You can also preprocess the data and store it in your own cluster storage.
3 Obtaining the Dataset
Small data set. (93 MB compressed, uncompressed 384 MB) for implementation and testing pur-
poses (roughly 2 million taxi trips). This is available at Amazon S3:
You can download or access the data sets using the following internal URLs:
Small Data Set gs://metcs777/taxi-data-sorted-small.csv.bz2
Large Data Set gs://metcs777/taxi-data-sorted-large.csv.bz2
Table 2: Data set on Google Cloud Storage – URLs
Small Data Set s3://metcs777/taxi-data-sorted-small.csv.bz2
Large Data Set s3://metcs777/taxi-data-sorted-large.csv.bz2
Table 3: Data set on Amazon AWS – URLs
4 Assignment Tasks
4.1 Task 1 : Simple Linear Regression (4 points)
We want to find a simple line to our data (distance, ”fare amount”) and use it to predict ”fare amount”
from the travel distance.
Consider a Simple Linear Regression model given in equation (1). What are the regression
coefficient for your model?
The solutions for m slope of the l
Why Choose Us
- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee
How it Works
- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "PAPER DETAILS" section.
- Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
- From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.