STAT GR5206 Homework 2 assignment 代写

  • 100%原创包过,高质代写&免费提供Turnitin报告--24小时客服QQ&微信:120591129
  • STAT GR5206 Homework 2 assignment 代写

    STAT GR5206 Homework 2 [100 pts]
    Due 8:00pm Monday, October 16th on Canvas
    our homework should be submitted on Canvas using RMarkdown. Please submit both a
    knitted .pdf file and a raw .Rmd file. (If you are having trouble knitting to .pdf come
    to office hours and we’ll try to sort it out, but for the homework, knit to .html and then
    convert to .pdf before handing it in). We will not (and cannot) accept any other formats.
    Please clearly label the questions in your responses and support your answers by textual
    explanations and the code you use to produce the result. Note that you cannot answer the
    questions by observing the data in the “Environment” section of RStudio or in Excel – you
    must use coded commands.
    Goals: regular expressions, character functions in R, and web scraping.
    In this assignment, we’re going to scrape the 2017-2018 Brooklyn Nets Regular Season
    Schedule (they’re a basketball team from Brooklyn that plays in the NBA). We will take the
    regular season schedule from and reassemble the game listings
    in an R data frame for computational use.
    To do this, perform the following tasks:
    i. Use the readLines() command we studied in class to load the NetsSchedule.html file
    into a character vector in R. Call the vector nets1718.
    a. How many lines are in the NetsSchedule.html file?
    b. What is the total number of characters in the file?

    STAT GR5206 Homework 2 assignment 代写
    c. What is the maximum number of characters in a single line of the file?
    ii. Open NetsSchedule.html as a webpage. This should happen if you simply click on
    the file. You should see a table listing all the games scheduled for the 2017-2018 NBA
    season. There are a total of 82 regular season games scheduled. Who and when are
    they playing first? Who and when are they playing last?
    iii. Now, open NetsSchedule.html using a text editor. To do this you may need to right-
    click on the file and tell your computer to use a text editor to open the file. What
    line in the file holds information about the first game of the regular season (date, time,
    opponent)? What line provides the date, time, and opponent for the final game? It
    may be helpful to use CTRL-F or COMMAND-F here and also work between the file in R
    and in the text editor.
    Using NetsSchedule.html we’d like to extract the following variables: the date, the game
    time (ET), the opponent, and whether the game is home or away. Looking at the file in
    the text editor, locate each of these variables. For the next part of the homework we use
    regular expressions to extract this information.
    iv. Write a regular expression that will capture the date of the game. Then using the
    grep() function find the lines in the file that correspond to the games. Make sure
    that grep() finds 82 lines, and the first and last locations grep() finds match the
    first and last games you found in (ii).
    v. Using the expression you wrote in (v) along with the functions regexp() and regmatches(),
    extract the dates from the text file. Store this information in a vector called date to
    save to use below. HINT: We did something like this in class.
    vi. Use the same strategy as in (v) and (vi) to create a time vector that stores the time
    of the game.
    vii. We would now like to gather information about whether the game is home or away.
    This information is indicated in the schedule by either an ‘@’ or a ‘vs’ in front of the
    opponent. If the Nets are playing ‘@’ their opponent’s court, the game is away. If the
    Nets are playing ‘vs’ the opponent, the game is at home.
    Capture this information using a regular expression. You may want to use the HTML
    code around these values to guide your search. Then extract this information and use
    it to create a vector called home which takes the value 1 if the game is played at home
    or 0 if it is away.
    HINT: In my solution, I use the fact that in each line, the string <li class= "game-status
    "> appears before this information. So my regular expression searches for that string
    followed by ‘@’ or that string followed by ‘vs’. After I’ve extracted these strings, I use
    gsub() to finally extract just the ‘@’ or the ‘vs’.
    viii. Finally we would like to find the opponent, again capture this information using a
    regular expression. Extract these values and save them to a vector called opponent.
    Again, to write your regular expression you may want to use the HTML code around
    the names to guide your search.
    ix. Construct a data frame of the four variables in the following order: date, time,
    opponent, home. Print the frame from rows 1 to 10 Does the data match the first 10
    games as seen from the web browser?
    STAT GR5206 Homework 2 assignment 代写