Extracting data from Twitter for #machinelearningflashcards

I’m a fan of Chris Albon’s recent project #machinelearningflashcards on Twitter where generalized topics and methodologies are drawn out with key takeaways. It’s a great approach to sharing concepts about machine learning for everyone and a timely refresher for those of us who frequently forget algorithm basics.

I leveraged Maëlle Salmon’s recent blog post on the Faces of #rstats Twitter heavily as a tutorial for this attempt at extracting data from Twitter to download the #machinelearningflashcards.

Source Repo for this work: jasdumas/ml-flashcards

Directions

1. Load libraries:

For this project I used rtweet to connect the Twitter API to search for relevant tweets by the hash tag, dplyr to filter and pipe things, stringr to clean up the tweet description, and magick to process the images.

Note: I previously ran into trouble when downloading ImageMagick and detailed the errors and approaches, if you fall into the same trap I did: How to install imagemagick on MacOS

library(rtweet)
library(dplyr)
library(magick)
library(stringr)
library(kableExtra)
library(knitr)

2. Get tweets for the hash tag and only curated tweets for Chris Albon’s work:

ml_tweets <- search_tweets("#machinelearningflashcards", n = 500, include_rts = FALSE) %>% filter(screen_name == 'chrisalbon')

mt <- ml_tweets[1:3, 1:5]

kable(mt, format = "html") %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = F) 

<?xml version=”1.0” encoding=”UTF-8”?>

screen_name	user_id	created_at	status_id	text
chrisalbon	11518572	2017-05-09 22:51:43	862077650772164608	Mean Squared Error #machinelearningflashcards https://t.co/K1iDqLV5DD
chrisalbon	11518572	2017-05-09 18:15:39	862008178527031296	R-Squared #machinelearningflashcards https://t.co/73gR8tb5PA
chrisalbon	11518572	2017-05-09 16:23:04	861979845563105280	Motivation For Kernel PCA #machinelearningflashcards https://t.co/AhLB91gHBh

3. Get the text within the tweet to add to the file name by removing the hash tag and URL link with some light regex:

ml_tweets$clean_text <- ml_tweets$text
ml_tweets$clean_text <- str_replace(ml_tweets$clean_text,"#[a-zA-Z0-9]{1,}", "") # remove the hashtag
ml_tweets$clean_text <- str_replace(ml_tweets$clean_text, " ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", "") # remove the url link
ml_tweets$clean_text <- str_replace(ml_tweets$clean_text, "[[:punct:]]", "") # remove punctuation

4. Write a function to download images of the flashcards from the media_url column and append the file name from the cleaned tweet text description and save into a folder:

save_image <- function(df){
  for (i in c(1:nrow(df))){
    image <- try(image_read(df$media_url[i]), silent = F)
  if(class(image)[1] != "try-error"){
    image %>%
      image_scale("1200x700") %>%
      image_write(paste0("data/", ml_tweets$clean_text[i],".jpg"))
  }
 
  }
   cat("Function complete...\n")
}

5. Apply the function:

save_image(ml_tweets)

## Function complete...

At the end of this process you can view all of the #machinelearningflashcards in one location! Thanks to Chris Albon for his work on this, and I’m looking forward to re-running this script to gain additional knowledge from new #machinelearningflashcards that are developed in the future!

Extracting data from Twitter for #machinelearningflashcards

A tutorial for machine learning - learning

Extracting data from Twitter for #machinelearningflashcards

A tutorial for machine learning - learning

Directions

1. Load libraries:

2. Get tweets for the hash tag and only curated tweets for Chris Albon’s work:

3. Get the text within the tweet to add to the file name by removing the hash tag and URL link with some light regex:

4. Write a function to download images of the flashcards from the media_url column and append the file name from the cleaned tweet text description and save into a folder:

5. Apply the function: