TwitterのデータをRであれこれ

Twitterのデータを
Rであれこれ
Osaka.R #4 (2010/12/02)
@a_bicky

自己紹介
• Takeshi Arabiki （大阪大学のM2）
o Twitter: @a_bicky
o はてな: id:a_bicky ↑ 5年前は若かった…

• 自然言語処理や機械学習を”勉強中” 「研究は？」とか聞かないで・・・

• ブログ
あらびき日記 http://d.hatena.ne.jp/a_bicky/

今日の内容
RでTwitterのデータと戯れます
• 前半
TwilogのグラフをRで出力！

• 後半
ツイートの情報からユーザをクラスタリング

Twilog
Twitterのつぶやきをブログ形式で保存
検索とかもできて便利！

http://twilog.org/

Twilog Stats
自分のつぶやきなどの統計情報

Stats
自分のつぶやきなどの

統計情報

R で Twilog Stats を
やってみよう

twitteRパッケージ
# twitteRパッケージを読み込む
library(twitteR)
# @a_bicky のツイートを3,200件取得（3,200件がMAX）
tweets <- userTimeline("a_bicky", n = 3200)
str(tweets[[1]]) # 最初のツイートの情報を出力

出力
Formal class 'status' [package "twitteR"] with 10 slots
..@ text : chr "RではもしかしてNULL文字を取り除くことできない！？ #r"
..@ favorited : logi FALSE ↑ ツイートのテキスト
..@ replyToSN : chr(0)
..@ created : POSIXct[1:1], format: "2010-12-01 14:17:31" ← ツイート日時 (UTC)
..@ truncated : logi FALSE
..@ replyToSID : num(0)
..@ id : int 1430261760 ← ツイートのID（桁が大きすぎて丸められてる）
..@ replyToUID : num(0)
..@ statusSource: chr "web"
..@ screenName : chr "a_bicky" ← ツイート主のアカウント名

要素へのアクセス
> str(tweets[[1]]) # データ構造を確認
Formal class 'status' [package "twitteR"] with 10 slots
..@ text : chr "RではもしかしてNULL文字を取り除くことできな
い！？ #r"
..@ favorited : logi FALSE
..@ replyToSN : chr(0)
..@ created : POSIXct[1:1], format: "2010-12-01 14:17:31"
..@ truncated : logi FALSE
..@ replyToSID : num(0)
..@ id : int 1430261760
..@ replyToUID : num(0)
..@ statusSource: chr "web"
..@ screenName : chr "a_bicky"
> tweets[[1]]@text # ツイート内容
[1] "RではもしかしてNULL文字を取り除くことできない！？ #r"
> tweets[[1]]@created # ツイート日時
[1] "2010-12-01 14:17:31 UTC"

諸情報
これらに相当するものを表示してみる

総つぶやき数

# 要素数を出力
print(length(tweets))

出力
[1] 2964
※公式RTが除かれているので3,200より少ない

つぶやいた日数
ツイート日時の日付を使いたい
# ツイート日時を抽出
dates <- sapply(tweets, function(x) x@created))
# 日時オブジェクト (POSIXct) に変換
dates <- structure(dates, class = c("POSIXt", "POSIXct"),
tzone = "Asia/Tokyo")
# dates を出力するときのフォーマット (月/日)
form <- "%m/%d"
# 月/日の文字列に変換
days <- format(dates, form)
# ユニークな日付の数を出力
print(length(unique(days)))

出力
[1] 129

sapply
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

Xの要素を1個ずつFUNの第1引数に渡してその結果をまとめて出力

> tmp <- sapply(list(a = "this is a", b = "this is b", c = "this is c"),
function(x) { cat("begin "); print(x); cat("end¥n"); return(x)})
begin [1] "this is a"
end
begin [1] "this is b"
end
begin [1] "this is c"
end
> tmp
a b c
"this is a" "this is b" "this is c"

つぶやかなかった日数
最も古いツイートと最新ツイートの間を対象とする
# 最も古いツイートから最新のツイートまでの日付
alldays <- format(seq(as.Date(dates[length(dates)]),
as.Date(dates[1]), by = "days"),
format = form) # form <- "%m/%d"
# days の因子
fdays <- factor(days, levels = alldays, order = TRUE)
# 日付ごとのつぶやき数
dtable <- table(fdays)
# つぶやき数が0の日付の数を出力
print(sum(dtable == 0))

出力
[1] 0

つぶやかなかった日数
最も古いツイートと最新ツイートの間を対象とする
# 最も古いツイートから最新のツイートまでの日付
alldays <- format(seq(as.Date(dates[length(dates)]),
as.Date(dates[1]), by = "days"),
format = form) # form <- "%m/%d"
# days の因子
fdays <- factor(days, levels = alldays, order = TRUE)
# 日付ごとのつぶやき数
dtable <- table(fdays)
# つぶやき数が0の日付の数を出力
print(sum(dtable == 0))

出力
[1] 0

ごめんなさい・・・

seq
連続する値を生成する
> seq(1, 10, by = 2)
[1] 1 3 5 7 9

> seq(as.Date("2010-12-01"), as.Date("2010-12-31"), by = "days")
[1] "2010-12-01" "2010-12-02" "2010-12-03" "2010-12-04" "2010-12-05"
...
[26] "2010-12-26" "2010-12-27" "2010-12-28" "2010-12-29" "2010-12-30"
[31] "2010-12-31"

> seq(as.Date("2010-12-01"), as.Date("2010-12-31"), by = "2 days")
[1] "2010-12-01" "2010-12-03" "2010-12-05" "2010-12-07" "2010-12-09"
[6] "2010-12-11" "2010-12-13" "2010-12-15" "2010-12-17" "2010-12-19"
[11] "2010-12-21" "2010-12-23" "2010-12-25" "2010-12-27" "2010-12-29"
[16] "2010-12-31"

注意点
table 関数は存在する要素の数のみをカウントする
> days <- c("12/01", "12/03")
> alldays <- c("12/01", "12/02", "12/03")
> table(days)
days
12/01 12/03
1 1
> table(factor(days, levels = alldays))

12/01 12/02 12/03
1 0 1
> # こうやるのもあり
> (diffdays <- setdiff(alldays, days))
[1] "12/02"
> length(diffdays) # つぶやかなかった日数
[1] 1

一日の平均&最高つぶやき数

# 平均を小数第１位まで丸めて表示
print(round(mean(dtable), 1))
# 最大値を表示
print(max(dtable))

出力
[1] 23
[1] 74

つぶやき文字数
# ツイートごとのつぶやき文字数
tnchar <- sapply(tweets, function(x) nchar(x@text))
# 1日ごとのつぶやき文字数
dnchar <- tapply(tnchar, fdays, sum)
# つぶやかなかった日があるとNAになるので0に修正
dnchar <- ifelse(is.na(dnchar), 0, dnchar)
# 全つぶやき文字数を出力
print(sum(tchar))
# ツイートごとのつぶやき文字数の平均を出力
print(round(mean(tnchar), 1))
# 1日ごとのつぶやき文字数の平均を出力
print(round(mean(dnchar)))
出力
[1] 165674
[1] 55.9
[1] 1284

tapply
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

Xの要素を同じINDEXでまとめてFUNの引数に渡して
その結果をまとめて出力

> # 奇数番目の要素ごとと偶数番目の要素ごとで総和をとる
> tmp <- tapply(1:4, c("odd", "even", "odd", "even"),
function(x) { cat("begin "); print(x); cat("end¥n"); return(sum(x)) })
begin [1] 2 4
end
begin [1] 1 3
end
> tmp
even odd
6 4

コミュニケーション率
# 各ツイートを抽出
texts <- sapply(tweets, function(x) x@text)
# アカウント名が含まれるツイート数
ncomu <- sum(grepl("(?<!¥¥w)@¥¥w+(?!@)", texts, perl = TRUE))
# コミュニケーション率
print(round(ncomu / length(tweets), 3))

出力
[1] 0.358

最近30日間のつぶやき数

# 諸情報で使った fdays (ツイート日時の因子) を利用
fdays30 <- fdays
# つぶやき日数が30日よりも多ければデータを削る
if (length(levels(fdays)) > 30) {
fdays30 <- fdays[!(fdays %in% levels(fdays)[1:(length(levels(fdays))
- 30)]), drop = TRUE]
}
plot(fdays30, xlab = "", ylab = "", col = "red", border = "red", space=0.7)


出力 Twilog

ggplot2パッケージ
# ggplot2パッケージを読み込む
library(ggplot2)
# どのデータを扱うか
c <- ggplot(mapping = aes(fdays30))
# どのようにプロットするか
c <- c + geom_bar(fill = "red", alpha = 0.7, width = 0.7) + xlab("") + ylab("")
# プロット！
print(c)

http://www.slideshare.net/syou6162/tsukuba

キレイだけど・・・

キレイだけど・・・

密

目盛りを修正

# 目盛りを15個に絞る
c <- c
+ scale_x_discrete(breaks = levels(fdays30)[seq(1, length(levels(fdays30)), len = 15)])
# プロット！
print(c)

あとは日時の表記が
変わるぐらい

月ごとのつぶやき数
# グラフの色
color <- "yellow3"
# ツイート日時を年/月で表現
months <- format(dates, "%Y/%m")
# months の因子
fmonths <- factor(months, levels = rev(unique(months)), order = TRUE)
# plot.factor でプロット
plot(fmonths, xlab = "", ylab = "", col = color, border = color, space = 0.7)
# ggplot2 でプロット
c <- ggplot(mapping = aes(fmonths))
c <- c + geom_bar(fill = color, alpha = 0.7, width = 0.7) + xlab("") + ylab("")
print(c)

曜日ごとのつぶやき数

曜日ごとのつぶやき数
# グラフの色
color <- "blue"
# %aの表記を英語にする
Sys.setlocale("LC_TIME", "en_US")
# ツイート日時を曜日で表現
wdays <- format(dates, "%a")
# wdays の因子
fwdays <- factor(wdays, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat",
"Sun"), order = TRUE)
plot(fwdays, xlab = "", ylab = "", col = color, border = color, space = 0.7)
c <- ggplot(mapping = aes(fwdays))
print(c)

時間ごとのつぶやき数

# グラフの色
color <- "chocolate1"
# ツイート日時を時間で表現
times <- format(dates, "%k")
# times の因子
ftimes <- factor(times, levels = sprintf("%2d", 0:23), order = TRUE)
plot(ftimes, xlab = "", ylab = "", col = color, border = color, space = 0.7)
c <- ggplot(data.frame(), aes(ftimes))
print(c)

合計つぶやき数推移
# グラフの色
color <- "green"
# 合計つぶやき数推移
cumsums <- cumsum(table(fdays))
# 縦軸の範囲
yrange <- c(0, cumsums[length(cumsums)])
# 最近30日間の推移
y <- cumsums[-(1:(length(cumsums) - 30))]
# plot.default でプロット
plot(y , ylim = yrange, col = color, type = "l", xaxt = "n", xlab = "", ylab = "")
axis(1, label = names(y), at = 1:min(30, length(cumsums)))
# ggplot2 でプロット form <- "%m/%d"
c <- ggplot(mapping = aes(as.Date(levels(fdays30), format = form), y))
c <- c + geom_line(color = color) + xlab("") + ylab("") + ylim(yrange)
print(c)


表記が違う！！
29-Nov 11/29

目盛りを修正

# 目盛りのフォーマットを指定
c <- c + scale_x_date(format = form) # form <- "%m/%d"
# プロット！
print(c)


11/29

Favmemo
Twitterのfavをタグで管理

http://favmemo.com/

Favmemo

よかったら使ってください！
http://favmemo.com/

favolog
Twitterのお気に入りをらくらく管理

http://favolog.org/

Favmemo

よ，よかったら使ってください・・・

http://favmemo.com/

クラスタリング
Cと聞いて、「カーボン？」となるのが化学クラスタ。
「キャパ？クーロン？」となるのが物理クラスタ。「積
分？組み合わせ？」となるのが数学クラスタ。「言
語？」となるのが情報クラスタ。「おいやめろ、取った
だけマシだろ！」となるのが大学生。「いいね！」とな
るのが男。だいたいあってるはず

クラスタリング
Cと聞いて、「カーボン？」となるのが化学クラスタ。
「キャパ？クーロン？」となるのが物理クラスタ。「積
分？組み合わせ？」となるのが数学クラスタ。「言
語？」となるのが情報クラスタ。「おいやめろ、取った
だけマシだろ！」となるのが大学生。「いいね！」とな
るのが男。だいたいあってるはず

似た人たちをグループ分けしたい！

方針
http://www.slideshare.net/bob3/tokyo-webmining5

ふぁぼり・ふぁぼられ関係

フォロー・フォロワー関係

http://www.slideshare.net/guest91c5ac/twitters-social-network-analysis

社会性とか苦手なので・・・

文書クラスタリング
• 1ユーザのツイートをまとめたものを1文書とみなす

• Bag of Words モデルでベクトル化
文書を形態素に分割し，それらの頻度を要素とするベクトル

• あとは普通のクラスタリング

文書クラスタリング
• 1ユーザのツイートをまとめたものを1文書とみなす

• Bag of Words モデルでベクトル化
文書を形態素に分割し，それらの頻度を要素とするベクトル

• あとは普通のクラスタリング

試しにフォローさんをクラスタリングしてみよう
リスト作成の参考になるかもしれないし・・・

データの加工
# tweetVec にフォローさん＋自分のツイートが格納されている
# URLを削除
tweetVec <- gsub("http:¥¥/¥¥/[¥¥w.#&%@¥¥/¥¥-¥¥?=]*", " ",
tweetVec, perl = TRUE)
# ハッシュタグを削除
tweetVec <- gsub("(?<!¥¥w)#¥¥w+", " ", tweetVec, perl = TRUE)
# ユーザ名を削除
tweetVec <- gsub("(?<!¥¥w)@¥¥w+(?!@)", " ", tweetVec, perl = TRUE)
# 本当ならUnicode正規化とかもしたい

RMeCabパッケージ
# RMeCabパッケージを読み込む
library(RMeCab)
# 抽出する形態素（全部）
pos = c("その他","フィラー","感動詞","記号","形容詞","助詞",
"助動詞","接続詞","接頭詞","動詞","副詞","名詞","連体詞")
# ベクトルの例
print(docMatrixDF("すもももももももものうち", pos))

出力
to make data frame

OBS.1
うち 1
すもも 1
の 1
も 2
もも 2

単語-ツイート行列
# 自立語を抽出
pos <- c("感動詞","形容詞","接続詞","動詞","副詞","名詞","連体詞")
# 単語-ツイート行列の作成
tweetMat <- docMatrixDF(tweetVec, pos)
# IDFによる重み付け
idf <- globalIDF(tweetMat)
tweetMat <- tweetMat * idf
# 正規化
tweetMat <- t(t(tweetMat) * mynorm(tweetMat))
# tweetMat <- docMatrixDF(tweetVec, pos, weight = "tf*idf*norm")
# これでも同じ結果が得られる

階層的クラスタリング
# 各ツイート（ユーザの）の距離行列を作成
d <- dist(t(tweetMat))
# 完全連結法
hc <- hclust(d)
# デンドログラムを表示
plot(hc)
# クラスタ数が5になるところでカット
hclabel <- cutree(hc, k = 5)
# クラスタの分布
print(table(hclabel))

出力
hclabel
1 2 3 4 5
85 3 1 1 3
あれ？

Single-path Clustering
1. 文書を1つ取ってくる

セントロイド

グリーンクラスタ

2. 新しく文書を1つ取ってくる
既存のクラスタとの類似度が閾値より上？
Yes → 類似度の最も高いクラスタへ
No → 新たなクラスタのセントロイドに

セントロイド

セントロイド


オレンジクラスタ

3. 残りの文書がなくなるまで繰り返す

セントロイド

セントロイド



3. 残りの文書がなくなるまで繰り返す

セントロイド

セントロイド

ブルークラスタ

セントロイド



# 名前やIDを格納
users <- names(tweetVec)
# 類似度計算用の関数（コサイン類似度）
sim <- function(x, y) {
sum(x * y) / (sqrt(sum(x^2)) * sqrt(sum(y^2)))
}
# 閾値
th <- 0.1
# 結果を格納するための変数
clusters <- list(centroid = list(), user = list())
# ツイート（アカウント）をランダムに抽出
index <- sample(1:ncol(tweetMat))
clusters$user[[1]] <- users[index[1]]
clusters$centroids[[1]] <- tweetMat[,index[1]]

for (i in index[-1]) {
x <- tweetMat[,i]
# 零ベクトルは飛ばす
if (all(x == 0)) next
sims <- sapply(clusters$centroid, sim, x) # 既存のクラスタとの類似度
if (max(sims) < th) {
target <- length(clusters$user) + 1
clusters$user[[target]] <- users[i]
clusters$centroid[[target]] <- x
} else {
target <- which.max(sims)
clusters$user[[target]] <- c(clusters$user[[target]], users[i])
# セントロイドを更新
clusters$centroid[[target]] <- clusters$centroid[[target]] + (x -
clusters$centroid[[target]]) / length(clusters$user[[target]])
}
}


print(sapply(clusters$user, length))

出力
[1] 85 1 4 1 1 1

あれれ？

k-means
# クラスタ数5でクラスタリング
km <- kmeans(t(tweetMat), 5)
print(km$size)

出力
[1] 22 42 12 13 4

おぉ！なんか良さそう！！

妥当性の検証

ごめんなさい

言い訳

そもそも Osaka.R は R の勉強会であって，統計解析や機
械学習の専門的な話をするよりも R でできることを伝える
ことに重きをおくべきであって＜中略＞なのでこれは決し
て怠慢というわけではなく，聴講者が眠くならないための
然るべき措置なのです．時間がなかったからというわけで
はありません．ええ，そんなわけないですとも．

自分と類似度の高い人は誰？
# 実は1列目は自分のツイートでした
me <- tweetMat[,1]
others <- as.data.frame(tweetMat[,-1])
# さぁ一体誰でしょう？
print(users[which.max(sapply(others, sim, me)) + 1])

出力
[1] "5482822"

# 実は1列目は自分のツイートでした
me <- tweetMat[,1]
others <- as.data.frame(tweetMat[,-1])
# さぁ一体誰でしょう？
print(users[which.max(sapply(others, sim, me)) + 1])

出力
[1] "5482822"

誰？

id <- users[which.max(sapply(others, sim, me)) + 1]
# 自作関数
who <- getUsers(as.integer(id))
imgUri <- who$profile_image_url
if (grepl("png", substr(imgUri, nchar(imgUri) -2, nchar(imgUri)),
ignore.case = T)) {
library(png); library(pixmap)
png <- getURLContent(imgUri)
img <- readPNG(png)
plot(pixmapRGB(img))
} else {
library(ReadImages)
jpg <- file("tmp.jpg", "wb")
writeBin(as.vector(getURLContent(imgUri)), jpg)
close(jpg)
jpg <- read.jpeg("tmp.jpg")
plot(jpg)
}

id <- users[which.max(sapply(others, sim, me)) + 1]
# 自作関数
who <- getUsers(as.integer(id))
imgUri <- who$profile_image_url
if (grepl("png", substr(imgUri, nchar(imgUri) -2, nchar(imgUri)),
ignore.case = T)) {
library(png); library(pixmap)
png <- getURLContent(imgUri)
img <- readPNG(png)
plot(pixmapRGB(img))
} else {
library(ReadImages)
jpg <- file("tmp.jpg", "wb")
writeBin(as.vector(getURLContent(imgUri)), jpg)
close(jpg)

ドキドキ・・・
jpg <- read.jpeg("tmp.jpg")
plot(jpg)
}

まとめ
• Twilog Stats の統計情報を提示しました

• フォローさんをクラスタリングしようとしてうまくいき
ませんでした・・・

• Twilog Stats の統計情報を提示するのはなかなか基礎が
身に付く気がします

• こんなしょぼいスライド作るのでも一苦労．発表者のみ
なさんお疲れ様です．

Description

As of July 2010, Twitter has changed their authentication
structure to require OAuth. Until such
a time that OAuth authentication is enabled in R and/or this
requirement changes, these functions
will not work properly.

【和訳】認証の必要なデータは取得できなくなるよ

RからOAuthによる認証を通すことが可能です！
http://www.slideshare.net/tor_ozaki/r-oauth-for-twitter-5232160

http://d.hatena.ne.jp/a_bicky/20101119/1290162523

Twitterという身近な
データを使ってRと
仲良くなりませんか？

ご静聴ありがとうございました

プログラムのダウンロード
プログラム
https://github.com/abicky/osakar4_abicky

プログラム実行まで
http://www.slideshare.net/abicky/intro-6027373

factor補足
drop = TRUE ってなんですか？
> (f <- factor(1:3)) # 因子
[1] 1 2 3
Levels: 1 2 3
> f[-2] # 2番目の要素を除く（水準は残す）
[1] 1 3
Levels: 1 2 3
> f[-2, drop = TRUE] # 2番目の要素を除く（水準も除く）
[1] 1 3
Levels: 1 3

コミュニケーション率補足
# 各ツイートを抽出
texts <- sapply(tweets, function(x) x@text)
# アカウント名が含まれるツイート数
何これ！？
ncomu <- sum(grepl("(?<!¥¥w)@¥¥w+(?!@)", texts, perl = TRUE))
# コミュニケーション率
print(round(ncomu / length(tweets), 3))

(?<!pattern) : 否定的後読み
(?!pattern) : 否定的先読み
詳しくはこちら →

正規表現の先読み・後読みを極める！ - あらびき日記
cf. http://d.hatena.ne.jp/a_bicky/20100530/1275195072

ggplot2補足
深刻な環境問題
> # グローバル変数を使うと問題ない
> x <- c(1, 2, 1); ggplot(mapping = aes(x)) + geom_bar()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
> # デフォルトでグローバル変数 y を参照するのでエラー
> (function(y) print(ggplot(mapping = aes(y)) + geom_bar()))(x)
Error in eval(expr, envir, enclos) : object 'y' not found
> # 環境を指定してやれば大丈夫
> (function(y) print(ggplot(mapping = aes(y),
+ environment = environment()) + geom_bar()))(x)
> # またはデータフレームも渡す
> (function(y) print(ggplot(data = as.data.frame(y), mapping = aes(y))
+ + geom_bar()))(x)

曜日の補足
ロケールの設定によって曜日の表示が異なる

> Sys.getlocale()
[1] "ja_JP.utf-8/ja_JP.utf-8/C/C/ja_JP.utf-8/ja_JP.utf-8"
> format(as.Date("2010-12-02"), "%a")
[1] "木"
> Sys.setlocale("LC_TIME", "en_US.utf-8")
[1] "en_US.utf-8"
> Sys.getlocale()
[1] "ja_JP.utf-8/ja_JP.utf-8/C/C/en_US.utf-8/ja_JP.utf-8"
> format(as.Date("2010-12-02"), format = "%a")
[1] "Thu"

TwitterのデータをRであれこれ

More Related Content

More from Takeshi Arabiki

TwitterのデータをRであれこれ