Big Questions For Social Media Big Data Representa
Big Questions For Social Media Big Data Representa
net/publication/261217624
Big Questions for Social Media Big Data: Representativeness, Validity and Other
Methodological Pitfalls
CITATIONS READS
450 3,468
1 author:
Zeynep Tufekci
University of North Carolina at Chapel Hill
17 PUBLICATIONS 3,805 CITATIONS
SEE PROFILE
All content following this page was uploaded by Zeynep Tufekci on 06 May 2014.
Samples drawn using different hashtags can differ in What had happened was that as soon as the protest be-
important dimensions, as hashtags are embedded in partic- came the dominant story, large numbers of people contin-
ular cultural and socio-political frameworks. In some cas- ued to discuss them heavily – almost to the point that no
es, the hashtag is a declaration of particular sympathy. In other discussion took place on their Twitter feeds – but
other cases, there may be warring messages as the hashtag stopped using the hashtags except to draw attention to a
emerges as a contested cultural space. For example, two new phenomenon or to engage in “trending topic wars”
years of regular monitoring of activity—checking at least with ideologically-opposing groups. While the protests
for an hour once a week—on the hashtags #jan25 and continued, and even intensified, the hashtags died down.
#Bahrain show their divergent nature. Those who choose Interviews revealed two reasons for this. First, once every-
to use #jan25 are almost certain to be sympathetic to the one knew the topic, the hashtag was at once superfluous
Egyptian revolution while #Bahrain tends to be used both and wasteful on the character-limited Twitter platform.
by supporters and opponents of the uprising in Bahrain. Second, hashtags were seen only as useful for attracting at-
Data I systematically sampled on three occasions showed tention to a particular topic, not for talking about it.
that only about 1 in 100 #jan25 tweets were neutral while In August, 2013, a set of stairs near Gezi Park which had
the rest were all supporting the revolution. Only about 5 been painted in rainbow colors were painted over in drab
out of 100 #Bahrain tweets were neutral, and 15 out of 100 gray by the local municipality. This sparked outrage as a
were strongly opposed to the uprising, while the rest, 80 symbolic moment, and many people took to Twitter under
out of 100 were supportive. In contrast, #cairotraffic did the hashtag #direnmerdiven (roughly “#occupystairs”). The
not exhibit any overt signs of political preference. Conse- hashtag quickly and briefly trended and then disappeared
quently, since the hashtag users are a particular communi- from the trending list as well as users’ Twitter streams.
ty, thus prone to selection biases, it would be difficult to However, this would be a misleading measure of activity
generalize from their behavior to other samples. Political on the painting of the stairs, as monitoring a group who
users may be more prone to retweeting, say, graphic con- had been using the hashtag showed that almost all of them
tent, whereas non-political users may react with aversion. continued to talk about the topic intensively on Twitter, but
Hence, questions such as “does graphic content spread without the hashtag. Over the next week, hundreds, maybe
quickly on Twitter” or “do angry messages diffuse more thousands of stairs in Turkey were painted in rainbow col-
quickly” might have quite different answers if the sample ors as a form of protest, a phenomenon not at all visible in
is drawn through different hashtags. any data drawn from the hashtag.
Hashtag analyses can also be affected by user activity Finally, most hashtags used to build big datasets are suc-
patterns. An analysis of twenty hashtags used during the cessful hashtags - ones that got well-known, distributed
height of Turkey’s Gezi Park protests in June 2013 (#oc- widely and generated large amount of interest. It is likely
cupygezi, #occupygeziparki, #direngeziparki, #direnanka- that the dynamics of such events differ significantly from
ra, #direngaziparki, etc.) shows a steep rise in activity on those of less successful ones. In sum, hashtag datasets
May 30th when the protests began, dropping off by June should be seen as self-selected samples with data “missing
3rd (Figure 1). Looking at this graph, one might conclude not at random” and interpreted accordingly (Allison, 2001;
that either the protests had died down, or that people had Meiman and Freund, 2012; Outhwaite et al, 2007)
stopped talking about the protests on Twitter. Both conclu- All this is not to argue that hashtag datasets are not use-
sions would be very mistaken, as revealed by the author’s ful. In contrast, they can provide illuminating glimpses into
interviews with hundreds of protesters on-the-ground dur- specific cultural and socio-political conversations. How-
ing the protests, online ethnography that followed hundreds ever, hashtag dataset analyses need to be accompanied by a
of people active in the protests (some of them also inter- thorough discussion of the culture surrounding the specific
viewed offline), monitoring of Twitter, trending topics, hashtag, and analyzed with careful consideration of selec-
news coverage and the protests themselves. tion and sampling biases.
There might be ways to structure the sampling of Twit- ab uprisings and concludes that “new media outlets that
ter datasets so that the hashtag is not the sole criterion. For that use bit.ly are more likely to spread information outside
example, Freelon, Lynch and Aday (2014) extracted a da- the region than inside it.” This is an important finding.
taset first based on the use of the word “Syria” in Arabic or However, interpretation of this finding should take into ac-
English, and then extracted hashtags from that dataset count the respective populations of Twitter users in the
while also performing analyses on the wider dataset. An- countries in question. Egypt’s population is about 80 mil-
other method might be to use the hashtag to identify a lion, about 1 percent of the global population. Any topic of
sample of users and then collect tweets of those users (who global interest about Egypt could very easily generate more
will likely drop using the hashtag) rather than collecting absolute number of clicks outside the country even if the
the tweets via the hashtag. activity within the country remained much more concen-
Above all, hashtag analyses should start from the princi- trated in relative proportions. Finally, the size of these da-
ple of understanding user behavior first, and should follow tasets makes traditional measures like statistical signifi-
the user rather than following the hashtag. cance less valuable (Meiman and Freund, 2012), a problem
exacerbated by lack of information about the denominator.
3. The Missing Denominator: We Know Who
Clicked But We Don’t Know Who Saw Or Could: 4. Missing the Ecology for the Platform:
One of the biggest methodological dangers of big data Most existing big data analyses of social media are con-
analyses is insufficient understanding of the denominator. fined to a single platform (often Twitter, as discussed.)
It’s not enough to know how many people have “liked” a However, most of the topics of interest in such studies,
Facebook status update, clicked on a link, or “retweeted” a such as influence or information flow, can rarely be con-
message without knowing how many people saw the item fined to the Internet, let alone to a single platform. The dif-
and chose not to take any action. We rarely know the char- ficulty in obtaining high-quality multi-platform data does
acteristics of the sub-population that sees the content even not mean that we can treat a single platform as a closed and
though that is the group, and not the entire population, insular system. Information in human affairs flows through
from which we are sampling. Normalization is rarely done, all available channels.
or may even be actively decided against because the results The emergent media ecology is a mix of old and new
start appearing more complex or more trivial (Cha, 2008). media which is not strictly segregated by platform or even
While the denominator is often not calculable, it may be by device. Many “viral” videos take off on social media
possible to estimate. One measure might be “potential ex- only after being featured on broadcast media, which often
posure,” corresponding to the maximum number of people follows their being highlighted on intermediary sites such
who may have seen a message. However, this highlights as Reddit or Buzzfeed. Political news flowing out of Arab
another key issue: the data is often proprietary (Boyd and Spring uprisings to broadcast media was often curated by
Crawford, 2012). It might be possible to work with the sites such as Nawaat.org that had emerged as trusted local
platforms to get estimates of visibility, click-through and information brokers. Analysis from Syria shows a similar
availability. For example, Facebook researchers have dis- pattern (Aday et al. 2014). As these examples show, the
closed that the mean and median fraction of a user’s object of analysis should be this integrated ecology, and
friends that see status update posts is about 34 to 35%, there will be significant shortcomings in analyses which
though the distribution of the variable seems to have a consider only a single platform.
large spread (Bernstein et al., 2013). Link analyses on hashtags datasets for the Arab upris-
With some disclosure from proprietary platforms, it may ings show that the most common links from social media
be possible to calculate “likely” exposure numbers based are to the websites of broadcast media (Aday et al. 2012).
on “potential” exposure - similar to the way election polls The most common pattern was that users alternate between
model “likely” voters or TV ratings try to measure people Facebook, Twitter, broadcast media, cell-phone conversa-
watching a show rather than just being in the room where tions, texting, face-to-face and other methods of interaction
the TV is on. Steps in this direction are likely to be com- and information sharing (Tufekci & Wilson, 2012).
plex and difficult, but without such efforts, our ability to These challenges do not mean single-platform analyses
interpret raw numbers will remain limited. The academic are not valuable. However, all such analyses must take into
community should ask for more disclosure and access from account that they are not examining a closed system and
the commercial platforms. that there may be effects which are not visible because the
It’s also important to normalize underlying populations relevant information is not contained within that platform.
when comparing “clicks,” “links,” or tweets. For example, Methodologically, single-platform studies can be akin to
Aday et al. (2012) compares numbers of clicks on bit.ly looking for our keys under the light. More research, admit-
links in tweets containing hashtags associated with the Ar- tedly much more difficult and expensive than scraping data
from one platform, is needed to understand broader pat- without understanding the context, the spike in
terns of connectivity. Sometimes, the only way to study @celebboutique mentions could easily be misunderstood.
people is to study people. Polarized situations provide other examples of “negative
retweets.” For example, during the Gezi protests in Turkey,
the mayor of Ankara tweeted personally from his account,
Inferences and Interpretations often until late hours of the night, engaging Gezi protesters
The question of inference from analyses of social media individually in his idiosyncratic style, which involved the
big data remains underconceptualized and underexamined. use of “ALL CAPS” and colorful language. He became
What’s a click? What does a retweet mean? In what con- highly visible among supporters as well as opponents of
text? By whom? How do different communities interpret these protests. His visibility, combined with his style,
these interactions? As with all human activities, interpret- meant that his tweets were widely retweeted—but not al-
ing online imprints engages layers of complexity. ways by supporters. Gezi protestors would retweet his
messages and then follow the retweet with a negative or
1. What’s in a Retweet? Understanding our Data: mocking message. His messages were also retweeted with-
out comment by people whose own Twitter timelines made
The same act can have multiple, even contradictory mean- clear that their intent was to “expose” or ridicule, rather
ings. In many studies, for example, retweets or mentions than agree. A simple aggregation would find that thou-
are used as proxies for influence or agreement. This may sands of people were retweeting his tweets, which might be
hold in some contexts; however, there are many conceptual interpreted as influence or agreement.
steps and implicit assumptions embedded in this analysis. One of the most cited Twitter studies (Kwak et al.) grap-
It is clear that a retweet is information exposure and/or re- ples with how to measure influence, and asks whether the
action; however, after that, its meaning could range from number of followers or the number of retweets is a better
affirmation to denunciation to sarcasm to approval to dis- measure. That paper settles on retweets, stating that “The
gust. In fact, many social media acts which are designed number of retweets for a certain tweet is a measure of the
as “positive” interactions by the platform engineers, rang- tweet’s popularity and in turn of the tweet writer’s popular-
ing from Retweets on Twitter to even “Likes” on Facebook ity.” The paper then proceeds to rank users by the total
can carry a range of meanings, some quite negative. number of retweets, and refers to this ranking alternatively
as influence or popularity. Another important social media
study, based on Twitter, speaks of in-degree (number of
followers) as a user’s popularity, and retweets as influence
(Cha et al., 2010). Both are excellent studies of retweet and
following behavior, but in light of the factors discussed
above, “influence” and “popularity” are may not be the
best term to use for the variables they are measuring. Some
portion of retweets and follows are, in fact, negative or
mocking, and do not represent “influence” in the way it is
ordinarily understood. The scale of such behavior remains
Figure 2: Retweeted widely, but mostly in disgust
an important, unanswered question (Freelon, 2014).
As an example, take the recent case of the twitter ac-
count of fashion store @celebboutique. On July, 2012, the 2. Engagement Invisible to Machines: Subtweets,
account tweeted with glee that the word “#aurora” was Hate-Links, Screen Captures and Other Methods:
trending and attributed this to the popularity of a dress
named #aurora in its shop. The hashtag was trending, how- Social media users engage in practices that alter their visi-
ever, because Aurora, Colorado was the site of a movie bility to machine algorithms, including subtweeting, dis-
theatre massacre on that day. There was an expansive cussing a person’s tweets via “screen captures,” and hate-
backlash against @celebboutique’s crass and insensitive linking. All these practices can blind big data analyses to
tweet. There were more than 200 mentions and many hun- this mode of activity and engagement.
dreds of retweets with angry messages in as little as sixty Subtweeting is the practice of making a tweet referring
seconds. The tweet itself, too, was retweeted thousands of to a person algorithmically invisible to that person—and
times (See Figure 2). After about an hour, the company re- consequently to automated data collection—even as the
alized its mistake and stepped in. This was followed by reference remains clear to those “in the know.” This ma-
more condemnation—a few hundred mentions per minute nipulation of visibility can be achieved by referring to a
at a minimum. (For more analysis: (Gilad, 2012)) Hence, person who has a twitter handle without either “mention-
ing” this handle, or by inserting a space between the @
ssign and the handle,
h or by using
u their reg
gular name or a an houur at a time inn, totaling at leeast 10 hours oof observa-
nnickname ratheer than the hanndle, or even so ometimes delib b- tion deedicated to cattching subtweeets. This resulteed in a col-
eerately misspellling the name. In some casees, the referencce lectionn of 100 unm mistakable subttweets; many m more were
ccan only be un nderstood in coontext, as theree is no mentio on undouubtedly missedd because theyy are not alwayys obvious
oof the target in
n any form. Th hese many form ms of subtweet- to obsservers. In fact,, the subtweetss were widely uunderstood
inng come with different
d implications for big g data analyticss. and reetweeted, whiich increases the importancce of such
For examplle, a controv versial article by Egyptian n- practicces. Overall, thhe practice apppears commonn enough to
AAmerican Mon na El Eltahawyy sparked a masssive discussio on be desscribed as routiine, at least in Turkey.
inn Egypt’s sociial media. In a valuable anallysis of this diss-
ccussion, socioloogists Alex Haanna and Marcc Smith extract-
eed the tweets which
w mentioneed her or linkeed to the articlee.
TTheir network analysis reveaaled great polarization in th he
ddiscussion, with two distinctlly clustered groups. Howeveer,
wwhile watching g this discussio
on online, I no oticed that manny
oof the high-proofile bloggers and young Eg gyptian activistts
ddiscussing the article - and greatly influenccing the converr-
ssation - were indeed
i subtweeeting. Later discussions
d witth
thhis community y revealed thatt this was a deeliberate choicce
thhat had been made
m because many people did not want to t
ggive Eltahawy attention,” ev ven as they waanted to discusss
thhe topic and her
h work.
F
Figure 3: Two peeople “subtweetting” each other without mentionn-
ing names. Thee exchange was clear
c enough, ho
owever, to be re--
ported in newspapers.
n Figuree 4: Algorithmiccally Invisible Enngagement: A coolumnist re-
sponds to critics by screeen captures.
In another exxample drawn from my prim mary research ono
TTurkey, figure 3 shows a sub btweet exchang ge between tw wo Usiing screen capttures rather thaan quotes is another prac-
pprominent individuals that would
w be uninteelligible to any
y- tice thhat adds to thhe invisibility of engagemennt to algo-
oone who did notn already follow the broad der conversatio on rithmss. A “caps” is ddone when Tw witter users refeerence each
aand was not in ntimately familiar with the context. Whille other’ s tweets throuugh screen caaptures rather than links,
eeach person is referring to th he other, theree are no names, mentioons or quotes. An examplee is shown onn Figure 4.
nnicknames, or handles.
h dition, neither follows the oth
In add h- This ppractice is so w widespread thaat a single hourr following
eer on Twitter. It is, howeverr, clearly a dirrect engagemen nt the saame purposive sample resultted in more thhan 300 in-
aand conversatioon, if a negativve one. A broaad discussion ofo stancees in which useers employed suuch “caps.”
thhis “Twitter sp
pat” on Turkish Twitter prov ved people werre Yett another praactice, colloquuially known as “hate-
aaware of this ass a two-way coonversation. It was so well un n- linkingg,” limits the algorithmic vvisibility of enngagement,
dderstood that it was even repoorted in newspaapers. althouugh this one is potentially ttraceable. “Haate-linking”
While the truue prevalence of this behavio or is hard to ess- occurss when a user llinks to anotheer user’s tweet rather than
taablish, exactly
y because the activity
a is hidd
den from largee- mentiooning or quotiing the user. T This practice, ttoo, would
sscale, machine--led analyses, observations ofo Turkish Twiit- skew analyses baseed on mentions or retweets, though in
ter during the Gezi
G protests off June 2013 rev vealed that succh this caase, it is at leasst possible to loook for such linnks.
ssubtweets weree common. In n order to gett a sense of itts Subbtweeters, “capps” users, and hate-linkers arre obvious-
sscale, I underttook an onlinee ethnography y in Decembeer, ly a smmaller commuunity than tweeeters as a wholle. While it
22013, during which
w two hun ndred Twitter users
u from Turr- is uncclear how wideespread these ppractices truly are, study-
kkey, assembled d as a purposiv ve sample including ordinarry ing Tuurkish Twitter shows that theey are not unccommon, at
uusers as well as journalists annd pundits, weere followed fo or least iin that contextt. Other counttries might havve specific
ssocial media practices that confound big data d analytics in
i Theere are clearly similar dynam mics in differennt types of
ddifferent ways.. Overall, a simple
s “scrapiing” of Turkissh netwoorks, human annd otherwise, and the diffeerent fields
TTwitter might produce
p a polaarized map of groups
g not talk
k- can leearn much from m each other. However, impportation of
inng to each othher, whereas th he reality is a polarized situaa- methoods needs to reely on more than some putatiive univer-
tiion in which contentious
c grooups are engag ging each otheer sal, coontext-indepenndent property of networked interaction
bbut without thee conventionall means that makem such conn- simplyy by virtue of tthe existence oof a network.
vversations visib
ble to algorithm
ms and to reseaarchers.
4. Fieeld Effects: N
Non-Network
ks Interaction
ns
33. Limits of Methodologic
M cal Analogiess and Importt- Anothher difference between spaatial or epideemiological
ing Network Methods fro om Other Fieelds: netwoorks and humann social netwoorks is that hum man social
DDo social mediia networks op perate through similar mechaa- informmation flows ddo not occur onnly through noode-to-node
nnisms as netwo orks of airliness? Does inform mation work th he netwoorks but also thhrough field efffects, large-scaale societal
wway germs do o? Such questiions are rarely y explicitly ad d- eventss which impacct a large grouup of actors coontempora-
ddressed even though
t many papers importt methods from m neouslly. Big events,, weather occuurrences, etc. aall have so-
oother fields on the implicit asssumption that the answer is a ciety-wwide field efffects and ofteen do not difffuse solely
yyes. Studies thaat do look at this
t question, suchs as Romerro througgh interpersonnal interaction (although theey also do
eet al. (2011) annd Lerman et al.a (2010), are often limited to t greatlyy impact interrpersonal interaction by afffecting the
ssingle, or few, platforms, which
w limit th
heir explanatorry agendda, mood and ddisposition of inndividuals).
ppower becausee information among a humanns does not dif- Forr example, moost studies agrree that Egypptian social
ffuse in a singlee platform wh hereas viruses do, indeed, dif- mediaa networks playyed a role in tthe uprising whhich began
ffuse in a well-defined manneer. To step bacck further, eveen in Egyypt in January 2011 and weree a key conduit of protest
rrepresenting so ocial media in nteractions as a network ree- informmation (Lynch,, 2012; Aday et al, 2012; T Tufekci and
qquires a whole host of impliccit and importaant assumption ns Wilsoon, 2012). How wever, there waas almost certainly anoth-
thhat should be considered ex xplicitly ratherr than assumeed er impportant informaation diffusionn dynamic. Thee influence
aaway (Butts, 20 009). of thee Tunisian revoolution itself on the expectattions of the
Epidemiologgical or contaagion-inspired analyses ofteen Egypttian populace was a major turning pointt (Ghonim,
trreat connected d edges in sociial media netw works as if theey 2012; Lynch, 2012). While analyysis of networkks in Egypt
wwere “neighborrs” in physicall proximity. In n epidemiology y, might not have revvealed a majorr difference beetween the
itt is reasonablee to treat “physsical proximityy” as a key varri- secondd and third weeek of Januaryy of 2011, som mething ma-
aable, assuming that adjacent nodes
n are “suscceptible” to diss- jor haad changed in tthe field. To trranslate it into epidemio-
eease transmission for very good reason: the underlyin ng logicaal language, duue to the Tunnisian revolutioon and the
mmodel is a well-developed
w d, empirically--verified germ m- exampple it presentedd, all the “noddes” in the netw work had a
thheory of diseaase in which sm mall microbes travel in actuaal differeent “susceptibiility” and “recceptivity” to innformation
sspace (where distance
d matters) to infect thee next person byb about an uprising. TThe downfall oof the Tunisiann president,
eentering their body. This physical proccess has welll- whichh showed that eeven an enduriing autocracy iin the Mid-
uunderstood pro operties, and un nderlying prob babilities can of- dle Eaast was suscepptible to streeet protests, eneergized the
ten be calculateed with precisio on. oppossition and channged the politiical calculationn in Egypt.
Creating an analogy from m social mediaa interactions to t This iinformation w was diffused thhrough multiplle methods
pphysical proxim mity may be a reasonable and d justified undeer and brroadcast mediaa played a keyy role. Thus, thhe commu-
ccertain conditioons and assum mptions, but thiis step is rarelly nicatioon of the Tuniisia effect to thhe Egyptian neetwork was
ssubjected to critical
c examinnation (Salathéé et al, 2013). not neecessarily depeendent on the network struccture of so-
TThere are sign nificant differeences between germs and in n- cial m
media.
fformation traveeling in social media networrks. Adjacenccy
inn social med dia is multi-faaceted; it can nnot always be b
mmapped to physsical proximity y; and human “nodes”
“ are sub
b-
ject to informaation from a wide
w range of sources,
s not just
thhose they are connected to in a particulaar social mediia
pplatform. Finallly, whether th here is a straig ghtforward relaa-
tiion between in nformation exp posure and thee rate of “influ u-
eence,” as there often is for exxposure to a diisease agent an nd Figure 5: Cleaar meaning onlyy in context and ttime.
thhe rate of infecction, is sometthing that should be empiricaal- Soccial media itseelf is often inncomprehensibble without
lyy investigated,, not assumed. referennce to field evvents outside iit. For examplle, take the
tweet in Figure 5. The tweets merely statess: “Getting
ccrowded underr that bus.” Sttrangely, it haas been tweeteed sensitiive to. That haashtag indeed trrended worldwwide. Simi-
mmore than sixtty times and favorited morre than 50. Fo or lar cooordinated cam mpaigns are com mmon in Turkkey and oc-
thhose following g in real time, this was an obbvious referencce curredd almost everyy day during thhe contentious protests of
too New Jersey Governor Chrris Christie’s pressp conferencce June, 22013.
inn which he blamed
b multiplle aides for thhe closing of a Succh behaviors, aaimed at avoidiing detection, amplifying
bbridge which caaused massive traffic jams, allegedly
a to pun
n- a signnal, or other goals, by deliberrate gaming of algorithms
ish a mayor wh ho did not endo orse him. Withhout understand d- and mmetrics, shouldd be expected iin all analysess of human
inng the Chris Christie
C press conference,
c neeither the tweeet, social media. Currenntly, many studdies do take innto account
nnor many retweeets of it are innterpretable. “gamiing” behaviorss such as spam m and bots; hoowever, co-
The turn to networks
n as a key
k metaphor in i social sciencc- ordinaated or active aattempts by acctual people too alter met-
ees, while fruitfful, should not diminish our attention to th he rics orr results, whichh often can onnly be discoverred through
mmulti-scale natu ure of human social
s interactio
on qualitaative research, are rarely takeen into accountt.