Mirror of Project Veritas Google Leak 2019
The folders in this repo came from the "Everything" zip (Don't Be Evil.zip); this README is my own commentary on what's good in the leak.
- Several conversation threads about Breitbart and InfoWars being blocked for hate speech and conspiracy theory, respectively.
- A few slides of Google powerpoints suggesting human reviewers of news videos, maybe taken out of context, not too interesting.
- a list of keywords that Google blacklists
from YouTube searches (or videos? not sure)
- Mass shooting attacks in Las Vegas, New York, Texas etc.
- Conspiracy theories like Crisis Actors
- Porn queries
- Depression? Colgate?
- Just some threads (3 PDF files) of googlers wanting to add an easter egg to Google Translate to say what "covfefe" means, which looks like they had problems adding it and backed out of the idea altogether.
- Easter egg translates "cov fe'fe" and "covfefe" into
( ̄\_(ツ)_/ ̄).
- Had problems with Arabic trying to actually translate it into something else in their language and had to roll it back.
Not much to see here.
- Lots of documents on how Google is combatting fake news, by withholding AdSense on sites they deem to be misrepresenting information. Sounds reasonable to me.
- Study end-to-end how a publisher's traffic is promoted (i.e. on social media), the content of the site, how the publisher represents their content, their relationship to other scammers, and other properties by manual human review to see if it falls under the Fake News policy. Sounds reasonable.
- Only AdSense is blocked on these Take News Sites, but DoubleClick ads may still be used on those sites (global politics and "ads as a platform" standards)
There's also a couple resumes in here of Googlers where they describe the products they've worked on and the features they've added. Pretty entertaining look at the "behind the scenes" at Google but nothing crazy jumps out at me. Worked with Google search ranking algorithms.
Fwd_ Fake News-letter 11_27_ Efforts to combat spread of (mis_dis)information - Google Groups.pdf
Some interesting stuff in here:
Goal: Establish “single point of truth” for definition of “news” across Google products. Mitigate risk of low-quality sources and misinformation in Google News corpus.
Goal: Establish and streamline news escalation processes to detect and handle misinformation across products during crises. Install 24/7 team of trained analysts ready to make policy calls and take actions across news surfaces including News, News 360 and Feed.
These could be abused maybe? No indication they intend to misuse them.
news black list site for google now.txt
# Manual list of sites excluded from appearing as Google Now stories to read # results. The urls are used with a UrlMatcher, and should be in the format # specified in: webutil/urlutil/urlmatcher.h
Notice: per the filename and comment at the top of the file, this is a blacklist for stories appearing in the Google Now app on Android. Some sites (GTA 5 Mods, APK Mirror) make sense to be suppressed from the feed IMHO. You also don't want to throw a page from 4Chan on somebody unexpectedly.
Some sites I recognized or have heard of before:
- apkmirror.com: Android app mirror site, not sure why it's on here.
- play.google.com, drive.google.com, docs.google.com
- torrentfreak.com and several similar (thepiratebay)
- newsbusters.org (pointed out in Veritas interview, I don't otherwise know of it)
- glennbeck.com (know the name, don't know about him though)
- yugiohblog.konami.com: a site about the Yu-Gi-Oh trading card game?
- smosh.com: YouTuber site
- queerty.com - LGBT magazine
- voat.co - a Reddit clone, login required now o.O
- reddit.com/r/interestingasfuck - oh they even filter by subreddits, how nice
Lots of domains that sound like fake news sites, but I didn't check them out myself. See the file for yourself.
This section of the file has what looks like a bunch of fake news sites (.com.co domain suffix? really?):
# START: sites flagged for peddling hoax stories. abcnews.com.co/ actionnews3.com/ cbsnews.com.co/ channel-7-news.com/ civictribune.com/ drudge-report.co/ independencetribune.com/ nbc.com.co/ neonnettle.com/ now8news.com/ tdtalliance.com/ theracketreport.com/ therightists.com/ thirdestatenewsgroup.com/ tipsforsurvivalists.com/ worldnewsdailyreport.com/ # END: sites flagged for peddling hoax stories.
I pinged a few and a lot of them don't even exist anymore.
Page level domain restrction 2017_10_02_us_las-vegas-attack-deadliest-us-mass-shooting-trnd_index.pdf
A bug ticket within Google to add "page level domain restriction" on several links to news articles talking about the Las Vegas attack. Some example pages:
427 URLs in total.
- Are you adding or removing violations?
- Which violations would you like to add?
- Please provide a brief justification for this request.
Las Vegas Mandalay Bay Shooting
Per a commenter in the bug ticket thread this is a "URL takedown request", but not clear what service. Probably Google Search listing. Could be just adding a "Sensitive" label to these links too.
PowerPoint slide about responding quickly to real-world events.
Detect real world events
Is this Query Trending?
Fast triggering: <5 mins after the event
Fast serving: 5ms average / 40ms 99% percentile
They use Twitter as a signal for rapid-fire tweets about breaking news
Updates Google search autocomplete quickly (type "p" and auto-suggest "prince dead" as one example in the slides)
To be checked.
To be checked.
Machine Learning Fairness
What it seems they're going for at Google:
- Machine learning collects data from the real world and then produces results based on real world data, which isn't always comfortable with people. Things like implicit stereotypes or social biases that fully exist in the real world get reflected "by default" when machine learning studies the real world.
- Google is working on algorithms to try and steer the "actual results" into a more equal output, so no group of people feel marginalized by the Google machine learning results.
- Similar in concept to a "random number generator" in many videogames or music playing programs aren't actually random and have algorithmic bias to seem more in line with what the user expects. If your music player picks the same song twice in a row -- that's true randomness and happens, so instead it runs an algorithm to be more in line with the ideal expectation of the user.
Some people may have a problem with that and expect that Google Search reflects the world exactly as it is without changing it and I'm sure arguments could be made for either side.
The folder contains some examples:
- Google search for "american inventors" returns mostly African American
inventors (George Washington Carver, Lewis Howard Latimer...).
- A discussion thread about how it's a bug in the search algorithm.
- Google Image Team is working on forcing diversity into search results to work around the bug.
As is the 'our algorithm isn't actually that smart, it's just that "African American" has the word "American" in it'
Other examples of why ML Fairness was needed to intervene in Google Search:
- The term "doctor" would primarily return photos of men.
- Real-world diversity in surgeons: 81% Male, 19% Female
- 89,000 results for "male surgeon" on Google
- 119,000 results for "female surgeon" on Google
Still working through powerpoints on this but so far not very worrisome.