Technology today is so invasive that buzzwords like privacy, cookies, tracking, etc. are appearing on every news outlet that cares about the web. But they usually just scratch the surface, leaving a lot of the underlying machinery in the dark and still hidden away. This causes the common internet denizen to fear and/or blame the wrong thing. In this article, I'll attempt to describe the different pieces that comprise online tracking, in a way that goes beyond the cookies and scripts.
The HTTP protocol
The web is built on top of the HTTP protocol. It's a protocol where the client requests what it wants from the server the server responds with what it has to the client. It's also, at its very basic, a stateless protocol. An individual request-response pair has no association with another request-response pair. This makes the protocol simple - client sends a bunch of text, server returns a bunch of text, rinse and repeat. But without the ability to associate the multiple requests of a single client together, things like authentication, sessions and contextually-correct responses are impossible.
Cookies are small pieces of text stored on the browser. What makes cookies special is that they are sent to the server during an HTTP request, and received from the server during an HTTP response. If a request is made to a domain, any cookies present for that domain are attached to the headers of request, making them readable by the server. On the flip side, when a server responds, it can optionally set cookie information to the header of the response, which the client uses to update its local stash of cookies. This makes cookies a popular way to store and sync state, especially between client and server.
Tracking only happens if something on the page makes a request to a tracking domain, carrying with it the unique ID of the client and optional metadata. If nothing on the page makes this request, then the page is effectively a dead zone on the internet. This is where ads, analytics, social media buttons and similar widget scripts come in. In order for these scripts to work, they make requests to their respective domains to load things like functionality, assets and data. These same requests can also used to pass tracking information to their respective domains, effectively making them the trackers.
Tracking is only effective if you can follow users wherever they go. This means tracker scripts on every website on the internet, which is only possible if you have code-level access to add the script. But you don't have to, especially when there is an incentive for website owners to add it themselves. Analytics scripts provide user insights, social media buttons provide instant access to audience, ads provide additional profit. Because of these, website operators are voluntarily adding these scripts on their websites, effectively doing the work for the trackers and spreading their presence on the internet. This ubiquity coupled with the unique client ID allows trackers to follow any user virtually anywhere on the internet.
The biggest reason why tracking exists is to gather data about the user. This data could then be used to power personalized content, improve marketing campaigns, improve user experience, improve website performance, and so on. One goal in particular is increasing conversion - the practice of making a drive-by user take action, beyond just viewing the content. This could be a purchase, a subscription, a user registration, a newsletter, whatever. Because higher conversion rates are highly sought after, this makes targeted advertising very profitable, and user data even more valuable.
Real-time bidding is a protocol between ad publisher and ad supplier which facilitates automated buying and selling of ads. It starts when a user visits a page that requests an ad from an ad publisher. The ad publisher broadcasts this request to ad suppliers for bidding, together with metadata about the user. Ad suppliers select their most relevant ad based on the request metadata and any historical data they have on the user, and submit the bid. The ad publisher will then select the highest bidder and renders their ad. In addition, the winning ad supplier is allowed to make a client request back to their servers for their own tracking. The data used in the process, both from the ad publisher and the ad supplier's own tracking, will then be used by the ad supplier to submit better bids in the future.
All browsers have some form of tracker blocker. Safari has its Intelligent Tracking Prevention powered by in-device machine learning. Firefox has its own Tracking Protection powered by Disconnect. Chrome blocks ads that do not follow the Better Ads Standard. Opera comes with an ad blocker that's compatible with AdBlock Plus block lists. And of course, third-party extensions like AdBlock Plus. While they all take different approaches and degrees of blocking, they all have one common goal: Prevent the client from being given a persistent ID. Because when an ID is established, a rogue request can send it to the tracking domain, and it's game over.
Tracker blocking is just a short-term solution to prevent tracking, a bandage to the bigger problem: corporations aggressively farming user data, to the point of invading privacy. The proper solution to this issue is the regulation on how these corporations operate, transparency on what data is being collected, accountability in the event of breaches, as well as fine-grained user-level controls to allow the user to put in or pull out their data. EU's General Data Protection Regulation (GDPR) is one such regulation, which I believe is the right way to approach this problem. Not more ad blocker filters.
Whether it's used for serving authenticated content or just relevant ads, knowing the user behind the browser is essential to the web. Otherwise, we'd all just be staring at the same, pre-generated, non-dynamic content all the time. However, one must also exercise caution when on the web. Until data regulations become global or become part of the technology itself, what goes on the web stays indefinitely on the web.
Hopefully this gave you a 10,000ft overview on how the web works, how tracking works, how your data is being used on the internet, and how to address tracking on the internet. As always, let me know in the comments what you think about the article.