Filter Duplicate Boosts: Difference between revisions
No edit summary |
|||
Line 193: | Line 193: | ||
.or(Status.where(id: inner_query)) | .or(Status.where(id: inner_query)) | ||
end | end | ||
</syntaxhighlight> | |||
=== Example Query === | |||
<syntaxhighlight lang="sql"> | |||
SELECT "statuses"."id", "statuses"."updated_at" | |||
FROM "statuses" | |||
INNER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" | |||
WHERE "statuses"."visibility" = $1 | |||
AND "accounts"."suspended_at" IS NULL | |||
AND "accounts"."silenced_at" IS NULL | |||
AND ( | |||
statuses.reply = FALSE | |||
OR statuses.in_reply_to_account_id = statuses.account_id | |||
) | |||
AND ( | |||
"statuses"."reblog_of_id" IS NULL | |||
OR "statuses"."id" IN ( | |||
SELECT DISTINCT ON (reblog_of_id) statuses.id | |||
FROM "statuses" | |||
WHERE "statuses"."deleted_at" IS NULL | |||
AND "statuses"."id" IN ( | |||
SELECT "statuses"."id" FROM "statuses" | |||
WHERE "statuses"."deleted_at" IS NULL | |||
AND "statuses"."id" < 111819125737828654 | |||
ORDER BY "statuses"."id" DESC | |||
LIMIT $2 | |||
) | |||
ORDER BY "statuses"."reblog_of_id" DESC, "statuses"."id" DESC | |||
) | |||
) | |||
AND ( | |||
"statuses"."local" = $3 | |||
OR "statuses"."uri" IS NULL | |||
) | |||
AND "statuses"."deleted_at" IS NULL | |||
AND 1=1 | |||
AND "statuses"."id" < 111813463418866657 | |||
ORDER BY "statuses"."id" DESC LIMIT $4 | |||
[["visibility", 0], ["LIMIT", 100], ["local", true], ["LIMIT", 20]] | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Revision as of 00:20, 25 January 2024
Filter Duplicate Boosts | |
---|---|
Description | Prevent duplicated boosts in public timelines like in home timelines |
Part Of | Mastodon/Hacking |
Contributors | Jonny |
Has Git Repository | https://github.com/NeuromatchAcademy/mastodon |
Completion Status | Stub |
Active Status | Inactive |
- Pull Request: https://github.com/NeuromatchAcademy/mastodon/pull/36
Problem
People like boosts in local timeline, but don't like seeing the same boost a zillion times.
Public feeds work differently than home feeds:
- Public feeds use a public_feed model that is a scoped database query
def get(limit, max_id = nil, since_id = nil, min_id = nil)
scope = public_scope
scope.merge!(without_local_only_scope) unless allow_local_only?
scope.merge!(without_replies_scope) unless with_replies?
scope.merge!(without_reblogs_scope) unless with_reblogs?
scope.merge!(local_only_scope) if local_only?
scope.merge!(remote_only_scope) if remote_only?
scope.merge!(account_filters_scope) if account?
scope.merge!(media_only_scope) if media_only?
scope.merge!(language_scope) if account&.chosen_languages.present?
scope.cache_ids.to_a_paginated_by_id(limit, max_id: max_id, since_id: since_id, min_id: min_id)
end
- Home feeds use a feed_manager class that inserts posts into a persistent list, which applies the boost filter at the time of that insertion:
def add_to_feed(timeline_type, account_id, status, aggregate_reblogs: true)
timeline_key = key(timeline_type, account_id)
reblog_key = key(timeline_type, account_id, 'reblogs')
if status.reblog? && (aggregate_reblogs.nil? || aggregate_reblogs)
# If the original status or a reblog of it is within
# REBLOG_FALLOFF statuses from the top, do not re-insert it into
# the feed
rank = redis.zrevrank(timeline_key, status.reblog_of_id)
return false if !rank.nil? && rank < FeedManager::REBLOG_FALLOFF
# The ordered set at `reblog_key` holds statuses which have a reblog
# in the top `REBLOG_FALLOFF` statuses of the timeline
if redis.zadd(reblog_key, status.id, status.reblog_of_id, nx: true)
# This is not something we've already seen reblogged, so we
# can just add it to the feed (and note that we're reblogging it).
redis.zadd(timeline_key, status.id, status.id)
else
# Another reblog of the same status was already in the
# REBLOG_FALLOFF most recent statuses, so we note that this
# is an "extra" reblog, by storing it in reblog_set_key.
reblog_set_key = key(timeline_type, account_id, "reblogs:#{status.reblog_of_id}")
redis.sadd(reblog_set_key, status.id)
return false
end
else
# A reblog may reach earlier than the original status because of the
# delay of the worker delivering the original status, the late addition
# by merging timelines, and other reasons.
# If such a reblog already exists, just do not re-insert it into the feed.
return false unless redis.zscore(reblog_key, status.id).nil?
redis.zadd(timeline_key, status.id, status.id)
end
true
end
Options
- Model - Make some tricky scope that can filter repeated boosts in the existing public_feed model
- Controller - Filter repeated boosts from the cached feed when serving the public_feed endpoint
- Frontend - Filter repeated boosts in the web UI (but wouldn't work on apps, which is bad)
Implementation
Actually pretty damn simple. Add an additional scope in public_feed.rb
The core of the scope is this:
def without_duplicate_reblogs(limit, max_id, since_id, min_id)
inner_query = Status.select('DISTINCT ON (reblog_of_id) statuses.id')
.reorder(reblog_of_id: :desc, id: :desc)
Status.where(statuses: { reblog_of_id: nil })
.or(Status.where(id: inner_query))
which selects either
- Statuses that aren't a boost (
reblog_of_id == nil
) - Statuses that are a boost
- sorted by the boosted status ID and the ID of the boost itself,
- using postgres' [DISTINCT ON https://www.postgresql.org/docs/current/sql-select.html] clause to select only the first matching row
- since
id
is a snowflake ID, and thus chronological, this will be the most recent boost.
There are some problems with this naive implementation though:
- The inner query will select all boosts from all time, every time, because the outer pagination parameters passed to
PublicFeed.get
don't propagate to the inner query - The inner query necessarily needs to sort by
reblog_of_id
first, so just applying a LIMIT will filter on the recency of the original post rather than the boost, meaning we will miss most boosts most of the time
So we add the pagination information from PublicFeed.get
, mimicking Paginable.to_a_paginated_by_id
's use of the parameters to make a WHERE filter selecting whatever statuses would also be included in the given page that is being fetched
def without_duplicate_reblogs(limit, max_id, since_id, min_id)
inner_query = Status.select('DISTINCT ON (reblog_of_id) statuses.id').reorder(reblog_of_id: :desc, id: :desc)
if min_id.present?
inner_query = inner_query.where(min_id < :id)
elsif since_id.present?
inner_query = inner_query.where(since_id < :id)
end
inner_query = inner_query.where(max_id > :id) if max_id.present?
inner_query = inner_query.limit(limit) if limit.present?
Status.where(statuses: { reblog_of_id: nil })
.or(Status.where(id: inner_query))
end
But that still doesn't quite get us there, since the first page of a local feed load doesn't have any pagination information in it except for the limit, so we still have the same problem as before. So instead we create a set of n
candidate statuses where n
is some arbitrary multiple of limit such that we assume that whatever is being filtered out by the other scopes is less than n
- ie. we are considering the same set of candidate statuses as the outer scope. This should be relatively fast because we're just taking a slice of the index, rather than applying some complex SQL shit.
Recall that this is still one of several scopes ANDed together, so we will not return any boosts that are filtered by the other scopes, even if they are present in this scope.
def without_duplicate_reblogs(limit, max_id, since_id, min_id)
candidate_statuses = Status.select(:id).reorder(id: :desc)
if min_id.present?
candidate_statuses = candidate_statuses.where(min_id < :id)
elsif since_id.present?
candidate_statuses = candidate_statuses.where(since_id < :id)
end
candidate_statuses = candidate_statuses.where(max_id > :id) if max_id.present?
if limit.present?
limit *= 5
candidate_statuses = candidate_statuses.limit(limit)
end
inner_query = Status
.where(id: candidate_statuses)
.select('DISTINCT ON (reblog_of_id) statuses.id')
.reorder(reblog_of_id: :desc, id: :desc)
Status.where(statuses: { reblog_of_id: nil })
.or(Status.where(id: inner_query))
end
But one MORE problem - since we're just considering one page at a time, we will still get duplicated boosts across pages. So...
- When no minimum ID is provided, we use the "multiply the limit" strategy to avoid querying all statuses from all time
- When a minimum ID is provided, we
- Use no limit
- If a maximum ID is also provided, we add 1 day worth of time to the ID so we also don't query arbitrarily into the future when fetching past pages
def without_duplicate_reblogs(limit, max_id, since_id, min_id)
candidate_statuses = Status.select(:id).reorder(id: :desc)
if min_id.present?
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].gt(min_id))
elsif since_id.present?
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].gt(since_id))
elsif limit.present?
limit *= 5
candidate_statuses = candidate_statuses.limit(limit)
end
if max_id.present?
max_time = Mastodon::Snowflake.to_time(id)
max_time += 1.day
max_id = Mastodon::Snowflake.id_at(max_time)
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].lt(max_id))
end
inner_query = Status
.where(id: candidate_statuses)
.select('DISTINCT ON (reblog_of_id) statuses.id')
.reorder(reblog_of_id: :desc, id: :desc)
Status.where(statuses: { reblog_of_id: nil })
.or(Status.where(id: inner_query))
end
Example Query
SELECT "statuses"."id", "statuses"."updated_at"
FROM "statuses"
INNER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id"
WHERE "statuses"."visibility" = $1
AND "accounts"."suspended_at" IS NULL
AND "accounts"."silenced_at" IS NULL
AND (
statuses.reply = FALSE
OR statuses.in_reply_to_account_id = statuses.account_id
)
AND (
"statuses"."reblog_of_id" IS NULL
OR "statuses"."id" IN (
SELECT DISTINCT ON (reblog_of_id) statuses.id
FROM "statuses"
WHERE "statuses"."deleted_at" IS NULL
AND "statuses"."id" IN (
SELECT "statuses"."id" FROM "statuses"
WHERE "statuses"."deleted_at" IS NULL
AND "statuses"."id" < 111819125737828654
ORDER BY "statuses"."id" DESC
LIMIT $2
)
ORDER BY "statuses"."reblog_of_id" DESC, "statuses"."id" DESC
)
)
AND (
"statuses"."local" = $3
OR "statuses"."uri" IS NULL
)
AND "statuses"."deleted_at" IS NULL
AND 1=1
AND "statuses"."id" < 111813463418866657
ORDER BY "statuses"."id" DESC LIMIT $4
[["visibility", 0], ["LIMIT", 100], ["local", true], ["LIMIT", 20]]