Module:Histogram

From Darkest Dungeon Wiki
Jump to navigation Jump to search
Template-info.svg Documentation

This module contains various functions for generating histograms from data, as well as storing and retrieving those histograms from Cargo.

API Functions

_generate_table

Given:

  • t1: string, the name of the source table.
  • t2: string, the name of the destination table.
  • cols: list of strings, columns from t1.

This function generates histograms of all the given columns in the source table and writes the histograms to the destination table.

Input columns must be numeric. The destination table must have three columns:

  • attribute: string, the name of the column.
  • value: numeric, the value in the attribute.
  • count: numeric, the number of points in the data set whose attribute value is equal to value.

I.e., it is a pivot table.

By convention, nil entries are ignored.

The function will return a simple message upon success.

generate_table

As above, but called from Wikitext via #invoke rather than Lua. Note that this is not well supported and is only used in test cases.

_load_table

Given:

  • t: string, the name of a table of histograms.

This function loads the histograms hs from the given table.

hs will be a table of attribute-histogram pairs. Each histogram will be a sorted list of data points, where each data point has two attributes, value and count.

_interpolate_cdf

Given:

  • h: a single histogram.
  • x: numeric, a number to be interpolated.

This function returns a pseudo-cumulative distribution function of the histogram, evaluated at x.

Properties of the Pseudo-CDF

Let f denote the pseudo-CDF. f is not a true CDF but rather has the following properties:

  • Let x1 <= x2 <= x3 <= ... <= xn-1 <= xn be the observed points in the original data set, and let xi, xi+1, ..., xj-1, xj be the set of all data points with value x. Then f(x) = (i + j) / (2 * n). That is to say, it takes the middle of the range of all data points with value x. This is different from the usual discrete CDF, which by convention takes the rightmost point (or very rarely, the leftmost point).
  • If x lies in between two different points xi and xi+1, then f(x) interpolates linearly between f(xi) and f(xi+1).
  • If x < x1, then f(x) = 0, and if x > xn, then f(x) = 1.
  • If x is nil, then f(x) = nil also.

Motivations

The primary motivation for creating the pseudo-CDF is to answer the following question: given a set of observed data h and a value x, how 'big' is x relative to the other values in h? And how do we express this concept numerically?

The regular CDF F(x) is a good starting point. F(x), ranging from 0 to 1, is defined as "the proportion of elements less than or equal to x in the observed data". An undesirable property of this is that in small data sets, this is biased towards higher values. For example, in a data set where all the elements are equal to 3, F(3) = 1, and 1 is a big value. Rarely the CDF is defined in the opposite way—as the number of elements strictly less than x. But this is also undesirable, for the opposite reason.

Intuitively, we'd like to say that F(3) = 0.5 because 3 is an average value in the distribution. Hence, for all empirically observed points x we can define the pseudo-CDF as the median of its proportionate range in the distribution.

But why do we need to interpolate? Why can't we just precompute and then look up the value in the histogram table directly? There is a more subtle problem here which has to do with MediaWiki's update model.

Suppose we have a table of histograms—for example, of all enemy stats—and we add a new enemy. The problem is that MediaWiki won't automatically update the histogram table whenever data is added. Thus the histogram table may become out-of-sync with the current data.

If the newly added enemy happens to have an attribute value that isn't in the distribution, then we can't directly look up its CDF from a cached value. Thus we must produce a notion of "bigness" or "smallness" that only depends on the existing distribution data. A similar problem occurs when updating any existing data as well. The most reasonable solution here is to interpolate between the previously stored data points to produce a best-effort estimate of the (pseudo-)CDF for x. Given that the distribution is not likely to drift that much between individual patches, we are guaranteeing availability at the cost of a very small amount of accuracy.

There are various other ways to deal with the problem. We could, for example, trigger a histogram update with every cargo_store—either automatically in the template, or instruct users to manually purge and regenerate the existing data. The latter case introduces a level of complexity too high for the average Wiki contributor; not to mention, it would be inefficient to update the entire data set with every edit.

A Final Note about MIN/MAX Queries

With a sufficiently large data set, even loading the entire histogram with every page will become expensive. In that case we would need to implement a windowing query that computes, for a given point, the largest point smaller than, and the smallest point larger than it in the histogram. The database should be able to handle both of those queries in roughly logarithmic time instead of the currently linear query.

However, this is unnecessary at the moment, and it is not clear that the Darkest Dungeon Wiki will ever have enough data to warrant this.

Test Cases

These templates define a simple data set along with a corresponding histogram table:


local Cargo = mw.ext.cargo
local getArgs = require("Module:Arguments").getArgs

local function _compute_histograms(data)
	local hs = {}
	for _, row in ipairs(data) do
		for attr, val in pairs(row) do
			if val ~= nil then
				if hs[attr] == nil then
					hs[attr] = {}
				end
				local h = hs[attr]
				if h[val] == nil then
					h[val] = 0
				end
				h[val] = h[val] + 1
			end
		end
	end
	return hs
end

local function _generate_table(t1, t2, cols)
	local data = Cargo.query(t1, table.concat(cols, ","))
	local hs = _compute_histograms(data)

	local frame = mw.getCurrentFrame()
	for attr, h in pairs(hs) do
		for val, cnt in pairs(h) do
			frame:callParserFunction("#cargo_store", {"",
				_table=t2,
				attribute=attr,
				value=val,
				count=cnt
			})
		end
	end

	return ("Successfully generated histograms. " ..
		"View table [[Special:CargoTables/%s|here]]."):format(t2)
end

-- This is not well supported and is only used for testing.
local function generate_table(frame)
	local args = getArgs(frame)
	local t1 = args[1]
	local t2 = args[2]
	-- Workaround for metatable issues
	local cols, i = {}, 3
	while args[i] do
		table.insert(cols, args[i])
		i = i + 1
	end
	_generate_table(t1, t2, cols)
end

local function _load_table(t)
	local rows = Cargo.query(
		t,
		"attribute,value,count",
		{orderBy="attribute,value"}
	)
	local hs = {}
	for _, row in ipairs(rows) do
		local attr = row.attribute
		-- BUG: Cargo doesn't load numerical values correctly.
		local val = tonumber(row.value)
		local cnt = tonumber(row.count)
		if hs[attr] == nil then
			hs[attr] = {}
			hs[attr].total = 0
		end
		table.insert(hs[attr], {value=val, count=cnt})
		hs[attr].total = hs[attr].total + cnt
	end
	return hs
end

local function _interpolate_cdf(h, x)
	if x == nil then
		return nil
	end
	if #h == 0 then
		return 0.5
	end
	if x < h[1].value then
		return 0
	end
	if x > h[#h].value then
		return 1
	end
	if #h == 1 then
		return 0.5
	end
	local cdf = h[1].count / 2
	for i, pt in ipairs(h) do
		local delta = (h[i].count + h[i + 1].count) / 2
		if x <= h[i + 1].value then
			local lambda = (x - h[i].value) / (h[i + 1].value - h[i].value)
			return (cdf + lambda * delta) / h.total
		else
			cdf = cdf + delta
		end
	end
end

return {
	_generate_table=_generate_table,
	generate_table=generate_table,
	_load_table=_load_table,
	_interpolate_cdf=_interpolate_cdf
}