Module:Histogram
This module contains various functions for generating histograms from data, as well as storing and retrieving those histograms from Cargo.
API Functions
_generate_table
Given:
t1
: string, the name of the source table.t2
: string, the name of the destination table.cols
: list of strings, columns fromt1
.
This function generates histograms of all the given columns in the source table and writes the histograms to the destination table.
Input columns must be numeric. The destination table must have three columns:
attribute
: string, the name of the column.value
: numeric, the value in the attribute.count
: numeric, the number of points in the data set whoseattribute
value is equal tovalue
.
I.e., it is a pivot table.
By convention, nil
entries are ignored.
The function will return a simple message upon success.
generate_table
As above, but called from Wikitext via #invoke
rather than Lua. Note that this is not well supported and is only used in test cases.
_load_table
Given:
t
: string, the name of a table of histograms.
This function loads the histograms hs
from the given table.
hs
will be a table of attribute
-histogram pairs. Each histogram will be a sorted list of data points, where each data point has two attributes, value
and count
.
_interpolate_cdf
Given:
h
: a single histogram.x
: numeric, a number to be interpolated.
This function returns a pseudo-cumulative distribution function of the histogram, evaluated at x
.
Properties of the Pseudo-CDF
Let f
denote the pseudo-CDF. f
is not a true CDF but rather has the following properties:
- Let
x1 <= x2 <= x3 <= ... <= xn-1 <= xn
be the observed points in the original data set, and letxi, xi+1, ..., xj-1, xj
be the set of all data points with valuex
. Thenf(x) = (i + j) / (2 * n)
. That is to say, it takes the middle of the range of all data points with valuex
. This is different from the usual discrete CDF, which by convention takes the rightmost point (or very rarely, the leftmost point). - If
x
lies in between two different pointsxi
andxi+1
, thenf(x)
interpolates linearly betweenf(xi)
andf(xi+1)
. - If
x < x1
, thenf(x) = 0
, and ifx > xn
, thenf(x) = 1
. - If
x
isnil
, thenf(x) = nil
also.
Motivations
The primary motivation for creating the pseudo-CDF is to answer the following question: given a set of observed data h
and a value x
, how 'big' is x
relative to the other values in h
? And how do we express this concept numerically?
The regular CDF F(x)
is a good starting point. F(x)
, ranging from 0
to 1
, is defined as "the proportion of elements less than or equal to x
in the observed data". An undesirable property of this is that in small data sets, this is biased towards higher values. For example, in a data set where all the elements are equal to 3
, F(3) = 1
, and 1
is a big value. Rarely the CDF is defined in the opposite way—as the number of elements strictly less than x
. But this is also undesirable, for the opposite reason.
Intuitively, we'd like to say that F(3) = 0.5
because 3
is an average value in the distribution. Hence, for all empirically observed points x
we can define the pseudo-CDF as the median of its proportionate range in the distribution.
But why do we need to interpolate? Why can't we just precompute and then look up the value in the histogram table directly? There is a more subtle problem here which has to do with MediaWiki's update model.
Suppose we have a table of histograms—for example, of all enemy stats—and we add a new enemy. The problem is that MediaWiki won't automatically update the histogram table whenever data is added. Thus the histogram table may become out-of-sync with the current data.
If the newly added enemy happens to have an attribute value that isn't in the distribution, then we can't directly look up its CDF from a cached value. Thus we must produce a notion of "bigness" or "smallness" that only depends on the existing distribution data. A similar problem occurs when updating any existing data as well. The most reasonable solution here is to interpolate between the previously stored data points to produce a best-effort estimate of the (pseudo-)CDF for x
. Given that the distribution is not likely to drift that much between individual patches, we are guaranteeing availability at the cost of a very small amount of accuracy.
There are various other ways to deal with the problem. We could, for example, trigger a histogram update with every cargo_store
—either automatically in the template, or instruct users to manually purge and regenerate the existing data. The latter case introduces a level of complexity too high for the average Wiki contributor; not to mention, it would be inefficient to update the entire data set with every edit.
A Final Note about MIN/MAX Queries
With a sufficiently large data set, even loading the entire histogram with every page will become expensive. In that case we would need to implement a windowing query that computes, for a given point, the largest point smaller than, and the smallest point larger than it in the histogram. The database should be able to handle both of those queries in roughly logarithmic time instead of the currently linear query.
However, this is unnecessary at the moment, and it is not clear that the Darkest Dungeon Wiki will ever have enough data to warrant this.
Test Cases
These templates define a simple data set along with a corresponding histogram table:
local Cargo = mw.ext.cargo
local getArgs = require("Module:Arguments").getArgs
local function _compute_histograms(data)
local hs = {}
for _, row in ipairs(data) do
for attr, val in pairs(row) do
if val ~= nil then
if hs[attr] == nil then
hs[attr] = {}
end
local h = hs[attr]
if h[val] == nil then
h[val] = 0
end
h[val] = h[val] + 1
end
end
end
return hs
end
local function _generate_table(t1, t2, cols)
local data = Cargo.query(t1, table.concat(cols, ","))
local hs = _compute_histograms(data)
local frame = mw.getCurrentFrame()
for attr, h in pairs(hs) do
for val, cnt in pairs(h) do
frame:callParserFunction("#cargo_store", {"",
_table=t2,
attribute=attr,
value=val,
count=cnt
})
end
end
return ("Successfully generated histograms. " ..
"View table [[Special:CargoTables/%s|here]]."):format(t2)
end
-- This is not well supported and is only used for testing.
local function generate_table(frame)
local args = getArgs(frame)
local t1 = args[1]
local t2 = args[2]
-- Workaround for metatable issues
local cols, i = {}, 3
while args[i] do
table.insert(cols, args[i])
i = i + 1
end
_generate_table(t1, t2, cols)
end
local function _load_table(t)
local rows = Cargo.query(
t,
"attribute,value,count",
{orderBy="attribute,value"}
)
local hs = {}
for _, row in ipairs(rows) do
local attr = row.attribute
-- BUG: Cargo doesn't load numerical values correctly.
local val = tonumber(row.value)
local cnt = tonumber(row.count)
if hs[attr] == nil then
hs[attr] = {}
hs[attr].total = 0
end
table.insert(hs[attr], {value=val, count=cnt})
hs[attr].total = hs[attr].total + cnt
end
return hs
end
local function _interpolate_cdf(h, x)
if x == nil then
return nil
end
if #h == 0 then
return 0.5
end
if x < h[1].value then
return 0
end
if x > h[#h].value then
return 1
end
if #h == 1 then
return 0.5
end
local cdf = h[1].count / 2
for i, pt in ipairs(h) do
local delta = (h[i].count + h[i + 1].count) / 2
if x <= h[i + 1].value then
local lambda = (x - h[i].value) / (h[i + 1].value - h[i].value)
return (cdf + lambda * delta) / h.total
else
cdf = cdf + delta
end
end
end
return {
_generate_table=_generate_table,
generate_table=generate_table,
_load_table=_load_table,
_interpolate_cdf=_interpolate_cdf
}