titipata / scipdf_parser

Python PDF parser for scientific publications: content and figures

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong parsing when dealing with double column papers

ktgiahieu opened this issue · comments

Hope you are well,

I am here to report a bug where the Abstract and Introduction is wrongly parsed. I have emphasize the error with Bold text in the parsed json below. I also included the original paper.

Since I need this for my current research, can you take a look and give me some advice about what might have gone wrong here as soon as possible? I can help fix the bug and make a pull request.

Thank you in advanced.

paper11.pdf
{
"authors": "Sahra Ghalebikesabi; Harrison Wilde; Jack Jewson; Arnaud Doucet; Sebastian Vollmer; Chris Holmes",
"pub_date": "",
"title": "Mitigating Statistical Bias within Differentially Private Synthetic Data",
"abstract": "Increasing interest in privacy-preserving machine learning has led to new and evolved approaches for generating private synthetic data from undisclosed real data. However, mechanisms of privacy preservation can significantly reduce the utility of synthetic data, which in turn impacts downstream tasks such as learning predictive models or inference. We propose several re-weighting strategies using privatised likelihood ratios that not only mitigate statistical bias of downstream estimators but also have general applicability to differentially private generative models. Through large-scale empirical evaluation, we show that private importance weighting provides simple and effective privacycompliant augmentation for general applications of synthetic data.
Recent literature has proposed techniques to decrease this bias by modifying the training processes of private algorithms. These approaches are specific to a particular synthetic data generating method (Zhang et al., 2018;Frigerio et al., 2019;Neunhoeffer et al., 2020), or are query-based (Hardt and Rothblum, 2010;Liu et al., 2021) and are thus not generally applicable. Hence, we propose several postprocessing approaches that aid mitigating the bias induced by the DP synthetic data. While there has been extensive research into estimating models directly on protected data without leaking privacy, we argue that releasing DP synthetic data is crucial for rigorous statistical analysis. This makes providing a framework to debias inference on this an important direction of future research that goes beyond the applicability of any particular DP estimator. Because of the post-processing theorem (Dwork et al., 2014), any function on the DP synthetic data is itself DP. This allows deployment of standard statistical analysis tooling that may otherwise be unavailable for DP estimation. These include 1) exploratory data analysis, 2) model verification and analysis of model diagnostics, 3) private release of (newly developed) models for which no DP analogue has been derived, 4) the computation of con-",
"sections": [
{
"heading": "INTRODUCTION",
"text": "The prevalence of sensitive datasets, such as electronic health records, contributes to a growing concern for violations of an individual's privacy. In recent years, the notion of Differential Privacy (Dwork et al., 2006) has gained popularity as a privacy metric offering statistical guarantees. This framework bounds how much the likelihood of a randomised algorithm can differ under neighbouring real datasets. We say two datasets D and D are neighbouring when they differ by at most one observation. A randomised algorithm g : M \u2192 R satisfies ( , \u03b4)-differential privacy for , \u03b4 \u2265 0 if and only if for all neighbouring datasets D, D and all subsets S \u2286 R, we have Pr(g(D) \u2208 S) \u2264 \u03b4 + e Pr(g(D ) \u2208 S).\nThe parameter is referred to as the privacy budget; smaller quantities imply more private algorithms.\nInjecting noise into sensitive data according to this paradigm allows for datasets to be published in a private manner. With the rise of generative modelling approaches, such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), there has been a surge of literature proposing generative models for differentially private (DP) synthetic data generation and release (Jordon et al., 2019;Xie et al., 2018;Zhang et al., 2017). These generative models often fail to capture the true underlying distribution of the real data, possibly due to flawed parametric assumptions and the injection of noise into their training and release mechanisms.\nThe constraints imposed by privacy-preservation can lead to significant differences between nature's true data generating process (DGP) and the induced synthetic DGP (SDGP) (Wilde et al., 2020). This increases the bias of estimators trained on data from the SDGP which reduces their utility.\nfidence intervals of downstream estimators through the nonparametric bootstrap, and 5) the public release of a data set to a research community whose individual requests would otherwise overload the data curator. This endeavour could facilitate the release of data on public platforms like the UCI Machine Learning Repository (Lichman, 2013) or the creation of data competitions, fuelling research growth for specific modelling areas.\nThis motivates our main contributions, namely the formulation of multiple approaches to generating DP importance weights that correct for synthetic data's issues. In particular, this includes:\n\u2022 The bias estimation of an existing DP importance weight estimation method, and the introduction of an unbiased extension with smaller variance (Section 3.3).\n\u2022 An adjustment to DP Stochastic Gradient Descent's sampling probability and noise injection to facilitate its use in the training of DP-compliant neural networkbased classifiers to estimate importance weights from combinations of real and synthetic data (Section 3.4).\n\u2022 The use of discriminator outputs of DP GANs as importance weights that do not require any additional privacy budget (Section 3.5).\n\u2022 An application of importance weighting to correct for the biases incurred in Bayesian posterior belief updating with synthetic data motivated by the results from (Wilde et al., 2020) and to exhibit our methods' wide applicability in frequentist and Bayesian contexts (Section 3.1).",
"n_publication_ref": 8,
"n_figure_ref": 0
},
}