The following example shows how to process your images with the Newspaper Segmentation API in python programming language.

First you need to install the client.

pip install arcanum-newspaper-segmentation-client

In order to be able to run the following commands you need to have access enabled to AWS's Textract service.

import numpy as np
from newspaper_segmentation_client import run_newspaper_segmentation

# provide your own API key
API_KEY = "<api key>"

with open("your-image-file.jpg", "rb") as image_file: # you can use other binary file-like objects like io.BytesIO
    segmented_page_response = run_newspaper_segmentation(image_file, api_key=API_KEY)
    # help us with providing the correct DPI value
    segmented_page_response = run_newspaper_segmentation(image_file, api_key=API_KEY, dpi=150)

    # print number of articles on this page

    # extract block mask for the first block with numpy
    layout_components = np.load(segmented_page["layout_components"])["layout_components"]
    first_block_mask = layout_components == segmented_page["articles"][0]["blocks"][0]["id"]

Input image limitations

Please note that Textract has some limitations regarding the input image. Learn more

Output format

If the request is processed successfuly the following output is returned as a python dictionary.

  • articles: (list) The articles on the page.
    • confidence: (float) Confidence score associated with the article. Floating point value between zero and one.
    • blocks: (list) The articles' blocks.
      • bounds: (list) The bounds of the block. Needs to interpreted in the rotated image's coordinate space the top left corner being (0, 0). The values are provided in the following order: xmin, ymin, xmax, ymax.
      • label: (str) Label of the blocks. Possible labels are: Artifact, Advertising, Picture, Table, Title, Roof_Title, Subtitle, Header, Footer, Intermediate, Lead, Abstract, Caption, Listing, Body, Footnote, Byline
      • confidence: (float) Confidence score associated with the label. Floating point value between zero and one.
      • id: (int) The identifier of the block. It can be used to extract more detailed mask from the layout_components
  • layout_components: (str) The mask of the blocks. It is a 2-dimensional numpy array with the same aspect ratio as the original image with block identifiers corresponding to each pixel. Please note that the dimensions differ from the ones of the original image. You need to resize it to the same size as the original image in order to get the block identifier for each pixel of the image. 

If the request results in an error the status code of the response will be other then 200 and the error message is returned in the errorMessage entry of the output dictionary.

Arcanum logo

Arcanum is an online publisher that creates massive structured databases of digitized cultural contents.

The Company Contact Press room