Let’s talk about Google Vision API and how to write a simple java program to extract Text from an Image
Google Cloud’s Vision API offers powerful pre-trained machine learning models through REST and RPC APIs. Assign labels to images and quickly classify them into millions of predefined categories. Detect objects and faces, read printed and handwritten text, and build valuable metadata into your image catalog.
This is the description of the Vision API that you would see on the API page. It is a fairly modest description of what Google has been able to achieve with its ML models.
If any of you would have used Google Lens you would already be aware of it’s capabilities and yes — the capabilities are limitless.
What if I were to tell you that for a nominal amount you can actually subscribe to those APIs as well and have Google Lens like functionality integrated into your app as well.
By registering for Google Cloud’s vision API you can access just that with an API key.
Having read the description, I was intrigued and decided to try it out by writing a simple text extractor tool using Google’s TextAnnotation client.
Here’s what I did
Step 1: If you already have a Google account, go to GCP console and create a new Project. Name it anything you want.
Step 2: Make sure you have billing enabled for your account. Don’t worry most of the APIs are free to use and try out so you won’t incur any charges unless you use them heavily.
Step 3: Enable the Vision API for your project.
Step 4: Create a service account so that you can create an API key for using the Vision API from your codebase.
Step 5: Create a service account key and download it for use in your project.
Now you are ready to invoke the API from your code.
Step 6: Create a Java project, I created one using Maven and this is the POM file:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.textextractor</groupId>
<artifactId>textextractor</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>libraries-bom</artifactId>
<version>19.2.1</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-vision</artifactId>
</dependency>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Next I created a Class with a Main method to use the API for demonstration purposes. This is the code:
import com.google.api.gax.core.FixedCredentialsProvider;
import com.google.auth.oauth2.GoogleCredentials;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;
import com.google.cloud.vision.v1.*;
import com.google.cloud.vision.v1.Feature.Type;
import com.google.common.collect.Lists;
import com.google.protobuf.ByteString;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class TextExtractionDemo {
public static void main(String... args) throws Exception {
GoogleCredentials credentials = GoogleCredentials.fromStream(new FileInputStream("<PATH_TO_API_KEY_FILE>"))
.createScoped(Lists.newArrayList("https://www.googleapis.com/auth/cloud-platform"));
Storage storage = StorageOptions.newBuilder().setCredentials(credentials).build().getService();
detectDocumentText("<PATH_TO_IMAGE_FILE>", credentials);
}
public static void detectDocumentText(String filePath, GoogleCredentials credentials) throws IOException {
List<AnnotateImageRequest> requests = new ArrayList<>();
ByteString imgBytes = ByteString.readFrom(new FileInputStream(filePath));
Image img = Image.newBuilder().setContent(imgBytes).build();
Feature feat = Feature.newBuilder().setType(Type.DOCUMENT_TEXT_DETECTION).build();
AnnotateImageRequest request =
AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
requests.add(request);
ImageAnnotatorSettings imageAnnotatorSettings =
ImageAnnotatorSettings.newBuilder()
.setCredentialsProvider(FixedCredentialsProvider.create(credentials))
.build();
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests. After completing all of your requests, call
// the "close" method on the client to safely clean up any remaining background resources.
try (ImageAnnotatorClient client = ImageAnnotatorClient.create(imageAnnotatorSettings)) {
BatchAnnotateImagesResponse response = client.batchAnnotateImages(requests);
List<AnnotateImageResponse> responses = response.getResponsesList();
client.close();
for (AnnotateImageResponse res : responses) {
if (res.hasError()) {
System.out.format("Error: %s%n", res.getError().getMessage());
return;
}
// For full list of available annotations, see http://g.co/cloud/vision/docs
TextAnnotation annotation = res.getFullTextAnnotation();
for (Page page : annotation.getPagesList()) {
String pageText = "";
for (Block block : page.getBlocksList()) {
String blockText = "";
for (Paragraph para : block.getParagraphsList()) {
String paraText = "";
for (Word word : para.getWordsList()) {
String wordText = "";
for (Symbol symbol : word.getSymbolsList()) {
wordText = wordText + symbol.getText();
}
}
blockText = blockText + paraText;
}
pageText = pageText + blockText;
}
}
System.out.println("Complete annotation:");
System.out.println(annotation.getText());
}
}
}
}
To test the API, I used the following image file:

And on running the API I get what I want:
Complete annotation:
STOP SAYING
“I WISH”,
START SAYING
“I WILL”.
With the quotes included.
I must say that I was positively blown away by the accuracy.
Apart from the text extraction APIs, the Vision APIs also include
- Face detection
- Landmark detection
- Label detection
- OCR for handwritten text
And several other APIs with surprisingly high accuracy.
So before you go around training your own models, give this a go first. I am sure that like me, you won’t be disappointed.
Links I followed to set this up:
Its fairly straightforward and everything you need to know is clearly explained so you won’t have a tough time implementing this.