Showing posts with label technology. Show all posts
Showing posts with label technology. Show all posts

Thursday, February 16, 2017

Robert Kehrer’s Industry Trends and Outlook – #RootsTech

Robert Kehrer at RootsTech 2017Robert Kehrer, product manager at FamilySearch, took part of a panel discussion titled “Industry Trends and Outlook” at the Innovators Summit portion of RootsTech 2017. Robert wrestles with big data technology problems at FamilySearch.

One of the hardest things Robert faced in preparing his presentation was narrowing down the areas that he wanted to talk about. He narrowed things down to three categories of innovation: technology, process, and data.

The first technology innovation he sees coming is automated transcription—the ability of a computer to transcribe a document. There have been some recent advances, particularly in the area of handwriting recognition. Today automated transcription works well on typescript documents and pretty well on print handwriting. The ability to do recognize cursive writing is showing promise. However, there are really messy documents that automated transcription is not likely.

Robert Kehrer says automated transcription of some documents is harder based on handwriting style

Another area where technology innovation is happening is named entity recognition. A computer takes transcripted text and, using a process called natural language processing, picks out the names, dates, locations, relationships, and so forth. Progress is being made in this area.

Innovation is happening in neural networks and machine learning and is important in combination with automated transcription and named entity recognition. Machine learning is not difficult to understand when demonstrated with a simple example. Machine learning could make it possible to show the machine many images of the name William. Subsequently, when names are shown to the machine, it can pick out those that are William.

Robert Kehrer demystifies machine learning Robert Kehrer demystifies machine learning

Don’t think that these technologies are going to replace human indexers. These technologies must be trained using data indexed by people. And these technologies free up people to do what only people can do.

Innovation is happening in fuzzy search advancements. Fuzzy is a funny word that he used to refer to non-exact search results. This is familiar stuff like wildcards and name variants. Robert feels like there could be some innovation here less complicated than an artificial intelligence hint matching system but more sophisticated than the search engines of today.

DNA will and is having a massive impact on genealogy.

Process innovations are going to be important as well. Today, organizations have a centralized process for determining what records to acquire. Robert thinks we will see more distributed decision making on what collections to digitize. He envisions a world where local archives, libraries, church congregations (like LDS stakes and wards), and individuals take the responsibility to identify, digitize, and index collections. We see this a little already with apps like FamilySearch Memories or BillionGraves.

Data innovation was Robert’s final category. There is a lot of data out there that is highly valuable, but there is a risk that it will be lost. Records can be at-risk because of poor archival conditions, political instability, natural disaster, or scheduled destruction. India destroys their censuses before the decade is over. Lastly, there are hundreds of millions of “records” stored in memorized genealogies in certain cultures, many throughout Africa. FamilySearch has an active and growing program to capture these “oral genealogies.”

Robert Kehrer says some records are at risk because of poor archival condition. Robert Kehrer says some records are at risk because of political instability. Robert Kehrer says some records are at risk because of natural disaster Robert Kehrer says some records are at risk because of scheduled destruction

The last data innovation is one of Robert’s hopes. There is so much good genealogy data locked up in the record managers on genealogists’ computers. It is not shared freely. Robert envisions a world where tree data is more readily available and shared more freely among all the different sites. Websites could compete on best features, user experience, and records rather than on availability of member submitted trees.

Thursday, January 5, 2017

#RootsTech 2017 Semifinalists Announced for Innovator Showdown

RootsTech's Innovator ShowdownDid RootsTech ever announce the Innovator Showdown 2017 semifinalists? I can’t find an official announcement anywhere. RootsTech silently updated the Innovator Showdown webpage, but fortunately allowed a couple of the judges, Jill Ball and Christine Woodcock, to personally make the announcement and the news spread around the blogosphere.

The ten semifinalists are:

Champollion 2.0


CSI: Crowd Sourced Indexingimage

Cuzins

Double Match Triangulator

imageEmberall

imageJoyFLIPS

Kindex

OldNews USA

QromaTag

imageRootsFinder

According to Christine, there were 42 submissions to the contest and the judges were given 21 of them for consideration. Christine said there were four criteria for their review:

Family History
Submissions must be directly or indirectly related to family history.

Quality of Idea
Includes creativity and originality.

Implementation of Idea
Includes how well the idea was executed by the developer.

Potential Impact
Will users get excited about this, is it applicable, does it solve a genuine problem?

The Innovator Showdown will be held Friday, 10 February 2017, at 10:30 MST during RootsTech and can be viewed online at Rootstech.org.

Wednesday, July 13, 2016

FamilySearch Family Tree Outage Minimal

FamilySearch Family Tree Now Overshadows NFSThe scheduled outage and system upgrade of FamilySearch on 27 June 2016 seems to have gone smoothly. The upgrade is an attempt to prevent performance problems. The upgrade provides “a new technology that should provide better scaling with traffic,” said FamilySearch’s Joe Martel. “That means as more people use the site it shouldn't bog down.” If the upgrade was successful, Sunday afternoon system failures should be a thing of the past.

The upgrade included breaking the synchronization link between FamilySearch Family Tree and the archaic New FamilySearch (NFS), said Ron Tanner, Family Tree product manager. The most touted benefits of the break are the ability to merge IOUSes (“Individuals of Unusual Size”) and the cessation of stupid data changes attributed to the FamilySearch or LDS Membership accounts.

Joe said that another benefit of the new system is that FamilySearch will be able to improve and enhance features faster.

“The cutover was a HUGE effort,” Joe said. “Hat's off to the engineering teams and planning that went into this.” According to fellow blogger Renee Zamora, the system was scheduled to go offline at 12:30 Monday morning. While FamilySearch had warned users the outage could go 24 hours, Holly Hansen reported on Facebook that it was back online by 6:00am.

I haven’t heard any reports of significant problems with the new system. “I'm guessing we'll see a few glitches but nothing monumental has turned up,” Joe said. I’ve seen minor issues. (There was a report that you can’t directly change a name from uppercase to mixed case. There was a report that %22 replaced quotation marks in custom facts.). I’ve seen comments about the system being faster. Joe has said that FamilySearch will need to tune the new system configuration.

There will still be situations where merging persons is not possible according to Ron, but the system will tell you the exact reason. “There are some restrictions we had to put in place for those with lots of relationships, etc.” A merge is not allowed if the combined person exceeded certain limits. According to Renee, these are the current limits:

  • Note length: 10,752 characters
  • Person notes: 50, characters 215,040
  • Relationship notes 12, characters 129,024
  • All person and relationship notes characters: 386,320
  • Conclusions: 200
  • Person source: 200
  • Relationship source: 50
  • Memories: 1000
  • Person not a match: 400
  • Discussions: 20
  • Couple relationships: 200
  • Sets of parents: 50
  • Number of children: 400

Ron says these numbers are changing as needed. I’ve already seen a report that the parent limit has changed to 100 and discussions to 50.

FamilySearch calls the new system “Tree Foundation,” according to FamilySearch engineer, Randy Wilson. It uses a database technology called Cassandra. “Our relational database just couldn’t be made to go much faster and so there was concern that Family Tree would tip over at some point soon,” he said. The new system can “scale horizontally.” That means that FamilySearch can easily add more computer servers to meet demand. “That doesn’t necessarily mean that response time will be faster, but, rather, that more people should be able to use it at once,” Randy said. He noted that this technology change will not magically fix all performance problems, but the change eliminates a fundamental bottleneck that was important to fix.

Wednesday, May 25, 2016

Ancestry.com Preparing Large German Record Collection

Fraktur font city directory sampleAnd I’m guessing it is German city directories.

A number of years ago Ancestry.com developed technology allowing its computers to “read” (technically called OCR, or optical character recognition) and interpret U.S. city directories. (See “Data Extraction Technology at Ancestry.com.”) While the technology sometimes produces silly results, overall, it allowed Ancestry to publish over two billion records in record time and at minimal cost. The tradeoff seems reasonable.

Ancestry has nearly 700 German city directories but has yet to apply the technology to them. “The main challenge with German is the common use of the Gothic or Fraktur font when printing,” says Laryn Brown, Ancestry product manager. “This special script-like font is particularly difficult to recognize using the best OCR [optical character recognition] tools available today. Words and especially names can be misread as many of the characters in this special font are extremely similar.”

Ancestry’s solution is a quality-assurance check that compares the names the computer thinks it’s reading against a list of known names. If the computer sees something not in the name list, a reviewer is alerted. If the computer has identified a name not in the name list, the reviewer adds it. Otherwise, the name is mapped to the correct name or deleted. The results of these reviews are fed back to the computer so it can learn from its mistakes. It re-reads the books and the process is repeated. 

“When these records are finished, a random list of German words in a very difficult-to-read font will have been turned into a set of records about people that looks a lot like an annual census,” said Laryn.

This new collection will become available over several years and will include millions of pages of new content.

Thursday, March 24, 2016

Full URLs in Citations?

Dorothea Lange, “Destitute pea pickers in California. Mother of seven children. Age thirty-two. Nipomo, California,” 1936When citing a web page, one must decide whether or not to use a full URL or a URL to the home page. One usually cites the website home page and includes the additional information necessary to guide users to the target page. Citing the full URL is a good alternative under two conditions: 1. The URL is long lived. 2. The URL is not too long. The longer the URL, the harder it is for a user to enter it without making a typographical error.1

How do you know if a URL is long lived?

URLs suffer from a process called link rot. For various reasons, they cease to work. Companies cease to exist or rename or reorganize websites. Some URLs are set to expire within minutes. Others never work anywhere but on your computer in your current browser. How might you know? Try copying and pasting the URL into a different browser. If it fails, you know right away it is not long lived.

For example, NARA included descriptive pamphlets at the beginning of their microfilm publications that sometimes contain rich information. Some are available online only through the NARA microfilm store. In the store, the product page for each microfilm contains a link (“View Important Publication Details”) to download the pamphlet. Unfortunately, the URL of the pamphlet (such as the one for M1328) is nearly impossible to obtain, is seven lines long, and won’t work again. And the URL of a product page expires immediately; even refreshing the page sends you back to the welcome page. The only way to access a pamphlet in the store is through a lengthy set of instructions.

Another class of URLs that fails to work are URLs of records found using databases at your public library or through their website.

Some publishers provide URLs that they intend to work for a considerable amount of time. How long? Let’s say they will work almost to eternity. However, remember that “Internet time” runs much faster than regular time. “Eeternity” is no more than 30 years away.

What are some of the systems and websites providing long lived URLs?

PURL and the GPO

The U.S. Government Publishing Office utilizes a system called PURL (persistent uniform resource locator) for some online publications.

As part of the online dissemination of Federal information, the FDLP uses persistent uniform resource locators (PURLs) to provide stable URLs to online Federal information. When a user clicks on a PURL, the request is routed to the Federal publication. As Federal agencies redesign and remove information from their sites, GPO staff reroute PURL entries to the appropriate location.2

For example, the PURL for the tri-fold brochure, USCIS Genealogy Program, is http://purl.fdlp.gov/GPO/gpo64668. When you enter that URL into your browser, the GPO server reroutes you to the current location of the brochure, where ever that might be. Similarly, http://purl.fdlp.gov/GPO/gpo26239 sends you to Guide to Tracing Your American Indian Ancestry. Apparently, GPO even supports some government publications on non-government websites. http://purl.fdlp.gov/GPO/gpo43102 sends you to a poster, National Atlas of the United States of America. Presidential Elections, 1789-2008, on the University of Iowa’s website.

The GPO PURL system will work if resources haven’t been altogether removed from the Internet and if GPO personnel have the time to update the links. I would use PURL links in a citation unless they were too long.

ARK and FamilySearch

FamilySearch provides long-lived URLs for its historical records, record images, IGI, and personal genealogies. Any URL containing “ark:” (archival resource key) or “pal:” (persistent archival link) is expected to work for a long time. I consider these safe to use in citations. Also, I think it is safe to remove the question mark and everything past it.

URLs to collections, persons in Family Tree, photos, user uploaded documents, wiki articles, and other pages don’t contain the “ark:” characters so I don’t consider them long lived.

LOC DIGITAL IDS and Handles

Online items on the Library of Congress website often have a permanent URL containing a digital ID.

To find a permanent URL for an item first look at the bottom of the item record. In some collections, you will find shorter permanent addresses in the "Digital ID" field of the item record. The URLs begin with "http://hdl..." and are called "handles" or "handle addresses."3

The URL https://www.loc.gov/item/mfd.45004/ currently leads to three death certificates from the Frederick Douglass family. But that URL may not work in the future. On that page one can find a digital ID in URL form: http://hdl.loc.gov/loc.mss/mfd.45004. If you use the digital ID URL, the LOC computers will interpret it and generate a URL that currently works. Go to it and you find yourself back at https://www.loc.gov/resource/mfd.45004. LOC has the latitude of changing the latter URL, but the digital ID URL is longer lived. I consider it safe to use in a citation.

The URL https://www.loc.gov/item/fsa1998021539/PP/ used to point to an instance of a famous Dorothea Lange photograph (shown at the top of this article).4 That link is broken now and I don’t know the digital ID, so I am unable to return to that webpage. You can see the original, unretouched photograph using digital ID URL http://hdl.loc.gov/loc.pnp/ppmsca.12883.

CONTENTdm and Reference URLs

CONTENTdm is software many universities use to display their digital collections. It has a reputation for links that fail. Let’s say I search the Robert Hawley Milne papers from Lewis University on the CARLI digital collections website and find the birth certificate of Flora Jane Putnam. The URL displayed by my browser is http://collections.carli.illinois.edu/cdm/singleitem/collection/lew_rhm/id/311/rec/4. It is not guaranteed to work if I change browsers or clear my cookies or use it tomorrow. If I poke around, I find a link labeled “Reference URL.” I click it and am rewarded with this URL: http://collections.carli.illinois.edu/cdm/ref/collection/lew_rhm/id/311. If you wish to share a URL to Flora’s birth certificate, shre this one. But I wouldn’t use it in a citation. Why?

If an institution switches from CONTENTdm to another software solution, CONTENTdm reference links will break. This is true for the software systems employed by most universities and small- to medium-sized archives. That brings us full circle.

Conclusion

It is often better to cite a homepage and include information inherent to a digital artifact—information that is likely to survive a switch from one software solution to another. That information can then be used with the search function. Digital identifiers, titles, and author/creators are information likely to survive.

In the Flora Jane Putnam example, the digital artifact title is “Birth Certificate for Flora Jane Putnam” and the identifier is “Flora Jane Putnam Birth Certificate 1893.tif.” One or both of these are likely to survive. I could cite the certificate and the digital artifact like this:

Illinois Department of Public Health, certified copy of delayed record of birth no. 201472, Flora Jane Putnam (1893); Robert Hawley Milne Papers; Canal and Regional History Collection; Lewis University Library, Romeoville, IL; digital image, (http://www.lewisu.edu : accessed 18 March 2016), search the library’s Milne digital collection for "Flora Jane Putnam Birth Certificate 1893".

My personal practice is to use complete URLs sparingly and, if there is any doubt as to their persistence, include enough other information that a user can find the webpage even after the URL has rotted. 

 


Portions of this article were adapted, with permission, from a post made on the BCG ACTION mailing list.

SOURCES

     1.  Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace, third edition, Adobe Digital Edition, (Baltimore, Maryland: Genealogical Publishing, 2015), 59, 269, 283, 597, 626, 767.
     2.  Federal Depository Library Program Persistent URL Home Page (http://purl.access.gpo.gov : accessed 19 March 2016).
     3.  “Frequently Asked Questions,” The Library of Congress: American Memory (https://memory.loc.gov/ammem : accessed 19 March 2016), Bookmarking [and] Linking.
     4.  Dorothea Lange, “Destitute pea pickers in California. Mother of seven children. Age thirty-two. Nipomo, California,” 1936; retouched photograph of Florence Thompson with left thumb removed, LC-USF34-T01-009058-C (b&w film dup. neg.); Farm Security Administration/Office of War Information Black-and-White Negatives collection; Prints and Photographs Division; Library of Congress, Washington, D.C.; digital image (http://hdl.loc.gov/loc.pnp/fsa.8b29516 : accessed 19 March 2016).

Wednesday, February 3, 2016

#BYU Family History Technology Workshop

Amy Harris gave her wish list to developers and researchers at the BYU Family History Technology Workshop.Before #RootsTech, before #InnovatorSummit, there was the Brigham Young University Family History Technology Workshop. Now in its 16th year, the one day workshop brings together developers and researchers tackling some of genealogy’s most thorny challenges.

Amy Harris, an associate professor of history at BYU and an accredited genealogist, provided the workshop’s keynote yesterday. Amy currently serves as the director of the Family History Program at BYU. She spoke to the topic “A Genealogist's Technological Wish List: Teaching, Filtering, and Mapping.”

“We are engaged in similar work,” Amy said of genealogists and technologists. “We are solving puzzles or mysteries.” Amy went through her wish list of things she wished technology would do to improve the work of historians and genealogists.

Amy wishes applications could be more instructional, teaching users to be better. It doesn’t have to be FamilySearch that makes the FamilySearch website more usable. It could be a popup app that explained in which situations a record collection might be useful. Developers wouldn’t need to develop the instructional resources. It could point users to existing resources. Apps could help with situation-specific research problems, walking users through the process of figuring out which records should be used at each step of the process.

She wished there were instructional OCR technology. She wished there was help for citation standards. She wished there was technology helping users evaluate record hints in Ancestry or FamilySearch trees. She wished tree software better assisted users work through the challenges of naming schemes that didn’t carry the same surname from one generation to the next.

Amy wishes programs helped users understand and use changing jurisdictions. It would be great to have an app that showed all the different jurisdictions for a place, overlaying the boundaries on a map and allowing for boundaries that changed over time. Just a few examples of different jurisdictions in England are civil registration districts, poor law unions, Church of England parishes and dioceses, and Quaker monthly meeting boundaries.

In short, Amy wishes there were apps that were informed by advanced research methodology and helped users utilize them.

Thursday, September 24, 2015

Ancestry Insider Named One of Family Tree Magazine’s Top Blogs

The Ancestry Insider is one of Family Tree Magazine's top 5 blogs.I learned last week that Family Tree Magazine has named me one of the top five genealogy blogs. Thank you, Family Tree Magazine. I am grateful to your editorial’s staff’s encouragement that I stick with blogging even though I’m a poor writer. Hopefully, I make up for it by presenting useful content.

Speaking of which, enough about me. Let me give you something actually useful. The other four genealogy blogs are

Family Tree Magazine honored YouTube as a sixth choice. I’ve discovered the same thing of late. Here’s just some of the useful channels I’ve looked at:

Read David A. Fryxell’s take on these blogs at http://familytreemagazine.com/article/best-genealogy-blogs-2015.

Read his comments about the other “101 Best Websites,” 2015 at http://familytreemagazine.com/article/101-best-websites-2015 and:

 

Thursday, August 27, 2015

The Future Will Bring Automated Indexing Tools – #BYUFHGC

Jake Gehring presenting at the 2015 BYU Conference on Family History and Genealogy“It’s not that we don’t like our [indexing] volunteers,” said Jake Gehring. “We would just rather have them work on things that only [humans] can do.” Jake is director of content development for FamilySearch and presented at the BYU Conference on Family History and Genealogy last month. This article is the third and last article about his presentation. In the first article I reported on Jake’s premise that FamilySearch Indexing is not keeping up with the number of records FamilySearch is acquiring and additional means are needed. In the second article I reported about two of those means: increasing the efficiency of human indexers and working with commercial partners. In today’s article I will report on the third means: increased automation via computers.

In the third part of his presentation, Jake spoke about “the really far-out stuff, HAL9000 kind of stuff.”

Jake showed a screen shot that we saw in Robert Kehrer’s keynote. (See “Kehrer Talks FamilySearch Transformations” on my blog.) The screen showed a color-coded obituary.

Obituary with parts of speech color coded by FamilySearch automated obituary indexing system

FamilySearch trained a computer to identify the different parts of speech. They trained the computer how to discern meaning out of a bunch of words. Notice in the example above that names of people are identified in dark green, places in brown, dates in dark blue, relationships in salmon, events in pale green, clock times in a steel blue (or would you call that a dark sky blue?), organizations in red, and buildings in goldenrod (or would you call that a mustard?).

They basically teach the computer to read. The computer is willing to extract a lot more detail from an obituary than a volunteer can easily do. And it can work really, really fast. For obituaries, computers can do in about a week and a half what it takes all of FamilySearch’s volunteers three and a half years to do. This is why in a few weeks FamilySearch is going to stop having volunteers index the current obituary project. In fact, FamilySearch has already published about 37 million obituaries this way. You may already have found and used an obituary that was indexed by a smart computer.

This applies to obituaries published since about 1977. Since that time, most obituaries have been published and stored digitally. Pre-1977 it looks a lot differently. Because the obituaries are not already digital, it is a pretty nasty OCR problem. [OCR converts the printed page to text so that the computer can subsequently try to make sense of it.] The problem is so severe, computers can recognize only about half of the words in pre-1900 newspapers.

If you were at RootsTech you may have seen the last thing Jake showed. A company named Planet entered its ArgusSearch into the Innovator Challenge. ArgusSearch is a system that reads the handwriting of documents that have not been indexed. You type in something like “Steinberg” and the program shows some records that might match that name. It won’t find all the matches. And it may return some results that aren’t matches. But this is still useful. This technology is still young, but an application like this is likely to hit real life in the next ten years.

Planet's ArgusSearch automatically read handwritten names in census records without an index.

Jake summarized by saying that while indexing is going really well—never better—unfortunately, it is just not good enough to give us all the records you need. [FamilySearch does not index all the records they acquire.] “We need to do much better. It’s not that we are not quite there; we are way behind and getting further behind every year,” he said. There are three areas that FamilySearch needs to utilize. FamilySearch needs to increase the efficiency of its indexing volunteers. FamilySearch needs more help from for-profit publishers who can bring more resources to the table. And FamilySearch needs to use computer technology to make images searchable with little or no human intervention.

“It’s an exciting time to be alive. Can you imagine the explosion of document availability once we make a bit more headway in a few of these areas?”

Jake took a couple of questions:

Q. How easy is it to use tools like Google Translate to translate Spanish records?

A. Google Translate is better at modern, generic words. If you type in the text of a letter, you would be able to get the gist of it, but it may not handle archaic words or words specific to a vital record. As long as you know a small set of terms, you can usually get by without a computerized translator. There is no magic tool currently available.

Q. Why do we sometimes key so very little from a record? While we have someone looking at the document, shouldn’t they be extracting more?

A. Because we publish both indexes and images, we index the minimal amount necessary to find the image. Why index something that no one will ever use in a search? Cook County, Illinois death certificates are an example where we indexed something that didn’t need to be. We indexed the deceased’s address, but who will ever search using the address? Sometimes we don’t get it quite right, but that’s the general principle.

Q. When will we be able to correct published indexes?

A. We’re starting now after ten years of being in the top three requested features, we’re starting to implement the feature to allow you to contribute corrections. We are rapidly approaching the point when this will be available. I’m not really authorized to say “soon,” but we have our eyes on that feature.

Wednesday, August 26, 2015

FamilySearch Should Increase Indexing Efficiency and Utilize Partnerships

Jake Gehring presenting at the 2015 BYU Conference on Family History and GenealogyFamilySearch is not keeping up with indexing the records it digitizes and improvements in three ways could help fix this, according to FamilySearch director of content development, Jake Gehring. Yesterday I presented the first part of my remarks about his presentation at the 2015 BYU Conference on Family History and Genealogy (#BYUFHGC). Today I’ll present the second part, covering the first two of the three ways, increasing efficiency and partnering. Tomorrow I’ll present the third way, increased use of computerization.

Today’s FamilySearch Indexing (FSI) system is somewhat inefficient. FSI primarily utilizes a double-blind indexing methodology, sometimes described as A+B+arbitrate. Two indexers independently index a batch of records. If there are any differences, even one letter in one record, the entire batch is sent to a third person to arbitrate between the two values, or supply a value of their own. It turns out that 97% of all batches have at least one difference, even though what is keyed is the same for 70% of the fields. As a result, almost all records are looked at by three people. There’s a good argument that that is wasteful. For certain kinds of records and certain kinds of people [and certain kinds of fields, I might add], only one keyer is sufficient. The accuracy doesn’t get any better when involving two more people. FamilySearch has recently switched to single keying for newspapers in the last year since reading typeset material can usually be done without error. You wouldn’t want to do this for certain types of records or for beginning indexers.

A more efficient methodology is referred to as A+review. One person keys the information and a second person reviews what is keyed. All the reviewer does is indicate whether the information is correct or not. This could easily be done, even on a cell phone. This method is about 40% more efficient than the double-blind methodology because FamilySearch knows when a record needs to be keyed a second time. FamilySearch is actively working on this kind of methodology to increase the efficiency of indexing.

Jake showed three, entirely new, experimental types of indexing. Some do not even have working prototypes: keyboardless indexing, free-form indexing, and casual “micro-indexing.”

Jake showed an indexing system that allows productive use of devices without keyboards, such as smart phones. If you’ve used photo recognition in Photoshop, you have seen the paradigm before. He showed a slide showing 12 snippets of a name, such as “Henry.” (See my version, below.) These had been read from documents by a computerized handwriting recognition system. But since computers aren’t too good at reading handwriting, it presents its results to a person for verification. The person marks any that the computer got wrong. Where the computer had a good second guess, it could present that as well, allowing the person to select an alternate name, such as “Kerry.” For pre-printed forms, this works great and allows easy indexing on devices without keyboards, such as cell phones.

Snippet of name indexed as Henry

Shippet of a name that was indexed as Henry or Kerry Snippet of name indexed as Kerry
Snippet of a name that was indexed as Kerry Snippet from a page wherein one name was indexed as Kerry Snippet of a name that was indexed as Kerry
Snippet of a name that was indexed as Kerry Snippet from a page wherein one name was indexed as Kerry Snippet from a page wherein one name was indexed as Kerry

Snippet of a name that was indexed as Kerry

Snippet of name indexed as Henry Snippet of a name indexed as Kerry

Jake showed the FamilySearch Pilot Tool, another indexing system for free-form indexing. It is currently live, as a pilot. A large portion of the screen is a browser showing a record on FamilySearch.org. Along the right side is a pane where an indexer can enter names, dates, and places extracted from the document. (See the screen shot, below.) A person would use the tool to index any record that they care about and a short time later the record would be searchable. You wouldn’t have to ask for anyone’s permission. You wouldn’t have to index all the names. Anyone could take any collection desired and do some indexing. This tool is in pilot right now. FamilySearch is very interested in tools that let you index as you go. To join the pilot, send Jake an email. (I see someone has also posted the link online. See “FamilySearch Pilots Web-Based Indexing Extension” on the Tennessee GenWeb website.) There is no arbitration. If you care enough to index the image, you probably care enough to be accurate. But that supposition is something yet to be validated.

The FamilySearch Pilot Tool for indexing - Click to englarge

“Micro-indexing” could be used to make images more usable. It would be nice to be able to browse unindexed images easier. FamilySearch is very interested in an upgrade to the current browse experience. Jake showed an animated artist’s rendition of a tool, reminding us that this is just a research and development idea.

FamilySearch is interested in making it easier to find records in images that have not yet been indexed.

In micro-indexing the system might ask you really simple questions, like, “What kind of record is this?” and have you click the record type. By asking volunteers to do tiny tasks, FamilySearch might be able to gather information to make browsing images easier to find my record type, locality, and time. Just because FamilySearch doesn’t have the time to index the images, doesn’t mean they can’t be made easy to browse.

This is a mock-up of what a micro-indexing tool might look like.

In addition to talking about increasing the efficiency of indexing, Jake talked about partnering. FamilySearch is fine with the concept of trading data with other companies. FamilySearch provides images and the partner creates indexes. They may even get exclusive use of the indexes for awhile. For example, a lot of Mexico church and civil records are being indexed right now by Ancestry.com. We all get the value of it eventually. FamilySearch has similar projects going on with Findmypast (I didn’t catch the projects names) and MyHeritage (Danish census and church records, and Swedish household names). This increases the rate of indexing by bringing more indexers to the table.

Thursday, May 7, 2015

Ancestry Hacker Internal Contest

Clip art of groups thinkingAs a former software engineer, I was interested to see an article about an internal programmers’ contest at Ancestry. They gave their programmers two days to produce some cool product. That’s not much time to do anything useful, so it presents just the sort of challenge that ignites a programmer’s creativity and competitive nature. Participation as a team builds unity and camaraderie. “Teams find themselves really energized by sitting and working together to brainstorm and solve problems without as much regard to roles and process,” said Christopher Bradford, Ancestry VP of engineering.

Prizes were rewarded in two categories, one for usefulness, and one just for fun. The prize winner in the serious category provided a different search experience, better utilizing the data itself to assist users to narrow down search results. A prize winner in the fun category was a dungeon-type game that pits your skills against those of your ancestors. To see the other prize winning ideas, see “2015 Hack Days at Ancestry” on the Ancestry tech blog.

Don’t look for any of these features or products on your grocers’ shelves anytime soon. That’s not the purpose of the contest and not the typical outcome. But, hopefully, happier programmers lead to happier customers.

 

 


Image courtesy of fotographic1980 at FreeDigitalPhotos.net.

Thursday, February 12, 2015

GEDCOM Replacement is Here (#RootsTech #RTATEAM)

GEDCOM XAt RootsTech two years ago, Ryan Heaton of FamilySearch talked about a GEDCOM replacement: GEDCOM X. (See “Ryan Heaton: A New GEDCOM.”) Today, GEDCOM X is a reality. Heaton’s presentation this year was titled “The Ecosystem of Genealogical Data Exchange.” I believe they recorded it; you’ll probably be able to view it yourself at some point. A warning is warranted, however. This was an Innovator’s Summit session. The target audience for the presentation was software engineers.

Heaton spoke of a genealogical ecosystem of information exchange. Family Group Sheets and other forms imposed structure on the exchange of genealogical data. Computer programs enforced it. Now we use the Internet and data types are tightly defined.

The elements of a genealogical data ecosystem are

  • records
  • persons
  • relationships
  • sources
  • citations
  • analyses (How did I make this inference? What makes me believe this information is true?)
  • Research (For example, what are the to do items in my research plan?)

The actors are

  • Systems
  • Users

He talked about information flow between users and systems. By user he meant a desktop genealogy tree management program like Ancestral Quest, Legacy, or RootsMagic. By system he meant an online tree manager. The information flow can be from:

  • User to user. This exchange has been done with GEDCOM or proprietary file formats of the desktop genealogy tree managers. Users can generally import the proprietary data files of competitive tree managers, but generally can’t export in competitors’ formats. There can also be data loss. There is limited exchange capability of citation metadata; elements or formatting are often lost. Internationalization of character sets are sometimes mishandled.
  • System to/from user. This is done with publicly facing interfaces (APIs).
  • System to system. This is usually done using bulk exchange formats.

User-to-User

There are reasons inhibiting user-to-user exchange. There is no specification commonly used by desktop tree managers to exchange citation metadata. Tree management software vendors lack incentive to make it easy for you to move to their competitors. This includes FamilySearch, who doesn’t necessarily want you to download all your data in one step.

System-to-user

Many desktop managers have the capability to exchange data with online tree systems. The desktop managers use APIs that allow desktop programs to talk to online tree managers like FamilySearch’s Family Tree, and MyHeritage’s tree.

The FamilySearch API conforms to GEDCOM X. A significant number of partners are using GEDCOM X to talk with the FamilySearch Family Tree. (I think his point here is that desktop tree managers know the API and could use it to exchange with each other if they chose to.)

System-to-system

Companies sometimes strike business deals to share their data. A non-genealogical example is Open Archives. OAI-PHM and A2A. FamilySearch has also done bulk data exchanges with Ancestry.com, MyHeritage, and findmypast. FamilySearch gives them a big atom feed that transfers GEDCOM X data sets.

What inhibits genealogical dataflow?

  • Security
  • Budget constraints
  • Data loss
  • Feature mismatches
  • Lack of well-established specifications
  • User reluctance to share
  • Programmer awareness

Heaton ran out of time before he could talk about GEDCOM X directly. But I think his message was that GEDCOM X is here. It is alive. If vendors use it to exchange data with FamilySearch, but not with each other.

Tuesday, December 9, 2014

RootsTech Announces Techie Contest with $25,000 in Cash Prizes

RootsTech Innovator ChallengeRootsTech has announced that it is offering $25,000 in cash prizes to encourage development of cool new family history apps and technology. “The contest will culminate with a hybrid Shark Tank, America’s Got Talent-like live event where judges and thousands of viewers will decide the winners,” said Paul Nauta, FamilySearch spokesperson. The event will occur 13 February 2015. “A panel of five genealogy, technology, and business gurus will judge four finalists from around the world in a showdown at the Salt Palace Convention Center in Salt Lake City, Utah,” said Nauta.

The first prize award is $10,000, second prize is $7,000, and third prize is $3,000. The People’s Choice award, determined by audience voting, will be $5,000.

A marketing research firm studied potential product ideas by conferring with 11 senior executives of family history organizations. Three were from Ancestry.com, one from MyHeritage, one from BrightSolid, two from FindMyPast, two from FamilySearch, one is from the Federation of Genealogical Societies, and one was the managing director (i.e., president) of FamilySearch. The study showed that the older, niche target of record-oriented genealogists and historians represent just 1% of the potential market. As the current market is worth $4 billion, the potential market is $400 billion.

The Ancestry Insider is an official RootsTech ambassadorThe potential market consists of people who are “generally younger, digital, tech savy, socially connected, mobile and experience driven,” according to the study. They are unwilling to invest the time and effort in traditional research like searching for names and dates, but still want to experience what traditional researchers want: “content that fosters joyful, shared and satisfying family history experiences that bond generations.”

For more information, read the FamilySearch announcement and visit the RootsTech Innovator’s Challenge website.

Tuesday, February 25, 2014

#RootsTech – Developer Challenge 2014

Click this image to see a short video about Saving Memories ForeverRootsTech was born partly out of a conference for software engineers. It retains that aspect today, including a contest, the Developer Challenge.

“The annual RootsTech Developer Challenge rewards developers who introduce the most innovative, new concepts to family history,” according to the RootsTech website. The challenge is to “create an application or service that introduces a compelling new concept or innovation to family history.” Winners were announced at the end of the keynote session Friday.

First prize went to “Saving Memories Forever,” a smart phone app by Harvey and Jane Baker, of St. Louis, Missouri.

“The Bakers saw that smart phones could serve as a mobile recording studio and as a tool to upload stories seamlessly to a private website,” wrote FamilySearch’s Thom Reed. “The app provides prompts and questions to encourage recording life stories and make them available for generations. It creates an easy way to connect families through the richness of voice and the warmth of storytelling.”

I ate dinner with the Bakers on Wednesday and found them to be very nice people. The Bakers won $2,000 cash and a Dell laptop computer. Click the picture above to view a 100 second introductory video of Saving Memories Forever.

Click to see a short introductory video to Find-A-Record.Second prize went to “Find-A-Record,” a creation of John Clark and Justin York of Genealogy Systems LLC in Provo, Utah. They won $1,000.

“Find-A-Record is a searchable worldwide index of records collections,” wrote Reed. “A family history researcher can enter available information about where and when their ancestors lived and discover the various record collections available. The search is integrated with popular online trees through a browser extension.” Click the picture to see a 90 second video.

Click to see a short video about PhotoFaceMatch, a technology shown at RootsTech 2014.Third prize went to PhotoFaceMatch, a technology developed by Charley Smart and Steve Miller of Eclipse Identity Recognition Corporation. PhotoFaceMatch uses facial recognition technology to compare a set of photographs of a known person against a photograph of an unidentified person and determines if there is a potential match. Third prize was $500. Click the picture to see a 90 second introductory video.

Congratulations to all who participated in this year’s Developers’ Challenge.